ELRA Newsletter
Issue #4 | January 2023
Language Resources
LRs @ELRA
LRs in the ELRA Catalogue this month
Since October 2022, 2 new Written corpora and 67 new Speech Resources are available in our catalogue.
Written Corpora
German Political Speeches Corpus
ISLRN: 381-445-879-769-5
This corpus consists of a collection of political speeches in German crawled from the online archive of the German presidency (Bundespraësident) and the Chancellery (Bundesregierung). For the German Presidency the speeches are available from July 1, 1984, to February 17, 2012, and the corpus contains a total of 1442 texts comprising 2 392 074 tokens. For the German Chancellery, the corpus contains a total of 1831 text comprising 3 891 588 tokens covering a period from December 11, 1998, to December 6, 2011. This corpus contains speeches from the Chancellor but also from other politicians.
Learner Corpus of Portuguese L2 – COPLE2
ISLRN: 936-320-703-366-7
The Learner Corpus of Portuguese as Second/Foreign Language (COPLE2) is a corpus of written and oral texts produced by students of Portuguese as Foreign/Second Language courses in the Instituto de Cultura e Língua Portuguesa (the Institute of Portuguese Language and Culture) (ICLP – FLUL) and by applicants for examinations in the Centro de Avaliação de Português Língua Estrangeira (Center for Evaluation of Portuguese as a Foreign Language) (CAPLE – FLUL). The corpus contains texts from learners with 15 different native languages (L1s) and proficiencies from A1 to C1, and covers different topics and types of tasks. It is encoded in TEI format through the TEITOK environment. The corpus includes at the moment a total of 182,474 tokens and 978 texts, classified according to the CEFR scales. The corpus contains annotations for part of speech, lemma and learner errors. All the information encoded is searchable through the CQP query language.
Speech Resources
LR Agreement with Datatang for 67 Speech Resources
ELRA and Datatang signed a Language Resources distribution agreement to release a total of 67 Speech Resources distributed by ELRA. With this agreement, ELRA is strengthening its position as the leading worldwide distribution centre and Datatang is getting more visibility on the European market.
Those resources were designed and collected to boost Speech Recognition in particular. They cover the following languages: Cantonese, Chinese Mandarin, Various dialects from China: Changsha, Kunming, Shanghai, Sichuan, Wuhan, Several variants of English (English from Australia, Canada, China, France, Germany, India, Italy, Japan, Korea, Latin America, Portugal, Russia, Singapore, Spain, United Kingdom, USA), French, German, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Portuguese (Brazilian), Russian, Spanish (including non-hispanic Spanish), Thai, Uyghur, Vietnamese.
ISLRN submissions
The International Standard Language Resource Number (ISLRN) provides Language Resources (LRs) with unique identifiers using a standardised nomenclature. This aims to ensure that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers.
Latest figures
- 82 new ISLRN numbers assigned between October and December 2022
- A total of 3342 ISLRN numbers assigned since January 2014
- A total of 270 distinct languages.
The latest LRs for which an ISLRN number was requested and accepted are as follows:
- Learner Corpus of Portuguese L2 – COPLE2 – ISLRN: 936-320-703-366-7
- LORELEI Swahili Representative Language Pack – ISLRN: 699-485-644-732-3
- AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts – ISLRN: 699-485-644-732-3
- CAMIO Transcription Languages – ISLRN: 014-810-264-834-8
- Global TIMIT Thai – ISLRN: 471-761-626-780-5
- Third DIHARD Challenge Evaluation – ISLRN: 805-666-543-566-5
More about ISLRN.
Legal Issues
Publication of the EU-US Draft Adequacy Decision
Following the Schrems II case annulling the validity of the Privacy Shield framework allowing for transatlantic transfers of personal data, the US and EU authorities have been working to put a new framework in place.
On December 13, 2022, the European Commission published a draft adequacy decision detailing the new redress mechanisms and other developments brought by the American authorities in order to get compliant with GDPR.
The next steps are the submission to the European Data Protection Board for its opinion and the approval by a committee of Member States representatives.
It is also possible that individuals, the Parliament, or the European Counsel may challenge the validity of this new framework before the European Court of Justice.
Full draft adequacy decision available here.
Out with the old Standard Contractual Clauses for data transfers between EU and non-EU Countries
After the issuance of new Standard Contractual Clauses (SCCs) for data transfers between EU and non-EU countries on 4 June 2021, the European Commission allowed controllers and processors to rely on the earlier version of the SCCs until December 27, 2022 only for contracts concluded before the September 27, 2022.
Now that the deadline has passed transfers between EU and non-EU countries can only be made pursuant to the new SCCs.
The new SCCs are available here.
Over €300 million fines against Meta group announced by the Irish Data Protection Commission
Following inquiries, the Irish Data Protection Commissioner announced two sanctions against the Meta group, to which Facebook, Instagram and Whatsapp belong.
On January 4, 2023, the Irish DPC imposed a €210 million fine on the Facebook service and a €180 million fine on the Instagram service. During this inquiry, it was found that Meta could not rely on the “contract” legal basis to process the personal data of its users and therefore was in breach of its transparency obligations.
On January 19, 2023, the Irish DPC imposed a fine of €5,5 million for the Whatsapp Service operated by the Meta group. During this inquiry, it was found out that Meta could not rely on the “contract” legal basis to process the personal data of its users and therefore was in breach of its transparency obligations.
Reports of the decisions are available here for the decision published on January 4, 2023 and here for the decision published on January 19, 2023.
Swedish presidency circulates option papers on the Data Act
On January 10, 2023, the Swedish Presidency seeked the Member States’ opinion on the most crucial aspect of the upcoming data law to resolve some of the most pending issues.
This paper looks to address the following questions:
- SMEs exclusion of the Act
- Business to Government data sharing
- Protection of trade secrets
The full report is available here.
Berlin provides its position on the Data Act
Germany provided its position paper relative to the adoption of the upcoming Data Act to the Swedish Presidency. The paper covers the following points:
- Clarification on the scope of the regulation especially products covered by the Act
- Overlap and inconsistencies between the Data Act and the GDPR
- Differentiation between data sharing conditions of Business to Business (B2B) and Business to Consumer (B2C) use cases
- Protection of Trade Secrets
- Expansion of unfair contractual protection to all companies
- Contractual freedom regarding cloud switching
The full report is available here.
Event Review – CLARIN Café on the Text and Data Mining Exception
On November 8, 2022, CLARIN organised a CLARIN Café dedicated to the implementation of the Text and Data Mining Exception provided by the new Copyright Directive in the Digital Single Market.
The event featured presentations by Thomas Margoni from KU Leuven, Toby Bond from Bird & Bird, and Jan Hajic from Prague University.
Thomas Margoni gave an overview of the legislative framework for Text and Data Mining considering that the Text and Data Mining Exception as it is articulated today does not make the EU market attractive for Text and Data Mining due to legal uncertainties while creating a market for right-holders for the downstream markets (AI developments).
Toby Bond presented the state of the legislation in the post-Brexit United Kingdom. He also provided an outlook on the future of the Text and Data Mining legislation. He forecast that the UK government aim to implement a broad exception to Text and Data Mining to allow these operations for commercial and non-commercial organisations with no “opt-out” provision.
Jan Hajic presented the High-Performance Language Technology Project (HPLT) whose goal is to get large amounts of data in 30 languages, create Large Language Models (LLM) and make them available openly and for free on large repositories to the language community.
Full recording of the event can be found here and slides are available here.
ELRA/ELDA Projects
Information on the on-going projects
Conclusion of the ELRC initiative
The European Language Resource Coordination (ELRC) initiative has been officially concluded on January 16, 2023. The initial purpose of ELRC was to collect language data within the CEF-AT countries (EU Member States plus Iceland and Norway) to train eTranslation, the European Commission’s MT service.
Since the beginning of the initiative in 2014, substantial achievements were reached as the figures below show:
- 3,306 LRs available on the ELRC-SHARE repository, 80% of which are freely re-usable.
- 86 workshops and 6 conferences organized throughout Europe to highlight the importance of language data and language technologies and to promote the collection of multilingual language data.
- White Paper
- Development of LTs : NER, Speech-to-Text, Social Media Translation, etc. See the CEF AT services page for more information on the available services.
Common European Language Data Space (LDS)
The Common European Language Data Space (LDS) project was launched on January 19, 2023. The 3-year project will aim at establishing a European platform and marketplace for the collection, creation, sharing and re-use of multilingual and multimodal language data.
The service contract has been established between the European Commission and the four partners consortium composed of:
- German Research Center for Artificial intelligence (DFKI) (coordinator),
- Evaluations and Language Resources Distribution Agency (ELDA),
- Athena Research and Innovation Center in Information, Communication and Knowledge Technologies (ILSP),
- SIA Tilde
ELRA, through its operational body ELDA, will be involved in several work packages.
More details will be provided soon on the dedicated website. In the meantime, you can subscribe to the @LangDataSpace Twitter account.
Language Technology Solutions – CNECT/LUX/2022/OP/0030
This call for tenders from the European Commission was published within the Digital Europe programme (DIGITAL). It aims to achieve three specific goals: 1. facilitate uptake by SMEs, NGOs, public administration, and academia of European machine translation services for websites; 2. support the creation of open-source European language speech recognition solutions; 3. carry out market studies on language technologies and widely disseminate their results to foster the take-up of language technologies in Europe.
ELRA, through its operational body ELDA, is involved in two of the funded projects which are described below.
LOT 1 – Solutions Supporting the Use of Automated Translations on Websites
The project was officially launched on December 12, 2022 under the name “European Multilingual Web (EMW)”. EMW consortium is coordinated by Tilde (Latvia) with the participation of ELDA (Evaluations and Language resources Distribution Agency, France), IDC (International Data Corporation), Ogilvy (SIA Guilty, Latvia) and Rīga Stradiņš University (Latvia).
It involves four major tasks respectively consisting of:
Task 1: carrying out a comprehensive and evidence-based market study on the multilingualism of websites.
Task 2: delivering a set of ready-to-use open-source automated website translation solutions, and their subsequent maintenance and support (including helpdesk), including regular updating of relevant documentation, as required.
Task 3: publishing a set of open-source automated website translation solutions developed during Task 2 on a dedicated solutions website and to achieve widespread use of the solutions with promotional activities, as well as to build awareness of EU actions to support and nurture multilingualism.
Task 4: developing and implementing the strategy to ensure the sustainability of the set of ready-to-use open-source websites automated translations solutions developed or supported under Task 2 after the end of the contract.
LOT 2 – Language Technologies Solutions
This call for tenders implementing the Digital Europe programme (DIGITAL) in the field of language technologies is to achieve three specific goals:
1. facilitate uptake by SMEs, NGOs, public administration, and academia of European machine translation services for websites;
2. support the creation of open source European language speech recognition solutions;
3. carry out market studies on language technologies and widely disseminate their results in order to foster the take-up of language technologies in Europe.
This call for tenders covers:
- the creation and promotion of a set of ready-to-use open-source automated website translation solutions,
- the creation of an open-source basic speech recognition prototype solution,
- the conduct of a market research on language technologies and the wide dissemination of their results.
Evaluation Campaigns
Current campaigns
- IWSLT 2023 Evaluation Campaign – https://iwslt.org/2023/
- SemEval-2023 – The 17th International Workshop on Semantic Evaluation – https://semeval.github.io/SemEval2023/
- VarDial Evaluation Campaign 2023 – https://sites.google.com/view/vardial-2023/shared-tasks
- HaSpeeDe 3 (Hate Speech Detection) shared task within Evalita 2023 –http://www.di.unito.it/~tutreeb/haspeede-evalita23/
Dissemination
News from ELRA
Six Board members have reached the end of their term in 2022. Elections have been organized to replace them. The 2-step process started in November 2022 with the nomination of 7 candidates by the ELRA members and the ELRA Board and continues early 2023 with the online voting.
The 7 nominees were:
- Jan Niehues
- Sakti Sakriani
- Francesca Frontini
- Nancy Ide
- Patrick Paroubek
- Teresa Lynn
- Aline Villavicencio
Elections results will be shared shortly on ELRA usual channels, including the @ELRANews Twitter account.
Language Resources and Evaluation Journal
The 4 Regular Issues were published in 2022 in Volume 56:
News from the community
SMaLL-100 Model
SMaLL-100, a Shallow Multilingual MT Model for Low-Resource Languages, is a compact and fast massively multilingual machine translation model covering more than 10K language pairs. It is a distilled version of the large 12B MTM-100 model released by Meta.
Scientists working on NMT for low-resource languages may be interested in SMaLL. Good pre-trained models are provided to develop MT for low resource language pair. A demo platform is also available to access MT for those 10,000 language pairs.
- Online MT demo: https://huggingface.co/spaces/alirezamsh/small100
- SMaLL-100A paper accepted at EMNLP 2022 can be found here: https://arxiv.org/abs/2210.11621
Call for proposals
European Commission recently published a €20 million call for proposals on Natural Language Understanding and Interaction in Advanced Language Technologies under the HORIZON EUROPE research program (topic ID: HORIZON-CL4-2023-HUMAN-01-03).
The call will close on 29 March 2023 and the evaluation is expected to happen during April and May 2023.
The call covers the following topics:
- Improve context-aware human-machine interaction to increase understanding and exploitation of the interaction context and content in multimodal settings, thus increasing responsiveness of interactive AI solutions, such as smart assistants, conversational and dialogue systems, content generation models, etc.
- Support and enhance seamless human-to-human communication across languages e.g., by means of automatic translation or interpretation (incl. automatic subtitling) in real time with a greater understanding of the communication context and the meaning involved in it.
Call for submssions
ERCIM News #132 has just been published.
Featuring the special theme “Cognitive AI and Cobots“ and showcasing remarkable achievements from research teams in Europe, this issue was coordinated by our guest editors Theodore Patkos (ICS-FORTH) and Zsolt Viharos (SZTAKI)..
Submissions to the next issue is #133, April 2023 on the Special Theme: “Data infrastructures and management” are open until February 28, 2023. See the call for contributions for more details.
ERCIM Twitter: @ercim_news and other social media.