ELRA Newsletter

Issue #7 | Decembre 2023

Language Resources

LRs @ELRA

Language Resources in the ELRA Catalogue

Since July 2023, we are happy to announce the release of large series of monolingual and multilingual lexicons, as well as 1 written corpus and 4 speech resources available in our catalogue.

Monolingual and multilingual lexicons

Bitext Lexical Datasets and Synthetic Data

The series of Bitext Lexical Datasets for the generic vocabulary includes Lemmas, POS tagging, Frequency, Named Entities and Offensive features. Depending on the dataset and language, other syntactic and morphological features are also provided. The following 15 languages are available: Arabic, Chinese, Dutch, English, Finnish, French, German, Indonesian, Italian, Malay, Norwegian (Bokmal), Portuguese, Spanish and Ukrainian.

As a complement to the datasets mentioned above, 11 datasets of Language Variants can also be obtained for the following languages: Arabic, Chinese, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese and Spanish.

A Synonym dataset that can be linked with the English dataset is also provided.

The Bitext Synthetic Data consist of pre-built training data for intent detection and are provided for 20 verticals for English and Spanish languages. They cover the most common intents for each vertical and include many example utterances for each intent, with optional entity/slot annotations for each utterance. Data is distributed as models or open text files.

For each language, the following verticals are available: Automotive, Retail banking, Education, Event and ticketing, Field Service, Healthcare, Hospitality, Insurance, Legal, Manufacturing, Media Streaming, Mortgage and loans, Moving and storage, Real estate and construction, Restaurant/ bar chains, Retail Ecomm, Telecommunication, Travel, Utilities, Wealth management.

Lexicala Multilingual Lexical Data

The Lexicala resources consist of different groups of datasets. Full descriptions can be found in the ELRA Catalogue of Language Resources under the following links:

  1. GLOBAL Multilingual Lexical Data: a network of lexicographic cores for major world languages, comprising monolingual cores, bilingual pairs, and multilingual combinations for 25 languages.
  2. MULTIGLOSS Multilingual Glossaries: a series of innovative word-to-sense glossaries for over 30 languages into 45 more languages.
  3. Morphological lexicons: extensive morphological lists linking inflected forms to main lemmas for 15 languages.
  4. Parallel Corpora & Domains: parallel corpora for nearly 400 language pairs and numerous multilingual combinations, featuring general language and vertical domain vocabularies.
  5. Biographical & Geographical Names:
  6. Audio Pronunciation & Phonetic Transcription: human voice recordings of single-word lemmas and multiword expressions, as well as IPA and alternative scripts for 21 languages.

Written Corpora

Corpus for fine-grained analysis and automatic detection of irony on Twitter

This corpus was annotated by trained annotators (master’s students in Linguistics) using a detailed annotation scheme for irony categorization, which describes four labels: ‘ironic by means of a polarity contrast’, ‘situational irony’, ‘other verbal irony’ and ‘not ironic’. It consists of 4791 instances with an irony label and a tweet ID.

Speech Resources

ATCO2 Project Data

ATCO2 project collected the real-time voice communication between air-traffic controllers and pilots available either directly through publicly accessible radio frequency channels or indirectly from air-navigation service providers (ANSPs). In addition to the voice communication data, contextual information is available in the form of metadata (i.e. surveillance data).

The dataset consists of two subsets:

  • a corpus of more than 4000 hours with untranscribed data
  • a corpus of 4 hours with transcribed data of air-traffic control speech collected across different airports (Sion, Bern, Zurich, etc.). Ca. 1 hour of annotation has followed a human re-checking.

Corpus of Spontaneous Japanese (CSJ)

The Corpus of Spontaneous Japanese (or CSJ) contains about 650 hours of spontaneous speech that correspond to about 7000k words. All these speech materials are recorded using head-worn close-talking microphones and DAT, and down-sampled to 16kHz, 16bit accuracy. The speech material is transcribed both at orthographic and phonetic levels. In addition, segment label, intonation label, and other miscellaneous annotations are provided for a subset of CSJ, called the Core, which contains about 500k words or 45 hours of speech.

EWA-DB – Early Warning of Alzheimer speech database

EWA-DB is a speech database that contains data from 3 clinical groups: Alzheimer’s disease, Parkinson’s disease, mild cognitive impairment, and a control group of healthy subjects. Speech samples of each clinical group were obtained using the EWA smartphone application, which contains 4 different language tasks: sustained vowel phonation, diadochokinesis, object and action naming (30 objects and 30 actions), picture description (two single pictures and three complex pictures). The total number of speakers in the database is 1649. Of these, there are 87 people with Alzheimer’s disease, 175 people with Parkinson’s disease, 62 people with mild cognitive impairment, 2 people with a mixed diagnosis of Alzheimer’s + Parkinson’s disease and 1323 healthy controls.

Persian Kids’ Speech Corpus

The Persian Kids’ Speech Corpus consists of speech signals recorded by 286 children (141 girls, 145 boys), from 6 to 9 years old, through an Andreas Mic Anti-Noise microphone and a Premium Speechmike headphone. This recorded data was manually checked and labelled. Finally, a corpus containing 162,395 samples with a duration of 33 hours and 44 minutes was created. The samples are distributed as follows: 29,057 Words (478 minutes), 17,429 SubWords (260 minutes), 43,838 Syllables (485 minutes), 70,078 Phonemes (765 minutes), 1,993 Extra Vocabulary (36 minutes).

ISLRN submissions

The International Standard Language Resource Number (ISLRN) provides Language Resources (LRs) with unique identifiers using a standardised nomenclature. This aims to ensure that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers.

Latest figures

  • 134 new ISLRN numbers assigned between July and October 2023.
  • A total of 3503 ISLRN numbers assigned since January 2014.
  • A total of 275 distinct languages.

The latest LRs for which an ISLRN number was requested and accepted are as follows:

More about ISLRN.

Legal Issues

ELRA Legal Issues Publications

AFDIT Conference on October 5, 2023

On October 5, 2023, ELDA attended the annual AFDIT conference whose focus was on the Law as a “lever” in the data market. The conference focused on issues such as the implementation of the upcoming regulations in the data market (Data Act and Data Governance Act) and their impact on data availability and data sharing. It also focused on the intersection and interplay of these new regulations with existing frameworks such as the GDPR and Copyright Directive. This conference also featured presentation as to how legislation will impact the development of the data market and economy.

The full recording (in French) is available on demand at the following link: https://drive.google.com/file/d/1hoDZalfAEus8VgxFDJSssOH_k88x2USR/view?usp=share_link

French Legislative proposal on generative AI

A few French Members of Parliament are proposing a new legislation to frame the use of Artificial Intelligence by copyright law. It proposes the inclusion of four provisions into French Law.

  • The principle that the integration of works into AI systems are subject to the authorisation of the copyright holder.
  • The fact that if a work has been independently created by AI without human intervention is the property of the authors whose works have been used for the creation of this work.
  • The obligation to insert the mention “AI Generated work” on work created with the help of an AI system.
  • To create a taxation measure in favour of creators when works are generated by an AI system using works whose author cannot be determined.

Full text of the proposal (in French) available here: https://www.assemblee-nationale.fr/dyn/16/textes/l16b1630_proposition-loi

Guidelines for the constitution of databases for development of AI Systems

The French CNIL issued guidelines on the constitution of databases for the constitution of AI Systems.

These guidelines cover all the issues related to data protection such as:

  • The legal regime applicable to the development of the AI system
  • The legal base for processing
  • The legal regime applicable to the AI system
  • Lawfulness of the data processing
  • Drafting of Data Protection Impact Analysis
  • Data Protection by Default and by Design in the systems
  • Data Protection by Default and by Design in the collection of data

The Guidelines are available (in French only): https://www.cnil.fr/fr/les-fiches-pratiques-ia

Joint Reseach Center publishes a report on cybersecurity in the AI Act

The European Commission Joint Research Center publishes a report on Cybersecurity requirements in the AI Act.

It presents a high-level analysis in the context of the rapidly evolving AI landscape, and provides a set of key guiding principles to achieve compliance with the AI Act.

The report is available at the following link: https://publications.jrc.ec.europa.eu/repository/handle/JRC134461

EU Model Contractual Clauses for procurements involving AI

The European Commission has provided model contractual clauses for public stakeholders seeking external providers to develop AI Systems.

These EU model contractual AI clauses include provisions tailored to AI systems and align with the proposed AI Act, specifically addressing matters related to AI. It’s important to note that these clauses exclude other obligations or requirements that may arise under relevant applicable legislation, such as the General Data Protection Regulation.

These model clauses are available at the following link :

https://public-buyers-community.ec.europa.eu/communities/procurement-ai/resources/eu-model-contractual-ai-clauses-pilot-procurements-ai#:~:text=The%20EU%20model%20contractual%20AI%20clauses%20contain%20provisions%20specific%20to,the%20General%20Data%20Protection%20Regulation

EDPS Opinions on the AI Act and Directives on AI Liability

The European Data Protection Supervisor (EDPS) published two opinions on the upcoming legislations on Artificial Intelligence (AI Act and AI Product Liability Directive).

These two opinions are available at the following links:

Opinion on AI Liability Rules: https://edps.europa.eu/system/files/2023-10/23-10-11_opinion_ai_liability_rules.pdf

ELRA/ELDA Projects

Information on the on-going projects

Common European Language Data Space (LDS)

The Common European Language Data Space (LDS) project was launched on January 19, 2023. This 3-year project aims at establishing a European platform and marketplace for the collection, creation, sharing and re-use of multilingual and multimodal language data.

ELDA is currently working on the different tasks under its responsibility:

  • Setting up of a technical and legal helpdesk to provide support and guidance to all platform users. The technical platform to run this helpdesk is currently being implemented.
  • Definition and establishment of a Multistakeholder Data and Services Governance Scheme. For that purpose, a solid collaboration has been established between the LDS and the Data Spaces Support Centre (DSSC) and work towards a full alignment of governance issues is being established. A first version of this governance scheme is currently under development and it is due in January 2024. Likewise, the first draft of a scheme for the LDS infrastructure governance is also being drafted and will be finalised by January 2024.
  • Organisation of events. The second Technology Workshop is currently being organised and will be held early 2024. This workshop will be entitled “Legislation and Regulations for Data Spaces: An Environment for the Development of a European Data Market“. It will aim to give a panorama of the regulations that have been adopted or that are still in the pipeline for implementation in the data market (Digital Services Act (DSA), Directive on Copyright in the Digital Single Market, Data Act (DA), Data Governance Act (DGA), General Data Protection Regulation (GDPR) and Artificial Intelligence Act). It will also provide keys to understanding the interactions between these different legal instruments, their impact on the landscapes of Data Spaces and Artificial Intelligence, and how they will influence the data economy. The final workshop agenda will be published shortly.
  • Ensuring the compliance with and respect of personal data protection in all the different data processing operations which the project includes. ELDA has produced a preliminary document identifying all the possible processing operations involving personal data in each of the LDS tasks. This work is being expanded taking into consideration the information deriving from work in the other tasks (e.g., governance, technical implementation and conceptual definitions). Furthermore, compliance towards existing Data Protection Records (DPR) from the EC Data Protection public register has been investigated and defined. At present, work is focusing on preparing two Data Protection analyses following upcoming needs from the Data Space Connector (a set of software components that technically enable membership to a data space participant and the performance operations within the data space) on the one hand, and the EC node (EC-owned framework and connector as a LDS participant) on the other hand.

 

LDS: Face to face Consortium meeting in Berlin

On November 22 and 23, 2023, the 4 partners in the LDS consortium, DFKI, ELDA, ILSP and TILDE, convened at the DFKI premises in Berlin.
The primary purpose of this meeting was to engage in discussions and assess the activities conducted by all partners within LDS related to: project coordination, governance, technical infrastructure, promotion, as well as legal aspects pertaining to data protection and copyright compliance.

Language Technology Solutions – CNECT/LUX/2022/OP/0030

This call for tenders from the European Commission was published within the Digital Europe programme (DIGITAL). It aims to achieve three specific goals: 

  1. facilitate uptake by SMEs, NGOs, public administration, and academia of European machine translation services for websites; 
  2. support the creation of open-source European language speech recognition solutions; 
  3. carry out market studies on language technologies and widely disseminate their results to foster the take-up of language technologies in Europe.

ELRA, through its operational body ELDA, is involved in two of the funded projects which are described below.

LOT 1 – Solutions Supporting the Use of Automated Translations on Websites 

The main goal of the European Multilingual Web (EMW) project is to set up a set of ready-to-use open-source websites automated translations solutions. In this framework, ELDA will oversee managing the Helpdesk team that will support users to report issues and ask questions about the solutions.

ELDA is currently working on the setting up of the Helpdesk. The technical platform to run this helpdesk is currently being implemented. During this period, ELDA also offered legal expertise to the project consortium by addressing necessary data protection issues for the solution.

LOT 2 – Automated speech recognition prototype solution

Currently, ELDA is involved in the creation of resources for three under-resourced European languages: Czech, Estonian, and Greek. The LTS Lot 2 initiative aims to create two types of resources for each of the three languages: ASR solutions and a dataset of 4500 hours of data, 1500 of which to be transcribed.

A first iteration of the ASR solutions will be presented to the EC in December, BUT and TILDE are the main actors of the solutions’ development. The solutions have been used to kick-off a simplified transcription task, providing transcribers with a pre-transcription of the data to be corrected. The data collected until now is made of EC archives (interviews, debates, etc.), Czech municipality debates, as well as other sources with which negotiations are still ongoing.

As for the Lot 1 initiative, ELDA offered legal expertise to the project consortium mainly concerning data protection issues.

Consortium and Tasks

The consortium operating in this project is coordinated by Brno University of Technology (BUT) with the participation of TILDE and ELDA. Three main tasks are being performed with the participation of all members of the consortium, which are:

  • Task 1: A comprehensive market study of the Automatic Speaker Recognition (ASR) solutions.
  • Task 2: Creation of an open-source speech recognition prototype solution for three under-represented European languages (Czech, Estonian, and Greek).

Task 3: Collection and partial transcription (one third) of speech data for the three above-mentioned European under-resourced languages.

Evaluation Campaigns

Current campaigns

Dissemination

News from ELRA

 

LREC-COLING 2024

Lingotto Conference Centre in Turin (Italy)

20-25 May, 2024

Two international key players in the area of computational linguistics, the ELRA Language Resources Association (ELRA) and the International Committee on Computational Linguistics (ICCL), are joining forces to organize the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) to be held in Torino, Italy on 20-25 May, 2024.

Submissions and on-going Review Process

For this joint edition, the number of submissions to the Main conference broke all records with 3300+ papers submitted. In addition, 57 workshop and 20 tutorial proposals have been received.  All papers and proposals are currrently being reviewed. According to the schedule, the authors will be notified in February 2024.

The important dates are:

  • 22–29 January 2024: Author rebuttal period
  • 5 February 2024: Final reviewing
  • 19 February 2024: Notification of acceptance
  • 25 March 2024: Camera-ready due
  • 20-25 May 2024: LREC-COLING 2024 conference

The most current information on the conference can be found online @ https://lrec-coling-2024.lrec-conf.org/

Twitter: @LrecColing2024

 

Language Resources and Evaluation Journal

Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications.

Since July 2023, the following issues have been published.

Volume 57

Volume 57, issue 3, September 2023

Volume 57, issue 4, December 2023

Each of these regular issues include a number of papers in Open Access.

News from the community

Publication on ATCO2 dataset in the Aerospace Journal

Following ELDA’s involvement in the ATCO2 project a new publication from the consortium was made in the Aerospace Journal entitled “Lessons Learned in Transcribing 5000 h of Air Traffic Control Communications for Robust Automatic Speech Understanding”.

The paper reviews (i) robust automatic speech recognition (ASR), (ii) natural language processing, (iii) English language identification, and (iv) contextual ASR biasing with surveillance data based on the corpus available on ELDA’s catalogue.

The full article is available here: https://www.mdpi.com/2226-4310/10/10/898