ELRA Newsletter

Issue #8 | January-July 2024

Language Resources
LRs @ELRA
Language Resources in the ELRA Catalogue

Since November 2023, we are happy to announce the release of 3 monolingual lexicons and and 2 speech resources available in our catalogue. Moreover, 1 written corpus is now available at reduced fees.

1. New Language Resources

Monolingual lexicons

DiaLEX – Egyptian (DiaLEX-EA) 
ISLRN:
697-328-151-668-9

A comprehensive full-form lexicon of Egyptian Arabic general vocabulary (DiaLEX-EA) including 78 million entries for 31,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms. Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Egyptian Arabic, especially morphological analysis and speech technology.
Quantity and size: 75,204,644 lines / 11,217 MB (11.0 GB)

DiaLEX – Emirati (DiaLEX-UA) 
ISLRN
: 836-793-503-213-8

A comprehensive full-form lexicon of Emirati Arabic general vocabulary (DiaLEX-UA) including 28 million entries for 29,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms. Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Emirati Arabic, especially morphological analysis and speech technology.
Quantity and size: 24,976,871 lines / 3,841 MB (3.8 GB)

DiaLEX – Saudi Arabian Hijazi (DiaLEX-HA) 
ISLRN
: 849-157-479-216-3

A comprehensive full-form lexicon of Hijazi Arabic general vocabulary (DiaLEX-HA) including 21 million entries for 30,000 lemmas with all inflected forms, enclitics, proclitics, case endings, declensions, and conjugated forms. Each entry is accompanied by a full and accurate diacriticization (vocalization) as well as an extensive coverage of variants. The lexicon is ideally suited to support natural language processing applications for Hijazi Arabic, especially morphological analysis and speech technology.
Quantity and size: 20,247,655 lines / 2,835 MB (2.8 GB)

Speech Resources

ÌròyìnSpeech 
ISLRN
: 012-405-700-001-6

A modern, high-fidelity, multi-speaker, Yorùbá read speech corpus suitable for Speech Synthesis, Automatic Speech Recognition and Computational Linguistics research. The subject matter is drawn from the Broadcast News domain as well as fictional texts, delivering a multi-purpose, contemporary speech dataset. This corpus consists in 34000 read sentences, 42 hours of audio recorded under 48kHz, 16bit Linear PCM WAV format, for ca. 12.5 Gigabytes.

Slovak Autistic and Non-Autistic Child Speech Corpus (SANACS)
ISLRN: 016-848-885-785-1

SANACS Corpus contains 67 recorded sessions of interactions between two native Slovak speakers. In 37 sessions an autistic child interacts with a neurotypical adult experimenter, and in 30 control sessions a neurotypical child interacts with the same neurotypical adult experimenter. The children were 6-12 years old (mean 9.2). In all sessions, the two participants are involved in a collaborative, task-oriented communication based on the Maps Task. Most tasks consist of six trials: a practice and two real trials where the experimenter is the describer and the child the follower, and then one practice and two real trials when the roles switched and the child is the describer and the experimenter is the follower.

2. Reduced fees for the following written corpus

Wojood – A corpus for nested Arabic Named Entity Recognition 
ISLRN:
688-718-284-176-0

Wojood consists of about 550,000 tokens (Modern Standard Arabic and dialect) that are manually annotated with 21 entity types (person, group of people, occupation, organization, geopolitical entity, location, facility, event, date, time, language, website, law, product, cardinal number, ordinal number, percent, quantity, unit, money, currency). It covers multiple domains (Media, History, Culture, Health, Finance, ICT, Law, Elections, Politics, Migration, Terrorism, social media) and was annotated with nested entities. The corpus contains about 75K entities and 22.5% of which are nested. The corpus was annotated using the IOB2 tagging scheme and is available in CSV format.

ISLRN submissions

The International Standard Language Resource Number (ISLRN) provides Language Resources (LRs) with unique identifiers using a standardised nomenclature. This aims to ensure that LRs are correctly identified, and consequently, recognised with proper references for their usage in applications in R&D projects, products evaluation and benchmark as well as in documents and scientific papers.

Latest figures

  • 31 new ISLRN numbers assigned between November 2023 and May 2024.
  • A total of 3534 ISLRN numbers assigned since January 2014.
  • A total of 275 distinct languages.

The latest LRs for which an ISLRN number was requested and accepted are as follows:

More about ISLRN.

Legal Issues 

Legal Workshop at LREC-COLING 2024 in Turin

During the LREC-COLING 2024 conference in Turin this May 2024, a new edition of the “Legal and Ethical Issues in Human Language Technologies” workshop took place. Co-organized by ELDA, the workshop brought together experts to discuss crucial legal and ethical considerations in language resources and technology.

The workshop featured an invited talk on “AI Regulation Perspectives from the UK” by Jennifer Williams from the University of Southampton, which set the stage for a longer discussion on international approaches to AI governance. Three engaging sessions covered a wide range of topics:

  • Legal Frameworks and Ethical Considerations: This session explored compliance methodologies for European Data Spaces, legal frameworks for language model training in Portugal, intellectual property rights in Large Language Models, and evolving ethical challenges in language technology.
  • Considerations and Implications of AI: Presentations addressed the hurdles of using LLMs in Finnish higher education, the impact of AI regulations on disinformation in election years, data broker regulations in the US, and modeling legal and ethical aspects of linguistic data collection.
  • Applications and User Perspective: The final session focused on practical applications, including data envelopes for cultural heritage, the emotional impact of annotating hate speech data, and user perspectives on anonymity in voice assistants.

The workshop highlighted the complex interplay between technological advancements and legal frameworks, emphasizing the need for continued cooperation between technologists, legal experts, and policymakers. The full list of presentations and more detailed information can be found on the LREC-COLING 2024 website. 

Link: https://legal2024.mobileds.de/…

EU AI Act Published in the EU Official Journal

The Artificial Intelligence Act (AI Act) has been officially published in the EU Official Journal on July 12, 2024. This marks a milestone in the efforts towards compliant and safe Artificial Intelligence within the European Union.

The AI Act’s implementation will follow a phased approach:

  • Entry into Force: The Act will officially enter into force on  August 2, 2024.
  • Ban on Prohibited AI: Starting from February 2, 2025, AI systems deemed to present unacceptable risks will be prohibited.
  • Application to General-Purpose AI (GPAI) Models: The Act will apply to GPAI models from August 2, 2025.
  • Full Implementation: The Act is expected to apply in full, with some exceptions, to all other aspects from August 22026.

The AI Act reflects the EU’s commitment to creating a balanced framework for AI development and use. This regulation aims to foster innovation while ensuring the protection of fundamental rights and safety of EU citizens. As we move towards the implementation dates, organizations developing or using AI systems will need to start preparing for compliance. This includes understanding the risk categories their AI systems fall into and implementing the necessary safeguards and transparency measures.

The phased implementation approach gives businesses and organizations time to adapt to the new requirements. However, given the complexity of the regulation, it’s advisable to start preparations early. ELRA will continue to provide updates and guidance as we gain more knowledge about this complex piece of legislation.

CNIL Launches Second Round of AI Factsheets

On July 2, 2024, the French Data Protection Authority (CNIL) has published a second series of how-to sheets and a questionnaire on AI system development in an effort of guiding the responsible development of artificial intelligence systems.

The sheets are designed to help professionals in the AI field strike a balance between innovation and respecting individuals’ rights.

The new how-to sheets cover the following seven topics:

  • Legal basis for legitimate interest in AI system development
  • Legitimate interest: focus on open-sourcing models
  • Legitimate interest: focus on web scraping
  • Informing data subjects
  • Respecting and facilitating the exercise of data subjects’ rights
  • Annotating data
  • Ensuring the safe development of an AI system

In order to ensure these guidelines meet the needs of the AI community, the CNIL has opened these new fact sheets for public consultation. The consultation period will span from July 2 to September 1, 2024 and all stakeholders and actors in the AI ecosystem are encouraged to review these materials and provide feedback.

This second round of consultation builds on CNIL’s ongoing efforts to provide clear guidance on AI development within the framework of data protection regulations. Your participation can make a real difference in creating a responsible AI ecosystem that balances innovation with respect for individual rights.

Link: https://www.cnil.fr/en/artific…

ELRA/ELDA Projects

Information on the on-going projects

Common European Language Data Space (LDS)

The Common European Language Data Space (LDS) project was launched on January 19, 2023. This 3-year project aims at establishing a European platform and marketplace for the collection, creation, sharing and re-use of multilingual and multimodal language data.The service contract has been established between the European Commission and a consortium with the four following partners:
  • German Research Center for Artificial intelligence (DFKI) (coordinator),
  • Evaluations and Language Resources Distribution Agency (ELDA),
  • Athena Research and Innovation Center in Information, Communication and Knowledge Technologies (ILSP),
  • SIA Tilde.
During this period, ELDA has continued the work planned within Task 14 on the Data Protection Compliance and submitted Version 1 of the Data Protection Concept to the EC on 31 March 2023. This Data Protection Concept is a document describing all personal data processing activities inherent to the Language Data Space (LDS) or enabled by the LDS. It is drafted as a product of Task 14 and it will serve as the reference documentation for the project’s compliance with GDPR and EUDPR rules.This document is also intended to be updated over the course of the development and deployment of the LDS infrastructure and architecture to include all ongoing and foreseen data processing activities performed during the project. Version 1.0 of the Data Protection Concept includes the following:
  • A revised list of all personal data processing activities foreseen in the original Technical Offer.
  • A first list of activities linked with existing Data Protection Records provided by the European Commission.
  • The main structure and definitions of all subtasks that will be addressed in the Data Protection Concept document at later stages (Data Processing Impact Assessment, Data Protection Risk Assessment, LDS Infrastructure, Data Protection Notices, etc.).
In addition to the active work carried out within Task 14, ELDA is also responsible for the event organisation Task 10 of the project and a number of dedicated workshops and conferences are planned to take place in the upcoming months. More details on the first events will be provided soon.For more information you can also subscribe to the @LangDataSpaceTwitter account.

 

Language Technology Solutions – CNECT/LUX/2022/OP/0030

This call for tenders from the European Commission was published within the Digital Europe programme (DIGITAL). It aims to achieve three specific goals:

  1. facilitate uptake by SMEs, NGOs, public administration, and academia of European machine translation services for websites;
  2. support the creation of open-source European language speech recognition solutions;
  3. carry out market studies on language technologies and widely disseminate their results to foster the take-up of language technologies in Europe.

ELRA, through its operational body ELDA, is involved in two of the funded projects which are described below.

LOT 1 – Solutions Supporting the Use of Automated Translations on Websites 

The project was officially launched on December 12, 2022 under the name “European Multilingual Web (EMW)”. EMW consortium is coordinated by Tilde (Latvia) with the participation of ELDA (Evaluations and Language resources Distribution Agency, France), IDC (International Data Corporation), Ogilvy (SIA Guilty, Latvia) and Rīga Stradiņš University (Latvia).

It involves four major tasks respectively consisting of:

Task 1: carrying out a comprehensive and evidence-based market study on the multilingualism of websites.

Task 2: delivering a set of ready-to-use open-source automated website translation solutions, and their subsequent maintenance and support (including helpdesk), including regular updating of relevant documentation, as required.  ELDA will be responsible for the running of the helpdesk which will be set up for Month 9 of the project (September 2023).

Task 3: publishing a set of open-source automated website translation solutions developed during Task 2 on a dedicated solutions website and to achieve widespread use of the solutions with promotional activities, as well as to build awareness of EU actions to support and nurture multilingualism.

Task 4: developing and implementing the strategy to ensure the sustainability of the set of ready-to-use open-source websites automated translations solutions developed or supported under Task 2 after the end of the contract.

LOT 2 – Language Technologies Solutions

The project was officially launched on December 13, 2022.

The consortium operating in this project is coordinated by Brno University of Technology (BUT) with the participation of TILDE and ELDA. With the participation of all members of the consortium, three main tasks will be carried out, which are:

Task 1: A comprehensive market study of the Automatic Speaker Recognition (ASR) solutions. This includes an overview of the main stakeholders and techniques of the domain as well as the availability of speech and related transcription data for ASR. This task is mainly carried out by ELDA. In March 2023, a first draft of the study was produced and submitted to the European Commission (EC).

Task 2: Creation of an open-source basic speech recognition prototype solution. This task is mainly carried out by BUT and TILDE, with ELDA taking an advisory position. As for Task 1 a first deliverable describing the solution’s documentation and key features was produced and submitted to the European Commission (EC).

Task 3: Collection and partial transcription, one third, of speech data for three European under-resourced languages. A total of 4500 hours per language will be packaged under the responsibility and coordination of ELDA. The data will be used for training the solution developed in Task 2 as well as to constitute three corpora that will be delivered to the EC. In March 2023 a first legal analysis was sent to EC with basic information about the collection and transcription timeline.

Evaluation Campaigns

Current campaigns

Dissemination

News from ELRA

LREC-COLING 2024

Lingotto Conference Center in Turin (Italy)

May 20-25, 2024

Two major international key players in the area of computational linguistics, the ELRA Language Resources Association (ELRA) and the International Committee on Computational Linguistics (ICCL), are joining forces to organize the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) to be held in Turin (Italy) on 20-25 May, 2024.

The hybrid conference will bring together researchers and practitioners in computational linguistics, speech, multimodality, and natural language processing, with special attention to evaluation and the development of resources that support work in these areas. Following in the tradition of the well-established parent conferences COLING and LREC, the joint conference will feature grand challenges and provide ample opportunity for attendees to exchange information and ideas through both oral presentations and extensive poster sessions, complemented by a friendly social program.

In addition to the the three-day main conference, workshops and tutorials will be held before and after the conference.

Conference website: https://lrec-coling-2024.lrec-conf.org/

Twitter: @LrecColing2024

 

Language Resources and Evaluation Journal

Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. The Journal is edited by ELRA and published by Springer.

Since January 2024, the following issues have been published.

Volume 58

Volume 58, issue 2,  June 2024, Regular Issue

Volume 58, issue 1, March 2024, Special Issue: Computational approaches to Portuguese

Each of these regular issues include a number of papers in Open Access.

New ELRA Website and Logo

The ELRA Language Resources Association (ELRA) introduced its brand new image at the Members’ Meeting organized within LREC-COLING 2024  last May. The ELRA logo has been replaced by a new one and the website’s content has been restructured to better reflect ELRA’s activities.

Easy access is provided to all to free resources, the Language Resources Catalogue, the ELRA SIGs, the latest news, and how to join the association.

ELRA Website: https://www.elra.info/