Free Resources
AURORA-CD0002 AURORA Project Database 2.0 – Evaluation Package
This database is intended for the evaluation of algorithms for front-end feature extraction algorithms in background noise but may also be used more widely by speech researchers to evaluate and compare the performance of noise robust speech recognition algorithms.
AURORA-CD0004-01 AURORA Project Database – Aurora 4a – Evaluation Package
The Aurora 4a database is based the Wall Street Journal data with artificial addition of noise over a range of signal to noise ratios. It contains both clean and multiple condition training sets and 14 evaluation sets with different noise types and microphones.
AURORA-CD0004-02 AURORA Project Database – Aurora 4b – Evaluation Package
The Aurora 4b contains noisy versions of the Nov’92 Wall Street Journal development set.
The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system. It contains artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database, a set of recordings that contain sequences of digits uttered by different speakers in hands-free mode in a meeting room, as well as a set of scripts for running recognition experiments on those speech data.
ELRA-E0002 TC-STAR 2005 Evaluation Package – ASR English
This package includes the material used for the TC-STAR 2005 Automatic Speech Recognition (ASR) first evaluation campaign for the English language. It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
ELRA-E0011 TC-STAR 2006 Evaluation Package – ASR English
This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for the English language. It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
ELRA-E0025 TC-STAR 2007 Evaluation Package – ASR English
This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for the English language. It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
This resource is an acoustic and articulatory English database recorded as part of the ESPRIT-ACCOR project investigating cross-language acoustic-articulatory correlations in coarticulatory processes.
This resource is a multi-English speech database with 797 calls received in Italy and in the UK , using different types of collecting equipment. It consists of a repetition of the same vocabulary from the “TI (Texas Instrument) words” (digits + yes, no, go, etc.).
ELRA-S0239 N4 (NATO Native and Non Native) database
This database comprises speech data recorded in the naval transmission training centers of four countries ( Germany , the Netherlands , United Kingdom and Canada ) during naval communication training sessions in 2000-2002. The material consists of native and non-native speakers using NATO Naval English procedure between ships, and reading from a text, “The North Wind and the Sun,” in both English and the speaker’s native language. The audio files have been manually transcribed and annotated.
This resource contains a set of lexicons developed in the MULTEXT project financed by the European Commission (LRE 62-050). The set contains the following languages:
- English: 66,214 Word forms
- French: 306,795 Word forms
- German: 233,861 Word forms
- Italian: 145,530 Word forms
- Spanish: 510,710 Word forms
This is a multi-lingual aligned corpus with 1,000,000 token corpora for English, French and Spanish, with morphosyntactical annotations.
An extended version of CRATER (ref. ELRA-W0003) is available in CRATER 2 (ref. ELRA-W0033).
The CRATER 2 parallel corpus is an extension of the CRATER corpus, available in the catalogue under reference W0003. It consists of 1,500,000 tokens for English and French and of 1,000,000 tokens for Spanish, with morphosyntactical annotations.
CRATER 2 (ref. ELRA-W0033) includes CRATER (ref. ELRA-W0003).
ELRA-W0023 MLCC Multilingual and Parallel Corpora
The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The Financial Times (English, 30 million words), Le Monde (French, 10 million words), Handelsblatt (German, 33 million words), Il sole 24 Ore (Italian, 1.88 million words), Expansion (Spanish, 10 million words).
The second set consists of a parallel corpus of translated data in the nine European official languages (1992-1994) divided into 2 sub-corpora: written questions (10.2 million words) and parliamentary debates (5 to 8 million words per language).
The TUNA Corpus of Referring Expressions is built with the contributions from 50 native or fluent speakers of English and it contains about 2000 descriptions (referring expressions). Participants described objects (targets) in visual domains by typing and submitting referring expressions that distingued them from other objects that were shown simultaneously (distractors). Each description is annotated with semantic information.
ELRA-E0003 TC-STAR 2005 Evaluation Package – ASR Spanish
This package includes the material used for the TC-STAR 2005 Automatic Speech Recognition (ASR) first evaluation campaign for the Spanish language. It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
ELRA-E0012-01 TC-STAR 2006 Evaluation Package – ASR Spanish – CORTES
This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for the Spanish language within the CORTES task. It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
ELRA-E0012-02 TC-STAR 2006 Evaluation Package – ASR Spanish – EPPS
This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for the Spanish language within the EPPS task. It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
ELRA-E0026-01 TC-STAR 2007 Evaluation Package – ASR Spanish – CORTES
This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for the Spanish language within the CORTES task. It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
ELRA-E0026-02 TC-STAR 2007 Evaluation Package – ASR Spanish – EPPS
This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for the Spanish language within the EPPS task. It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
This corpus consists of approximately 20 minutes of speech (per speaker) from 23 German and 23 Italian intermediate learners of English. Each speaker recorded sentences from several blocks of differing types (reading simple sentences, using minimal pairs, giving answers to multiple choice questions). About 2/3 of the data for each speaker was annotated by linguists. The files were corrected first at the word level, and an automatic recognizer was then used to produce phone-level annotations. The annotator then re-annotated each sentence to mark phone and stress errors (e.g., substitutions, insertions, or deletions).
ELRA-S0238 MIST Multi-lingual Interoperability in Speech Technology database
The MIST Multi-lingual Interoperability in Speech Technology database comprises the recordings of 74 native Dutch speakers (52 males, 22 females) who uttered 10 sentences in Dutch, English, French and German. These sentences comprise 5 sentences per language which are identical for all speakers and 5 sentences per language which are unique for each speaker. Dutch sentences are orthographically annotated.
ELRA-S0268 UPC-TALP database of isolated meeting-room acoustic events
This database has been produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission’s Sixth Framework Programme. It contains a set of isolated acoustic events that occur in a meeting room environment and that were recorded for the CHIL Acoustic Event Detection (AED) task. The database can be used as training material for AED technologies as well as for testing AED algorithms in quiet environments without temporal sound overlapping. Approximately 60 sounds per sound class were recorded. Ten people (5 men and 5 women) participated in three sessions. During each session a person had to produce a complete set of sounds twice.
ELRA-W0013 TSNLP (Test Suites for NLP Testing)
The TSNLP project (LRE 62-089) has produced a database of test suites for English, French and German containing over 4,000 test items (sentences or fragment of sentences) per language. The examples have been systematically constructed with detailed annotations about grammatical and other information, and are relevant to developers or users of systems with grammatical components who wish to test, benchmark, or evaluate them.
This corpus contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains ca. 5 million words in English, French, German, Italian and Spanish (ca. 1 million words par language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.
ELRA-W0018 ARCADE/ROMANSEVAL corpus
The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission). The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3700 contexts all together. It comprises: semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian; and word-level alignment of all the occurrences of the test words between French and English.
ELRA-W0031 GeFRePaC – German French Reciprocal Parallel Corpus
The German-French Reciprocal Parallel Corpus (GeFRePaC) was produced through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging – LE4-8335). It is a 30 million word corpus (15 million for each language) built for the purpose of developing, enhancing and improving translation aids (dictionaries, lexicons, platforms) for French-German and German-French translation.
The LT Corpus is composed of 70 fiction texts from Portuguese renowned authors. The corpus contains 1,781,083 tokens. The texts date from before 1940. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
The PTPARL Corpus contains 1,076 texts consisting of adapted transcriptions of the Portuguese Parliament sessions. The corpus contains 1,000,441 tokens. The corpus is delivered in one file, in two different formats. The txt version has one sentence per line, an identification number for each text and no further annotation. The cqpweb file is one token per line, followed by pos tag and lemma, and is annotated for NP chunks.
ELRA-W0061 CINTIL-DependencyBank
The CINTIL-DependencyBank (Silva and Branco, 2012) is a corpus of sentences annotated with their syntactic dependency graphs and grammatical function tags composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) that are used for regression testing of the computational grammar that supported the annotation of the corpus.
The CINTIL-DeepBank (Branco et al., 2010) is a corpus of sentences annotated with their full-fledged deep grammatical representations, composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), and novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 tokens) used for regression testing of the computational grammar that supported the annotation of the corpus.