A non-native speech database is a speech database of non-native pronunciations of English. Such databases are used in the development of: multilingual automatic speech recognition systems, text to speech systems, pronunciation trainers, and second language learning systems.

List

Table 1: Abbreviations for languages used in Table 2
Arabic A Japanese J Chinese C Korean K Czech Cze Malaysian M Danish D Norwegian N Dutch Dut Portuguese P English E Russian R French F Spanish S German G Swedish Swe Greek Gre Thai T Indonesian Ind Vietnamese V Italian I
ArabicAJapaneseJ
ChineseCKoreanK
CzechCzeMalaysianM
DanishDNorwegianN
DutchDutPortugueseP
EnglishERussianR
FrenchFSpanishS
GermanGSwedishSwe
GreekGreThaiT
IndonesianIndVietnameseV
ItalianI

The actual table with information about the different databases is shown in Table 2.

Table 2: Overview of non-native Databases
CorpusAuthorAvailable atLanguages#SpeakersNative Language#Utt.DurationDateRemarks
Corpus Author Available at Languages #Speakers Native Language #Utt. Duration Date Remarks AMI EU E Dut and other 100h meeting recordings ATR-Gruhn Gruhn ATR E 96 C G F J Ind 15000 2004 proficiency rating BAS Strange Corpus 1+10 ELRA G 139 50 countries 7500 1998 Berkeley Restaurant ICSI E 55 G I H C F S J 2500 1994 Broadcast News LDC E 1997 Cambridge-Witt Witt U. Cambridge E 10 J I K S 1200 1999 Cambridge-Ye Ye U. Cambridge E 20 C 1600 2005 Children News Tomokiyo CMU E 62 J C 7500 2000 partly spontaneous CLIPS-IMAG Tan CLIPS-IMAG F 15 C V 6h 2006 CLSU LDC E 22 countries 5000 2007 telephone, spontaneous CMU CMU E 64 G 452 0.9h not available Cross Towns Schaden U. Bochum E F G I Cze Dut 161 E F G I S 72000 133h 2006 city names Duke-Arslan Arslan Duke University E 93 15 countries 2200 1995 partly telephone speech ERJ Minematsu U. Tokyo E 200 J 68000 2002 proficiency rating Fischer LDC E many 200h telephone speech Fitt Fitt U. Edinburgh F I N Gre 10 E 700 1995 city names Fraenki U. Erlangen E 19 G 2148 Hispanic Byrne E 22 S 20h 1998 partly spontaneous HLTC HKUST E 44 C 3h 2010 available on request IBM-Fischer IBM E 40 S F G I 2000 2002 digits iCALL Chen I2R, A*STAR C 305 24 countries 90841 142h 2015 phonetic and tonal transcriptions (in Pinyin), proficiency ratings ISLE Atwell EU/ELDA E 46 G I 4000 18h 2000 Jupiter Zue MIT E unknown unknown 5146 1999 telephone speech K-SEC Rhee SiTEC E unknown K 2004 LDC WSJ1 LDC 10 800 1h 1994 LeaP Gut University of Münster E G 127 41 different ones 73.941 words 12h 2003 MIST ELRA E F G 75 Dut 2200 1996 NATO HIWIRE NATO E 81 F Gre I S 8100 2007 clean speech NATO M-ATC Pigeon NATO E 622 F G I S 9833 17h 2007 heavy background noise NATO N4 NATO E 115 unknown 7.5h 2006 heavy background noise Onomastica D Dut E F G Gre I N P S Swe (121000) 1995 only lexicon PF-STAR U. Erlangen E 57 G 4627 3.4h 2005 children speech Sunstar EU E 100 G S I P D 40000 1992 parliament speech TC-STAR Heuvel ELDA E S unknown EU countries 13h 2006 multiple data sets TED Lamel ELDA E 40(188) many 10h(47h) 1994 eurospeech 93 TLTS DARPA A E 1h 2004 Tokyo-Kikuko U. Tokyo J 140 10 countries 35000 2004 proficiency rating Verbmobil U. Munich E 44 G 1.5h 1994 very spontaneous VODIS EU F G 178 F G 2500 1998 about car navigation WP Arabic Rocca LDC A 35 E 800 1h 2002 WP Russian Rocca LDC R 26 E 2500 2h 2003 WP Spanish Morgan LDC S E 2006 WSJ Spoke E 10 unknown 800 1993
AMIEUEDut and other100hmeeting recordings
ATR-GruhnGruhnATRE96C G F J Ind150002004proficiency rating
BAS Strange Corpus 1+10ELRAG13950 countries75001998
Berkeley RestaurantICSIE55G I H C F S J25001994
Broadcast NewsLDCE1997
Cambridge-WittWittU. CambridgeE10J I K S12001999
Cambridge-YeYeU. CambridgeE20C16002005
Children NewsTomokiyoCMUE62J C75002000partly spontaneous
CLIPS-IMAGTanCLIPS-IMAGF15C V6h2006
CLSULDCE22 countries50002007telephone, spontaneous
CMUCMUE64G4520.9hnot available
Cross TownsSchadenU. BochumE F G I Cze Dut161E F G I S72000133h2006city names
Duke-ArslanArslanDuke UniversityE9315 countries22001995partly telephone speech
ERJMinematsuU. TokyoE200J680002002proficiency rating
FischerLDCEmany200htelephone speech
FittFittU. EdinburghF I N Gre10E7001995city names
FraenkiU. ErlangenE19G2148
HispanicByrneE22S20h1998partly spontaneous
HLTCHKUSTE44C3h2010available on request
IBM-FischerIBME40S F G I20002002digits
iCALLChenI2R, A*STARC30524 countries90841142h2015phonetic and tonal transcriptions (in Pinyin), proficiency ratings
ISLEAtwellEU/ELDAE46G I400018h2000
JupiterZueMITEunknownunknown51461999telephone speech
K-SECRheeSiTECEunknownK2004
LDC WSJ1LDC108001h1994
LeaPGutUniversity of MünsterE G12741 different ones73.941 words12h2003
MISTELRAE F G75Dut22001996
NATO HIWIRENATOE81F Gre I S81002007clean speech
NATO M-ATCPigeonNATOE622F G I S983317h2007heavy background noise
NATO N4NATOE115unknown7.5h2006heavy background noise
OnomasticaD Dut E F G Gre I N P S Swe(121000)1995only lexicon
PF-STARU. ErlangenE57G46273.4h2005children speech
SunstarEUE100G S I P D400001992parliament speech
TC-STARHeuvelELDAE SunknownEU countries13h2006multiple data sets
TEDLamelELDAE40(188)many10h(47h)1994eurospeech 93
TLTSDARPAAE1h2004
Tokyo-KikukoU. TokyoJ14010 countries350002004proficiency rating
VerbmobilU. MunichE44G1.5h1994very spontaneous
VODISEUF G178F G25001998about car navigation
WP ArabicRoccaLDCA35E8001h2002
WP RussianRoccaLDCR26E25002h2003
WP SpanishMorganLDCSE2006
WSJ SpokeE10unknown8001993

Legend

In the table of non-native databases some abbreviations for language names are used. They are listed in Table 1. Table 2 gives the following information about each corpus: The name of the corpus, the institution where the corpus can be obtained, or at least further information should be available, the language which was actually spoken by the speakers, the number of speakers, the native language of the speakers, the total amount of non-native utterances the corpus contains, the duration in hours of the non-native part, the date of the first public reference to this corpus, some free text highlighting special aspects of this database and a reference to another publication. The reference in the last field is in most cases to the paper which is especially devoted to describe this corpus by the original collectors. In some cases it was not possible to identify such a paper. In these cases a paper is referenced which is using this corpus is.

Some entries are left blank and others are marked with unknown. The difference here is that blank entries refer to attributes where the value is just not known. Unknown entries, however, indicate that no information about this attribute is available in the database itself. As an example, in the Jupiter weather database no information about the origin of the speakers is given. Therefore this data would be less useful for verifying accent detection or similar issues.

Where possible, the name is a standard name of the corpus, for some of the smaller corpora, however, there was no established name and hence an identifier had to be created. In such cases, a combination of the institution and the collector of the database is used.

In the case where the databases contain native and non-native speech, only attributes of the non-native part of the corpus are listed. Most of the corpora are collections of read speech. If the corpus instead consists either partly or completely of spontaneous utterances, this is mentioned in the Specials column.