OLAC Record
oai:www.ldc.upenn.edu:LDC2009S05

Metadata
Title:2007 NIST Language Recognition Evaluation Supplemental Training Set
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Martin, Alvin, et al. 2007 NIST Language Recognition Evaluation Supplemental Training Set LDC2009S05. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:Martin, Alvin
Le, Audrey
Graff, David
van Santen, Jan
Date (W3CDTF):2009
Date Issued (W3CDTF):2009-11-20
Description:Introduction 2007 NIST Language Recognition Evaluation Supplemental Training Se consists of 118 hours of conversational telephone speech segments in the following languages and dialects: Arabic (Egyptian colloquial), Bengali, Min Nan Chinese, Wu Chinese, Taiwan Mandarin, Cantonese, Russian, Mexican Spanish, Thai, Urdu and Tamil. The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted three previous language recognition evaluations, in 1996, 2003 and 2005. The most significant differences between those evaluations and the 2007 task were the increased number of languages and dialects, the greater emphasis on a basic detection task for evaluation and the variety of evaluation conditions. Thus, in 2007, given a segment of speech and a language of interest to be detected (i.e., a target language), the task was to decide whether that target language was in fact spoken in the given telephone speech segment (yes or no), based on an automated analysis of the data contained in the segment. The supplemental training material in this release consists of the following: * Approximately 53 hours of conversational telephone speech segments in Arabic (Egyptian colloquial), Bengali, Cantonese, Min Nan Chinese,Wu Chinese, Russian, Thai and Urdu. This material is taken from LDC's CALLHOME, CALLFRIEND and Mixer collections. * Approximately 65 hours of full telephone conversations in Mandarin Chinese (Taiwan), Spanish (Mexican) and Tamil. This material was collected by Oregon Health and Science University (OHSU), Beaverton, Oregon. The test segments used in the 2005 NIST Language Recognition Evaluation were derived from these full conversations. In addition to the supplemental material contained in this release, the training data for the 2007 NIST Language Recognition Evaluation consisted of data from previous LRE evaluation test sets, namely, 2003 NIST Language Recognition Evaluation and 2005 NIST Language Recognition Evaluation. *Samples * For an example of the data in this corpus, please listen to this sample of the Egyptian Arabic data from the data set.
Extent:Corpus size: 3323985 KB
Format:Sampling Rate: 8000
Sampling Format: 8 bit u-law
Identifier:LDC2009S05
https://catalog.ldc.upenn.edu/LDC2009S05
ISBN: 1-58563-530-8
ISLRN: 498-359-265-464-3
Language:Yue Chinese
Wu Chinese
Urdu
Thai
Tamil
Spanish
Russian
Min Nan Chinese
Mandarin Chinese
Bengali
Egyptian Arabic
Language (ISO639):yue
wuu
urd
tha
tam
spa
rus
nan
cmn
ben
arz
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2009S05
Rights Holder: Portions © 2005 Oregon Health and Science University, © 1996, 2006, 2009 Trustees of the University of Pennsylvania
Type (DCMI):Sound
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2009S05
DateStamp:  2014-07-17
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Martin, Alvin; Le, Audrey; Graff, David; van Santen, Jan. 2009. Linguistic Data Consortium.
Terms: area_Africa area_Asia area_Europe country_BD country_CN country_EG country_ES country_IN country_PK country_RU country_TH dcmi_Sound iso639_arz iso639_ben iso639_cmn iso639_nan iso639_rus iso639_spa iso639_tam iso639_tha iso639_urd iso639_wuu iso639_yue olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009S05
Up-to-date as of: Thu Jul 6 1:40:49 EDT 2017