OLAC Record: Asian Spoken Language Sampler

OLAC Record
oai:www.ldc.upenn.edu:LDC2010S07

Metadata

Title: Asian Spoken Language Sampler

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Linguistic Data Consortium. Asian Spoken Language Sampler LDC2010S07. Web Download. Philadelphia: Linguistic Data Consortium, 2010

Contributor: Linguistic Data Consortium

Date (W3CDTF): 2010

Description: *Introduction * The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected, readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with data providers and maintaining relations with other like-minded groups around the world. Resources available from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDCs publications, a searchable catalog is available at http://www.ldc.upenn.edu/Catalog/. *Data * The Asian Spoken Language Sampler provides a variety of speech and transcript samples from various corpora and is designed to illustrate the variety and breadth of the speech-related resources available from LDCs Catalog. Further information about each data set can be obtained by clicking the links in the table below. The sample files provided in this release have been modified in various ways relative to the original data as published by LDC: * most excerpts are truncated to be much shorter than the original files, excerpt duration is typically one minute and thirty seconds * signal amplitude has been adjusted where necessary to normalize playback volume * some corpora are published in compressed form, but all samples here are uncompressed * LDC frequently uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. 2005 NIST Language Recognition Evaluation The goal of the NIST Language Recognition Evaluation is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. 2007 NIST Language Recognition Evaluation Test Set The most significant differences between previous NIST evaluations and the 2007 task were the increased number of languages and dialects, the greater emph asis on a basic detection task for evaluation and the variety of evaluation conditions. ARL Urdu Speech Database, Training Data The ARL Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India. CALLFRIEND Farsi A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi. CALLFRIEND Tamil A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil. CALLFRIEND Vietnamese A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Vietnamese. CALLHOME Japanese A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. CALLHOME Mandarin Chinese Speech The Callhome Mandarin Chinese corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese. JEIDA/JCSD-Channel 0 Mono Syllables This collection consists of high-fidelity recordings of 150 native speakers of Japanese each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones. Korean Telephone Conversations Speech and Transcripts This publication consists of 100 telephone conversations, 49 of which were published in 1996 as Callfriend Korean, while the rest of 51 are previously unexposed calls. All 100 conversations have been transcribed. Mandarin Affective Speech Mandarin Affective Speech is a database of emotional speech consisting of audio recordings and corresponding transcripts collected in 2005 at the Advance Computing and System Laboratory, Zhejiang University. The speech database was recorded by eliciting speakers to express different emotional states in response to stimuli. Russian through Switched Telephone Network (RuSTeN) The purpose of the project was to develop software for automatic identification of speakers based on voice samples acquired through telephone channels. TDT4 Multilingual Broadcast News Speech Corpus This release contains the complete set of American English, Modern Standard Arabic and Mandarin Chinese broadcast news audio used in the 2002 and 2003 Topic Detection and Tracking technology evaluations. West Point Korean Speech West Point Korean Speech is a database of digital recordings of spoken Korean. The prompt scripts were created from 20,000 distinct sentences, along with a subset of prompts designed to elicit free response answers to questions for use in domain-specific translation systems. Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities. Gulf Arabic Conversational Telephone Speech Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions. *How to Obtain the Sampler * The Asian Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler. Download 28 mb

Extent: Corpus size: 41369 KB

Identifier: LDC2010S07

https://catalog.ldc.upenn.edu/LDC2010S07

ISBN: 1-58563-559-6

ISLRN: 042-211-152-679-3

DOI: 10.35111/e3jx-tv33

Language: Japanese

Hindi

Persian

Mandarin Chinese

North Levantine Arabic

South Levantine Arabic

Gulf Arabic

Dari

Iranian Persian

Yue Chinese

Vietnamese

Urdu

Tamil

Russian

Korean

Language (ISO639): jpn

hin

fas

cmn

apc

ajp

afb

prs

pes

yue

vie

urd

tam

rus

kor

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Rights Holder: Portions © 2010 Trustees of the University of Pennsylvania

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2010S07

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Linguistic Data Consortium. 2010. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_AF country_CN country_IN country_IR country_JO country_JP country_KR country_KW country_PK country_RU country_SY country_VN iso639_afb iso639_ajp iso639_apc iso639_cmn iso639_fas iso639_hin iso639_jpn iso639_kor iso639_pes iso639_prs iso639_rus iso639_tam iso639_urd iso639_vie iso639_yue olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010S07
Up-to-date as of: Wed Oct 29 7:01:13 EDT 2025

Metadata
Title:		Asian Spoken Language Sampler
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Linguistic Data Consortium. Asian Spoken Language Sampler LDC2010S07. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:		Linguistic Data Consortium
Date (W3CDTF):		2010
Description:		Introduction The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected, readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with data providers and maintaining relations with other like-minded groups around the world. Resources available from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDCs publications, a searchable catalog is available at http://www.ldc.upenn.edu/Catalog/. Data The Asian Spoken Language Sampler provides a variety of speech and transcript samples from various corpora and is designed to illustrate the variety and breadth of the speech-related resources available from LDCs Catalog. Further information about each data set can be obtained by clicking the links in the table below. The sample files provided in this release have been modified in various ways relative to the original data as published by LDC: * most excerpts are truncated to be much shorter than the original files, excerpt duration is typically one minute and thirty seconds * signal amplitude has been adjusted where necessary to normalize playback volume * some corpora are published in compressed form, but all samples here are uncompressed * LDC frequently uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. 2005 NIST Language Recognition Evaluation The goal of the NIST Language Recognition Evaluation is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. 2007 NIST Language Recognition Evaluation Test Set The most significant differences between previous NIST evaluations and the 2007 task were the increased number of languages and dialects, the greater emph asis on a basic detection task for evaluation and the variety of evaluation conditions. ARL Urdu Speech Database, Training Data The ARL Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India. CALLFRIEND Farsi A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi. CALLFRIEND Tamil A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil. CALLFRIEND Vietnamese A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Vietnamese. CALLHOME Japanese A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. CALLHOME Mandarin Chinese Speech The Callhome Mandarin Chinese corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese. JEIDA/JCSD-Channel 0 Mono Syllables This collection consists of high-fidelity recordings of 150 native speakers of Japanese each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones. Korean Telephone Conversations Speech and Transcripts This publication consists of 100 telephone conversations, 49 of which were published in 1996 as Callfriend Korean, while the rest of 51 are previously unexposed calls. All 100 conversations have been transcribed. Mandarin Affective Speech Mandarin Affective Speech is a database of emotional speech consisting of audio recordings and corresponding transcripts collected in 2005 at the Advance Computing and System Laboratory, Zhejiang University. The speech database was recorded by eliciting speakers to express different emotional states in response to stimuli. Russian through Switched Telephone Network (RuSTeN) The purpose of the project was to develop software for automatic identification of speakers based on voice samples acquired through telephone channels. TDT4 Multilingual Broadcast News Speech Corpus This release contains the complete set of American English, Modern Standard Arabic and Mandarin Chinese broadcast news audio used in the 2002 and 2003 Topic Detection and Tracking technology evaluations. West Point Korean Speech West Point Korean Speech is a database of digital recordings of spoken Korean. The prompt scripts were created from 20,000 distinct sentences, along with a subset of prompts designed to elicit free response answers to questions for use in domain-specific translation systems. Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities. Gulf Arabic Conversational Telephone Speech Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions. How to Obtain the Sampler The Asian Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler. Download 28 mb
Extent:		Corpus size: 41369 KB
Identifier:		LDC2010S07
		https://catalog.ldc.upenn.edu/LDC2010S07
		ISBN: 1-58563-559-6
		ISLRN: 042-211-152-679-3
		DOI: 10.35111/e3jx-tv33
Language:		Japanese
		Hindi
		Persian
		Mandarin Chinese
		North Levantine Arabic
		South Levantine Arabic
		Gulf Arabic
		Dari
		Iranian Persian
		Yue Chinese
		Vietnamese
		Urdu
		Tamil
		Russian
		Korean
Language (ISO639):		jpn
		hin
		fas
		cmn
		apc
		ajp
		afb
		prs
		pes
		yue
		vie
		urd
		tam
		rus
		kor
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Rights Holder:		Portions © 2010 Trustees of the University of Pennsylvania
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2010S07
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Linguistic Data Consortium. 2010. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_AF country_CN country_IN country_IR country_JO country_JP country_KR country_KW country_PK country_RU country_SY country_VN iso639_afb iso639_ajp iso639_apc iso639_cmn iso639_fas iso639_hin iso639_jpn iso639_kor iso639_pes iso639_prs iso639_rus iso639_tam iso639_urd iso639_vie iso639_yue olac_primary_text