OLAC Record: CSLU: Multilanguage Telephone Speech Version 1.2

OLAC Record
oai:www.ldc.upenn.edu:LDC2006S35

Metadata

Title: CSLU: Multilanguage Telephone Speech Version 1.2

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Muthusamy, Yeshwant, Ronald Cole, and Beatrice Oshika. CSLU: Multilanguage Telephone Speech Version 1.2 LDC2006S35. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Muthusamy, Yeshwant

Cole, Ronald Allan

Oshika, Beatrice

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-06-15

Description: *Introduction* CSLU: Multilanguage Telephone Speech Version 1.2 was developed by The Center for Spoken Language Understanding (CSLU) and consists of telephone approximately 38.5 hours of speech, about eight hours of which has time-aligned phonetic transcripts, from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2,052 speakers, 12,152 speech files, and 619 phonetic transcripts. This corpus was collected and developed in 1992. *Data* Each subject called the CSLU data collection system by dialing a toll-free number. Most subjects were respondents to postings on USEnet newsgroups. Subjects were asked to contribute their voice to science to help with the research. Participating subjects responded to prompts that were designed to elicit vocabulary of three types: fixed and useful -- language spoken, days of the week, numbers domain-specific -- short open-ended questions unrestricted -- monologue on subject of choice An analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8 kHz and the files were stored in 16-bit linear format on a UNIX file system. Each utterance was recorded as a separate file. *Samples* For an example of the data in this corpus, please listen to these audio samples in Korean (WAV), Tamil (WAV) and English (WAV). *Updates* None at this time.

Extent: Corpus size: 2202009 KB

Format: Sampling Rate: 8000

Sampling Format: pcm

Identifier: LDC2006S35

https://catalog.ldc.upenn.edu/LDC2006S35

ISBN: 1-58563-390-9

ISLRN: 871-936-811-171-7

DOI: 10.35111/j0p6-f049

Language: Vietnamese

Tamil

Spanish

Iranian Persian

Korean

Japanese

Hindi

French

English

German

Mandarin Chinese

Language (ISO639): vie

tam

spa

pes

kor

jpn

hin

fra

eng

deu

cmn

License: CSLU Agreement: https://catalog.ldc.upenn.edu/license/cslu-corpora-non-commercial-research-only.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006S35

Rights Holder: Portions © 1992, 2000, 2002 Center for Spoken Language Understanding, Oregon Health & Science University, © 2006 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006S35

DateStamp: 2026-03-13

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Muthusamy, Yeshwant; Cole, Ronald Allan; Oshika, Beatrice. 2006. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_DE country_ES country_FR country_GB country_IN country_IR country_JP country_KR country_VN dcmi_Sound dcmi_Text iso639_cmn iso639_deu iso639_eng iso639_fra iso639_hin iso639_jpn iso639_kor iso639_pes iso639_spa iso639_tam iso639_vie olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006S35
Up-to-date as of: Wed Jul 8 7:30:26 EDT 2026

Metadata
Title:		CSLU: Multilanguage Telephone Speech Version 1.2
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Muthusamy, Yeshwant, Ronald Cole, and Beatrice Oshika. CSLU: Multilanguage Telephone Speech Version 1.2 LDC2006S35. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Muthusamy, Yeshwant
		Cole, Ronald Allan
		Oshika, Beatrice
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-06-15
Description:		Introduction CSLU: Multilanguage Telephone Speech Version 1.2 was developed by The Center for Spoken Language Understanding (CSLU) and consists of telephone approximately 38.5 hours of speech, about eight hours of which has time-aligned phonetic transcripts, from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2,052 speakers, 12,152 speech files, and 619 phonetic transcripts. This corpus was collected and developed in 1992. Data Each subject called the CSLU data collection system by dialing a toll-free number. Most subjects were respondents to postings on USEnet newsgroups. Subjects were asked to contribute their voice to science to help with the research. Participating subjects responded to prompts that were designed to elicit vocabulary of three types: fixed and useful -- language spoken, days of the week, numbers domain-specific -- short open-ended questions unrestricted -- monologue on subject of choice An analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8 kHz and the files were stored in 16-bit linear format on a UNIX file system. Each utterance was recorded as a separate file. Samples For an example of the data in this corpus, please listen to these audio samples in Korean (WAV), Tamil (WAV) and English (WAV). Updates None at this time.
Extent:		Corpus size: 2202009 KB
Format:		Sampling Rate: 8000
Format:		Sampling Format: pcm
Identifier:		LDC2006S35
		https://catalog.ldc.upenn.edu/LDC2006S35
		ISBN: 1-58563-390-9
		ISLRN: 871-936-811-171-7
		DOI: 10.35111/j0p6-f049
Language:		Vietnamese
		Tamil
		Spanish
		Iranian Persian
		Korean
		Japanese
		Hindi
		French
		English
		German
		Mandarin Chinese
Language (ISO639):		vie
		tam
		spa
		pes
		kor
		jpn
		hin
		fra
		eng
		deu
		cmn
License:		CSLU Agreement: https://catalog.ldc.upenn.edu/license/cslu-corpora-non-commercial-research-only.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006S35
Rights Holder:		Portions © 1992, 2000, 2002 Center for Spoken Language Understanding, Oregon Health & Science University, © 2006 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006S35
DateStamp:		2026-03-13
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Muthusamy, Yeshwant; Cole, Ronald Allan; Oshika, Beatrice. 2006. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_DE country_ES country_FR country_GB country_IN country_IR country_JP country_KR country_VN dcmi_Sound dcmi_Text iso639_cmn iso639_deu iso639_eng iso639_fra iso639_hin iso639_jpn iso639_kor iso639_pes iso639_spa iso639_tam iso639_vie olac_primary_text