OLAC Record: LDC Spoken Language Sampler

OLAC Record
oai:www.ldc.upenn.edu:LDC2008S08

Metadata

Title: LDC Spoken Language Sampler

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Castelletto, Anthony, and et al.. LDC Spoken Language Sampler LDC2008S08. Web Download. Philadelphia: Linguistic Data Consortium, 2008

Contributor: Castelletto, Anthony

et al.

Date (W3CDTF): 2008

Date Issued (W3CDTF): 2008-11-18

Description: *Introduction* The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. In 2008, LDC is a growing consortium that includes more than 100 companies, universities, and government members that has distributed over 50,000 corpora to a global audience. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with potential information providers and would-be members, and maintaining relations with other like-minded groups around the world. Resources available from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. *Data* The LDC Spoken Language Sampler provides a variety of speech, transcript and lexicon samples and is designed to illustrate the variety and breadth of the resources available from LDC Publication Catalog. * most excerpts are truncated to be much shorter than the original files, typically one minute and thirty seconds of speech * signal amplitude has been adjusted where necessary to normalize playback volume * some corpora are published in compressed form, but all samples here are uncompressed * LDC typically uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. The sampler includes samples from the following corpora and lexicons. Audio samples range from 30 seconds to 90 seconds and are accompanied by transcripts. An English Dictionary of the Tamil Verb This dictionary contains translations for over 6000 English verbs and defines over 9000 Tamil verbs. Entries include the English word, the Tamil equivalent in transliteration and Tamil script and audio examples in Spoken Tamil pronunciation. CALLFRIEND Farsi A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi. CALLFRIEND Tamil A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil. CALLHOME Japanese A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. CALLHOME Spanish A corpus of 120 unscripted telephone conversations between native Spanish speakers and a corpus of associated transcripts. CSLU Kids Speech Developed at Oregeon State Universitys Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10. Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities. Grassfields Bantu Fieldwork: Dschang Tone Paradigms Tone paradigms from Yémba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon. Gulf Arabic Conversational Telephone Speech Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions. Korean Telephone Speech Collection of 100 telephone conversations between native Korean speakers and their transcriptions. Mawukakan Lexicon The first publication of an ongoing project aiming to build an electronic dictionary of four Mandekan [Eastern Manding languages of the Mande Group of the Niger-Congo family] languages. Nationwide Speech Project A database of speech representing current regional accents and dialects of the United States. NIST Pilot Meeting Speech Collects speech and transcriptions from topical discussions in meeting settings including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place. West Point Russian Speech Utterances of sentences in Russian from 1,891 native and non-native speakers. *How to Obtain* The LDC Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler. Download 74 mb

Extent: Corpus size: 73728 KB

Identifier: LDC2008S08

https://catalog.ldc.upenn.edu/LDC2008S08

ISBN: 1-58563-495-6

ISLRN: 857-539-187-188-1

DOI: 10.35111/jawx-3z48

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Rights Holder: Portions © 2008 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): lexicon

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2008S08

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Castelletto, Anthony; et al. 2008. Linguistic Data Consortium.
Terms: dcmi_Sound olac_lexicon

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2008S08
Up-to-date as of: Wed Oct 29 7:01:05 EDT 2025

Metadata
Title:		LDC Spoken Language Sampler
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Castelletto, Anthony, and et al.. LDC Spoken Language Sampler LDC2008S08. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:		Castelletto, Anthony
Contributor:		et al.
Date (W3CDTF):		2008
Date Issued (W3CDTF):		2008-11-18
Description:		Introduction The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. In 2008, LDC is a growing consortium that includes more than 100 companies, universities, and government members that has distributed over 50,000 corpora to a global audience. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with potential information providers and would-be members, and maintaining relations with other like-minded groups around the world. Resources available from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. Data The LDC Spoken Language Sampler provides a variety of speech, transcript and lexicon samples and is designed to illustrate the variety and breadth of the resources available from LDC Publication Catalog. * most excerpts are truncated to be much shorter than the original files, typically one minute and thirty seconds of speech * signal amplitude has been adjusted where necessary to normalize playback volume * some corpora are published in compressed form, but all samples here are uncompressed * LDC typically uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. The sampler includes samples from the following corpora and lexicons. Audio samples range from 30 seconds to 90 seconds and are accompanied by transcripts. An English Dictionary of the Tamil Verb This dictionary contains translations for over 6000 English verbs and defines over 9000 Tamil verbs. Entries include the English word, the Tamil equivalent in transliteration and Tamil script and audio examples in Spoken Tamil pronunciation. CALLFRIEND Farsi A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi. CALLFRIEND Tamil A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil. CALLHOME Japanese A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. CALLHOME Spanish A corpus of 120 unscripted telephone conversations between native Spanish speakers and a corpus of associated transcripts. CSLU Kids Speech Developed at Oregeon State Universitys Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10. Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities. Grassfields Bantu Fieldwork: Dschang Tone Paradigms Tone paradigms from Yémba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon. Gulf Arabic Conversational Telephone Speech Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions. Korean Telephone Speech Collection of 100 telephone conversations between native Korean speakers and their transcriptions. Mawukakan Lexicon The first publication of an ongoing project aiming to build an electronic dictionary of four Mandekan [Eastern Manding languages of the Mande Group of the Niger-Congo family] languages. Nationwide Speech Project A database of speech representing current regional accents and dialects of the United States. NIST Pilot Meeting Speech Collects speech and transcriptions from topical discussions in meeting settings including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place. West Point Russian Speech Utterances of sentences in Russian from 1,891 native and non-native speakers. How to Obtain The LDC Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler. Download 74 mb
Extent:		Corpus size: 73728 KB
Identifier:		LDC2008S08
		https://catalog.ldc.upenn.edu/LDC2008S08
		ISBN: 1-58563-495-6
		ISLRN: 857-539-187-188-1
		DOI: 10.35111/jawx-3z48
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Rights Holder:		Portions © 2008 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		lexicon
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2008S08
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Castelletto, Anthony; et al. 2008. Linguistic Data Consortium.
Terms:		dcmi_Sound olac_lexicon