OLAC Record: West Point Russian Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC2003S05

Metadata

Title: West Point Russian Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: LaRocca, Stephen A., and Christine Tomei. West Point Russian Speech LDC2003S05. Web Download. Philadelphia: Linguistic Data Consortium, 2003

Contributor: LaRocca, Stephen A.

Tomei, Christine

Date (W3CDTF): 2003

Date Issued (W3CDTF): 2003-12-18

Description: *Introduction* West Point Russian Speech was developed at the Department of Foreign Languages (DFL) and the Center for Technology Enhanced Language Learning (CTELL) at the United States Military Academy at West Point. The purpose of the corpus is to provide a set of recordings for the training and development of speaker-independent speech recognition systems for use by West Point cadets enrolled in the Russian language program. *Data* The corpus consists of 4,181 speech files in SPHERE format, totalling approximately four hours of speech. Approximately 2,290 files are from native informants and 1,891 are from non-native informants. The following tables show the breakdown of corpus content in terms of male, female, native and non-native speakers. Number of speakers: male female total native 13 16 29 non-native 16 10 26 totals 29 26 55 Number of speech files: male female total native 1027 1263 2290 non-native 1103 788 1891 totals 2130 2050 4181 The speech data was collected using laptop computers running Windows NT. Recordings were captured at a sampling rate of 16-bit at 22,050 Hz pcm using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. A visual display of the sentence, along with a digital recording of the sentence as read by a native speaker, was presented. The informant pressed the Enter key to record the utterance. The informant's recording was played back for review and the utterance was re-recorded if necessary. The collection script consists of 96 sentences with a total of 528 tokens and 351 types. Each waveform file has a monophone and word level master label file transcription in HTK-format. A concatenated version of the master label files at both the word level and the phone level is provided. The lexicon contains 690 distinct orthographic word forms, including all words found in the collection script. *Samples* Please view the following samples: * Female Speaker (S31) * Male Speaker (S08) * Phone Level Transcript * Word Level Transcript *Updates* There are no updates available at this time.

Format: Sampling Rate: 22050

Sampling Format: 1-channel pcm

Identifier: LDC2003S05

https://catalog.ldc.upenn.edu/LDC2003S05

ISBN: 1-58563-277-5

ISLRN: 741-782-638-900-9

DOI: 10.35111/7rt8-8x28

Language: Russian

Language (ISO639): rus

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2003S05

Rights Holder: Portions © 2003 United States Military Academy, © 2003 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2003S05

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: LaRocca, Stephen A.; Tomei, Christine. 2003. Linguistic Data Consortium.
Terms: area_Europe country_RU dcmi_Sound iso639_rus olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2003S05
Up-to-date as of: Wed Oct 29 7:00:14 EDT 2025

Metadata
Title:		West Point Russian Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		LaRocca, Stephen A., and Christine Tomei. West Point Russian Speech LDC2003S05. Web Download. Philadelphia: Linguistic Data Consortium, 2003
Contributor:		LaRocca, Stephen A.
Contributor:		Tomei, Christine
Date (W3CDTF):		2003
Date Issued (W3CDTF):		2003-12-18
Description:		Introduction West Point Russian Speech was developed at the Department of Foreign Languages (DFL) and the Center for Technology Enhanced Language Learning (CTELL) at the United States Military Academy at West Point. The purpose of the corpus is to provide a set of recordings for the training and development of speaker-independent speech recognition systems for use by West Point cadets enrolled in the Russian language program. Data The corpus consists of 4,181 speech files in SPHERE format, totalling approximately four hours of speech. Approximately 2,290 files are from native informants and 1,891 are from non-native informants. The following tables show the breakdown of corpus content in terms of male, female, native and non-native speakers. Number of speakers: male female total native 13 16 29 non-native 16 10 26 totals 29 26 55 Number of speech files: male female total native 1027 1263 2290 non-native 1103 788 1891 totals 2130 2050 4181 The speech data was collected using laptop computers running Windows NT. Recordings were captured at a sampling rate of 16-bit at 22,050 Hz pcm using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. A visual display of the sentence, along with a digital recording of the sentence as read by a native speaker, was presented. The informant pressed the Enter key to record the utterance. The informant's recording was played back for review and the utterance was re-recorded if necessary. The collection script consists of 96 sentences with a total of 528 tokens and 351 types. Each waveform file has a monophone and word level master label file transcription in HTK-format. A concatenated version of the master label files at both the word level and the phone level is provided. The lexicon contains 690 distinct orthographic word forms, including all words found in the collection script. Samples Please view the following samples: * Female Speaker (S31) * Male Speaker (S08) * Phone Level Transcript * Word Level Transcript Updates There are no updates available at this time.
Format:		Sampling Rate: 22050
Format:		Sampling Format: 1-channel pcm
Identifier:		LDC2003S05
		https://catalog.ldc.upenn.edu/LDC2003S05
		ISBN: 1-58563-277-5
		ISLRN: 741-782-638-900-9
		DOI: 10.35111/7rt8-8x28
Language:		Russian
Language (ISO639):		rus
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2003S05
Rights Holder:		Portions © 2003 United States Military Academy, © 2003 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2003S05
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		LaRocca, Stephen A.; Tomei, Christine. 2003. Linguistic Data Consortium.
Terms:		area_Europe country_RU dcmi_Sound iso639_rus olac_primary_text