OLAC Record: RATS Language Identification

OLAC Record
oai:www.ldc.upenn.edu:LDC2018S10

Metadata

Title: RATS Language Identification

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David, et al. RATS Language Identification LDC2018S10. Web Download. Philadelphia: Linguistic Data Consortium, 2018

Contributor: Graff, David

Ma, Xiaoyi

Strassel, Stephanie

Walker, Kevin

Jones, Karen

Date (W3CDTF): 2018

Date Issued (W3CDTF): 2018-07-16

Description: *Introduction* RATS Language Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 600 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The audio was retransmitted over eight channels, making 5,400 hours of total audio. The corpus was created to provide training, development and initial test sets for the Language Identification (LID) task in the DARPA RATS (Robust Automatic Transcription of Speech) program. The goal of the RATS program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, LDC assembled a system for the transmission, reception and digital capture of audio data that allowed a single source audio signal to be distributed and recorded over eight distinct transceiver configurations simultaneously. Those configurations included three frequencies -- high, very high and ultra high -- variously combined with amplitude modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band or wide-band frequency modulation. Annotations on the clear source audio signal, e.g., time boundaries for the duration of speech activity, were projected onto the corresponding eight channels recorded from the radio receivers. *Data* The source audio consists of conversational telephone speech recordings from: (1) conversational telephone speech (CTS) recordings, taken either from previous LDC CTS corpora, or from CTS data collected specifically for the RATS program from Levantine Arabic, Pashto, Urdu, Farsi and Dari native speakers; and (2) portions of VOA broadcast news recordings, taken from data used in the 2009 NIST Language Recognition Evaluation. The 2009 LRE Test Set is available from LDC as LDC2014S06. CTS recordings were audited by annotators who listened to short segments and determined whether the audio was in the target language. Annotations on the audio files include start time, end time, speech activity detection (SAD) label, SAD provenance, language ID and LID provenance. All audio files are presented as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file headers. The data is divided for use as training, initial development set, and initial evaluation set. *Samples* Please view this audio sample. *Updates* None at this time. *Acknowledgment* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D10PC20016. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Extent: Corpus size: 298512480 KB

Format: Sampling Rate: 16000

Sampling Format: pcm

Identifier: LDC2018S10

https://catalog.ldc.upenn.edu/LDC2018S10

ISBN: 1-58563-852-8

ISLRN: 190-505-311-077-0

DOI: 10.35111/xjdn-0g13

Language: South Levantine Arabic

North Levantine Arabic

Persian

Dari

Pushto

Urdu

Language (ISO639): ajp

apc

fas

prs

pus

urd

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2018S10

Rights Holder: Portions © 2000, 2001, 2004, 2005, 2007, 2014, 2018 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2018S10

DateStamp: 2021-10-15

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David; Ma, Xiaoyi; Strassel, Stephanie; Walker, Kevin; Jones, Karen. 2018. Linguistic Data Consortium.
Terms: area_Asia country_AF country_JO country_PK country_SY dcmi_Sound iso639_ajp iso639_apc iso639_fas iso639_prs iso639_pus iso639_urd olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2018S10
Up-to-date as of: Wed Oct 29 7:01:49 EDT 2025

Metadata
Title:		RATS Language Identification
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David, et al. RATS Language Identification LDC2018S10. Web Download. Philadelphia: Linguistic Data Consortium, 2018
Contributor:		Graff, David
		Ma, Xiaoyi
		Strassel, Stephanie
		Walker, Kevin
		Jones, Karen
Date (W3CDTF):		2018
Date Issued (W3CDTF):		2018-07-16
Description:		Introduction RATS Language Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 600 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The audio was retransmitted over eight channels, making 5,400 hours of total audio. The corpus was created to provide training, development and initial test sets for the Language Identification (LID) task in the DARPA RATS (Robust Automatic Transcription of Speech) program. The goal of the RATS program was to develop human language technology systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, LDC assembled a system for the transmission, reception and digital capture of audio data that allowed a single source audio signal to be distributed and recorded over eight distinct transceiver configurations simultaneously. Those configurations included three frequencies -- high, very high and ultra high -- variously combined with amplitude modulation, frequency hopping spread spectrum, narrow-band frequency modulation, single-side-band or wide-band frequency modulation. Annotations on the clear source audio signal, e.g., time boundaries for the duration of speech activity, were projected onto the corresponding eight channels recorded from the radio receivers. Data The source audio consists of conversational telephone speech recordings from: (1) conversational telephone speech (CTS) recordings, taken either from previous LDC CTS corpora, or from CTS data collected specifically for the RATS program from Levantine Arabic, Pashto, Urdu, Farsi and Dari native speakers; and (2) portions of VOA broadcast news recordings, taken from data used in the 2009 NIST Language Recognition Evaluation. The 2009 LRE Test Set is available from LDC as LDC2014S06. CTS recordings were audited by annotators who listened to short segments and determined whether the audio was in the target language. Annotations on the audio files include start time, end time, speech activity detection (SAD) label, SAD provenance, language ID and LID provenance. All audio files are presented as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file headers. The data is divided for use as training, initial development set, and initial evaluation set. Samples Please view this audio sample. Updates None at this time. Acknowledgment This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D10PC20016. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Extent:		Corpus size: 298512480 KB
Format:		Sampling Rate: 16000
Format:		Sampling Format: pcm
Identifier:		LDC2018S10
		https://catalog.ldc.upenn.edu/LDC2018S10
		ISBN: 1-58563-852-8
		ISLRN: 190-505-311-077-0
		DOI: 10.35111/xjdn-0g13
Language:		South Levantine Arabic
		North Levantine Arabic
		Persian
		Dari
		Pushto
		Urdu
Language (ISO639):		ajp
		apc
		fas
		prs
		pus
		urd
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2018S10
Rights Holder:		Portions © 2000, 2001, 2004, 2005, 2007, 2014, 2018 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2018S10
DateStamp:		2021-10-15
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David; Ma, Xiaoyi; Strassel, Stephanie; Walker, Kevin; Jones, Karen. 2018. Linguistic Data Consortium.
Terms:		area_Asia country_AF country_JO country_PK country_SY dcmi_Sound iso639_ajp iso639_apc iso639_fas iso639_prs iso639_pus iso639_urd olac_primary_text