OLAC Record: RT-03 MDE Training Data Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC2004S08

Metadata

Title: RT-03 MDE Training Data Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Strassel, Stephanie, Christopher Walker, and Haejoong Lee. RT-03 MDE Training Data Speech LDC2004S08. Web Download. Philadelphia: Linguistic Data Consortium, 2004

Contributor: Strassel, Stephanie

Walker, Christopher

Lee, Haejoong

Date (W3CDTF): 2004

Date Issued (W3CDTF): 2004-06-15

Description: *Introduction* MDE RT-03 Training Data Speech corpus was produced by the Linguistic Data Consortium (LDC) and contains approximately 70 hours of English Conversational Telephone Speech (CTS) and Broadcast News (BN) audio data. This data was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE) and was distributed as training data for the RT-03F evaluation cycle. The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. The corresponding transcripts and annotations for these speech files are available as MDE RT-03 Training Data Text and Annotations (LDC2004T12). *Data* There are 633 files, totalling approximately 5.39 GB (uncompressed). There are 23 hours of BN and over 46 hours of CTS contained in the corpus. The CTS data was drawn from Switchboard-1 Release 2 (LDC97S62). The BN speech data was drawn from 1997 English Broadcast News Speech (HUB4) (LDC98S71), from four distinct sources: American Broadcasting Company (ABC) (1998, 2001) National Broadcasting Company (NBC) (1998, 2001) Public Radio International (PRI) (1998) Cable News Network (CNN) (2001) The audio data in this corpus conforms to the following technical specifications: Type Format Encoding Channels Sample Rate CTS WAVE u-Law 2 8000/sec BN WAVE 16-bit PCM 1 16000/sec Note that the data is in wave format. This is the audio file format that our MDE annotation tool supports. Since the annotation data is best explored with this open-source annotation tool, the WAVE format is our choice of data format. The transcripts corresponding to this speech have been annotated for various kinds of metadata. The goal of MDE is to create automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals, and editing terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units, or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, here by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels, and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs). General information about the EARS MDE Annotation effort can be found at LDC's past projects page. *Samples* * Audio (wav) *Updates* There are no updates available at this time. Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania

Extent: Corpus size: 5242880 KB

Format: Sampling Format: u-law, pcm

Identifier: LDC2004S08

https://catalog.ldc.upenn.edu/LDC2004S08

ISBN: 1-58563-300-3

ISLRN: 111-911-583-428-2

DOI: 10.35111/2jrd-8f06

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2004S08

Rights Holder: The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2004S08

DateStamp: 2024-03-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Strassel, Stephanie; Walker, Christopher; Lee, Haejoong. 2004. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004S08
Up-to-date as of: Wed Oct 29 7:00:19 EDT 2025

Metadata
Title:		RT-03 MDE Training Data Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Strassel, Stephanie, Christopher Walker, and Haejoong Lee. RT-03 MDE Training Data Speech LDC2004S08. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:		Strassel, Stephanie
		Walker, Christopher
		Lee, Haejoong
Date (W3CDTF):		2004
Date Issued (W3CDTF):		2004-06-15
Description:		Introduction MDE RT-03 Training Data Speech corpus was produced by the Linguistic Data Consortium (LDC) and contains approximately 70 hours of English Conversational Telephone Speech (CTS) and Broadcast News (BN) audio data. This data was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE) and was distributed as training data for the RT-03F evaluation cycle. The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. The corresponding transcripts and annotations for these speech files are available as MDE RT-03 Training Data Text and Annotations (LDC2004T12). Data There are 633 files, totalling approximately 5.39 GB (uncompressed). There are 23 hours of BN and over 46 hours of CTS contained in the corpus. The CTS data was drawn from Switchboard-1 Release 2 (LDC97S62). The BN speech data was drawn from 1997 English Broadcast News Speech (HUB4) (LDC98S71), from four distinct sources: American Broadcasting Company (ABC) (1998, 2001) National Broadcasting Company (NBC) (1998, 2001) Public Radio International (PRI) (1998) Cable News Network (CNN) (2001) The audio data in this corpus conforms to the following technical specifications: Type Format Encoding Channels Sample Rate CTS WAVE u-Law 2 8000/sec BN WAVE 16-bit PCM 1 16000/sec Note that the data is in wave format. This is the audio file format that our MDE annotation tool supports. Since the annotation data is best explored with this open-source annotation tool, the WAVE format is our choice of data format. The transcripts corresponding to this speech have been annotated for various kinds of metadata. The goal of MDE is to create automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals, and editing terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units, or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, here by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels, and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs). General information about the EARS MDE Annotation effort can be found at LDC's past projects page. Samples * Audio (wav) Updates There are no updates available at this time. Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania
Extent:		Corpus size: 5242880 KB
Format:		Sampling Format: u-law, pcm
Identifier:		LDC2004S08
		https://catalog.ldc.upenn.edu/LDC2004S08
		ISBN: 1-58563-300-3
		ISLRN: 111-911-583-428-2
		DOI: 10.35111/2jrd-8f06
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2004S08
Rights Holder:		The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2004S08
DateStamp:		2024-03-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Strassel, Stephanie; Walker, Christopher; Lee, Haejoong. 2004. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text