OLAC Record: RT-04 MDE Training Data Speech

OLAC Record
oai:www.ldc.upenn.edu:LDC2005S16

Metadata

Title: RT-04 MDE Training Data Speech

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Lee, Haejoong, and Stephanie Strassel. RT-04 MDE Training Data Speech LDC2005S16. Web Download. Philadelphia: Linguistic Data Consortium, 2005

Contributor: Lee, Haejoong

Strassel, Stephanie

Date (W3CDTF): 2005

Date Issued (W3CDTF): 2005-08-17

Description: *Introduction* RT-04 MDE Training Data Speech was developed by the Linguistic Data Consortium (LDC) and contains approximately 63 hours of English broadcast news and conversational telephone speech (CTS). This corpus was created to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. This data set has been created and distributed by LDC. This data was previously released to the EARS MDE community as LDC2004E31. The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation, and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE. The transcript and annotation files corresponding to this release are available as RT-04 MDE Training Data Text/Annotations (LDC2005T24). *Data* There are 419 files, 22.6 hours of Broadcast News, and 40.4 hours of CTS contained in the corpus. The CTS data was drawn from Switchboard-1 Release 2 (LDC97S62). The BN speech data was drawn from the 1997 English Broadcast News Speech (Hub-4) corpus, from 4 distinct sources: Name Abbreviation Years Collected Broadcasting Company (ABC) (1998, 2001) National Broadcasting Company (NBC) (1998, 2001) Public Radio International (PRI) (1998) Cable News Network (CNN) (2001) The audio data in this corpus conforms to the following technical specifications: Type Format Encoding Channels Sample Rate CTS WAVE u-Law 2 8000/sec BN WAVE 16-bit PCM 1 16000/sec *Samples* For an example of the data in this publication, please listen to this broadcast news (WAV) sample and this telephone conversation (WAV) sample. *Updates* None at this time.

Format: Sampling Rate: varied

Sampling Format: varied

Identifier: LDC2005S16

https://catalog.ldc.upenn.edu/LDC2005S16

ISBN: 1-58563-355-0

ISLRN: 514-959-558-272-6

DOI: 10.35111/27r9-h809

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2005S16

Rights Holder: Portions © 2004 Trustees of the University of Pennsylvania, © 2003 American Broadcasting Company, © 2003 National Broadcasting Company, © 2003 Public Radio International, © 2003 Cable News Network, Inc. All Rights Reserved,© 2003 National Cable Satellite Corporation.

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Type (DCMI): Sound

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2005S16

DateStamp: 2022-01-20

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Lee, Haejoong; Strassel, Stephanie. 2005. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005S16
Up-to-date as of: Wed Oct 29 7:00:51 EDT 2025

Metadata
Title:		RT-04 MDE Training Data Speech
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Lee, Haejoong, and Stephanie Strassel. RT-04 MDE Training Data Speech LDC2005S16. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:		Lee, Haejoong
Contributor:		Strassel, Stephanie
Date (W3CDTF):		2005
Date Issued (W3CDTF):		2005-08-17
Description:		Introduction RT-04 MDE Training Data Speech was developed by the Linguistic Data Consortium (LDC) and contains approximately 63 hours of English broadcast news and conversational telephone speech (CTS). This corpus was created to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. This data set has been created and distributed by LDC. This data was previously released to the EARS MDE community as LDC2004E31. The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation, and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE. The transcript and annotation files corresponding to this release are available as RT-04 MDE Training Data Text/Annotations (LDC2005T24). Data There are 419 files, 22.6 hours of Broadcast News, and 40.4 hours of CTS contained in the corpus. The CTS data was drawn from Switchboard-1 Release 2 (LDC97S62). The BN speech data was drawn from the 1997 English Broadcast News Speech (Hub-4) corpus, from 4 distinct sources: Name Abbreviation Years Collected Broadcasting Company (ABC) (1998, 2001) National Broadcasting Company (NBC) (1998, 2001) Public Radio International (PRI) (1998) Cable News Network (CNN) (2001) The audio data in this corpus conforms to the following technical specifications: Type Format Encoding Channels Sample Rate CTS WAVE u-Law 2 8000/sec BN WAVE 16-bit PCM 1 16000/sec Samples For an example of the data in this publication, please listen to this broadcast news (WAV) sample and this telephone conversation (WAV) sample. Updates None at this time.
Format:		Sampling Rate: varied
Format:		Sampling Format: varied
Identifier:		LDC2005S16
		https://catalog.ldc.upenn.edu/LDC2005S16
		ISBN: 1-58563-355-0
		ISLRN: 514-959-558-272-6
		DOI: 10.35111/27r9-h809
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2005S16
Rights Holder:		Portions © 2004 Trustees of the University of Pennsylvania, © 2003 American Broadcasting Company, © 2003 National Broadcasting Company, © 2003 Public Radio International, © 2003 Cable News Network, Inc. All Rights Reserved,© 2003 National Cable Satellite Corporation. The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):		Sound
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2005S16
DateStamp:		2022-01-20
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Lee, Haejoong; Strassel, Stephanie. 2005. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Sound iso639_eng olac_primary_text