OLAC Record: RT-03 MDE Training Data Text and Annotations

OLAC Record
oai:www.ldc.upenn.edu:LDC2004T12

Metadata

Title: RT-03 MDE Training Data Text and Annotations

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Strassel, Stephanie, Christopher Walker, and Haejoong Lee. RT-03 MDE Training Data Text and Annotations LDC2004T12. Web Download. Philadelphia: Linguistic Data Consortium, 2004

Contributor: Strassel, Stephanie

Walker, Christopher

Lee, Haejoong

Date (W3CDTF): 2004

Date Issued (W3CDTF): 2004-06-15

Description: *Introduction* MDE RT-03 Training Data Text and Annotations was produced by the Linguistic Data Consortium (LDC) and contains transcripts and metadata annotations for approximately 20 hours of Broadcast News (BN) and 40 hours of Conversational Telephone Speech (CTS) in English. This data was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. The corresponding speech data for these files is available as MDE RT-03 Training Data Speech (LDC2004S08). *Data* There are 633 files, totalling approximately 747 MB with a total of 764,978 tokens. The annotated data was originally distributed as training data for the RT-03F evaluation cycle. The CTS data was drawn from Switchboard-1 Release 2 (LDC97S62). There are two sets of CTS data. The main set is located in the "train" folder of the release and contains 377 files of text and annotation representing 40 hours of audio. The "meteer-mapped" folder contains another 40 files with Meteer annotation representing approximately 6 hours of audio. The Meteer annotation specifications differ from the SimpleMDE specifications in important ways; these files are included to compare the two different annotation modes. The BN speech data was drawn from 1997 English Broadcast News Speech (HUB4) (LDC98S71), from four distinct sources: * American Broadcasting Company (1998, 2001) * National Broadcasting Company (1998, 2001) * Public Radio International (1998) * Cable News Network (2001) In simple terms, the main goal of MDE is the creation of automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals, and editing terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units, or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, in this case by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels, and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs). The data appears in two formats. The AG Atlas (ag.xml) format represents the native annotation format, and utilizes the Annotation Graph Library. The data is also provided in RTTM format developed by NIST to support the EARS Program. The RTTM format labels each token in the reference transcript according to the properties it displays: lexeme vs. non-lexeme; edit, filler, SU, etc. General information about the EARS MDE Annotation effort can be found under the EARS heading at LDC's Past Projects Page. *Samples* * Annotation (AG Atlas) * Annotation (RTTM) *Updates* There are no updates available at this time. Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania

Identifier: LDC2004T12

https://catalog.ldc.upenn.edu/LDC2004T12

ISBN: 1-58563-301-1

ISLRN: 754-359-961-593-5

DOI: 10.35111/ztjc-kx37

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2004T12

Rights Holder: The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2004T12

DateStamp: 2024-03-19

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Strassel, Stephanie; Walker, Christopher; Lee, Haejoong. 2004. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T12
Up-to-date as of: Wed Oct 29 7:00:22 EDT 2025

Metadata
Title:		RT-03 MDE Training Data Text and Annotations
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Strassel, Stephanie, Christopher Walker, and Haejoong Lee. RT-03 MDE Training Data Text and Annotations LDC2004T12. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:		Strassel, Stephanie
		Walker, Christopher
		Lee, Haejoong
Date (W3CDTF):		2004
Date Issued (W3CDTF):		2004-06-15
Description:		Introduction MDE RT-03 Training Data Text and Annotations was produced by the Linguistic Data Consortium (LDC) and contains transcripts and metadata annotations for approximately 20 hours of Broadcast News (BN) and 40 hours of Conversational Telephone Speech (CTS) in English. This data was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. The corresponding speech data for these files is available as MDE RT-03 Training Data Speech (LDC2004S08). Data There are 633 files, totalling approximately 747 MB with a total of 764,978 tokens. The annotated data was originally distributed as training data for the RT-03F evaluation cycle. The CTS data was drawn from Switchboard-1 Release 2 (LDC97S62). There are two sets of CTS data. The main set is located in the "train" folder of the release and contains 377 files of text and annotation representing 40 hours of audio. The "meteer-mapped" folder contains another 40 files with Meteer annotation representing approximately 6 hours of audio. The Meteer annotation specifications differ from the SimpleMDE specifications in important ways; these files are included to compare the two different annotation modes. The BN speech data was drawn from 1997 English Broadcast News Speech (HUB4) (LDC98S71), from four distinct sources: * American Broadcasting Company (1998, 2001) * National Broadcasting Company (1998, 2001) * Public Radio International (1998) * Cable News Network (2001) In simple terms, the main goal of MDE is the creation of automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals, and editing terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units, or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, in this case by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels, and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs). The data appears in two formats. The AG Atlas (ag.xml) format represents the native annotation format, and utilizes the Annotation Graph Library. The data is also provided in RTTM format developed by NIST to support the EARS Program. The RTTM format labels each token in the reference transcript according to the properties it displays: lexeme vs. non-lexeme; edit, filler, SU, etc. General information about the EARS MDE Annotation effort can be found under the EARS heading at LDC's Past Projects Page. Samples * Annotation (AG Atlas) * Annotation (RTTM) Updates There are no updates available at this time. Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania
Identifier:		LDC2004T12
		https://catalog.ldc.upenn.edu/LDC2004T12
		ISBN: 1-58563-301-1
		ISLRN: 754-359-961-593-5
		DOI: 10.35111/ztjc-kx37
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2004T12
Rights Holder:		The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2004T12
DateStamp:		2024-03-19
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Strassel, Stephanie; Walker, Christopher; Lee, Haejoong. 2004. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text