OLAC Record
oai:www.ldc.upenn.edu:LDC2005T24

Metadata
Title:RT-04 MDE Training Data Text/Annotations
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Walker, Christopher, et al. RT-04 MDE Training Data Text/Annotations LDC2005T24. Web Download. Philadelphia: Linguistic Data Consortium, 2005
Contributor:Walker, Christopher
Strassel, Stephanie
Shriberg, Elizabeth
Liu, Yang
Ang, Jeremy
Lee, Haejoong
Date (W3CDTF):2005
Date Issued (W3CDTF):2005-08-17
Description:*Introduction* RT-04 MDE Training Data Text/Annotations was developed by the Linguistic Data Consortium (LDC) and contains annotated transcripts of approximately 60 hours of English speech. This corpus was created to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. This data set has been created and distributed by LDC. This data was previously released to the EARS MDE community as LDC2004E31. The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means creating automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation, standardized spelling, and sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE. The speech files corresponding to this release are available as RT-04 MDE Training Data Speech (LDC2005S16). The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. *Data* In this release, some original annotations contained in LDC2004E31 have been re-mapped to new MDE elements to support better annotation consistency. In particular, the mapping affects Discourse Responses (DR), Discourse Markers (DM), and Backchannel SUs (BC). The data directories contain a variety of file formats: MDE AG XML (.ag.xml), RTTM (.rttm), and UEM (.uem) files. MDE AG XML is the LDC internal file format, RTTM is the official file format of the MDE program, and the UEM file specifies the portion of a speech file that is subject to MDE evaluation. The RTTM and UEM files have been generated using a conversion program developed by the National Institute of Standards and Technology (NIST). *Samples* For examples of the data in this corpus, please review the following .xml samples. * Broadcast News Annotations (XML) * Telephone Conversation Annotation (XML) *Updates* None at this time.
Identifier:LDC2005T24
https://catalog.ldc.upenn.edu/LDC2005T24
ISBN: 1-58563-358-5
ISLRN: 314-507-149-954-5
DOI: 10.35111/qwyc-cw15
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2005T24
Rights Holder:Portions © 2004 Trustees of the University of Pennsylvania,© 2003 American Broadcasting Company,© 2003 National Broadcasting Company,© 2003 Public Radio International,© 2003 Cable News Network, Inc. All Rights Reserved,© 2003 National Cable Satellite Corporation

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2005T24
DateStamp:  2021-07-23
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Walker, Christopher; Strassel, Stephanie; Shriberg, Elizabeth; Liu, Yang; Ang, Jeremy; Lee, Haejoong. 2005. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2005T24
Up-to-date as of: Mon Mar 25 7:20:08 EDT 2024