OLAC Record: Unified Linguistic Annotation Text Collection

OLAC Record
oai:www.ldc.upenn.edu:LDC2009T07

Metadata

Title: Unified Linguistic Annotation Text Collection

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Linguistic Data Consortium. Unified Linguistic Annotation Text Collection LDC2009T07. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Linguistic Data Consortium

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-03-17

Description: *Introduction* Unified Linguistic Annotation Text Collection consists of two separate corpora: The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation Training/DevTest (LDC2009T11). Most recent annotation efforts for language have focused on small pieces of the larger problem of semantic annotation rather than producing a single unified representation. The Unified Linguistic Annotation (ULA) project, sponsored by the National Science Foundation, seeks to integrate into one framework different layers of annotation (e.g., semantics, discourse, temporal, opinions) using various existing resources, including PropBank, NomBank, TimeBank, Penn Discourse Treebank and coreference and opinion annotations. The project represents a concerted effort of researchers from several institutions to develop a large word corpus with balanced and annotated data. The ULA Text Collection is provided as a resource for the ULA effort. It consists of two datasets, the Language Understanding Annotation Corpus from the Johns Hopkins Center of Excellence in Human Language Technology and ACE Reflex Entity Translation Training Dev/Test developed by LDC. The Language Understanding Annotation Corpus (LDC2009T10). The Language Understanding Annotation Corpus consists of over 9000 words of English text (6949 words) and Arabic text (2183 words) annotated for committed belief, event and entity coreference, dialog acts and temporal relations. The materials were chosen from various sources to represent "informal input," that is, text that contains colloquial forms. The documents in the corpus include excerpts from newswire stories, telephone conversation transcripts, emails, contracts and written instructions. REFLEX Entity Translation Training/DevTest (LDC2009T11). REFLEX Entity Translation Training/DevTest is the complete set of training data and development test data for the 2007 REFLEX Entity Translation evaluation sponsored by the National Institute of Standards and Technology (NIST). It contains approximately 67.5k words of newswire and weblog text for each of English, Chinese and Arabic (or approximately22.5k words in each language) translated ito each of the other two languages. The data is annotated for entities and TIMEX2 extents and normalization. *Samples* Please view this LDC2009T10 sample and LDC2009T11 sample.

Extent: Corpus size: 364544 KB

Identifier: LDC2009T07

https://catalog.ldc.upenn.edu/LDC2009T07

ISBN: 1-58563-511-1

ISLRN: 369-443-379-033-6

DOI: 10.35111/gh95-sk17

Language: English

Mandarin Chinese

Standard Arabic

Arabic

Language (ISO639): eng

cmn

arb

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Rights Holder: Portions © 1998-2000, 2003 Agence France Presse, © 2000 Al Hayat, © 2000, 2003 The Associated Press, © 2000, 2002 An Nahar, © 2003, 2005 Cable News Network, LP, LLLP, © 1987-1989 Dow Jones & Company, Inc., © 2003 Indiana Center for Intercultural Communication, © 2000 New York Times, ©1994-1998, 2000-2003 Xinhua News Agency, © 1992- 2009 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009T07

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Linguistic Data Consortium. 2009. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_ara iso639_arb iso639_cmn iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T07
Up-to-date as of: Wed Oct 29 7:01:07 EDT 2025

Metadata
Title:		Unified Linguistic Annotation Text Collection
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Linguistic Data Consortium. Unified Linguistic Annotation Text Collection LDC2009T07. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Linguistic Data Consortium
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-03-17
Description:		Introduction Unified Linguistic Annotation Text Collection consists of two separate corpora: The Language Understanding Annotation Corpus (LDC2009T10) and REFLEX EntityTranslation Training/DevTest (LDC2009T11). Most recent annotation efforts for language have focused on small pieces of the larger problem of semantic annotation rather than producing a single unified representation. The Unified Linguistic Annotation (ULA) project, sponsored by the National Science Foundation, seeks to integrate into one framework different layers of annotation (e.g., semantics, discourse, temporal, opinions) using various existing resources, including PropBank, NomBank, TimeBank, Penn Discourse Treebank and coreference and opinion annotations. The project represents a concerted effort of researchers from several institutions to develop a large word corpus with balanced and annotated data. The ULA Text Collection is provided as a resource for the ULA effort. It consists of two datasets, the Language Understanding Annotation Corpus from the Johns Hopkins Center of Excellence in Human Language Technology and ACE Reflex Entity Translation Training Dev/Test developed by LDC. The Language Understanding Annotation Corpus (LDC2009T10). The Language Understanding Annotation Corpus consists of over 9000 words of English text (6949 words) and Arabic text (2183 words) annotated for committed belief, event and entity coreference, dialog acts and temporal relations. The materials were chosen from various sources to represent "informal input," that is, text that contains colloquial forms. The documents in the corpus include excerpts from newswire stories, telephone conversation transcripts, emails, contracts and written instructions. REFLEX Entity Translation Training/DevTest (LDC2009T11). REFLEX Entity Translation Training/DevTest is the complete set of training data and development test data for the 2007 REFLEX Entity Translation evaluation sponsored by the National Institute of Standards and Technology (NIST). It contains approximately 67.5k words of newswire and weblog text for each of English, Chinese and Arabic (or approximately22.5k words in each language) translated ito each of the other two languages. The data is annotated for entities and TIMEX2 extents and normalization. Samples Please view this LDC2009T10 sample and LDC2009T11 sample.
Extent:		Corpus size: 364544 KB
Identifier:		LDC2009T07
		https://catalog.ldc.upenn.edu/LDC2009T07
		ISBN: 1-58563-511-1
		ISLRN: 369-443-379-033-6
		DOI: 10.35111/gh95-sk17
Language:		English
		Mandarin Chinese
		Standard Arabic
		Arabic
Language (ISO639):		eng
		cmn
		arb
		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Rights Holder:		Portions © 1998-2000, 2003 Agence France Presse, © 2000 Al Hayat, © 2000, 2003 The Associated Press, © 2000, 2002 An Nahar, © 2003, 2005 Cable News Network, LP, LLLP, © 1987-1989 Dow Jones & Company, Inc., © 2003 Indiana Center for Intercultural Communication, © 2000 New York Times, ©1994-1998, 2000-2003 Xinhua News Agency, © 1992- 2009 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009T07
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Linguistic Data Consortium. 2009. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_CN country_GB country_SA dcmi_Text iso639_ara iso639_arb iso639_cmn iso639_eng olac_primary_text