OLAC Record: Datasets for Generic Relation Extraction (reACE)

OLAC Record
oai:www.ldc.upenn.edu:LDC2011T08

Metadata

Title: Datasets for Generic Relation Extraction (reACE)

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Hachey, Benjamin, Claire Grover, and Richard Tobin. Datasets for Generic Relation Extraction (reACE) LDC2011T08. Web Download. Philadelphia: Linguistic Data Consortium, 2011

Contributor: Hachey, Benjamin

Grover, Claire

Tobin, Richard

Date (W3CDTF): 2011

Date Issued (W3CDTF): 2011-06-17

Description: *Introduction* Datasets for Generic Relation Extraction (reACE) was developed at The University of Edinburgh, Edinburgh, Scotland. It consists of English broadcast news and newswire data originally annotated for the ACE (Automatic Content Extraction) program to which the Edinburgh Regularized ACE (reACE) mark-up has been applied. The Edinburgh relation extraction (RE) task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and to recode it in a format such as a relational database or RDF triple store (a database for the storage and retreival of Resource Description Framework (RDF) metadata) that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluation of automatic systems for RE in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and different notions of what constitutes a relation. reACE solves this problem by converting data to a common document type using token standoff and including detailed linguistic markup while maintaining all information in the original annotation. The subsequent reannotation process normalises the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web. The data in this corpus consists of newswire and broadcast news material from ACE 2004 Multilingual Training Corpus LDC 2005T09 and ACE 2005 Multilingual Training Corpus LDC2006T06. This material has been standardised for evaluation of multi-type RE across domains. Complete documentation for this corpus is available at the publication providers web site Datasets for Generic Relation Extraction. *Data* Annotation includes (1) a refactored version of the original data to a common XML document type (2) linguistic information from LT-TTT (a system for tokenizing text and adding markup) and MINIPAR (an English parser) and (3) a normalised version of the original RE markup that complies with a shared notion of what constitutes a relation across domains. The data sources represented in the corpus were collected by LDC in 2000 and 2003 and consist of the following: ABC, Agence France Presse, Associated Press, Cable News Network, MSNBC/NBC, New York Times, Public Radio International, Voice of America and Xinhua News Agency. *Samples* For an example of the data contained in this corpus, please examine this sample file.

Extent: Corpus size: 72704 KB

Identifier: LDC2011T08

https://catalog.ldc.upenn.edu/LDC2011T08

ISBN: 1-58563-582-0

ISLRN: 494-554-511-556-5

DOI: 10.35111/6mma-3a80

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2011T08

Rights Holder: Portions © 2000 American Broadcasting Corporation, © 2000, 2003 Cable News Network, LP, LLP, © 2000 National Broadcasting Company, © 2000 New York Times, © 2000 Public Radio International, © 2000 The Associated Press, © 2005, 2006, 2011 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2011T08

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Hachey, Benjamin; Grover, Claire; Tobin, Richard. 2011. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2011T08
Up-to-date as of: Wed Oct 29 7:01:16 EDT 2025

Metadata
Title:		Datasets for Generic Relation Extraction (reACE)
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Hachey, Benjamin, Claire Grover, and Richard Tobin. Datasets for Generic Relation Extraction (reACE) LDC2011T08. Web Download. Philadelphia: Linguistic Data Consortium, 2011
Contributor:		Hachey, Benjamin
		Grover, Claire
		Tobin, Richard
Date (W3CDTF):		2011
Date Issued (W3CDTF):		2011-06-17
Description:		Introduction Datasets for Generic Relation Extraction (reACE) was developed at The University of Edinburgh, Edinburgh, Scotland. It consists of English broadcast news and newswire data originally annotated for the ACE (Automatic Content Extraction) program to which the Edinburgh Regularized ACE (reACE) mark-up has been applied. The Edinburgh relation extraction (RE) task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and to recode it in a format such as a relational database or RDF triple store (a database for the storage and retreival of Resource Description Framework (RDF) metadata) that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluation of automatic systems for RE in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and different notions of what constitutes a relation. reACE solves this problem by converting data to a common document type using token standoff and including detailed linguistic markup while maintaining all information in the original annotation. The subsequent reannotation process normalises the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web. The data in this corpus consists of newswire and broadcast news material from ACE 2004 Multilingual Training Corpus LDC 2005T09 and ACE 2005 Multilingual Training Corpus LDC2006T06. This material has been standardised for evaluation of multi-type RE across domains. Complete documentation for this corpus is available at the publication providers web site Datasets for Generic Relation Extraction. Data Annotation includes (1) a refactored version of the original data to a common XML document type (2) linguistic information from LT-TTT (a system for tokenizing text and adding markup) and MINIPAR (an English parser) and (3) a normalised version of the original RE markup that complies with a shared notion of what constitutes a relation across domains. The data sources represented in the corpus were collected by LDC in 2000 and 2003 and consist of the following: ABC, Agence France Presse, Associated Press, Cable News Network, MSNBC/NBC, New York Times, Public Radio International, Voice of America and Xinhua News Agency. Samples For an example of the data contained in this corpus, please examine this sample file.
Extent:		Corpus size: 72704 KB
Identifier:		LDC2011T08
		https://catalog.ldc.upenn.edu/LDC2011T08
		ISBN: 1-58563-582-0
		ISLRN: 494-554-511-556-5
		DOI: 10.35111/6mma-3a80
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2011T08
Rights Holder:		Portions © 2000 American Broadcasting Corporation, © 2000, 2003 Cable News Network, LP, LLP, © 2000 National Broadcasting Company, © 2000 New York Times, © 2000 Public Radio International, © 2000 The Associated Press, © 2005, 2006, 2011 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2011T08
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Hachey, Benjamin; Grover, Claire; Tobin, Richard. 2011. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text