OLAC Record: Machine Reading Phase 1 IC Training Data

OLAC Record
oai:www.ldc.upenn.edu:LDC2020T04

Metadata

Title: Machine Reading Phase 1 IC Training Data

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Simpson, Heather, et al. Machine Reading Phase 1 IC Training Data LDC2020T04. Web Download. Philadelphia: Linguistic Data Consortium, 2020

Contributor: Simpson, Heather

Strassel, Stephanie

Wright, Jonathan

Griffitt, Kira

Date (W3CDTF): 2020

Date Issued (W3CDTF): 2020-02-17

Description: *Introduction* Machine Reading Phase 1 IC Training Data was developed by the Linguistic Data Consortium and contains 248 English source documents and 116 standoff annotation files used in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. The Machine Reading (MR) program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains. The data in this release constitutes the training data for the IC (Core Domain) task. The IC Use Cases tested the core domain by extracting information about about Entities (people, organizations, geopolitical entities or "GPEs") and their involvement in four types of Relations: Attack Relations (e.g. bombings), Biographical Relations (e.g. being a citizen of a country), Affiliation Relations (e.g. being a leader of an organization), and Family Relations (e.g. having a spouse) as described in newswire text. This information was then aligned with an IC Use Cases ontology that would allow automated reasoning about the extracted Entities and Relations. *Data* This release contains 248 source documents (108,960 words) from English newswire stories in English Gigaword Fourth Edition (LDC2009T13). Roughly half of those documents (116) were annotated for IC/Core Use Cases. Annotation was non-exhaustive, but an attempt was made to provide instances of all relations and their arguments where explicitly stated in a single sentence, as well as some non-explicit relations, which were marked with an "Inferred" tag by the annotator. Annotations are in GUI XML (traditional annotation) and RDF XML (formal knowledge representation) formats. A second set of GUI XML is provided with additional, unofficial annotations. All source and annotation files are presented as UTF-8 encoded XML files with associated dtds, schemas or ontologies. *Acknowledgments* The Linguistic Data Consortium gratefully acknowledges the support of Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09 C-xxxx. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government. *Samples* Please view the following samples: * Source * RDF XML * GUI XML * GUI XML Extended *Updates* None at this time.

Extent: Corpus size: 12971 KB

Identifier: LDC2020T04

https://catalog.ldc.upenn.edu/LDC2020T04

ISBN: 1-58563-916-8

ISLRN: 013-884-229-405-9

DOI: 10.35111/tj3x-ce20

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2020T04

Rights Holder: Portions © 1994-1997, 2001-2006 Agence France Presse, © 2002 An Nahar, ©1995-1998, 2000-2001, 2005-2006 The Associated Press, © 1996-1998, 2004, 2006 Los Angeles Times-Washington Post News Service, Inc., © 1994-2002, 2004-2006 New York Times, © 1994 Reuters America, Inc., © 1995-2006 Xinhua News Agency, © 2009, 2020 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2020T04

DateStamp: 2021-01-01

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Simpson, Heather; Strassel, Stephanie; Wright, Jonathan; Griffitt, Kira. 2020. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2020T04
Up-to-date as of: Wed Oct 29 7:01:59 EDT 2025

Metadata
Title:		Machine Reading Phase 1 IC Training Data
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Simpson, Heather, et al. Machine Reading Phase 1 IC Training Data LDC2020T04. Web Download. Philadelphia: Linguistic Data Consortium, 2020
Contributor:		Simpson, Heather
		Strassel, Stephanie
		Wright, Jonathan
		Griffitt, Kira
Date (W3CDTF):		2020
Date Issued (W3CDTF):		2020-02-17
Description:		Introduction Machine Reading Phase 1 IC Training Data was developed by the Linguistic Data Consortium and contains 248 English source documents and 116 standoff annotation files used in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program. The Machine Reading (MR) program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains. The data in this release constitutes the training data for the IC (Core Domain) task. The IC Use Cases tested the core domain by extracting information about about Entities (people, organizations, geopolitical entities or "GPEs") and their involvement in four types of Relations: Attack Relations (e.g. bombings), Biographical Relations (e.g. being a citizen of a country), Affiliation Relations (e.g. being a leader of an organization), and Family Relations (e.g. having a spouse) as described in newswire text. This information was then aligned with an IC Use Cases ontology that would allow automated reasoning about the extracted Entities and Relations. Data This release contains 248 source documents (108,960 words) from English newswire stories in English Gigaword Fourth Edition (LDC2009T13). Roughly half of those documents (116) were annotated for IC/Core Use Cases. Annotation was non-exhaustive, but an attempt was made to provide instances of all relations and their arguments where explicitly stated in a single sentence, as well as some non-explicit relations, which were marked with an "Inferred" tag by the annotator. Annotations are in GUI XML (traditional annotation) and RDF XML (formal knowledge representation) formats. A second set of GUI XML is provided with additional, unofficial annotations. All source and annotation files are presented as UTF-8 encoded XML files with associated dtds, schemas or ontologies. Acknowledgments The Linguistic Data Consortium gratefully acknowledges the support of Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09 C-xxxx. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government. Samples Please view the following samples: * Source * RDF XML * GUI XML * GUI XML Extended Updates None at this time.
Extent:		Corpus size: 12971 KB
Identifier:		LDC2020T04
		https://catalog.ldc.upenn.edu/LDC2020T04
		ISBN: 1-58563-916-8
		ISLRN: 013-884-229-405-9
		DOI: 10.35111/tj3x-ce20
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2020T04
Rights Holder:		Portions © 1994-1997, 2001-2006 Agence France Presse, © 2002 An Nahar, ©1995-1998, 2000-2001, 2005-2006 The Associated Press, © 1996-1998, 2004, 2006 Los Angeles Times-Washington Post News Service, Inc., © 1994-2002, 2004-2006 New York Times, © 1994 Reuters America, Inc., © 1995-2006 Xinhua News Agency, © 2009, 2020 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2020T04
DateStamp:		2021-01-01
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Simpson, Heather; Strassel, Stephanie; Wright, Jonathan; Griffitt, Kira. 2020. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text