OLAC Record: SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages

OLAC Record
oai:www.ldc.upenn.edu:LDC2011T01

Metadata

Title: SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Recasens, Marta, et al. SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages LDC2011T01. Web Download. Philadelphia: Linguistic Data Consortium, 2011

Contributor: Recasens, Marta

Marquez, Lluis

Sapena, Emili

M. Antònia Martí

Taulé, Mariona

Date (W3CDTF): 2011

Date Issued (W3CDTF): 2011-01-24

Description: *Introduction* SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages, Linguistic Data Consortium (LDC) catalog number LDC2011T01 and isbn 1-58563-572-3, is a subset of OntoNotes Release 2.0 LDC2008T04 used in SemEval-2010 Task 1, Coreference Resolution in Multiple Languages. OntoNotes Release 2.0 consists of roughly 500,000 words of English broadcast and newswire data annotated with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). This SemEval-2010 Task 1 release contains approximately 120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task. SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems. The goal of SemEval-2010 Task 1 was to evaluate and compare automatic coreference resolution systems for six languages (Catalan, Dutch, English, German, Italian and Spanish) in four evaluation settings using four metrics. Further information about Task 1 can be found on the task description website. The task organizers included researchers from Universitat de Barcelona (Spain), Universitat Politècnica de Catalunya (Spain), University of Essex (United Kingdom), Universita di Trento (Italy), Hogeschool Gent (Netherlands), University of Tübingen (Germany) and Stanford University (USA). *Data* The data is divided into three sets: the development set (*/data/en.devel.txt) which contains 39 documents, 741 sentences and 17,044 tokens; the training set (*/data/en.train.txt) which contains 229 documents, 3,648 sentences and 79,060 tokens; and the test set (*/data/en.test.txt) which contains 85 documents, 1,141 sentences and 24,206 tokens. The complete material for training systems is the sum of the development and training sets. Details of the SemEval task formatting applied to the data can be found in the documentation file, en.info.txt. *Scorer* The official scorer is available from the the task download page. *Updates* An update was issues on March 30th, 2012 for this corpus. A bug was fixed that caused one annotation error in every document. All data downloaded after this date will be the correct release. Contact ldc@ldc.upenn.edu with any questions. *Samples* For an example of the data in this publication, please review this text file excerpt.

Extent: Corpus size: 8458 KB

Identifier: LDC2011T01

https://catalog.ldc.upenn.edu/LDC2011T01

ISBN: 1-58563-572-3

ISLRN: 365-198-419-802-6

DOI: 10.35111/bmpd-n944

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2011T01

Rights Holder: Portions © 2000-2001 American Broadcasting Company, © 2000-2001 Cable News Network, LP, LLP, © 1989 Dow Jones & Company, Inc., © 2000-2001 National Broadcasting Company, Inc., © 2000-2001 Public Radio International, © 1995, 2005, 2006, 2007, 2008, 2011 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2011T01

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Recasens, Marta; Marquez, Lluis; Sapena, Emili; M. Antònia Martí; Taulé, Mariona. 2011. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2011T01
Up-to-date as of: Wed Oct 29 7:01:14 EDT 2025

Metadata
Title:		SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Recasens, Marta, et al. SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages LDC2011T01. Web Download. Philadelphia: Linguistic Data Consortium, 2011
Contributor:		Recasens, Marta
		Marquez, Lluis
		Sapena, Emili
		M. Antònia Martí
		Taulé, Mariona
Date (W3CDTF):		2011
Date Issued (W3CDTF):		2011-01-24
Description:		Introduction SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages, Linguistic Data Consortium (LDC) catalog number LDC2011T01 and isbn 1-58563-572-3, is a subset of OntoNotes Release 2.0 LDC2008T04 used in SemEval-2010 Task 1, Coreference Resolution in Multiple Languages. OntoNotes Release 2.0 consists of roughly 500,000 words of English broadcast and newswire data annotated with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). This SemEval-2010 Task 1 release contains approximately 120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task. SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems. The goal of SemEval-2010 Task 1 was to evaluate and compare automatic coreference resolution systems for six languages (Catalan, Dutch, English, German, Italian and Spanish) in four evaluation settings using four metrics. Further information about Task 1 can be found on the task description website. The task organizers included researchers from Universitat de Barcelona (Spain), Universitat Politècnica de Catalunya (Spain), University of Essex (United Kingdom), Universita di Trento (Italy), Hogeschool Gent (Netherlands), University of Tübingen (Germany) and Stanford University (USA). Data The data is divided into three sets: the development set (/data/en.devel.txt) which contains 39 documents, 741 sentences and 17,044 tokens; the training set (/data/en.train.txt) which contains 229 documents, 3,648 sentences and 79,060 tokens; and the test set (/data/en.test.txt) which contains 85 documents, 1,141 sentences and 24,206 tokens. The complete material for training systems is the sum of the development and training sets. Details of the SemEval task formatting applied to the data can be found in the documentation file, en.info.txt. Scorer* The official scorer is available from the the task download page. Updates An update was issues on March 30th, 2012 for this corpus. A bug was fixed that caused one annotation error in every document. All data downloaded after this date will be the correct release. Contact ldc@ldc.upenn.edu with any questions. Samples For an example of the data in this publication, please review this text file excerpt.
Extent:		Corpus size: 8458 KB
Identifier:		LDC2011T01
		https://catalog.ldc.upenn.edu/LDC2011T01
		ISBN: 1-58563-572-3
		ISLRN: 365-198-419-802-6
		DOI: 10.35111/bmpd-n944
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2011T01
Rights Holder:		Portions © 2000-2001 American Broadcasting Company, © 2000-2001 Cable News Network, LP, LLP, © 1989 Dow Jones & Company, Inc., © 2000-2001 National Broadcasting Company, Inc., © 2000-2001 Public Radio International, © 1995, 2005, 2006, 2007, 2008, 2011 Trustees of the University of Pennsylvania The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2011T01
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Recasens, Marta; Marquez, Lluis; Sapena, Emili; M. Antònia Martí; Taulé, Mariona. 2011. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text