OLAC Record: Phrase Detectives Corpus Version 2

OLAC Record
oai:www.ldc.upenn.edu:LDC2019T10

Metadata

Title: Phrase Detectives Corpus Version 2

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Chamberlain, Jon, et al. Phrase Detectives Corpus Version 2 LDC2019T10. Web Download. Philadelphia: Linguistic Data Consortium, 2019

Contributor: Chamberlain, Jon

Paun, Silviu

Yu, Juntao

Kruschwitz, Udo

Poesio, Massimo

Date (W3CDTF): 2019

Date Issued (W3CDTF): 2019-07-15

Description: *Introduction* Phrase Detectives Corpus Version 2 was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 407,000 tokens across 537 documents anaphorically-annotated by the Phrase Detectives Game, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference. This release constitutes a new version of the Phrase Detectives Corpus (LDC2017T08) that adds significantly more annotated tokens to the data set and supplies for each markable a substantial number of judgments expressed by the players and a silver label annotation based on the probabilistic aggregation method for anaphoric information. GWAPs for creating language resources are growing. In general, they employ non-monetary incentives, such as entertainment, to motivate participation and can be successful for large-scale persistent annotation efforts. Two projects that collect linguistic resources via Phrase Detectives and other similar language-oriented GWAPs are DALI (Disagreements and Language Interpretation), led by Queen Mary University of London and the University of Essex, and the LDC NIEUW (Novel Incentives and Workflows in Linguistic Data Annotation) project through its game site Lingo Boingo, in collaboration with Queen Mary University, the University of Essex and other partners. *Data* The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. The annotation is a simplified form of the coding scheme used in The ARRAU Corpus of Anaphoric Information (LDC2013T22). Players were asked to classify markables as referring or non-referring. Referring noun phrases could be classified either as discourse-new or discourse-old (referring to the same entity as a previous mention). Two types of non-referring expressions are identified: expletives and predicative NPs (called 'properties'). Discourse-old markables include so-called split antecedent plurals, as in Mary met John. They had dinner together. All player judgments are stored in MAS-XML format; they average 20 judgments per markable, up to 90 judgments in one case. A silver label extracted from those judgments using the MPA probabilistic annotation method (Paun et. al, 2018) is also provided. Wikipedia articles are presented as html, and all other source files are presented as plain text. All text is encoded as UTF-8. Annotations are released in three formats: (1) MAS-XML (the format in the first release), (2) a CONLL-style format based on the CoNLL 2011 and 2012 shared tasks on coreference and (3) CRAC 2018 format. *Samples* Please view the following samples: * Source * CoNLL * CRAC * MAS-XML *Updates* None at this time.

Extent: Corpus size: 511728 KB

Identifier: LDC2019T10

https://catalog.ldc.upenn.edu/LDC2019T10

ISBN: 1-58563-893-5

ISLRN: 666-328-454-074-3

DOI: 10.35111/0ypb-ya31

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2019T10

Rights Holder: Portions © 2019 University of Essex, © 2019 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2019T10

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Chamberlain, Jon; Paun, Silviu; Yu, Juntao; Kruschwitz, Udo; Poesio, Massimo. 2019. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2019T10
Up-to-date as of: Wed Oct 29 7:01:55 EDT 2025

Metadata
Title:		Phrase Detectives Corpus Version 2
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Chamberlain, Jon, et al. Phrase Detectives Corpus Version 2 LDC2019T10. Web Download. Philadelphia: Linguistic Data Consortium, 2019
Contributor:		Chamberlain, Jon
		Paun, Silviu
		Yu, Juntao
		Kruschwitz, Udo
		Poesio, Massimo
Date (W3CDTF):		2019
Date Issued (W3CDTF):		2019-07-15
Description:		Introduction Phrase Detectives Corpus Version 2 was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 407,000 tokens across 537 documents anaphorically-annotated by the Phrase Detectives Game, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference. This release constitutes a new version of the Phrase Detectives Corpus (LDC2017T08) that adds significantly more annotated tokens to the data set and supplies for each markable a substantial number of judgments expressed by the players and a silver label annotation based on the probabilistic aggregation method for anaphoric information. GWAPs for creating language resources are growing. In general, they employ non-monetary incentives, such as entertainment, to motivate participation and can be successful for large-scale persistent annotation efforts. Two projects that collect linguistic resources via Phrase Detectives and other similar language-oriented GWAPs are DALI (Disagreements and Language Interpretation), led by Queen Mary University of London and the University of Essex, and the LDC NIEUW (Novel Incentives and Workflows in Linguistic Data Annotation) project through its game site Lingo Boingo, in collaboration with Queen Mary University, the University of Essex and other partners. Data The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. The annotation is a simplified form of the coding scheme used in The ARRAU Corpus of Anaphoric Information (LDC2013T22). Players were asked to classify markables as referring or non-referring. Referring noun phrases could be classified either as discourse-new or discourse-old (referring to the same entity as a previous mention). Two types of non-referring expressions are identified: expletives and predicative NPs (called 'properties'). Discourse-old markables include so-called split antecedent plurals, as in Mary met John. They had dinner together. All player judgments are stored in MAS-XML format; they average 20 judgments per markable, up to 90 judgments in one case. A silver label extracted from those judgments using the MPA probabilistic annotation method (Paun et. al, 2018) is also provided. Wikipedia articles are presented as html, and all other source files are presented as plain text. All text is encoded as UTF-8. Annotations are released in three formats: (1) MAS-XML (the format in the first release), (2) a CONLL-style format based on the CoNLL 2011 and 2012 shared tasks on coreference and (3) CRAC 2018 format. Samples Please view the following samples: * Source * CoNLL * CRAC * MAS-XML Updates None at this time.
Extent:		Corpus size: 511728 KB
Identifier:		LDC2019T10
		https://catalog.ldc.upenn.edu/LDC2019T10
		ISBN: 1-58563-893-5
		ISLRN: 666-328-454-074-3
		DOI: 10.35111/0ypb-ya31
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2019T10
Rights Holder:		Portions © 2019 University of Essex, © 2019 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2019T10
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Chamberlain, Jon; Paun, Silviu; Yu, Juntao; Kruschwitz, Udo; Poesio, Massimo. 2019. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text