OLAC Record: ACE 2007 Multilingual Training Corpus

OLAC Record
oai:www.ldc.upenn.edu:LDC2014T18

Metadata

Title: ACE 2007 Multilingual Training Corpus

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Chen, Song, et al. ACE 2007 Multilingual Training Corpus LDC2014T18. Web Download. Philadelphia: Linguistic Data Consortium, 2014

Contributor: Chen, Song

Maeda, Kazuaki

Walker, Christopher

Strassel, Stephanie

Date (W3CDTF): 2014

Date Issued (W3CDTF): 2014-09-15

Description: *Introduction* ACE 2007 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains the complete set of Arabic and Spanish training data for the 2007 Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages. The LDC Catalog contains a series of publications from the ACE project and from researchers building on that work. Among them are: * ACE-2 Version 1.0 (LDC2003T11) * TIDES Extraction (ACE) 2003 Multilingual Training Data (LDC2004T09) * ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07) * ACE 2004 Multilingual Training Corpus (LDC2005T09) * ACE 2005 Multilingual Training Corpus (LDC2006T06) * ACE 2005 English SpatialML Annotations (LDC2008T03) * ACE 2005 Mandarin SpatialML Annotations (LDC2010T09) * ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 (LDC2010T18) * ACE 2005 English SpatialML Annotations Version 2 (LDC2011T02) * Datasets for Generic Relation Extraction (reACE) (LDC2011T08) *Data* The Arabic data is composed of newswire (60%) published in October 2000-December 2000 and weblogs (40%) published during the period November 2004-February 2005. The Spanish data set consists entirely of newswire material from multiple sources published in January 2005-April 2005. Data selection was semi-automatic. A document pool was established for each language based on genre and epoch requirements. Humans reviewed the pool to select individual documents suitable for ACE annotation, such as documents that were representative of their genre and contained targeted ACE entity types. One annotator completed the entity and temporal expression (TIMEX2) markup in the first pass annotation. This work was reviewed in the second pass by a senior annotator. TIMEX2 values were normalized by an annotator specifically trained for that task. The table below describes the amount of data included in the current release and its annotation status. Corpus content for each language and data type is represented in the three stages of annotation: first pass annotation (1P), second pass annotation (2P) and TIMEX2 normalization and additional quality control (NORM). Arabic Words Files 1P 2P NORM 1P 2P NORM NW 58,015 58,015 58,015 257 257 257 WL 40,338 40,338 40,338 121 121 121 Total 98,353 98,353 98,353 378 378 378 Spanish Words Files 1P 2P NORM 1P 2P NORM NW 100,401 100,401 100,401 352 352 352 Total 100,401 100,401 100,401 352 352 352 For a given document, there is a source .sgm file together with the .ag.xml and .apf.xml annotation files in each of the three directories "1p", "2p" and "timex2norm". In other words, for each newswire story or weblog entry, the three annotation directories each contain an identical copy of the source text (SGML .sgm file) along with distinct versions of the associated annotations (XML .ag.xml, apf.xml files and plain text .tab files). Note that in many cases, two annotation stages have produced identical output for a given source text, if no changes were made in the latter stage. All files are presented in UTF-8 *Samples* Please view the following samples * SGML Sample * AG XML Sample * APF XML Sample * Tab Delimited Sample *Updates* None at this time.

Extent: Corpus size: 312408 KB

Identifier: LDC2014T18

https://catalog.ldc.upenn.edu/LDC2014T18

ISBN: 1-58563-688-6

ISLRN: 600-375-253-846-9

DOI: 10.35111/ygjb-7f15

Language: Spanish

Standard Arabic

Language (ISO639): spa

arb

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2014T18

Rights Holder: Portions © 2000, 2005 Agence France Presse, © 2000 Al Hayat, © 2000 An Nahar, © 2005 The Associated Press, © 2005 Xinhua News Agency, © 2005-2007, 2014 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2014T18

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Chen, Song; Maeda, Kazuaki; Walker, Christopher; Strassel, Stephanie. 2014. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_ES country_SA dcmi_Text iso639_arb iso639_spa olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2014T18
Up-to-date as of: Wed Oct 29 7:01:28 EDT 2025

Metadata
Title:		ACE 2007 Multilingual Training Corpus
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Chen, Song, et al. ACE 2007 Multilingual Training Corpus LDC2014T18. Web Download. Philadelphia: Linguistic Data Consortium, 2014
Contributor:		Chen, Song
		Maeda, Kazuaki
		Walker, Christopher
		Strassel, Stephanie
Date (W3CDTF):		2014
Date Issued (W3CDTF):		2014-09-15
Description:		Introduction ACE 2007 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains the complete set of Arabic and Spanish training data for the 2007 Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions. The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages. The LDC Catalog contains a series of publications from the ACE project and from researchers building on that work. Among them are: * ACE-2 Version 1.0 (LDC2003T11) * TIDES Extraction (ACE) 2003 Multilingual Training Data (LDC2004T09) * ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07) * ACE 2004 Multilingual Training Corpus (LDC2005T09) * ACE 2005 Multilingual Training Corpus (LDC2006T06) * ACE 2005 English SpatialML Annotations (LDC2008T03) * ACE 2005 Mandarin SpatialML Annotations (LDC2010T09) * ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 (LDC2010T18) * ACE 2005 English SpatialML Annotations Version 2 (LDC2011T02) * Datasets for Generic Relation Extraction (reACE) (LDC2011T08) Data The Arabic data is composed of newswire (60%) published in October 2000-December 2000 and weblogs (40%) published during the period November 2004-February 2005. The Spanish data set consists entirely of newswire material from multiple sources published in January 2005-April 2005. Data selection was semi-automatic. A document pool was established for each language based on genre and epoch requirements. Humans reviewed the pool to select individual documents suitable for ACE annotation, such as documents that were representative of their genre and contained targeted ACE entity types. One annotator completed the entity and temporal expression (TIMEX2) markup in the first pass annotation. This work was reviewed in the second pass by a senior annotator. TIMEX2 values were normalized by an annotator specifically trained for that task. The table below describes the amount of data included in the current release and its annotation status. Corpus content for each language and data type is represented in the three stages of annotation: first pass annotation (1P), second pass annotation (2P) and TIMEX2 normalization and additional quality control (NORM). Arabic Words Files 1P 2P NORM 1P 2P NORM NW 58,015 58,015 58,015 257 257 257 WL 40,338 40,338 40,338 121 121 121 Total 98,353 98,353 98,353 378 378 378 Spanish Words Files 1P 2P NORM 1P 2P NORM NW 100,401 100,401 100,401 352 352 352 Total 100,401 100,401 100,401 352 352 352 For a given document, there is a source .sgm file together with the .ag.xml and .apf.xml annotation files in each of the three directories "1p", "2p" and "timex2norm". In other words, for each newswire story or weblog entry, the three annotation directories each contain an identical copy of the source text (SGML .sgm file) along with distinct versions of the associated annotations (XML .ag.xml, apf.xml files and plain text .tab files). Note that in many cases, two annotation stages have produced identical output for a given source text, if no changes were made in the latter stage. All files are presented in UTF-8 Samples Please view the following samples * SGML Sample * AG XML Sample * APF XML Sample * Tab Delimited Sample Updates None at this time.
Extent:		Corpus size: 312408 KB
Identifier:		LDC2014T18
		https://catalog.ldc.upenn.edu/LDC2014T18
		ISBN: 1-58563-688-6
		ISLRN: 600-375-253-846-9
		DOI: 10.35111/ygjb-7f15
Language:		Spanish
Language:		Standard Arabic
Language (ISO639):		spa
Language (ISO639):		arb
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2014T18
Rights Holder:		Portions © 2000, 2005 Agence France Presse, © 2000 Al Hayat, © 2000 An Nahar, © 2005 The Associated Press, © 2005 Xinhua News Agency, © 2005-2007, 2014 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2014T18
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Chen, Song; Maeda, Kazuaki; Walker, Christopher; Strassel, Stephanie. 2014. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_ES country_SA dcmi_Text iso639_arb iso639_spa olac_primary_text