OLAC Record: AQUAINT-2 Information-Retrieval Text Research Collection

OLAC Record
oai:www.ldc.upenn.edu:LDC2008T25

Metadata

Title: AQUAINT-2 Information-Retrieval Text Research Collection

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Vorhees, Ellen, and David Graff. AQUAINT-2 Information-Retrieval Text Research Collection LDC2008T25. Web Download. Philadelphia: Linguistic Data Consortium, 2008

Contributor: Vorhees, Ellen

Graff, David

Date (W3CDTF): 2008

Date Issued (W3CDTF): 2008-12-19

Description: *Introduction: * AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31). The AQUAINT (Advanced Question-Answering for Intelligence) program addresses interactivity with scenarios or tasks. The scenario provides a context in which questions will be asked and answered, and the task reflects the overall assignment. The program is committed to solve a single problem: how to find topically relevant, semantically related, timely information in massive amounts of data in diverse languages, formats, and genres. AQUAINT technology is advancing the development of components and functions that allows users to pose a series of intertwined, complex questions and obtain comprehensive answers in the context of broad information-gathering tasks. In addition, while most information retrieval systems present only links to documents, AQUAINT is producing technology that will present answers to the user's questions. This question-answering technology is being developed with features for managing semantic similarity, co-reference, event characterization, opinions, linguistic and social and world inferencing, redundancy, deception, and missing or contradictory information. In order to allow the analyst to guide the exploration in concert with the machine, AQUAINT technology employs interactive question-answering, the automatic suggestion of additional paths of exploration, and the inferencing of the social context of the information search. *Data* AQUAINT-2 Information-Retrieval Text Research Collection is a subset of LDC's English Gigaword Third Edition (LDC2007T07). The collection comprises approximately 2.5 GB of text (about 907K documents) spanning the time period October 2004 - March 2006. For each source, all of the usable data collected by LDC was processed into a consistent XML format in which the stories for a given month are concatenated in chronological order into a single "DOCSTREAM" element; each story is a single "DOC" element within that stream and has a globally unique "id" attribute. The collection consists of newswire data in English drawn from six distinct sources, listed below in terms of their file name designations and full names: afp_eng Agence France Presse apw_eng Associated Press cna_eng Central News Agency (Taiwan) English Service ltw_eng Los Angeles Times - Washington Post News Service nyt_eng New York Times xin_eng Xinhua News Agency (Beijing) English Service *Samples* For an example of the data in this publication, please examine this image of the XML.

Extent: Corpus size: 2495610 KB

Identifier: LDC2008T25

https://catalog.ldc.upenn.edu/LDC2008T25

ISBN: 1-58563-494-8

ISLRN: 541-800-982-573-5

DOI: 10.35111/af48-z779

Language: English

Language (ISO639): eng

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2008T25

Rights Holder: Portions © 2004-2006 Agence France Presse, The Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post News Service, Inc., New York Times, Xinhua News Agency, © 2004-2006, 2007, 2008 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2008T25

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Vorhees, Ellen; Graff, David. 2008. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2008T25
Up-to-date as of: Wed Oct 29 7:01:05 EDT 2025

Metadata
Title:		AQUAINT-2 Information-Retrieval Text Research Collection
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Vorhees, Ellen, and David Graff. AQUAINT-2 Information-Retrieval Text Research Collection LDC2008T25. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:		Vorhees, Ellen
Contributor:		Graff, David
Date (W3CDTF):		2008
Date Issued (W3CDTF):		2008-12-19
Description:		Introduction: AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31). The AQUAINT (Advanced Question-Answering for Intelligence) program addresses interactivity with scenarios or tasks. The scenario provides a context in which questions will be asked and answered, and the task reflects the overall assignment. The program is committed to solve a single problem: how to find topically relevant, semantically related, timely information in massive amounts of data in diverse languages, formats, and genres. AQUAINT technology is advancing the development of components and functions that allows users to pose a series of intertwined, complex questions and obtain comprehensive answers in the context of broad information-gathering tasks. In addition, while most information retrieval systems present only links to documents, AQUAINT is producing technology that will present answers to the user's questions. This question-answering technology is being developed with features for managing semantic similarity, co-reference, event characterization, opinions, linguistic and social and world inferencing, redundancy, deception, and missing or contradictory information. In order to allow the analyst to guide the exploration in concert with the machine, AQUAINT technology employs interactive question-answering, the automatic suggestion of additional paths of exploration, and the inferencing of the social context of the information search. Data AQUAINT-2 Information-Retrieval Text Research Collection is a subset of LDC's English Gigaword Third Edition (LDC2007T07). The collection comprises approximately 2.5 GB of text (about 907K documents) spanning the time period October 2004 - March 2006. For each source, all of the usable data collected by LDC was processed into a consistent XML format in which the stories for a given month are concatenated in chronological order into a single "DOCSTREAM" element; each story is a single "DOC" element within that stream and has a globally unique "id" attribute. The collection consists of newswire data in English drawn from six distinct sources, listed below in terms of their file name designations and full names: afp_eng Agence France Presse apw_eng Associated Press cna_eng Central News Agency (Taiwan) English Service ltw_eng Los Angeles Times - Washington Post News Service nyt_eng New York Times xin_eng Xinhua News Agency (Beijing) English Service Samples For an example of the data in this publication, please examine this image of the XML.
Extent:		Corpus size: 2495610 KB
Identifier:		LDC2008T25
		https://catalog.ldc.upenn.edu/LDC2008T25
		ISBN: 1-58563-494-8
		ISLRN: 541-800-982-573-5
		DOI: 10.35111/af48-z779
Language:		English
Language (ISO639):		eng
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2008T25
Rights Holder:		Portions © 2004-2006 Agence France Presse, The Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post News Service, Inc., New York Times, Xinhua News Agency, © 2004-2006, 2007, 2008 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2008T25
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Vorhees, Ellen; Graff, David. 2008. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text