OLAC Record
oai:www.ldc.upenn.edu:LDC2008T16

Metadata
Title:North American News Text, General Release
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Graff, David. North American News Text, General Release LDC2008T16. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:Graff, David
Date (W3CDTF):2008
Date Issued (W3CDTF):2008-08-19
Description:*Introduction* North American News Text, General Release is a collection of English news text from the Los Angeles Times, Washington Post, New York Times and Reuters. This data is a subset of the data contained in the North American News Text Corpus (LDC95T21) released in 1995 and is reissued to complement the release of the Brown Laboratory for Linguistic Information Processing (BLLIP) North American News Text sets (LDC2008T13, LDC2008T14), which consist of Penn Treebank-style parsing of the North American News Text Corpus text. North American News Text is reissued in two versions: North American News Text, Complete, LDC2008T15, the members-only original version, now available as a 2008 Membership Year corpus; and North American News Text, General Release LDC2008T16 (which does not include text from the Wall Street Journal), available to nonmembers for the first time. The directory structure of each of these publications has been restructured to be identical to the directory structure of the BLLIP releases. *Data* The table below contains a breakdown of the sources, epochs and word counts for the data in the North American News Text releases: Source Dates # Words (millions) Los Angeles Times & Washington Post May 1994 - August 1997 52 New York Times News & Syndicate July 1994 - December 1996 173 Reuters News Service (General and Finanical) April 1994 - December 1996 85 Wall Street Journal (not in General Release) July 1994 - December 1996 40 The New York Times and the Los Angeles Times/Washington Post services include a range of other newspaper sources in their syndicated newswires. The Los Angeles Times/Washington Post material in this corpus includes some news text from the following sources: * Newsday * The Baltimore Sun * The Hartford Courant The New York Times material in this corpus contains some data from the following sources, although New York Times articles predominate: * Bloomberg Business News * The Boston Globe * Los Angeles Daily News * Fort Worth Star-Telegram * Newsweek * Cox News Service * The Arizona Republic * Seattle Post-Intelligencer * San Francisco Examiner * Houston Chronicle * San Francisco Chronicle * Economist Newspaper Ltd. * Hearst Newspapers The text content of each data file (following uncompression with the GNU-unzip utility) consists of plain ASCII character data with SGML tags to indicate article boundaries and organization of information within each article. There are differences among the five primary newswire sources in terms of the number and types of SGML tags used in the text, but the following tag structure is common to all data sets: -- start of a new article ... -- some variety of "header" tags appears here -- start of the text content of the article -- all paragraph boundaries are marked by this tag ... -- text data as it is provided by the newswire service -- end of text content of the article ... -- some variety of "trailer" tags appears here -- end of article In general, the differences in format among the various newswire sources will be found in the SGML tags that appear between and , and those that appear between and . The actual text content of articles (the region between and ) is consistent in format across sources, except for some uses of the SGML "&..;" notation to represent special characters in the data. For example, "&MD;" is used in the "latwp" material to represent the "em-dash" character, which is typically used to separate the "dateline" from the opening sentence in the first paragraph of each article. There may also be differences in how quotation marks are rendered. As this re-release is intended to complement the BLLIP North American News Text releases, the directory structure of this corpus is identical to that of the BLLIP publications.
Extent:Corpus size: 1258291 KB
Identifier:LDC2008T16
https://catalog.ldc.upenn.edu/LDC2008T16
ISBN: 1-58563-484-0
ISLRN: 637-707-612-417-4
DOI: 10.35111/axv3-5419
Language:English
Language (ISO639):eng
License:North American News Text, General Release: https://catalog.ldc.upenn.edu/license/north-american-news-text-general-release.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2008T16
Rights Holder: Portions © 1994-1997 Los Angeles Times-Washington Post News Service, Inc., © 1994-1996 New York Times, © 1994-1996 Reuters America, Inc. © 1995-1997, 2008 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2008T16
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Graff, David. 2008. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2008T16
Up-to-date as of: Mon Mar 25 7:20:19 EDT 2024