OLAC Record
oai:www.ldc.upenn.edu:LDC2006T02

Metadata
Title:Arabic Gigaword Second Edition
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Graff, David, et al. Arabic Gigaword Second Edition LDC2006T02. DVD. Philadelphia: Linguistic Data Consortium, 2006
Contributor:Graff, David
Chen, Ke
Kong, Junbo
Maeda, Kazuaki
Date (W3CDTF):2006
Date Issued (W3CDTF):2006-01-19
Description:*Introduction* Arabic Gigaword Second Edition was produced by Linguistic Data Consortium (LDC) catalog number LDC2006T02 and ISBN 1-58563-371-2. This is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by the Linguistic Data Consortium (LDC), at the University of Pennsylvania. Arabic Gigaword Second Edition includes all of the content of the first edition of Arabic Gigaword (LDC2003T12) as well as new data. Five distinct sources of Arabic newswire are represented here: Agence France Presse (afp_arb; formally afa) Al Hayat News Agency (hyt_arb; formally alh) An Nahar News Agency (nhr_arb; formally ann) Ummah Press (umh_arb) Xinhua News Agency (xin_arb; formally xia) The seven-letter codes in the parentheses above consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. The three-letter language code represents the standard Arabic in the ISO 639-3 standard. In the first edition of the Arabic Gigaword corpus, a simpler three-character-code scheme was used to identify both the source and the language. The new convention allows us to distinguish data sets by source and language more naturally when a single newswire provider distributes data in multiple languages. Ummah Press is a new source added to the Second Edition. The following table shows the new data that appear for the first time in the Second Edition. Agence France Presse 2003.01-2004.12 143,766 documents Al Hayat News Agency 2002.01-2003.12 64,308 documents An Nahar News Agency 2003.01-2004.01 16,316 documents Ummah Press 2003.01-2004.12 4,641 documents Xinhua News Agency 2003.06-2004.12 10,6236 documents *Data* There are 423 files, totaling approximately 1.4GB in compressed form (5,359 MB uncompressed, and 1,591,983 K-words). The table below presents the following categories of information: source of the data, number of files per source, Gzip-MB shows totals for compressed file sizes, Totl-MB shows totals for uncompressed file sizes (i.e. approximately 5.3 gigabytes total), K-words are the number of space-separated tokens in the text, excluding SGML tags. Source #Files Gzip-MB Totl-MB K-wrds #DOCs AFP_ARB 128 355 1429 123594 660621 HYT_ARB 119 524 1861 169100 369555 NHR_ARB 109 457 1649 151078 344084 UMH_ARB 24 4 13 1201 4645 XIN_ARB 43 103 407 36933 213082 TOTAL 423 1443 5359 481906 1591987 All text files in this corpus have been converted to UTF-8 character encoding. Owing to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of digits and punctuation marks (where the original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation), whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately. Each data file name consists of the seven-letter prefix, an underscore character ("_"), and a six-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). Therefore, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure. The file gigaword_a.dtd in the "dtd" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file. Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). All sources have received a uniform treatment in terms of quality control, and have been categorized into three distinct "types": story this type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences multi this type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ... (some general area like finance or sports)" and so on other these DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on The general strategy for categorizing DOCs into these three classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story." Other "Gigaword" corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"), which applied to DOCs that contain text intended solely for news service editors, not the news-reading public. In preparing the Arabic data, the task of determining patterns for assigning "non-story" type labels was carried out by a native speaker of Arabic, and (for whatever reason) this person did not find the "advis" category to be applicable to any of the data. As described in the introduction section, a new naming scheme for file names and document IDs is used in the Second Edition. All of the documents in the first edition of the Arabic Gigaword corpus can be mapped to the same documents in this edition by changing the prefix of DOC IDs and file names as below. The upper case letters are used for the DOC IDs; the lower case letters are used for the file and directory names. The underscore character to connect the seven-letter prefix and the date is included in the following table. Old New AFA AFP_ARB_ ALH HYT_ARB_ ANN NHR_ARB XIA XIN_ARB_ *Samples* For an example of the data in this corpus, please examine this screenshot which is an image of the text from a single file.
Extent:Corpus size: 1153433 KB
Identifier:LDC2006T02
https://catalog.ldc.upenn.edu/LDC2006T02
ISBN: 1-58563-371-2
ISLRN: 299-814-033-635-4
OAI: oai:www.ldc.upenn.edu:LDC2006T02
Language:Standard Arabic
Language (ISO639):arb
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: DVD
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2006T02
Rights Holder:Portions © 1994-2004 Agence France Presse, © 1994-2003 Al Hayat News Agency, © 1995-2004 An Nahar News Agency, © 2001-2004 Xinhua News Agency, © 2003-2004 Ummah Press, © 2005-2006 Trustees of the University of Pennsylvania
Type (DCMI):Text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2006T02
DateStamp:  2014-07-17
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki. 2006. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_arb


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006T02
Up-to-date as of: Tue Sep 23 0:17:12 EDT 2014