OLAC Record: Arabic Gigaword Fourth Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2009T30

Metadata

Title: Arabic Gigaword Fourth Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Parker, Robert, et al. Arabic Gigaword Fourth Edition LDC2009T30. Web Download. Philadelphia: Linguistic Data Consortium, 2009

Contributor: Parker, Robert

Graff, David

Chen, Ke

Kong, Junbo

Maeda, Kazuaki

Date (W3CDTF): 2009

Date Issued (W3CDTF): 2009-12-17

Description: *Introduction* Arabic Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number LDC2009T30 and ISBN 1-58563-532-4, is a comprehensive archive of Arabic newswire text that has been acquired over several years at LDC. Arabic Gigaword Fourth Edition includes all of the content of Arabic Gigaword Third Edition (LDC2007T40) as well as newly-collected data. In addition, three new sources have been added in the fourth edition: Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi. Nine distinct international sources of Arabic newswire are represented here: * Al-Ahram (ahr_arb) * Asharq Al-Awsat (aaw_arb) * Agence France Presse (afp_arb) * Assabah (asb_arb) * Al Hayat (hyt_arb) * An Nahar (nhr_arb) * Al-Quds Al-Arabi (qds_arb) * Ummah Press (umh_arb) * Xinhua News Agency (xin_arb) The seven-character codes shown above represent both the directory names where the data files are found and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. These news services all use Modern Standard Arabic (MSA), so there should be a fairly limited scope for orthographic and lexical variation due to regional Arabic dialects. However, to the extent that regional dialects might have an influence on MSA usage, the following should be noted: * Al-Ahram is based in Cairo, Egypt. * Asharq Al-Awsat is based in London, England, UK. * An Nahar is based in Beirut, Lebanon. * Al Hayat was originally a Lebanese news service, but it has been based in London during the entire period represented in this archive. * Assabah is based in Tunisia. * The Xinhua and Agence France Presse (AFP) services are obviously international in scope (Xinhua is based in Beijing, AFP in Paris), and the regional distribution of Arabic reporters and editors for these services is not known. * The content provided by Ummah Press comes from diverse sources throughout the Arabic-speaking world. * Al-Quds Al-Arabi is based in London, England, UK. *New in the Fourth Edition* * New Sources This release marks the first edition of Arabic Gigaword to include content from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the period from November 2006 through December 2008. * New Data for Existing Sources This release contains all data collected by LDC from January 2007 through December 2008, except for Ummah Press for which data from January 2005 through December 2008 is included. The table below shows data quantity by source under the following categories: data source (Source); the number of files per source (#Files); compressed file size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated words tokens in the text (K-words); and the number of documents per source (#DOCs). Source #Files Gzip-MB Totl-MB K-wrds #DOCs aaw_arb 26 114 386 36694 87506 afp_arb 176 530 1979 184631 930656 ahr_arb 26 114 131 42265 107187 asb_arb 52 45 149 14322 32794 hyt_arb 166 663 2224 209318 448335 nhr_arb 157 784 2662 253559 557151 qds_arb 26 62 198 18996 49352 umh_arb 68 9.3 31 2995 11350 xin_arb 91 245 890 85689 492664 Totals 788 5018 8650 848469 2716995 *Samples* For an example of the data contained in this corps, please examine this jpeg image of the text content. *Sponsorship* This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Extent: Corpus size: 2621440 KB

Identifier: LDC2009T30

https://catalog.ldc.upenn.edu/LDC2009T30

ISBN: 1-58563-532-4

ISLRN: 766-411-032-967-0

DOI: 10.35111/v9fm-zn61

Language: Standard Arabic

Arabic

Language (ISO639): arb

ara

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2009T30

Rights Holder: Portions © 1994-2008 Agence France Presse, © 2006-2008 Al-Ahram, © 2006-2008 Al-Quds Al-Arabi, © 2006-2008 Asharq Al-Awsat, © 2004-2008 Assabah, © 1994-2003, 2005-2008 Al Hayat, © 1995-2008 An Nahar, © 2003-2008 Ummah Press, © 2001-2008 Xinhua News Agency, © 2003, 2006, 2007, 2009 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2009T30

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Parker, Robert; Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki. 2009. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_ara iso639_arb olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2009T30
Up-to-date as of: Wed Oct 29 7:01:10 EDT 2025

Metadata
Title:		Arabic Gigaword Fourth Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Parker, Robert, et al. Arabic Gigaword Fourth Edition LDC2009T30. Web Download. Philadelphia: Linguistic Data Consortium, 2009
Contributor:		Parker, Robert
		Graff, David
		Chen, Ke
		Kong, Junbo
		Maeda, Kazuaki
Date (W3CDTF):		2009
Date Issued (W3CDTF):		2009-12-17
Description:		Introduction Arabic Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number LDC2009T30 and ISBN 1-58563-532-4, is a comprehensive archive of Arabic newswire text that has been acquired over several years at LDC. Arabic Gigaword Fourth Edition includes all of the content of Arabic Gigaword Third Edition (LDC2007T40) as well as newly-collected data. In addition, three new sources have been added in the fourth edition: Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi. Nine distinct international sources of Arabic newswire are represented here: * Al-Ahram (ahr_arb) * Asharq Al-Awsat (aaw_arb) * Agence France Presse (afp_arb) * Assabah (asb_arb) * Al Hayat (hyt_arb) * An Nahar (nhr_arb) * Al-Quds Al-Arabi (qds_arb) * Ummah Press (umh_arb) * Xinhua News Agency (xin_arb) The seven-character codes shown above represent both the directory names where the data files are found and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. These news services all use Modern Standard Arabic (MSA), so there should be a fairly limited scope for orthographic and lexical variation due to regional Arabic dialects. However, to the extent that regional dialects might have an influence on MSA usage, the following should be noted: * Al-Ahram is based in Cairo, Egypt. * Asharq Al-Awsat is based in London, England, UK. * An Nahar is based in Beirut, Lebanon. * Al Hayat was originally a Lebanese news service, but it has been based in London during the entire period represented in this archive. * Assabah is based in Tunisia. * The Xinhua and Agence France Presse (AFP) services are obviously international in scope (Xinhua is based in Beijing, AFP in Paris), and the regional distribution of Arabic reporters and editors for these services is not known. * The content provided by Ummah Press comes from diverse sources throughout the Arabic-speaking world. * Al-Quds Al-Arabi is based in London, England, UK. New in the Fourth Edition * New Sources This release marks the first edition of Arabic Gigaword to include content from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the period from November 2006 through December 2008. * New Data for Existing Sources This release contains all data collected by LDC from January 2007 through December 2008, except for Ummah Press for which data from January 2005 through December 2008 is included. The table below shows data quantity by source under the following categories: data source (Source); the number of files per source (#Files); compressed file size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated words tokens in the text (K-words); and the number of documents per source (#DOCs). Source #Files Gzip-MB Totl-MB K-wrds #DOCs aaw_arb 26 114 386 36694 87506 afp_arb 176 530 1979 184631 930656 ahr_arb 26 114 131 42265 107187 asb_arb 52 45 149 14322 32794 hyt_arb 166 663 2224 209318 448335 nhr_arb 157 784 2662 253559 557151 qds_arb 26 62 198 18996 49352 umh_arb 68 9.3 31 2995 11350 xin_arb 91 245 890 85689 492664 Totals 788 5018 8650 848469 2716995 Samples For an example of the data contained in this corps, please examine this jpeg image of the text content. Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Extent:		Corpus size: 2621440 KB
Identifier:		LDC2009T30
		https://catalog.ldc.upenn.edu/LDC2009T30
		ISBN: 1-58563-532-4
		ISLRN: 766-411-032-967-0
		DOI: 10.35111/v9fm-zn61
Language:		Standard Arabic
Language:		Arabic
Language (ISO639):		arb
Language (ISO639):		ara
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2009T30
Rights Holder:		Portions © 1994-2008 Agence France Presse, © 2006-2008 Al-Ahram, © 2006-2008 Al-Quds Al-Arabi, © 2006-2008 Asharq Al-Awsat, © 2004-2008 Assabah, © 1994-2003, 2005-2008 Al Hayat, © 1995-2008 An Nahar, © 2003-2008 Ummah Press, © 2001-2008 Xinhua News Agency, © 2003, 2006, 2007, 2009 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2009T30
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Parker, Robert; Graff, David; Chen, Ke; Kong, Junbo; Maeda, Kazuaki. 2009. Linguistic Data Consortium.
Terms:		area_Asia country_SA dcmi_Text iso639_ara iso639_arb olac_primary_text