OLAC Record
oai:www.ldc.upenn.edu:LDC2004T02

Metadata
Title:Arabic Treebank: Part 2 v 2.0
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Maamouri, Mohamed, et al. Arabic Treebank: Part 2 v 2.0 LDC2004T02. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:Maamouri, Mohamed
Bies, Ann
Buckwalter, Tim
Jin, Hubert
Date (W3CDTF):2004
Date Issued (W3CDTF):2004-01-30
Description:*Introduction* Arabic Treebank: Part 2 v 2.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T02 and ISBN 1-58563-282-1. This publication is the second part of a corpus of 1,000,000 words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic. Part one was released in 2003 as Arabic Treebank: Part 1 v 2.0, having the source data extracted from Agence France Press stories. The current Arabic Treebank: Part 2 v 2.0 corpus consists of stories from Al-Hayat distributed by Ummah. *Data* This corpus includes 501 stories from the Ummah Arabic News Text. There are a total of 144,199 words (counting non-Arabic tokens such as numbers and punctuation) in the 501 files - one story per file. New features of annotation include complete vocalization (including case endings), lemma IDs, and more specific POS tags for verbs and particles. The corpus contains 125,698 Arabic-only word tokens (prior to the separation of clitics), of which 124,740 (99.24%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 958 (0.76%) were items that the morphological parser failed to analyze correctly. *Samples* Please view the following samples: * SGML * Treebank * Treebank - XML * POS *Updates* There are no updates available at this time.
Extent:Corpus size: 1153433 KB
Identifier:LDC2004T02
https://catalog.ldc.upenn.edu/LDC2004T02
ISBN: 1-58563-282-1
ISLRN: 530-268-392-589-4
Language:Standard Arabic
Language (ISO639):arb
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2004T02
Rights Holder:Portions © 2001-2002 Ummah Press, © 2004 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2004T02
DateStamp:  2019-01-03
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Maamouri, Mohamed; Bies, Ann; Buckwalter, Tim; Jin, Hubert. 2004. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_arb olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T02
Up-to-date as of: Sun Sep 1 18:16:39 EDT 2019