OLAC Record
oai:www.ldc.upenn.edu:LDC2003T06

Metadata
Title:Arabic Treebank: Part 1 v 2.0
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Maamouri, Mohamed, et al. Arabic Treebank: Part 1 v 2.0 LDC2003T06. Web Download. Philadelphia: Linguistic Data Consortium, 2003
Contributor:Maamouri, Mohamed
Bies, Ann
Jin, Hubert
Buckwalter, Tim
Date (W3CDTF):2003
Date Issued (W3CDTF):2003-02-03
Description:*Introduction* Arabic Treebank: Part 1 v 2.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T06 and ISBN 1-58563-261-9. This publication is part one of a a corpus of one million words of Arabic Treebank, designed to support language research and development of language technology for Modern Standard Arabic. *Data* The Penn Arabic Treebank, which is part of the DARPA TIDES project, started in the Fall of 2001 with the objective of performing human and computer annotations of a large Arabic machine-readable text corpus (for project background please see POStest.html). As in previous Penn Treebanks, two different kinds of information need to be produced by two different (human and computer) processes. The Arabic Treebank project consists therefore of two distinct phases: * Part-of-Speech (POS) tagging - divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss * Arabic Treebanking (ArabicTB) - characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. Both tasks started in November 2001 with an initial pilot consisting of 734 files representing roughly 166K words of written Modern Standard Arabic newswire from the Agence France Presse corpus. The target of this publication is to provide a description of a written Modern Standard Arabic text corpus. The source data consists of Agence France Presse (AFP) newswire, spanning from July through November of 2000. This publication includes 734 stories representing 140,265 words (168,123 tokens after clitic segmentation in the Treebank). *Updates* There are no updates available at this time.
Extent:Corpus size: 271360 KB
Identifier:LDC2003T06
https://catalog.ldc.upenn.edu/LDC2003T06
ISBN: 1-58563-261-9
ISLRN: 333-321-196-670-5
OAI: oai:www.ldc.upenn.edu:LDC2003T06
Language:Standard Arabic
Language (ISO639):arb
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2003T06
Rights Holder:Portions © 2000 Agence France-Presse, © 2002 Trustees of the University of Pennsylvania
Type (DCMI):Text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2003T06
DateStamp:  2014-07-17
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Maamouri, Mohamed; Bies, Ann; Jin, Hubert; Buckwalter, Tim. 2003. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_arb


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2003T06
Up-to-date as of: Mon Nov 24 0:32:18 EST 2014