OLAC Record
oai:www.ldc.upenn.edu:LDC2004T23

Metadata
Title:Prague Arabic Dependency Treebank 1.0
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Hajič, Jan , et al. Prague Arabic Dependency Treebank 1.0 LDC2004T23. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:Hajič, Jan
Smrz, Otakar
Zemanek, Petr
Pajas, Petr
Snaidauf, Jan
Beska, Emanuel
Kracmar, Jakub
Hassanova, Kamila
Date (W3CDTF):2004
Date Issued (W3CDTF):2004-11-19
Description:*Introduction* Prague Arabic Dependency Treebank (PADT) 1.0 was developed by the Center for Computational Linguistics, the Institute of Formal and Applied Linguistics, and the Institute of Comparative Linguistics, Charles University in Prague, and consists of approximately 212,500 tokens of Modern Standard Arabic with multi-level linguistic annotations. It also provides a variety of unique software implementations designed for general use in Natural Language Processing (NLP). The PADT project might be summarized as an open-ended activity resting in multi-level annotation of Arabic language resources in line with the theory of Functional Generative Description. The project is a younger sibling to Prague Dependency Treebank for Czech, and is maintained in co-operation with the Linguistic Data Consortium (LDC), who release non-annotated corpora of Arabic newswire and developed an independent Arabic Treebank. *Data* The corpus of PADT 1.0 consists of morphologically and analytically annotated newswire texts of Modern Standard Arabic, which originate from the Arabic Gigaword (LDC2003T12) and the plain data of Arabic Treebank: Part 1 v 2.0 (LDC2003T06) and Arabic Treebank: Part 2 v 2.0 (LDC2004T02). The PADT 1.0 distribution comprises over 113,500 tokens of data annotated analytically and provided with the disambiguated morphological information. In addition, the release includes complete annotations of MorphoTrees resulting in more than 148,000 tokens, 49,000 of which have received the analytical processing. The contents are further divided into data sets as indicated in the table. In the table, tokens represent the number of syntactic units that are annotated [A] analytically and [M] within MorphoTrees. Approximate ratios of tokens per paragraph and tokens per document come in the next columns, distinguishing the two types of annotation. The sets of selected documents could cover only a couple of days of the specified period of time. Data Set [A] Tokens [M] Tokens/Para Tokens/Doc Original Data Provider News Period Related Corpora AFP 13,000 N/A 34.6 [N/A] 260 [N/A] Agence France Presse July 2000 Penn ATB Part 1 UMH 38,500 N/A 43.6 [N/A] 290 [N/A] Ummah Press Service Spring 2002 Penn ATB Part 2 XIN 13,500 N/A 31.2 [N/A] 155 [N/A] Xinhua News Agency May 2003 Arabic Gigaword ALH 10,000 73,500 47.0 [47.8] 405 [405] Al Hayat News Agency September 2001 Arabic Gigaword ANN 12,500 25,500 60.3 [50.3] 740 [630] An Nahar News Agency November 2002 Arabic Gigaword XIA 26,500 49,500 29.7 [25.9] 235 [205] Xinhua News Agency May 2003 Arabic Gigaword *Samples* For examples of the data in this corpus, please view this paragraph morphology tree (GIF) and this new analytical rendering style (GIF). *Support* PADT 1.0 was supported by the Ministry of Education of the Czech Republic, projects LN00A063 and MSM113200006, and by the Grant Agency of the Czech Republic, project 405/02/0823. *Updates* Updates or bug fixes may be available in the LDC catalog entry for this corpus, or at the PADT website. Your questions and suggestions are welcome at padt (at) ckl (dot) mff (dot) cuni (dot) cz.
Extent:Corpus size: 124928 KB
Identifier:LDC2004T23
https://catalog.ldc.upenn.edu/LDC2004T23
ISBN: ISBN 1-58563-319-4
ISLRN: 034-001-778-929-8
DOI: 10.35111/pn7r-7q63
Language:Standard Arabic
Language (ISO639):arb
License:Prague Arabic Dependency Treebank 1.0: https://catalog.ldc.upenn.edu/license/prague-arabic-dependency-treebank-1.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2004T23
Rights Holder:Portions © 2000 Agence France Presse, © 2001 Al Hayat, © 2002 An Nahar, © 2002 Ummah Press Service, © 2003 Xinhua News Agency, © 2000-2004 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2004T23
DateStamp:  2022-03-25
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Hajič, Jan; Smrz, Otakar; Zemanek, Petr; Pajas, Petr; Snaidauf, Jan; Beska, Emanuel; Kracmar, Jakub; Hassanova, Kamila. 2004. Linguistic Data Consortium.
Terms: area_Asia country_SA dcmi_Text iso639_arb olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T23
Up-to-date as of: Fri Oct 21 2:45:07 EDT 2022