OLAC Record

Title:NEMLAR Written Corpus
Access Rights: Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):2006-08-11
Date Issued (W3CDTF):2006-08-11
Date Modified (W3CDTF):2007-02-22
Description:This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Broadcast News Speech Corpus (ELRA-S0219) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories, aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The different categories are:•Political news: 48,000 words•Political debate: 30,000 words•Islamic text (Preaching and others): 29,000 words•Phrases of common words: 8,500 words•Text from broadcast news: 5,500 words•Business: 20,000 words•Arabic literature: 30,000 words•General news: 100,000 words•Interviews: 56,000 words•Scientific press: 50,000 words•Sports press: 50,000 words•Dictionary entries explanation: 52,000 words•Legal domain text: 21,000 wordsThe time span of the data included goes from late 1990’s to 2005.The corpus is provided in 4 different versions:•Raw text•Fully vowelized text•Text with Arabic lexical analysis•Text with Arabic POS-tagsDiacritics, lexical analysis and POS-tags were generated by RDI’s tool Fassieh©. The accuracy of the automatic analysis is around 95%. To reach about the 99% accuracy rate as defined for this corpus, the linguists used the visual revision mode of Fassieh© where the linguist has to either approve the 1st most likely analysis (most of the time) or select another one manually (in the 4% minority of the cases).The database is distributed on 1 ISO 9660 CD-ROM volume. It has been validated by an external partner and a validation report is provided.
ISLRN: 050-693-158-326-9
Identifier (URI):http://catalog.elra.info/en-us/repository/browse/ELRA-W0042/
Language (ISO639):ara
Medium:Not specified
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-W0042
DateStamp:  2006-08-11
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2006. ELRA (European Language Resources Association).
Terms: dcmi_Text iso639_ara olac_primary_text

Up-to-date as of: Wed Nov 17 9:12:41 EST 2021