OLAC Record: NEMLAR Written Corpus

OLAC Record
oai:catalogue.elra.info:ELRA-W0042

Metadata

Title: NEMLAR Written Corpus

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2006-08-11

Date Issued (W3CDTF): 2006-08-11

Date Modified (W3CDTF): 2007-02-22

Description: This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Broadcast News Speech Corpus (ELRA-S0219) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories, aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The different categories are:•Political news: 48,000 words•Political debate: 30,000 words•Islamic text (Preaching and others): 29,000 words•Phrases of common words: 8,500 words•Text from broadcast news: 5,500 words•Business: 20,000 words•Arabic literature: 30,000 words•General news: 100,000 words•Interviews: 56,000 words•Scientific press: 50,000 words•Sports press: 50,000 words•Dictionary entries explanation: 52,000 words•Legal domain text: 21,000 wordsThe time span of the data included goes from late 1990’s to 2005.The corpus is provided in 4 different versions:•Raw text•Fully vowelized text•Text with Arabic lexical analysis•Text with Arabic POS-tagsDiacritics, lexical analysis and POS-tags were generated by RDI’s tool Fassieh©. The accuracy of the automatic analysis is around 95%. To reach about the 99% accuracy rate as defined for this corpus, the linguists used the visual revision mode of Fassieh© where the linguist has to either approve the 1st most likely analysis (most of the time) or select another one manually (in the 4% minority of the cases).The database is distributed on 1 ISO 9660 CD-ROM volume. It has been validated by an external partner and a validation report is provided.

Identifier: ELRA-W0042

ISLRN: 050-693-158-326-9

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-W0042/

Language: Arabic

Language (ISO639): ara

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-W0042

DateStamp: 2006-08-11

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2006. ELRA (European Language Resources Association).
Terms: dcmi_Text iso639_ara olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0042
Up-to-date as of: Wed Oct 1 0:55:40 EDT 2025

Metadata
Title:		NEMLAR Written Corpus
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2006-08-11
Date Issued (W3CDTF):		2006-08-11
Date Modified (W3CDTF):		2007-02-22
Description:		This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Broadcast News Speech Corpus (ELRA-S0219) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories, aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The different categories are:•Political news: 48,000 words•Political debate: 30,000 words•Islamic text (Preaching and others): 29,000 words•Phrases of common words: 8,500 words•Text from broadcast news: 5,500 words•Business: 20,000 words•Arabic literature: 30,000 words•General news: 100,000 words•Interviews: 56,000 words•Scientific press: 50,000 words•Sports press: 50,000 words•Dictionary entries explanation: 52,000 words•Legal domain text: 21,000 wordsThe time span of the data included goes from late 1990’s to 2005.The corpus is provided in 4 different versions:•Raw text•Fully vowelized text•Text with Arabic lexical analysis•Text with Arabic POS-tagsDiacritics, lexical analysis and POS-tags were generated by RDI’s tool Fassieh©. The accuracy of the automatic analysis is around 95%. To reach about the 99% accuracy rate as defined for this corpus, the linguists used the visual revision mode of Fassieh© where the linguist has to either approve the 1st most likely analysis (most of the time) or select another one manually (in the 4% minority of the cases).The database is distributed on 1 ISO 9660 CD-ROM volume. It has been validated by an external partner and a validation report is provided.
Identifier:		ELRA-W0042
Identifier:		ISLRN: 050-693-158-326-9
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-W0042/
Language:		Arabic
Language (ISO639):		ara
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-W0042
DateStamp:		2006-08-11
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2006. ELRA (European Language Resources Association).
Terms:		dcmi_Text iso639_ara olac_primary_text