OLAC Record

Title:Normalized Arabic Fragments for Inestimable Stemming (NAFIS)
Access Rights: Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):2018-10-02
Date Issued (W3CDTF):2018-10-02
Description:Normalized Arabic Fragments for Inestimable Stemming (NAFIS) is an Arabic stemming gold standard corpus composed by a collection of sentences, selected to be representative of Arabic stemming tasks and manually annotated. Indeed, NAFIS is:Comprehensive: The content of NAFIS can be generalized to the Arabic language as a whole. Within the stemming issue, to be comprehensive the corpus must contain all possible affix combinations. To reflect this purpose, linguists made an inventory of all Arabic affix combinations. An affix is a prefix-suffix couple that can be agglutinated to a specific word type (noun, verb or particle). Arabic affixes consist of 12 atomic prefixes and 11 atomic suffixes. Their combining generates about 94 prefixes and 73 suffixes (we note that we use the terms affix, prefix and suffix instead of clitic, proclitic and enclitic because they are widely used in the literature). For example the prefix “وَال” (and the) is composed with two atomic prefixes “وَ” (the conjunction “and”) and “لا” (the definite article “the”). Compiled: linguists gathered a set of sentences containing all earlier listed affixes to ensure the comprehensiveness criterion. Compiled sentences belong to various sources (poems, holy Quran, books, and periodics) of diversified kinds (proverb and dictum, article commentary, religious text, literature, historical fiction). For instance, the following sentence "عليكم بالجد فإنه أساس النجاح" is part of the corpus and contains four affixes combination: 1.[-كم]: the empty prefix associated with the suffix pronoun ‘you’, 2.[بال-]: composed with two atomic prefixes ("ب" the preposition 'with' and “ال” the definite article 'the') and the empty suffix, 3.[ه-ف]: composed with the prefix “ف” (the conjunction “then”) and the suffix “ه” (the pronoun “his”) 4.[ال-]: composed with “ال” the definite article 'the' and the empty suffix.As shown in the extract below, NAFIS is represented according to the TEI standard. Sentences are enclosed within the tag. A sentence is a set of segments representing words . Since a word can have several stemming solutions (), each alternative is included within a
tag, which contains the prefix, base (root and stem) and suffix morphemes. All alternatives are ordered randomly except the first one, which is the suitable solution when taking the sentence context into consideration. The corpus has the following characteristics:•37 sentences•The average length of sentences is 5,05 words, with the longest being 10 words•Declarative, interrogative, imperative and exclamatory sentences accounted for 37,84%, 32,43%, 16,22% and 13,51% respectively•154 tokens with 5,95 solutions as an average number of stemming solutions
ISLRN: 305-450-745-774-1
Identifier (URI):http://catalog.elra.info/en-us/repository/browse/ELRA-W0127/
Language (ISO639):ara
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-W0127
DateStamp:  2018-10-02
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2018. ELRA (European Language Resources Association).
Terms: dcmi_Text iso639_ara olac_primary_text

Up-to-date as of: Wed Nov 17 9:15:23 EST 2021