Title:PAROLE English lexicon
Abstract:The PAROLE English lexicon consists of 22 000 morphological units extracted from the CRL-LKB and COBUILD dictionaries: 12998 are common nouns, 40 proper nouns, 4195 verbs, 3208 adjectives, 606 adverbs, 71 adpositions, 2 articles, 21 conjunctions, 25 determiners and 53 pronouns.
Access Rights:Rights available for: Research Use, Commercial Use
Date Available (W3CDTF):2001-04-06
Date Issued (W3CDTF):2004-09-14
Date Modified (W3CDTF):2016-11-15
Description:Monolingual Lexicons
The English PAROLE Lexicon has been compiled by two partners, Sheffield University and the Corpus Linguistic Group (CLG) at Birmingham University. The Lexicon was compiled from existing resources: CRL-LKB and the COBUILD dictionary database. Both have restricted availability and contain extensive syntactic, semantic and morphological information. The lexicon contains 22,000 morphological units, of which 12998 are common nouns, 40 proper nouns 4195 verbs, 3208 adjectives, 606 adverbs, 71 adpositions, 2 articles, 21 conjunctions, 25 determiners, 53 pronouns. The English PAROLE lexicon comprises the following information: - morphological encoding for all nouns, verbs, adverbs, adjectives and functions words; - syntactic encoding of all verbs, nouns, adjectives and adverbs. The organizational procedure was as follows: 1. Selection: Lemmata were mostly selected on the basis of frequency from the COBUILD corpus. Most proper nouns were deselected and some verbs were added because of the decision to encode deverbal nominalisations and compound information. 2. Coverage: the headword list was checked against the resources to make sure there was adequate coverage of syntactic and morphological information. 3. Composition: the nominal lemmata were checked for derivations and compounds. These were extracted and analyzed into their constituent parts and compounds were checked for lexicalisation. Components were flagged with their base forms and grammatical class. 4. Conversion: Morphosyntactic information was either directly transferred from existing resources or, in the case of inflectional information and subcategorisation patterns, programs were written to extract information and convert it into the PAROLE format. 5. Cross-reference: all components contained in nominal derivations and compounds were cross-referenced with their base PoS. Integrity checks were made and the lexicon was parsed using nsgmls. *** Introduction on the PAROLE project LE-PAROLE project (MLAP/LE2-4017) aims to offer a large-scale harmonised set of "core" corpora and lexica for all European Union languages. Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998. PAROLE Corpora: The harmonisation with respect to corpus composition (selection of corpus texts) was to be achieved by the obligatory application of common parameters for time of production and classification according to publication medium. No texts older than 1970 were allowed. As for publication medium, the corpus had to include specific proportions of texts from the categories ?Book?, ?Newspaper?, ?Periodical? and ?Miscellaneous? within a settled range. The harmonisation effort also applied to the textual and linguistic encoding of the language corpora involved. With respect to the mark up of text structure and primary data, every single corpus text was to be encoded according to the PAROLE DTD, which is compatible with the DTD of the Text Encoding Initiative (TEI) and with that of the Corpus Encoding Standard (CES). The level of encoding was set to Level 1 of the CES, implying the encoding of text structure and textual features up to Paragraph Level, with the additional constraint, however, that all legacy data was kept. As for linguistic corpus annotation, an equal proportion of the corpus texts (up to 250,000 running words) was to be morphosyntactically annotated according to a common core PAROLE tagset, extended with a set of language specific features. The checking of the tags was split in two: 50,000 words had to be checked for maximum granularity and 200,000 for part-of-speech (PoS) only. The languages involved in PAROLE corpora are: Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish. PAROLE Lexica: The lexica (20,000 entries per language) were built conform to a model based on EAGLES guidelines and GENELEX results, underlying a common lexical tool adapted from the EUREKA-GENELEX project. This software tool was extended to support the PAROLE model and conversion and management processes of the resulting resources. The languages involved in PAROLE lexica are: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish.
Language (ISO639):eng
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):primary_text


