OLAC Record
oai:catalogue.elra.info:ELRA-W0020

Metadata
Title:PAROLE French Corpus
Abstract:The PAROLE French corpus contains the following data: Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual and Parallel Corpora) 2 025 964 words Books: CNRS Editions 3 267 409 words Periodicals: CNRS Info, Herm?s 942 963 words Newspapers: Le Monde, provided by ELRA 13 856 763 words Total 20 093 099 words
Access Rights:Rights available for: Research Use
Date Available (W3CDTF):2000-03-06
Date Issued (W3CDTF):2004-09-14
Date Modified (W3CDTF):2016-11-15
Description:Written Corpora
The PAROLE French corpus contains the following data: Miscellaneous: Data provided by ELRA (CRATER, MLCC Multilingual and Parallel Corpora) 2 025 964 words Books: CNRS Editions 3 267 409 words Periodicals: CNRS Info, Herm?s 942 963 words Newspapers: Le Monde, provided by ELRA 13 856 763 words Total 20 093 099 words 1. Newspapers: 14 million words were extracted from complete issues of years 1987, 1989, 1991, 1993 and 1995 of Le Monde newspaper. 241,484 words, from 7 issues of Le Monde of September 1987, have been extracted, and POS-tagged automatically. Each article consists of a complete item ? header ? according to the directives of the TEI (Text Encoding Initiative). Le Monde original markups were changed into classication features, so that extracting articles of different topics is possible. 2. Periodicals: ? HERMES Issues 15 to 22 have been used (134 articles, one Word file per article). The data have been converted from Word to RTF (Rich Text Format) and then, via a translator, from RTF to HTML. The conversion from HTML to the PAROLE format was made thanks to flex programs. The result for each article is: one "header" file which contains information on the author and the article id, and one "body" file which contains the article itself. A perl script is creating the final file from both "header" and "body". ? CNRS-Infos The data come from the CNRS-Infos Web site (http://www.cnrs.fr/Cnrspresse/cnrsinfo.html). Each file has been processed as follows: cleaning the HTML header, extracting a summary, cleaning of HTML markups, translation to the PAROLE format, creation of the "header" and the "body" files (see Herm?s). . Like Herm?s files, a perl script is creating the final file from both "header" and "body". 3. Books All books were provided on CD-ROM as Xpress files, each book having its own structure. Therefore, each book has been considered separately. XPress allows conversion to a format called "Xpress markup". This format enables to spot the different structures of the book (if the Xpress file has been laid out well - which is not always the case). The structure of each book had to be worked out to create the perl script which enables the translation to the PAROLE format. Conformance to the PAROLE format was made thanks to a "nsgmls" tool. The errors found during the verification have been manually corrected. *** Introduction on the PAROLE project LE-PAROLE project (MLAP/LE2-4017) aims to offer a large-scale harmonised set of "core" corpora and lexica for all European Union languages. Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998. PAROLE Corpora: The harmonisation with respect to corpus composition (selection of corpus texts) was to be achieved by the obligatory application of common parameters for time of production and classification according to publication medium. No texts older than 1970 were allowed. As for publication medium, the corpus had to include specific proportions of texts from the categories ?Book?, ?Newspaper?, ?Periodical? and ?Miscellaneous? within a settled range. The harmonisation effort also applied to the textual and linguistic encoding of the language corpora involved. With respect to the mark up of text structure and primary data, every single corpus text was to be encoded according to the PAROLE DTD, which is compatible with the DTD of the Text Encoding Initiative (TEI) and with that of the Corpus Encoding Standard (CES). The level of encoding was set to Level 1 of the CES, implying the encoding of text structure and textual features up to Paragraph Level, with the additional constraint, however, that all legacy data was kept. As for linguistic corpus annotation, an equal proportion of the corpus texts (up to 250,000 running words) was to be morphosyntactically annotated according to a common core PAROLE tagset, extended with a set of language specific features. The checking of the tags was split in two: 50,000 words had to be checked for maximum granularity and 200,000 for part-of-speech (PoS) only. The languages involved in PAROLE corpora are: Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish. PAROLE Lexica: The lexica (20,000 entries per language) were built conform to a model based on EAGLES guidelines and GENELEX results, underlying a common lexical tool adapted from the EUREKA-GENELEX project. This software tool was extended to support the PAROLE model and conversion and management processes of the resulting resources. The languages involved in PAROLE lexica are: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish.
Identifier:ELRA-W0020
http://catalog.elra.info/product_info.php?products_id=565
Language:French
Language (ISO639):fra
Medium:Downloadable
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-W0020
DateStamp:  2000-03-06
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2004. ELRA (European Language Resources Association).
Terms: area_Europe country_FR dcmi_Text iso639_fra olac_primary_text


http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0020
Up-to-date as of: Sun Jun 17 0:44:47 EDT 2018