OLAC Record: CESART Evaluation Package

OLAC Record
oai:catalogue.elra.info:ELRA-E0019

Metadata

Title: CESART Evaluation Package

Access Rights: Rights available for: evaluationUse

Date Available (W3CDTF): 2007-06-28

Date Issued (W3CDTF): 2007-06-28

Date Modified (W3CDTF): 2017-06-26

Description: The CESART Evaluation Package was produced within the French national project CESART (Evaluation of terminology extraction tools), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The CESART project enabled to carry out a campaign for the evaluation of terminology extraction tools. This project is an extension of the evaluation campaign of terminology resource acquisition tools that was carried out for written corpora (ARC A3) within the AUPELF campaigns (Actions de recherche Concertées, 1996-1999). This package includes the material that was used for the CESART evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system. The campaign is distributed over two actions: 1)Term extraction for the building of a terminology reference which applications are the enrichment of the reference and the free indexing of documents.2)Extraction of semantic relations (synonymy) from a list of “focal” terms.The CESART evaluation package contains the following data and tools:Three domain-specific corpora in French were built: one medical corpus, one educational corpus, and one political corpus. The first two were used as test corpora, while the third one (political corpus) was used as a masking corpus. The corpora were encoded in UTF-8 and XML. They are available in two different versions, one for DOS and one for UNIX.1)The medical corpus consists of web pages extracted from Santé Canada (http://www.hc-sc.gc.ca/index_f.html). 2)The corpus in the educational field contains articles extracted from the SPIRAL magazine specialised in pedagogy and research in education. 3)The political corpus is composed of texts extracted from the Official Journal of the European Union. The table below gives some statistics on the corpora used for the evaluation:
Corpus (specialised) Medicine (test corpus) Education (test corpus) Politics (masking corpus)
Number of documents 7,514 149 1,477
Number of segments 255,161 12,109 9,024
Number of words 9,000,000 535,000 240,000
Two reference lists were built from two terminology databases in a specialised domain. The list of medical terms, based on the terminology provided by the CISMeF team (www.chu-rouen.fr/terminologiecismef), is available from the IST/Inserm (http://mesh.inserm.fr/mesh). This list contains 22,861 entries. As for the educational domain, the reference list is based on the Motbis thesaurus (http://www.thesaurus.motbis.cndp.fr/site/) and consists of 36,081 entries.A description of the project is available at the following address:http://www.technolangue.net/article.php3?id_article=200 (in French language)

Identifier: ELRA-E0019

ISLRN: 154-799-255-123-0

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-E0019/

Language: French

Language (ISO639): fra

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-E0019

DateStamp: 2007-06-28

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2007. ELRA (European Language Resources Association).
Terms: area_Europe country_FR dcmi_Text iso639_fra olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-E0019
Up-to-date as of: Wed Oct 1 0:55:49 EDT 2025