OLAC Record
oai:catalogue.elra.info:ELRA-M0052

Metadata
Title:EnToFrNE - a Parallel English-French Lexicon of Named Entities
Access Rights: Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):2019-09-10
Date Issued (W3CDTF):2019-09-10
Date Modified (W3CDTF):2019-04-24
Description:In any text document, there are particular terms that represent specific entities that are more informative and have a unique context. These entities are known as named entities, which more specifically refer to terms that represent real-world objects like people, places, organizations, and so on. They are often denoted by proper names and can be abstract or have a physical existence. Examples of named entities include: United States of America, Paris, Google, Mercedes Benz, Microsoft Windows, or anything else that can be named. Certain natural terms like biological species and substances, which are sometimes considered named entities, are not included in the lexicon.The lexicon consists of 1,167,263 parallel named entities in English and French.ClassificationNamed entities in the lexicon are tagged. The tags used are: PERSON, ORGANIZATION, LOCATION, PRODUCT and MISC. Each named entity belongs to one of these classes. The classes comprise:PERSON: humans, gods, saints, fictional characters;ORGANIZATION: political organizations, companies, schools, rock bands, sport teams;LOCATION: geographical terms, fictional places, cosmic terms;PRODUCT: industrial products, software products, weapons, art works, documents, concepts, standards, laws, formats, anthems, algorithms, journals, coats of arms, platforms, websites;MISC: events, languages, peoples, tribes, alliances, orders, scientific discoveries, theories, titles, currencies, holidays, dynasties, positions, projects, historical periods, battles, competitions, alliances, deceases, breeds, programs, set of locations, awards, musical genres, missions, artistic directions, set of organizations, networks.There are 1,167,263 entries in the lexicon. At least one tag is assigned to each one of them. The distribution of tags is as follows:PERSON: 387,676ORGANIZATION: 107,865LOCATION: 309,533PRODUCT: 149,137MISC: 247,655The total number of tags, 1,201,866, is slightly higher than the number of entries, due to the fact that some named entities may belong to more classes. For example, Tom Sawyer is tagged as both PRODUCT (the title of the novel) and PERSON (the character from the novel).EvaluationTo evaluate the tagging, two common metrics in information retrieval have been used: precision and recall. Precision means the percentage of tags which are correct. On the other hand, recall refers to the percentage of total relevant tags correctly classified by the algorithm.An alternative to having two measures is the F-measure which combines precision and recall into a single performance measure. This metric is known as F1-score, which is simply the harmonic mean of precision and recall.In order to evaluate the tagging, a random sample containing 1,000 entries has been extracted from the lexicon. The entries from the sample have been tagged manually and then compared to the tagging performed by the algorithm. The precision of tagging is between 0.94 for ORGANIZATION and 0.99 for PERSON. The recall is slightly lower, from 0.83 for PRODUCT and MISC to 0.97 for PERSON. The higher values of precision show that the tagging algorithm was adjusted to tag the named entities correctly, rather than to extract more named entities for the lexicon.FormatsThe lexicon comes in two formats: csv and xml.The first row in the csv file is a title row and tab is used as a field separator. The columns’ titles are: en, fr, PERSON, ORGANIZATION, LOCATION, PRODUCT and MISC. Next rows contain the data: English name, French name and five digits, 0’s or 1’s, depending on which class the named entity belongs to.The structure of the xml file is similar. The columns’ names from the csv file are now names of elements.
Identifier:ELRA-M0052
ISLRN: 233-270-965-120-8
Identifier (URI):https://catalog.elra.info/en-us/repository/browse/ELRA-M0052/
Language:English
French
Language (ISO639):eng
fra
Medium:Not specified
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):lexicon

OLAC Info

Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-M0052
DateStamp:  2019-09-10
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2019. ELRA (European Language Resources Association).
Terms: area_Europe country_FR country_GB dcmi_Text iso639_eng iso639_fra olac_lexicon


http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-M0052
Up-to-date as of: Fri Apr 19 6:29:20 EDT 2024