OLAC Record
oai:catalogue.elra.info:ELRA-W0023

Metadata
Title:MLCC Multilingual and Parallel Corpora
Abstract:The first set contains articles from 6 European newspapers: Het Financieele Dagblad (Dutch, 8.5 million words), The Financial Times (English, 30 million words), Le Monde (French, 10 million words), Handelsblatt (German, 33 million words), Il sole 24 Ore (Italian, 1.88 million words), Expansion (Spanish, 10 million words). The second set consists of a parallel corpus of translated data in the nine European official languages (1992-1994) divided into 2 sub-corpora: written questions (10.2 million words) and parliamentary debates (5 to 8 million words per language).
Access Rights:Rights available for: Research Use
Coverage:1986-1994
Date Available (W3CDTF):1996-09-01
Date Issued (W3CDTF):2004-11-04
Date Modified (W3CDTF):2012-05-23
Description:Written Corpora
The MLCC text corpus has two main components - one set to allow comparable studies to be carried out in different languages and one set as the basis for translation studies. The first set is referred as the Polylingual Document Collection, a collection of newspaper articles from financial newspapers in 6 languages (Dutch, English, French, German, Italian and Spanish). It consists of the following sub-corpora: Dutch - Het Financieele Dagblad - 1992-1993 (Samples) The corpus contains articles from the Dutch financial newspaper Het Financieele Dagblad editions of 2nd January 1992 through to 24th December 1993. It contains around 8.5 million words of text. English - The Financial Times - 1993 (Samples) The corpus contains articles from the British financial newspaper The Financial Times editions from the year 1993. The corpus contains around 30 million words. French - Le Monde - 1992-1993 (Samples) A corpus of articles from the French newspaper Le Monde, consisting of two years worth (1992-1993) of articles on financial subjects, approximately 10 million words. German - Handelsblatt - 1986-1988 (Samples) This subcorpus consists of articles from the period 02.01.1986 to 15.06.1988. It contains some 33 million words. It may be possible to obtain more recent articles from Handelsblatt. Italian - Il Sole 24 Ore - 1992-1993 (Samples) The corpus described here contains articles from the Italian financial newspaper Il Sole 24 Ore from the year 1992. This corpus contains some 1.88 million words. The SGML-markup was done by the University of Edinburgh. Spanish - Expansion - 1994 (Samples) This subcorpus contains articles from the Spanish financial newspaper Expansion editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to 27.12.1994. It contains some 10 million words. The second set is a Multilingual Parallel Corpus consisting of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data, provided by the European Commission, comprises two sub-corpora from the Official Journal of the European Communities: Official Journal of the European Commission, C Series: Written Questions 1993 Records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the European Community in all official languages (previously nine). This corpus contains written questions asked by members of the European Parliament and corresponding answers from the European Commission in 9 parallel versions. The total size of the corpus is approximately 10.2 million words (ca. 1.1 million words per language). Official Journal of the European Commission, Annex: Debates of the European Parliament 1992-1994 This parallel corpus is the records of Parliamentary sitting published as an annex to the Official Journal of the European Community Debates of the European Parliament. The Parliamentary Debates are a record of what was said by members of the meeting as well as written input provided to the meeting. The original data from which the translations are produced consist of a transcript of the sittings, each member speaking in the language of his choice. The final version consists of nine parallel versions of the material. The texts delivered comprise the Debates of Parliament from January 1992 to July 1994. This sub-corpus contains some 5 to 8 million words per language.
Identifier:ELRA-W0023
http://catalog.elra.info/product_info.php?products_id=764
Language:Dutch, Flemish
English
German
French
Italian
Spanish, Castilian
Language (ISO639):nld
eng
deu
fra
ita
spa
Publisher:ELRA (European Language Resources Association)
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  ELRA Catalogue of Language Resources
Description:  http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:catalogue.elra.info:ELRA-W0023
DateStamp:  1996-09-01
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: n.a. 2004. ELRA (European Language Resources Association).
Terms: area_Europe country_DE country_ES country_FR country_GB country_IT country_NL dcmi_Text iso639_deu iso639_eng iso639_fra iso639_ita iso639_nld iso639_spa olac_primary_text


http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0023
Up-to-date as of: Fri May 5 1:18:55 EDT 2017