OLAC Record
oai:www.ldc.upenn.edu:LDC94T5

Metadata
Title:ECI Multilingual Text
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Linguistic Data Consortium. ECI Multilingual Text LDC94T5. Web Download. Philadelphia: Linguistic Data Consortium, 1994
Contributor:Linguistic Data Consortium
Date (W3CDTF):1994
Description:The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI), has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). The CD-ROM is in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least. The amount of material per language varies, from about 36 million words (German) to about 5 thousand words (Bulgarian). The majority of sources are journalistic in nature (newspapers, magazines, broadcasts) additional sources include dictionaries (Albanian, Gaelic, Turkish, Japanese/English), literature, technical reports and proceedings or publications of international organizations. The table on the next page lists the languages included, the subcorpus numbers for each language (in parentheses) and the amount of data per language in thousands of lexical words. Language (Subcorpus #) Kwords Totals German (70) 34291 (09) 191 (65) 20 (28) 187 (29) 59 (30) 76 (47) 24 (59) 50 (71) 21 (70A) 999 35918 French (31) 4775 (04) 4121 (28) 187 (29) 59 (30) 76 (47) 24 (51) 6 (59) 50 (71) 21 (32) 1667 10986 Spanish (31) 4500 (13) 830 (14) 1041 (15) 447 (47) 24 (32) 1667 8 (59) 50 (71) 8580 English (31) 4222 (36) 1141 (74) 95 (28) 187 (47) 24 (51) 6 (56) 97 (59) 50 (71) 21 (32) 1667 7510 Dutch (03) 5500 (02) 600 (47) 24 (71) 21 6145 Czech (44) 4726 4726 Italian (11) 3518 (42) 303 (58) 13 (29) 59 (30) 76 (47) 24 (71) 21 4014 Chinese (78) 2895 2895 Greek (10) 2515 (47) 24 (59) 50 (71) 21 2610 Norwegian (41) 2226 2226 Swedish (37) 1718 1718 Serb/Croat/Slov(24) 700 (56) 289 989 Tibetan (76) 834 834 Portuguese (60) 675 (47) 24 (71) 21 720 Malay (80) 563 563 Russian (73) 364 364 Japanese (57) 203 203 Turkish (20) 173 (20A) 110 283 Albanian (82) 205 205 Gaelic (55) 141 141 Estonian (39) 100 100 Usbek (81) 88 88 Latin (74) 75 75 Danish (47) 24 (71) 21 45 Lithuanian (89) 20 20 Bulgarian (84) 5 5 Total 91969
Extent:Corpus size: 373760 KB
Identifier:LDC94T5
https://catalog.ldc.upenn.edu/LDC94T5
ISBN: 1-58563-033-0
ISLRN: 511-168-567-582-5
DOI: 10.35111/h2vd-p896
Language:Swedish
Slovenian
Russian
Portuguese
Norwegian Bokmål
Norwegian Nynorsk
Lithuanian
Latin
Japanese
Scottish Gaelic
French
Estonian
English
Modern Greek (1453-)
German
Danish
Bulgarian
Tosk Albanian
Spanish
Serbian
Mandarin Chinese
Italian
Dutch
Czech
Croatian
Albanian
Uzbek
Malay (macrolanguage); Malay
Language (ISO639):swe
slv
rus
por
nob
nno
lit
lat
jpn
gla
fra
est
eng
ell
deu
dan
bul
als
spa
srp
cmn
ita
nld
ces
hrv
sqi
uzb
msa
License:ECI/MCI Agreement: https://catalog.ldc.upenn.edu/license/eci-slash-mci-user-agreement.pdf
Le Monde Material User Agreement: https://catalog.ldc.upenn.edu/license/le-monde-material-user-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC94T5
Rights Holder:Portions © 1994 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC94T5
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Linguistic Data Consortium. 1994. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_AL country_BG country_CN country_CZ country_DE country_DK country_ES country_FR country_GB country_GR country_HR country_IT country_JP country_LT country_NL country_PT country_RS country_RU country_SE country_SI country_VA dcmi_Text iso639_als iso639_bul iso639_ces iso639_cmn iso639_dan iso639_deu iso639_ell iso639_eng iso639_est iso639_fra iso639_gla iso639_hrv iso639_ita iso639_jpn iso639_lat iso639_lit iso639_msa iso639_nld iso639_nno iso639_nob iso639_por iso639_rus iso639_slv iso639_spa iso639_sqi iso639_srp iso639_swe iso639_uzb olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC94T5
Up-to-date as of: Tue Feb 13 6:32:05 EST 2024