OLAC Record
oai:lindat.mff.cuni.cz:11372/LRT-2610

Metadata
Title:ParaCrawl Corpus version 1.0
Bibliographic Citation:http://hdl.handle.net/11372/LRT-2610
Creator:Koehn, Philipp
Heafield, Kenneth
Forcada, Mikel L.
Esplà-Gomis, Miquel
Ortiz-Rojas, Sergio
Sánchez, Gema Ramírez
Cartagena, Víctor M. Sánchez
Haddow, Barry
Bañón, Marta
Střelec, Marek
Samiotou, Anna
Kamran, Amir
Date (W3CDTF):2018-02-12T07:41:46Z
Date Available:2018-02-12T07:41:46Z
Description:The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html
Identifier (URI):http://hdl.handle.net/11372/LRT-2610
Language:English
German
French
Spanish
Italian
Portuguese
Dutch
Polish
Czech
Romanian
Finnish
Latvian
Russian
Estonian
Language (ISO639):eng
deu
fra
spa
ita
por
nld
pol
ces
ron
fin
lav
rus
est
Publisher:ParaCrawl
Rights:Public Domain Dedication (CC Zero)
http://creativecommons.org/publicdomain/zero/1.0/
Subject:ParaCrawl
parallel corpus
CommonCrawl
machine translation
text corpora
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11372/LRT-2610
DateStamp:  2021-06-29
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Koehn, Philipp; Heafield, Kenneth; Forcada, Mikel L.; Esplà-Gomis, Miquel; Ortiz-Rojas, Sergio; Sánchez, Gema Ramírez; Cartagena, Víctor M. Sánchez; Haddow, Barry; Bañón, Marta; Střelec, Marek; Samiotou, Anna; Kamran, Amir. 2018. ParaCrawl.
Terms: area_Europe country_CZ country_DE country_ES country_FI country_FR country_GB country_IT country_NL country_PL country_PT country_RO country_RU dcmi_Text iso639_ces iso639_deu iso639_eng iso639_est iso639_fin iso639_fra iso639_ita iso639_lav iso639_nld iso639_pol iso639_por iso639_ron iso639_rus iso639_spa olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11372/LRT-2610
Up-to-date as of: Thu Oct 5 0:40:51 EDT 2023