OLAC Record
oai:lindat.mff.cuni.cz:11372/LRT-2206

Metadata
Title:C4Corpus (CC BY-NC-SA part)
Bibliographic Citation:http://hdl.handle.net/11372/LRT-2206
Creator:Gurevych, Iryna
Habernal, Ivan
Zayed, Omnia
Date (W3CDTF):2017-06-07T13:08:21Z
Date Available:2017-06-07T13:08:21Z
Description:A large web corpus (over 10 billion tokens) licensed under CreativeCommons license family in 50+ languages that has been extracted from CommonCrawl, the largest publicly available general Web crawl to date with about 2 billion crawled URLs.
Identifier (URI):http://hdl.handle.net/11372/LRT-2206
Language:Afrikaans
Arabic
Bengali
Bulgarian
Czech
Danish
German
Modern Greek (1453-)
English
Estonian
Persian
Finnish
French
Gujarati
Hebrew
Hindi
Croatian
Hungarian
Indonesian
Italian
Japanese
Korean
Latvian
Lithuanian
Malayalam
Marathi
Macedonian
Nepali (macrolanguage)
Dutch
Norwegian
Polish
Portuguese
Romanian
Russian
Slovak
Slovenian
Somali
Spanish
Albanian
Swahili (macrolanguage)
Swedish
Tamil
Telugu
Tagalog
Thai
Turkish
Ukrainian
Undetermined
Urdu
Vietnamese
Chinese
Language (ISO639):afr
ara
ben
bul
ces
dan
deu
ell
eng
est
fas
fin
fra
guj
heb
hin
hrv
hun
ind
ita
jpn
kor
lav
lit
mal
mar
mkd
nep
nld
nor
pol
por
ron
rus
slk
slv
som
spa
sqi
swa
swe
tam
tel
tgl
tha
tur
ukr
und
urd
vie
zho
Publisher:Technische Universität Darmstadt
Rights:Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
http://creativecommons.org/licenses/by-nc-sa/4.0/
Subject:CommonCrawl
Creative Commons
Web corpus
Amazon Web Services
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11372/LRT-2206
DateStamp:  2017-06-07
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Gurevych, Iryna; Habernal, Ivan; Zayed, Omnia. 2017. Technische Universität Darmstadt.
Terms: area_Africa area_Asia area_Europe country_BD country_BG country_CZ country_DE country_DK country_ES country_FI country_FR country_GB country_GR country_HR country_HU country_ID country_IL country_IN country_IT country_JP country_KR country_LT country_MK country_NL country_NO country_PH country_PK country_PL country_PT country_RO country_RU country_SE country_SI country_SK country_SO country_TH country_TR country_UA country_VN country_ZA dcmi_Text iso639_afr iso639_ara iso639_ben iso639_bul iso639_ces iso639_dan iso639_deu iso639_ell iso639_eng iso639_est iso639_fas iso639_fin iso639_fra iso639_guj iso639_heb iso639_hin iso639_hrv iso639_hun iso639_ind iso639_ita iso639_jpn iso639_kor iso639_lav iso639_lit iso639_mal iso639_mar iso639_mkd iso639_nep iso639_nld iso639_nor iso639_pol iso639_por iso639_ron iso639_rus iso639_slk iso639_slv iso639_som iso639_spa iso639_sqi iso639_swa iso639_swe iso639_tam iso639_tel iso639_tgl iso639_tha iso639_tur iso639_ukr iso639_und iso639_urd iso639_vie iso639_zho olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11372/LRT-2206
Up-to-date as of: Sun Nov 26 2:08:41 EST 2017