OLAC Record
oai:lindat.mff.cuni.cz:11234/1-2735

Metadata
Title:Plaintext Wikipedia dump 2018
Bibliographic Citation:http://hdl.handle.net/11234/1-2735
Creator:Rosa, Rudolf
Date (W3CDTF):2018-05-09T09:25:05Z
Date Available:2018-05-09T09:25:05Z
Description:Wikipedia plain text data obtained from Wikipedia dumps with WikiExtractor in February 2018. The data come from all Wikipedias for which dumps could be downloaded at [https://dumps.wikimedia.org/]. This amounts to 297 Wikipedias, usually corresponding to individual languages and identified by their ISO codes. Several special Wikipedias are included, most notably "simple" (Simple English Wikipedia) and "incubator" (tiny hatching Wikipedias in various languages). For a list of all the Wikipedias, see [https://meta.wikimedia.org/wiki/List_of_Wikipedias]. The script which can be used to get new version of the data is included, but note that Wikipedia limits the download speed for downloading a lot of the dumps, so it takes a few days to download all of them (but one or a few can be downloaded fast). Also, the format of the dumps changes time to time, so the script will probably eventually stop working one day. The WikiExtractor tool [http://medialab.di.unipi.it/wiki/Wikipedia_Extractor] used to extract text from the Wikipedia dumps is not mine, I only modified it slightly to produce plaintext outputs [https://github.com/ptakopysk/wikiextractor].
Identifier (URI):http://hdl.handle.net/11234/1-2735
Language:Abkhazian
Achinese
Adyghe
Afrikaans
Akan
Tosk Albanian
Amharic
Old English (ca. 450-1100)
Arabic
Official Aramaic (700-300 BCE)
Aragonese
Egyptian Arabic
Assamese
Asturian
Atikamekw
Avaric
Aymara
South Azerbaijani
Azerbaijani
Bashkir
Bambara
Bavarian
Central Bikol
Belarusian
Bengali
Bislama
Banjar
Tibetan
Bosnian
Bishnupriya
Breton
Buginese
Bulgarian
Russia Buriat
Catalan
Min Dong Chinese
Cebuano
Czech
Chamorro
Chechen
Cherokee
Church Slavic
Chuvash
Cheyenne
Central Kurdish
Cornish
Corsican
Cree
Crimean Tatar
Kashubian
Welsh
Danish
German
Dinka
Dimli (individual language)
Dhivehi
Lower Sorbian
Dzongkha
Modern Greek (1453-)
English
Esperanto
Estonian
Basque
Ewe
Extremaduran
Faroese
Persian
Fijian
Finnish
French
Arpitan
Northern Frisian
Western Frisian
Fulah
Friulian
Gagauz
Gan Chinese
Scottish Gaelic
Irish
Galician
Gilaki
Manx
Goan Konkani
Gothic
Guarani
Gujarati
Hakka Chinese
Haitian
Hausa
Hawaiian
Serbo-Croatian
Hebrew
Herero
Fiji Hindi
Hindi
Hiri Motu
Croatian
Upper Sorbian
Hungarian
Armenian
Igbo
Ido
Inuktitut
Interlingue
Iloko
Interlingua (International Auxiliary Language Association)
Indonesian
Inupiaq
Icelandic
Italian
Jamaican Creole English
Javanese
Lojban
Japanese
Kara-Kalpak
Kabyle
Kalaallisut
Kannada
Kashmiri
Georgian
Kanuri
Kazakh
Kabardian
Kabiyè
Central Khmer
Kikuyu
Kinyarwanda
Kirghiz
Komi-Permyak
Komi
Kongo
Korean
Karachay-Balkar
Kölsch
Kurdish
Ladino
Lao
Latin
Latvian
Lak
Lezghian
Ligurian
Limburgan
Lingala
Lithuanian
Lombard
Northern Luri
Latgalian
Luxembourgish
Ganda
Literary Chinese
Marshallese
Maithili
Malayalam
Marathi
Moksha
Eastern Mari
Minangkabau
Macedonian
Malagasy
Maltese
Mongolian
Maori
Western Mari
Malay (macrolanguage)
Creek
Mirandese
Burmese
Erzya
Mazanderani
Min Nan Chinese
Neapolitan
Nauru
Navajo
Ndonga
Low German
Nepali (macrolanguage)
Newari
Dutch
Norwegian Nynorsk
Norwegian
Novial
Pedi
Nyanja
Occitan (post 1500)
Livvi
Oriya (macrolanguage)
Oromo
Ossetian
Pangasinan
Pampanga
Panjabi
Papiamento
Picard
Pennsylvania German
Pfaelzisch
Pitcairn-Norfolk
Pali
Piemontese
Western Panjabi
Pontic
Polish
Portuguese
Pushto
Quechua
Vlax Romani
Romansh
Romanian
Rusyn
Rundi
Macedo-Romanian
Russian
Sango
Yakut
Sanskrit
Sicilian
Scots
Samogitian
Sinhala
Slovak
Slovenian
Northern Sami
Samoan
Shona
Sindhi
Somali
Southern Sotho
Spanish
Albanian
Sardinian
Sranan Tongo
Serbian
Swati
Saterfriesisch
Sundanese
Swahili (macrolanguage)
Swedish
Silesian
Tahitian
Tamil
Tatar
Tulu
Telugu
Tama (Colombia)
Tetum
Tajik
Tagalog
Thai
Tigrinya
Tonga (Tonga Islands)
Tok Pisin
Tswana
Tsonga
Turkmen
Tumbuka
Turkish
Twi
Tuvinian
Udmurt
Uighur
Ukrainian
Urdu
Uzbek
Venetian
Venda
Veps
Vietnamese
Vlaams
Volapük
Võro
Waray (Philippines)
Walloon
Wolof
Wu Chinese
Kalmyk
Xhosa
Mingrelian
Yiddish
Yoruba
Yue Chinese
Zeeuws
Zhuang
Chinese
Zulu
Language (ISO639):abk
ace
ady
afr
aka
als
amh
ang
ara
arc
arg
arz
asm
ast
atj
ava
aym
azb
aze
bak
bam
bar
bcl
bel
ben
bis
bjn
bod
bos
bpy
bre
bug
bul
bxr
cat
cdo
ceb
ces
cha
che
chr
chu
chv
chy
ckb
cor
cos
cre
crh
csb
cym
dan
deu
din
diq
div
dsb
dzo
ell
eng
epo
est
eus
ewe
ext
fao
fas
fij
fin
fra
frp
frr
fry
ful
fur
gag
gan
gla
gle
glg
glk
glv
gom
got
grn
guj
hak
hat
hau
haw
hbs
heb
her
hif
hin
hmo
hrv
hsb
hun
hye
ibo
ido
iku
ile
ilo
ina
ind
ipk
isl
ita
jam
jav
jbo
jpn
kaa
kab
kal
kan
kas
kat
kau
kaz
kbd
kbp
khm
kik
kin
kir
koi
kom
kon
kor
krc
ksh
kur
lad
lao
lat
lav
lbe
lez
lij
lim
lin
lit
lmo
lrc
ltg
ltz
lug
lzh
mah
mai
mal
mar
mdf
mhr
min
mkd
mlg
mlt
mon
mri
mrj
msa
mus
mwl
mya
myv
mzn
nan
nap
nau
nav
ndo
nds
nep
new
nld
nno
nor
nov
nso
nya
oci
olo
ori
orm
oss
pag
pam
pan
pap
pcd
pdc
pfl
pih
pli
pms
pnb
pnt
pol
por
pus
que
rmy
roh
ron
rue
run
rup
rus
sag
sah
san
scn
sco
sgs
sin
slk
slv
sme
smo
sna
snd
som
sot
spa
sqi
srd
srn
srp
ssw
stq
sun
swa
swe
szl
tah
tam
tat
tcy
tel
ten
tet
tgk
tgl
tha
tir
ton
tpi
tsn
tso
tuk
tum
tur
twi
tyv
udm
uig
ukr
urd
uzb
vec
ven
vep
vie
vls
vol
vro
war
wln
wol
wuu
xal
xho
xmf
yid
yor
yue
zea
zha
zho
zul
Publisher:Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
http://creativecommons.org/licenses/by-sa/3.0/
Subject:Wikipedia
text corpora
monolingual corpus
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-2735
DateStamp:  2018-07-02
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Rosa, Rudolf. 2018. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Africa area_Americas area_Asia area_Europe area_Pacific country_AL country_AM country_AT country_BA country_BD country_BE country_BG country_BI country_BT country_BW country_BY country_CA country_CD country_CF country_CH country_CN country_CO country_CW country_CZ country_DE country_DK country_DZ country_EE country_EG country_ES country_ET country_FI country_FJ country_FR country_GB country_GE country_GH country_GL country_GR country_GU country_HR country_HT country_HU country_ID country_IE country_IL country_IM country_IN country_IQ country_IR country_IS country_IT country_JM country_JP country_KE country_KG country_KH country_KR country_KZ country_LA country_LK country_LS country_LT country_LU country_LV country_MD country_MH country_MK country_ML country_MM country_MT country_MV country_MW country_NA country_NF country_NG country_NL country_NO country_NP country_NR country_NZ country_PF country_PG country_PH country_PK country_PL country_PT country_RO country_RS country_RU country_RW country_SE country_SI country_SK country_SN country_SO country_SR country_SZ country_TG country_TH country_TJ country_TM country_TO country_TR country_UA country_UG country_US country_UZ country_VA country_VN country_VU country_WS country_ZA country_ZW dcmi_Text iso639_abk iso639_ace iso639_ady iso639_afr iso639_aka iso639_als iso639_amh iso639_ang iso639_ara iso639_arc iso639_arg iso639_arz iso639_asm iso639_ast iso639_atj iso639_ava iso639_aym iso639_azb iso639_aze iso639_bak iso639_bam iso639_bar iso639_bcl iso639_bel iso639_ben iso639_bis iso639_bjn iso639_bod iso639_bos iso639_bpy iso639_bre iso639_bug iso639_bul iso639_bxr iso639_cat iso639_cdo iso639_ceb iso639_ces iso639_cha iso639_che iso639_chr iso639_chu iso639_chv iso639_chy iso639_ckb iso639_cor iso639_cos iso639_cre iso639_crh iso639_csb iso639_cym iso639_dan iso639_deu iso639_din iso639_diq iso639_div iso639_dsb iso639_dzo iso639_ell iso639_eng iso639_epo iso639_est iso639_eus iso639_ewe iso639_ext iso639_fao iso639_fas iso639_fij iso639_fin iso639_fra iso639_frp iso639_frr iso639_fry iso639_ful iso639_fur iso639_gag iso639_gan iso639_gla iso639_gle iso639_glg iso639_glk iso639_glv iso639_gom iso639_got iso639_grn iso639_guj iso639_hak iso639_hat iso639_hau iso639_haw iso639_hbs iso639_heb iso639_her iso639_hif iso639_hin iso639_hmo iso639_hrv iso639_hsb iso639_hun iso639_hye iso639_ibo iso639_ido iso639_iku iso639_ile iso639_ilo iso639_ina iso639_ind iso639_ipk iso639_isl iso639_ita iso639_jam iso639_jav iso639_jbo iso639_jpn iso639_kaa iso639_kab iso639_kal iso639_kan iso639_kas iso639_kat iso639_kau iso639_kaz iso639_kbd iso639_kbp iso639_khm iso639_kik iso639_kin iso639_kir iso639_koi iso639_kom iso639_kon iso639_kor iso639_krc iso639_ksh iso639_kur iso639_lad iso639_lao iso639_lat iso639_lav iso639_lbe iso639_lez iso639_lij iso639_lim iso639_lin iso639_lit iso639_lmo iso639_lrc iso639_ltg iso639_ltz iso639_lug iso639_lzh iso639_mah iso639_mai iso639_mal iso639_mar iso639_mdf iso639_mhr iso639_min iso639_mkd iso639_mlg iso639_mlt iso639_mon iso639_mri iso639_mrj iso639_msa iso639_mus iso639_mwl iso639_mya iso639_myv iso639_mzn iso639_nan iso639_nap iso639_nau iso639_nav iso639_ndo iso639_nds iso639_nep iso639_new iso639_nld iso639_nno iso639_nor iso639_nov iso639_nso iso639_nya iso639_oci iso639_olo iso639_ori iso639_orm iso639_oss iso639_pag iso639_pam iso639_pan iso639_pap iso639_pcd iso639_pdc iso639_pfl iso639_pih iso639_pli iso639_pms iso639_pnb iso639_pnt iso639_pol iso639_por iso639_pus iso639_que iso639_rmy iso639_roh iso639_ron iso639_rue iso639_run iso639_rup iso639_rus iso639_sag iso639_sah iso639_san iso639_scn iso639_sco iso639_sgs iso639_sin iso639_slk iso639_slv iso639_sme iso639_smo iso639_sna iso639_snd iso639_som iso639_sot iso639_spa iso639_sqi iso639_srd iso639_srn iso639_srp iso639_ssw iso639_stq iso639_sun iso639_swa iso639_swe iso639_szl iso639_tah iso639_tam iso639_tat iso639_tcy iso639_tel iso639_ten iso639_tet iso639_tgk iso639_tgl iso639_tha iso639_tir iso639_ton iso639_tpi iso639_tsn iso639_tso iso639_tuk iso639_tum iso639_tur iso639_twi iso639_tyv iso639_udm iso639_uig iso639_ukr iso639_urd iso639_uzb iso639_vec iso639_ven iso639_vep iso639_vie iso639_vls iso639_vol iso639_vro iso639_war iso639_wln iso639_wol iso639_wuu iso639_xal iso639_xho iso639_xmf iso639_yid iso639_yor iso639_yue iso639_zea iso639_zha iso639_zho iso639_zul olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-2735
Up-to-date as of: Thu Sep 13 1:30:00 EDT 2018