OLAC Record: Czech Web Corpus 2017 (csTenTen17)

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-4835

Metadata

Title: Czech Web Corpus 2017 (csTenTen17)

Bibliographic Citation: http://hdl.handle.net/11234/1-4835

Creator: Suchomel, Vít

Date (W3CDTF): 2022-09-15T14:18:30Z

Date Available: 2022-09-15T14:18:30Z

Description: The Czech Web Corpus 2017 (csTenTen17) is a Czech corpus made up of texts collected from the Internet, mostly from the Czech national top level domain ".cz". The data was crawled by web crawler SpiderLing (https://corpus.tools/wiki/SpiderLing). The data was cleaned by removing boilerplate (using https://corpus.tools/wiki/Justext), removing near-duplicate paragraphs (by https://corpus.tools/wiki/Onion) and discarding paragraphs not in the target language. The corpus was POS annotated by morphological analyser Majka using this POS tagset: https://www.sketchengine.eu/tagset-reference-for-czech/. Text sources: General web, Wikipedia. Time span of crawling: May, October and November 2017, October and November 2016, October and November 2015. The Czech Wikipedia part was downloaded in November 2017. Data format: Plain text, vertical (one token per line), gzip compressed. There are the following structures in the vertical: Documents (, usually corresponding to web pages), paragraphs (
), sentences () and word join markers (, a "glue" tag indicating that there was no space between the surrounding tokens in the original text). Document metadata: src (the source of the data), title (the title of the web page), url (the URL of the document), crawl_date (the date of downloading the document). Paragraph metadata: heading ("1" if the paragraph is a heading, usually
to
elements in the original HTML data). Block elements in the case of an HTML source or double blank lines in the case of other source formats were used as paragraph separators. An internal heuristic tool was used to mark sentence breaks. The tab-separated positional attributes are: word form, morphological annotation, lem-POS (the base form of the word, i.e. the lemma, with a part of speech suffix) and gender respecting lemma (nouns and adjectives only). Please cite the following paper when using the corpus for your research: Suchomel, Vít. csTenTen17, a Recent Czech Web Corpus. In Recent Advances in Slavonic Natural Language Processing, pp. 111–123. 2018. (https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119)

Identifier (URI): http://hdl.handle.net/11234/1-4835

Language: Czech

Language (ISO639): ces

Publisher: Masaryk University, NLP Centre

Lexical Computing CZ s.r.o.

Rights: NLP Centre Web Corpus License

https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC

Subject: Web corpus

Type: corpus

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-4835

DateStamp: 2023-01-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Suchomel, Vít. 2022. Masaryk University, NLP Centre.
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-4835
Up-to-date as of: Mon Jun 16 1:07:55 EDT 2025

Metadata
Title:		Czech Web Corpus 2017 (csTenTen17)
Bibliographic Citation:		http://hdl.handle.net/11234/1-4835
Creator:		Suchomel, Vít
Date (W3CDTF):		2022-09-15T14:18:30Z
Date Available:		2022-09-15T14:18:30Z
Description:		The Czech Web Corpus 2017 (csTenTen17) is a Czech corpus made up of texts collected from the Internet, mostly from the Czech national top level domain ".cz". The data was crawled by web crawler SpiderLing (https://corpus.tools/wiki/SpiderLing). The data was cleaned by removing boilerplate (using https://corpus.tools/wiki/Justext), removing near-duplicate paragraphs (by https://corpus.tools/wiki/Onion) and discarding paragraphs not in the target language. The corpus was POS annotated by morphological analyser Majka using this POS tagset: https://www.sketchengine.eu/tagset-reference-for-czech/. Text sources: General web, Wikipedia. Time span of crawling: May, October and November 2017, October and November 2016, October and November 2015. The Czech Wikipedia part was downloaded in November 2017. Data format: Plain text, vertical (one token per line), gzip compressed. There are the following structures in the vertical: Documents (, usually corresponding to web pages), paragraphs ( ), sentences () and word join markers (, a "glue" tag indicating that there was no space between the surrounding tokens in the original text). Document metadata: src (the source of the data), title (the title of the web page), url (the URL of the document), crawl_date (the date of downloading the document). Paragraph metadata: heading ("1" if the paragraph is a heading, usually to elements in the original HTML data). Block elements in the case of an HTML source or double blank lines in the case of other source formats were used as paragraph separators. An internal heuristic tool was used to mark sentence breaks. The tab-separated positional attributes are: word form, morphological annotation, lem-POS (the base form of the word, i.e. the lemma, with a part of speech suffix) and gender respecting lemma (nouns and adjectives only). Please cite the following paper when using the corpus for your research: Suchomel, Vít. csTenTen17, a Recent Czech Web Corpus. In Recent Advances in Slavonic Natural Language Processing, pp. 111–123. 2018. (https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119)
Identifier (URI):		http://hdl.handle.net/11234/1-4835
Language:		Czech
Language (ISO639):		ces
Publisher:		Masaryk University, NLP Centre
Publisher:		Lexical Computing CZ s.r.o.
Rights:		NLP Centre Web Corpus License
Rights:		https://lindat.mff.cuni.cz/repository/xmlui/page/license-NLPC-WeC
Subject:		Web corpus
Type:		corpus
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-4835
DateStamp:		2023-01-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Suchomel, Vít. 2022. Masaryk University, NLP Centre.
Terms:		area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text