OLAC Record
oai:lindat.mff.cuni.cz:11234/1-2144

Metadata
Title:Automatically generated spelling correction corpus for Czech (Czech-SEC-AG)
Bibliographic Citation:http://hdl.handle.net/11234/1-2144
Creator:Hajič, Jan
Náplava, Jakub
Straka, Milan
Date (W3CDTF):2017-05-03T08:09:00Z
Date Available:2017-05-03T08:09:00Z
Description:Automatically generated spelling correction corpus for Czech (Czesl-SEC-AG) is a corpus containg text with automatically generated spelling errors. To create spelling errors, a character error model containing probabilities of character substitution, insertion, deletion and probabilities of swaping two adjacent characters is used. Besides these probabilities, also the probabilities of changing character casing are considered. The original clean text on which the spelling errors were generated is PDT3.0 (http://hdl.handle.net/11858/00-097C-0000-0023-1AAF-3). The original train/dev/test sentence split of PDT3.0 corpus is preserved in this dataset. Besides the data with artificial spelling errors, we also publish texts from which the character error model was created. These are the original manual transcript of an audiobook Švejk and its corrected version performed by authors of Korektor (http://ufal.mff.cuni.cz/korektor). These data are similarly to CzeSL Grammatical Error Correction Dataset (CzeSL-GEC: http://hdl.handle.net/11234/1-2143) processed into four sets based on error difficulty present.
Identifier (URI):http://hdl.handle.net/11234/1-2144
Language:Czech
Language (ISO639):ces
Publisher:Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
http://creativecommons.org/licenses/by-nc-sa/3.0/
Subject:spelling correction
natural language correction
Type:corpus
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-2144
DateStamp:  2018-07-02
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Hajič, Jan; Náplava, Jakub; Straka, Milan. 2017. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text


http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-2144
Up-to-date as of: Sun Sep 1 18:24:30 EDT 2019