Bibliographic Citation:http://hdl.handle.net/11234/1-2615
Creator:Straka, Milan
Mediankin, Nikita
Kocmi, Tom
Žabokrtský, Zdeněk
Hudeček, Vojtěch
Hajič, Jan
Date (W3CDTF):2020-01-10T09:44:46Z
Date Available:2020-01-10T09:44:46Z
Description:This entry contains the SumeCzech dataset and the metric RougeRAW used for evaluation. Both the dataset and the metric are described in the paper "SumeCzech: Large Czech News-Based Summarization Dataset" by Milan Straka et al. The dataset is distributed as a set of Python scripts which download the raw HTML pages from CommonCrawl and then process them into the required format. The MPL 2.0 license applies to the scripts downloading the dataset and to the RougeRAW implementation. Note: sumeczech-1.0-update-230225.zip is the updated release of the SumeCzech download script, including the original RougeRAW evaluation metric. The download script was modified to use the updated CommonCraw download URL and to support Python 3.10 and Python 3.11. However, the downloaded dataset is still exactly the same. The original archive sumeczech-1.0.zip was renamed to sumeczech-1.0-obsolete-180213.zip and is kept for reference.
Identifier (URI):http://hdl.handle.net/11234/1-2615
Language (ISO639):ces
Publisher:Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:Mozilla Public License 2.0
Type (DCMI):Text
Type (OLAC):primary_text


Archive:  LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:  http://www.language-archives.org/archive/lindat.mff.cuni.cz
OaiIdentifier:  oai:lindat.mff.cuni.cz:11234/1-2615
DateStamp:  2023-02-27
Citation: Straka, Milan; Mediankin, Nikita; Kocmi, Tom; Žabokrtský, Zdeněk; Hudeček, Vojtěch; Hajič, Jan. 2020. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
