OLAC Record: WordSim353-cs: Evaluation Dataset for Lexical Similarity and Relatedness, based on WordSim353

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-1713

Metadata

Title: WordSim353-cs: Evaluation Dataset for Lexical Similarity and Relatedness, based on WordSim353

Bibliographic Citation: http://hdl.handle.net/11234/1-1713

Creator: Cinková, Silvie

Straková, Jana

Hajič, Jakub

Hajič, Jan

Hajič, Jan, jr.

Janoušková, Jolana

Straka, Milan

Urešová, Miroslava

Date (W3CDTF): 2016-10-10T15:11:23Z

Date Available: 2016-10-10T15:11:23Z

Description: Czech translation of WordSim353. The Czech translation of English WordSim353 word pairs were obtained from four translators. All translation variants were scored according to the lexical similarity/relatedness annotation instructions for WordSim353 annotators, by 25 Czech annotators. The resulting data set consists of two annotation files: "WordSim353-cs.csv" and "WordSim-cs-Multi.csv". Both files are encoded in UTF-8, have a header, text is enclosed in double quotes, and columns are separated by commas. The rows are numbered. The WordSim-cs-Multi data set has rows numbered from 1 to 634, whereas the row indices in the WordSim353-cs data set reflect the corresponding row numbers in the WordSim-cs-Multi data set. The WordSim353-cs file contains a one-to-one mapping selection of 353 Czech equivalent pairs whose judgments have proven to be most similar to the judgments of their corresponding English originals (compared by the absolute value of the difference between the means over all annotators in each language counterpart). In one case ("psychology-cognition"), two Czech equivalent pairs had identical means as well as confidence intervals, so we randomly selected one. The "WordSim-cs-Multi.csv" file contains human judgments for all translation variants. In both data sets, we preserved all 25 individual scores. In the WordSim353-cs data set, we added a column with their Czech means as well as a column containing the original English means and 95% confidence intervals in separate columns for each mean (computed by the CI function in the Rmisc R package). The WordSim-cs-Multi data set contains only the Czech means and confidence intervals. For the most convenient lexical search, we provided separate columns with the respective Czech and English single words, entire word pairs, and eventually an English-Czech quadruple in both data sets. The data set also contains an xls table with the four translations and a preliminary selection of the best variants performed by an adjudicator.

Identifier (URI): http://hdl.handle.net/11234/1-1713

Language: Czech

English

Language (ISO639): ces

eng

Publisher: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)

Rights: Creative Commons - Attribution 4.0 International (CC BY 4.0)

http://creativecommons.org/licenses/by/4.0/

Subject: lexical semantics

similarity

relatedness

evaluation

distributional semantics

Czech language

English language

Subject (ISO639): ces

eng

Type: lexicalConceptualResource

Type (DCMI): Text

Type (OLAC): lexicon

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-1713

DateStamp: 2021-06-29

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Cinková, Silvie; Straková, Jana; Hajič, Jakub; Hajič, Jan; Hajič, Jan, jr.; Janoušková, Jolana; Straka, Milan; Urešová, Miroslava. 2016. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms: area_Europe country_CZ country_GB dcmi_Text iso639_ces iso639_eng olac_lexicon

Inferred Metadata
Country: Czech Republic United Kingdom
Area: Europe

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-1713
Up-to-date as of: Mon Jun 16 1:04:59 EDT 2025

Metadata
Title:		WordSim353-cs: Evaluation Dataset for Lexical Similarity and Relatedness, based on WordSim353
Bibliographic Citation:		http://hdl.handle.net/11234/1-1713
Creator:		Cinková, Silvie
		Straková, Jana
		Hajič, Jakub
		Hajič, Jan
		Hajič, Jan, jr.
		Janoušková, Jolana
		Straka, Milan
		Urešová, Miroslava
Date (W3CDTF):		2016-10-10T15:11:23Z
Date Available:		2016-10-10T15:11:23Z
Description:		Czech translation of WordSim353. The Czech translation of English WordSim353 word pairs were obtained from four translators. All translation variants were scored according to the lexical similarity/relatedness annotation instructions for WordSim353 annotators, by 25 Czech annotators. The resulting data set consists of two annotation files: "WordSim353-cs.csv" and "WordSim-cs-Multi.csv". Both files are encoded in UTF-8, have a header, text is enclosed in double quotes, and columns are separated by commas. The rows are numbered. The WordSim-cs-Multi data set has rows numbered from 1 to 634, whereas the row indices in the WordSim353-cs data set reflect the corresponding row numbers in the WordSim-cs-Multi data set. The WordSim353-cs file contains a one-to-one mapping selection of 353 Czech equivalent pairs whose judgments have proven to be most similar to the judgments of their corresponding English originals (compared by the absolute value of the difference between the means over all annotators in each language counterpart). In one case ("psychology-cognition"), two Czech equivalent pairs had identical means as well as confidence intervals, so we randomly selected one. The "WordSim-cs-Multi.csv" file contains human judgments for all translation variants. In both data sets, we preserved all 25 individual scores. In the WordSim353-cs data set, we added a column with their Czech means as well as a column containing the original English means and 95% confidence intervals in separate columns for each mean (computed by the CI function in the Rmisc R package). The WordSim-cs-Multi data set contains only the Czech means and confidence intervals. For the most convenient lexical search, we provided separate columns with the respective Czech and English single words, entire word pairs, and eventually an English-Czech quadruple in both data sets. The data set also contains an xls table with the four translations and a preliminary selection of the best variants performed by an adjudicator.
Identifier (URI):		http://hdl.handle.net/11234/1-1713
Language:		Czech
Language:		English
Language (ISO639):		ces
Language (ISO639):		eng
Publisher:		Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL)
Rights:		Creative Commons - Attribution 4.0 International (CC BY 4.0)
Rights:		http://creativecommons.org/licenses/by/4.0/
Subject:		lexical semantics
		similarity
		relatedness
		evaluation
		distributional semantics
		Czech language
		English language
Subject (ISO639):		ces
Subject (ISO639):		eng
Type:		lexicalConceptualResource
Type (DCMI):		Text
Type (OLAC):		lexicon
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-1713
DateStamp:		2021-06-29
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Cinková, Silvie; Straková, Jana; Hajič, Jakub; Hajič, Jan; Hajič, Jan, jr.; Janoušková, Jolana; Straka, Milan; Urešová, Miroslava. 2016. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (UFAL).
Terms:		area_Europe country_CZ country_GB dcmi_Text iso639_ces iso639_eng olac_lexicon
Inferred Metadata
Country:		Czech Republic United Kingdom
Area:		Europe