OLAC Record
oai:scholarspace.manoa.hawaii.edu:10125/41982

Metadata
Title:DUCKS in a Row: Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon
Bibliographic Citation:Benjamin, Martin, Mansour Lakouraj, Sina, Aberer, Karl, Benjamin, Martin, Mansour Lakouraj, Sina, Aberer, Karl; 2017-03-05; This paper introduces DUCKS, Data Unified Conceptual Knowledge Sets, as a tool for aligning lexical data across any number of languages. A starting point in producing a multilingual dictionary is to merge bilingual datasets through the overlapping words in a common pivot language. An essential problem in maintaining accuracy across languages is determining the matching senses of a polysemous pivot term, e.g. a term in Language-A meaning “spicy” might well be paired to a term in Language-C meaning “sweltering” because they are both connected to English “hot”. DUCKS addresses this problem through a game-like interface that invites experts and interested members of the public to participate in the sense disambiguation of linguistic datasets. DUCKS starts with the 100,000 concepts defined in the Princeton WordNet, for 200,000 English lemmas, and English will be expanded through a version of the game that matches senses from Wiktionary. In the basic case, we start with a dataset between Language-A and English. When a user selects a term in Language-A, we show all the contextual information about that item in a graphic block on the left of their screen, and all the senses of the designated English term on the right. The user slides the block to the definition that best matches the meaning in Language-A. If two or more English senses apply, duplicate bricks are available. The user may also select “no definition applies” when relevant. If English is absent from the dataset, the user must first type an equivalent term in English or another language that has already been aligned. We then fetch the possible senses, and play proceeds as above. A match is considered valid when a threshold number of players has made the same selection. DUCKS does not address semantic drift, which is resolved in other games the project has developed. In addition to integrating Language-A with all other languages in the system that share similar concepts for accurate multilingual exchange, concepts without English equivalents can be discovered that may be unique to that language. Data has been aligned among several dozen languages to date, beginning with languages with open data previously linked to WordNet. A large challenge now is that many existing datasets for less-resourced languages are closed data; it is hoped that DUCKS will inspire their proprietors toward joining the multilingual lexicon.; Kaipuleohone University of Hawai'i Digital Language Archive;http://hdl.handle.net/10125/41982.
Contributor (speaker):Benjamin, Martin
Mansour Lakouraj, Sina
Aberer, Karl
Creator:Benjamin, Martin
Mansour Lakouraj, Sina
Aberer, Karl
Date (W3CDTF):2017-03-05
Description:This paper introduces DUCKS, Data Unified Conceptual Knowledge Sets, as a tool for aligning lexical data across any number of languages. A starting point in producing a multilingual dictionary is to merge bilingual datasets through the overlapping words in a common pivot language. An essential problem in maintaining accuracy across languages is determining the matching senses of a polysemous pivot term, e.g. a term in Language-A meaning “spicy” might well be paired to a term in Language-C meaning “sweltering” because they are both connected to English “hot”. DUCKS addresses this problem through a game-like interface that invites experts and interested members of the public to participate in the sense disambiguation of linguistic datasets. DUCKS starts with the 100,000 concepts defined in the Princeton WordNet, for 200,000 English lemmas, and English will be expanded through a version of the game that matches senses from Wiktionary. In the basic case, we start with a dataset between Language-A and English. When a user selects a term in Language-A, we show all the contextual information about that item in a graphic block on the left of their screen, and all the senses of the designated English term on the right. The user slides the block to the definition that best matches the meaning in Language-A. If two or more English senses apply, duplicate bricks are available. The user may also select “no definition applies” when relevant. If English is absent from the dataset, the user must first type an equivalent term in English or another language that has already been aligned. We then fetch the possible senses, and play proceeds as above. A match is considered valid when a threshold number of players has made the same selection. DUCKS does not address semantic drift, which is resolved in other games the project has developed. In addition to integrating Language-A with all other languages in the system that share similar concepts for accurate multilingual exchange, concepts without English equivalents can be discovered that may be unique to that language. Data has been aligned among several dozen languages to date, beginning with languages with open data previously linked to WordNet. A large challenge now is that many existing datasets for less-resourced languages are closed data; it is hoped that DUCKS will inspire their proprietors toward joining the multilingual lexicon.
Identifier (URI):http://hdl.handle.net/10125/41982
Table Of Contents:41982.mp3
Type (DCMI):Sound

OLAC Info

Archive:  Language Documentation and Conservation
Description:  http://www.language-archives.org/archive/ldc.scholarspace.manoa.hawaii.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:scholarspace.manoa.hawaii.edu:10125/41982
DateStamp:  2017-05-11
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Benjamin, Martin; Mansour Lakouraj, Sina; Aberer, Karl. 2017. Language Documentation and Conservation.
Terms: dcmi_Sound


http://www.language-archives.org/item.php/oai:scholarspace.manoa.hawaii.edu:10125/41982
Up-to-date as of: Sat Apr 20 18:40:25 EDT 2024