OLAC Record: 2006 CoNLL Shared Task

OLAC Record
oai:www.ldc.upenn.edu:LDC2015T11

Metadata

Title: 2006 CoNLL Shared Task - Ten Languages

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Bulgarian Academy of Sciences, et al. 2006 CoNLL Shared Task - Ten Languages LDC2015T11. Web Download. Philadelphia: Linguistic Data Consortium, 2015

Contributor: Bulgarian Academy of Sciences

Eberhard-Karls-Universität

Copenhagen Business School

Danish Society for Language and Literature

University of Groningen

Universität Potsdam

Universität des Saarlandes

Universität Stuttgart

Eberhard-Karls-Universität Tübingen

University of Southern Denmark

SINTEF Telcom & Informatics

Jožef Stefan Institute

Charles University

The Fran Ramovš Institute for the Slovenian Language

University of Barcelona

Uppsala University

Växjŏ University

Middle East Technical University

Date (W3CDTF): 2015

Date Issued (W3CDTF): 2015-06-15

Description: *Introduction* 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish. LDC also released the following 2006 & 2007 CoNLL Shared Task corpora: * 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish (LDC2018T06) * 2007 CoNLL Shared Task - Greek, Hungarian & Italian (LDC2018T07) * 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish (LDC2018T06) * 2006 CoNLL Shared Task - 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12) This corpus is cross listed and jointly released with ELRA as ELRA-W0086. The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006, the shared task was devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. More information about the 2006 shared task is available on the CoNLL-X web page. LDC has released data sets from other CoNLL shared tasks. 2008 CoNLL Shared Task Data contains the English material used in the 2008 shared task which focused on English, employed a unified dependency-based formalism and merged the tasks of syntactic dependency parsing, identifying semantic arguments and labeling them with semantic roles. 2009 CoNLL Shared Task Data Parts 1 and 2 consists of the English, Catalan, Chinese, Czech, German and Spanish resources used in the 2009 task which included a comparison of time and space complexity based on participants' input and learning curve comparison for languages with large datasets. LDC has also released the following CoNLL Shared Task data sets: * 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12) * 2008 CoNLL Shared Task Data (LDC2009T12) * 2009 CoNLL Shared Task Part 1 (LDC2012T03) * 2009 CoNLL Shared Task Part 2 (LDC2012T04) * 2015-2016 CoNLL Shared Task (LDC2017T13) *Data* The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format. In general, dependency grammar is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies. This is a one-to-one correspondence: for every element in the sentence there is one node in the sentence structure that corresponds to that element. In constituency or phrase structure grammars, on the other hand, clauses are divided into noun phrases and verb phrases and in each sentence, one or more nodes may correspond to one element. The Penn Treebank (LDC99T42) is an example of a constituency or phrase structure approach. All of the data sets in this release are dependency treebanks. The individual data sets are: * BulTreeBank (Bulgarian) * The Danish Dependency Treebank (Danish) * The Alpino Treebank (Dutch) * The TIGER Corpus (German) * Treebank Tuba-J/S (Japanese) * Floresta Sinta(c)tica (Portuguese) * Slovene Dependency Treebank, SDT V0.1 (Slovene) * Cast3LB (Spanish) * Talbanken05 (Swedish) * METU-Sabanci Turkish Treebank (Turkish) *Samples* Please view these Japanese and Bulgarian samples. *Updates* None at this time.

Extent: Corpus size: 88568 KB

Identifier: LDC2015T11

https://catalog.ldc.upenn.edu/LDC2015T11

ISBN: 1-58563-717-3

ISLRN: 578-227-532-044-0

DOI: 10.35111/n6q5-tg41

Language: Bulgarian

Danish

Dutch

German

Japanese

Portuguese

Slovenian

Spanish

Swedish

Turkish

Language (ISO639): bul

dan

nld

deu

jpn

por

slv

spa

swe

tur

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2015T11

Rights Holder: Portions © 2002-2005 Gosse Bouma, © 2002-2004 Mattias Buch-Kromann, © 2006 Eberhard-Karls Universitaet Tuebingen, Seminar fuer Sprachwissenschaft, Abt. Computerlinguistik, © 2006 Jan Einarsson, © 2002-2005 Geert Kloosterman, © 2002-2005 Robert Malouf, © 2006 Joakim Nivre, © 2006 Technical University of Catalonia, © 2006 Technical University of Valencia, © 2002-2004 The Department of International Language Studies and Computational Linguistics at the Copenhagen Business School, © 1998 The Society for Danish Language and Literature, © 2006 University of Alicante, © 2006 University of Barcelona, © 2002-2005 Univerity of Groningen, © 2002-2005 Leonoor van der Beek, © 2002-2005 Gertjan van Noord, © 2015 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2015T11

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Bulgarian Academy of Sciences; Eberhard-Karls-Universität; Copenhagen Business School; Danish Society for Language and Literature; University of Groningen; Universität Potsdam; Universität des Saarlandes; Universität Stuttgart; Eberhard-Karls-Universität Tübingen; University of Southern Denmark; SINTEF Telcom & Informatics; Jožef Stefan Institute; Charles University; The Fran Ramovš Institute for the Slovenian Language; University of Barcelona; Uppsala University; Växjŏ University; Middle East Technical University. 2015. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_BG country_DE country_DK country_ES country_JP country_NL country_PT country_SE country_SI country_TR dcmi_Text iso639_bul iso639_dan iso639_deu iso639_jpn iso639_nld iso639_por iso639_slv iso639_spa iso639_swe iso639_tur olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2015T11
Up-to-date as of: Wed Oct 29 7:01:32 EDT 2025

Metadata
Title:		2006 CoNLL Shared Task - Ten Languages
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Bulgarian Academy of Sciences, et al. 2006 CoNLL Shared Task - Ten Languages LDC2015T11. Web Download. Philadelphia: Linguistic Data Consortium, 2015
Contributor:		Bulgarian Academy of Sciences
		Eberhard-Karls-Universität
		Copenhagen Business School
		Danish Society for Language and Literature
		University of Groningen
		Universität Potsdam
		Universität des Saarlandes
		Universität Stuttgart
		Eberhard-Karls-Universität Tübingen
		University of Southern Denmark
		SINTEF Telcom & Informatics
		Jožef Stefan Institute
		Charles University
		The Fran Ramovš Institute for the Slovenian Language
		University of Barcelona
		Uppsala University
		Växjŏ University
		Middle East Technical University
Date (W3CDTF):		2015
Date Issued (W3CDTF):		2015-06-15
Description:		Introduction 2006 CoNLL Shared Task - Ten Languages consists of dependency treebanks in ten languages used as part of the CoNLL 2006 shared task on multi-lingual dependency parsing. The languages covered in this release are: Bulgarian, Danish, Dutch, German, Japanese, Portuguese, Slovene, Spanish, Swedish and Turkish. LDC also released the following 2006 & 2007 CoNLL Shared Task corpora: * 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish (LDC2018T06) * 2007 CoNLL Shared Task - Greek, Hungarian & Italian (LDC2018T07) * 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish (LDC2018T06) * 2006 CoNLL Shared Task - 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12) This corpus is cross listed and jointly released with ELRA as ELRA-W0086. The Conference on Computational Natural Language Learning (CoNLL) is accompanied every year by a shared task intended to promote natural language processing applications and evaluate them in a standard setting. In 2006, the shared task was devoted to the parsing of syntactic dependencies using corpora from up to thirteen languages. The task aimed to define and extend the then-current state of the art in dependency parsing, a technology that complemented previous tasks by producing a different kind of syntactic description of input text. More information about the 2006 shared task is available on the CoNLL-X web page. LDC has released data sets from other CoNLL shared tasks. 2008 CoNLL Shared Task Data contains the English material used in the 2008 shared task which focused on English, employed a unified dependency-based formalism and merged the tasks of syntactic dependency parsing, identifying semantic arguments and labeling them with semantic roles. 2009 CoNLL Shared Task Data Parts 1 and 2 consists of the English, Catalan, Chinese, Czech, German and Spanish resources used in the 2009 task which included a comparison of time and space complexity based on participants' input and learning curve comparison for languages with large datasets. LDC has also released the following CoNLL Shared Task data sets: * 2006 CoNLL Shared Task - Arabic & Czech (LDC2015T12) * 2008 CoNLL Shared Task Data (LDC2009T12) * 2009 CoNLL Shared Task Part 1 (LDC2012T03) * 2009 CoNLL Shared Task Part 2 (LDC2012T04) * 2015-2016 CoNLL Shared Task (LDC2017T13) Data The source data in the treebanks in this release consists principally of various texts (e.g., textbooks, news, literature) annotated in dependency format. In general, dependency grammar is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies. This is a one-to-one correspondence: for every element in the sentence there is one node in the sentence structure that corresponds to that element. In constituency or phrase structure grammars, on the other hand, clauses are divided into noun phrases and verb phrases and in each sentence, one or more nodes may correspond to one element. The Penn Treebank (LDC99T42) is an example of a constituency or phrase structure approach. All of the data sets in this release are dependency treebanks. The individual data sets are: * BulTreeBank (Bulgarian) * The Danish Dependency Treebank (Danish) * The Alpino Treebank (Dutch) * The TIGER Corpus (German) * Treebank Tuba-J/S (Japanese) * Floresta Sinta(c)tica (Portuguese) * Slovene Dependency Treebank, SDT V0.1 (Slovene) * Cast3LB (Spanish) * Talbanken05 (Swedish) * METU-Sabanci Turkish Treebank (Turkish) Samples Please view these Japanese and Bulgarian samples. Updates None at this time.
Extent:		Corpus size: 88568 KB
Identifier:		LDC2015T11
		https://catalog.ldc.upenn.edu/LDC2015T11
		ISBN: 1-58563-717-3
		ISLRN: 578-227-532-044-0
		DOI: 10.35111/n6q5-tg41
Language:		Bulgarian
		Danish
		Dutch
		German
		Japanese
		Portuguese
		Slovenian
		Spanish
		Swedish
		Turkish
Language (ISO639):		bul
		dan
		nld
		deu
		jpn
		por
		slv
		spa
		swe
		tur
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2015T11
Rights Holder:		Portions © 2002-2005 Gosse Bouma, © 2002-2004 Mattias Buch-Kromann, © 2006 Eberhard-Karls Universitaet Tuebingen, Seminar fuer Sprachwissenschaft, Abt. Computerlinguistik, © 2006 Jan Einarsson, © 2002-2005 Geert Kloosterman, © 2002-2005 Robert Malouf, © 2006 Joakim Nivre, © 2006 Technical University of Catalonia, © 2006 Technical University of Valencia, © 2002-2004 The Department of International Language Studies and Computational Linguistics at the Copenhagen Business School, © 1998 The Society for Danish Language and Literature, © 2006 University of Alicante, © 2006 University of Barcelona, © 2002-2005 Univerity of Groningen, © 2002-2005 Leonoor van der Beek, © 2002-2005 Gertjan van Noord, © 2015 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2015T11
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Bulgarian Academy of Sciences; Eberhard-Karls-Universität; Copenhagen Business School; Danish Society for Language and Literature; University of Groningen; Universität Potsdam; Universität des Saarlandes; Universität Stuttgart; Eberhard-Karls-Universität Tübingen; University of Southern Denmark; SINTEF Telcom & Informatics; Jožef Stefan Institute; Charles University; The Fran Ramovš Institute for the Slovenian Language; University of Barcelona; Uppsala University; Växjŏ University; Middle East Technical University. 2015. Linguistic Data Consortium.
Terms:		area_Asia area_Europe country_BG country_DE country_DK country_ES country_JP country_NL country_PT country_SE country_SI country_TR dcmi_Text iso639_bul iso639_dan iso639_deu iso639_jpn iso639_nld iso639_por iso639_slv iso639_spa iso639_swe iso639_tur olac_primary_text