OLAC Record: Prague Dependency Treebank 2.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2006T01

Metadata

Title: Prague Dependency Treebank 2.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Hajič, Jan , et al. Prague Dependency Treebank 2.0 LDC2006T01. Web Download. Philadelphia: Linguistic Data Consortium, 2006

Contributor: Hajič, Jan

Panevová, Jarmila

Hajičová, Eva

Sgall, Petr

Pajas, Petr

Štěpánek, Jan

Havelka, Jiří

Mikulová, Marie

Žabokrtský, Zdeněk

Ševčíková-Razímová, Magda

Urešová, Zdeňka

Date (W3CDTF): 2006

Date Issued (W3CDTF): 2006-07-21

Description: *Introduction* The Prague Dependency Treebank 2.0 (PDT 2.0) was developed by Charles University and contains approximately 2 million words of Czech text with complex and interlinked morphological, syntactic, and complex semantic annotation. In addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. PDT 2.0 follows Prague Dependency Treebank 1.0 (LDC2001T10) and is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation, and language analysis are included. Extensive documentation (in English) is provided as well. *Data* The data in this corpus comes from four sources: * Lidové Noviny (daily newspapers), 1991, 1994, 1995 * Mladá Fronta Dnes (daily newspapers), 1992 * Českomoravský Profit (business weekly), 1994 * Vesmír (scientific journal), 1992, 1993 The texts in electronic form have been provided by the Institute of the Czech National Corpus. The data in PDT 2.0 are annotated on three layers—the morphological layer, analytical layer, and tectogrammatical layer. The following table shows the breakdown by annotation layer and source of data amounts in K-words (thousands of words). Each subsequent layer is additive, so everything that was annotated at the a-layer was also annotated at the m-layer, and everything annotated at the t-layer was also annotated at the other two layers. Layer Lidové Noviny Mladá Fronta Dnes Českomoravský Profit Vesmír Total m-layer 1,235 373 171 178 1,957 a-layer 920 234 171 178 1,504 t-layer 640 119 74 0 833 The primary data format for PDT 2.0 is an XML6-based format called PML. A SGML-based format, called CSTS, has been the primary format of PDT 1.0. It is now used only as an intermediate format in older NLP tools (such as taggers and parsers). As usual, the data are divided into three groups: the training data, the development test data and the evaluation test data. The training data cover approximately 80%, development 10% and evaluation 10% of the whole set of data (these proportions hold for all the three layers of annotation). *Samples* For an example of the data in this corpus, please view these samples. *Updates* None at this time.

Extent: Corpus size: 515072 KB

Identifier: LDC2006T01

https://catalog.ldc.upenn.edu/LDC2006T01

ISBN: 1-58563-370-4

ISLRN: 942-053-729-014-3

DOI: 10.35111/e6p0-9s32

Language: Czech

Language (ISO639): ces

License: Prague Dependency Treebank 2.0: https://catalog.ldc.upenn.edu/license/prague-dependency-treebank-2.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2006T01

Rights Holder: Portions © 1991, 1994,1995 Lidové noviny daily newspapers, © 1992 Mladá fronta Dnes daily newspapers, © 1994 Ceskomoravský Profit business weekly, © 1992-1993 Vesmír scientific magazine, Academia Publishers, © 1996-2005 Institute of Formal and Applied Linguistics and Center for Computational Linguistics, Faculty of Mathematics and Physics, Charles University, © 2006 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): lexicon

primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2006T01

DateStamp: 2021-04-16

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Hajič, Jan; Panevová, Jarmila; Hajičová, Eva; Sgall, Petr; Pajas, Petr; Štěpánek, Jan; Havelka, Jiří; Mikulová, Marie; Žabokrtský, Zdeněk; Ševčíková-Razímová, Magda; Urešová, Zdeňka. 2006. Linguistic Data Consortium.
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_lexicon olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006T01
Up-to-date as of: Wed Oct 29 7:00:18 EDT 2025

Metadata
Title:		Prague Dependency Treebank 2.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Hajič, Jan , et al. Prague Dependency Treebank 2.0 LDC2006T01. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:		Hajič, Jan
		Panevová, Jarmila
		Hajičová, Eva
		Sgall, Petr
		Pajas, Petr
		Štěpánek, Jan
		Havelka, Jiří
		Mikulová, Marie
		Žabokrtský, Zdeněk
		Ševčíková-Razímová, Magda
		Urešová, Zdeňka
Date (W3CDTF):		2006
Date Issued (W3CDTF):		2006-07-21
Description:		Introduction The Prague Dependency Treebank 2.0 (PDT 2.0) was developed by Charles University and contains approximately 2 million words of Czech text with complex and interlinked morphological, syntactic, and complex semantic annotation. In addition, certain properties of sentence information structure and coreference relations are annotated at the semantic level. PDT 2.0 follows Prague Dependency Treebank 1.0 (LDC2001T10) and is based on the long-standing Praguian linguistic tradition, adapted for the current Computational Linguistics research needs. The corpus itself uses the latest annotation technology. Software tools for corpus search, annotation, and language analysis are included. Extensive documentation (in English) is provided as well. Data The data in this corpus comes from four sources: * Lidové Noviny (daily newspapers), 1991, 1994, 1995 * Mladá Fronta Dnes (daily newspapers), 1992 * Českomoravský Profit (business weekly), 1994 * Vesmír (scientific journal), 1992, 1993 The texts in electronic form have been provided by the Institute of the Czech National Corpus. The data in PDT 2.0 are annotated on three layers—the morphological layer, analytical layer, and tectogrammatical layer. The following table shows the breakdown by annotation layer and source of data amounts in K-words (thousands of words). Each subsequent layer is additive, so everything that was annotated at the a-layer was also annotated at the m-layer, and everything annotated at the t-layer was also annotated at the other two layers. Layer Lidové Noviny Mladá Fronta Dnes Českomoravský Profit Vesmír Total m-layer 1,235 373 171 178 1,957 a-layer 920 234 171 178 1,504 t-layer 640 119 74 0 833 The primary data format for PDT 2.0 is an XML6-based format called PML. A SGML-based format, called CSTS, has been the primary format of PDT 1.0. It is now used only as an intermediate format in older NLP tools (such as taggers and parsers). As usual, the data are divided into three groups: the training data, the development test data and the evaluation test data. The training data cover approximately 80%, development 10% and evaluation 10% of the whole set of data (these proportions hold for all the three layers of annotation). Samples For an example of the data in this corpus, please view these samples. Updates None at this time.
Extent:		Corpus size: 515072 KB
Identifier:		LDC2006T01
		https://catalog.ldc.upenn.edu/LDC2006T01
		ISBN: 1-58563-370-4
		ISLRN: 942-053-729-014-3
		DOI: 10.35111/e6p0-9s32
Language:		Czech
Language (ISO639):		ces
License:		Prague Dependency Treebank 2.0: https://catalog.ldc.upenn.edu/license/prague-dependency-treebank-2.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2006T01
Rights Holder:		Portions © 1991, 1994,1995 Lidové noviny daily newspapers, © 1992 Mladá fronta Dnes daily newspapers, © 1994 Ceskomoravský Profit business weekly, © 1992-1993 Vesmír scientific magazine, Academia Publishers, © 1996-2005 Institute of Formal and Applied Linguistics and Center for Computational Linguistics, Faculty of Mathematics and Physics, Charles University, © 2006 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		lexicon
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2006T01
DateStamp:		2021-04-16
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Hajič, Jan; Panevová, Jarmila; Hajičová, Eva; Sgall, Petr; Pajas, Petr; Štěpánek, Jan; Havelka, Jiří; Mikulová, Marie; Žabokrtský, Zdeněk; Ševčíková-Razímová, Magda; Urešová, Zdeňka. 2006. Linguistic Data Consortium.
Terms:		area_Europe country_CZ dcmi_Text iso639_ces olac_lexicon olac_primary_text