OLAC Record: Chinese Treebank 4.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2004T05

Metadata

Title: Chinese Treebank 4.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Palmer, Martha, et al. Chinese Treebank 4.0 LDC2004T05. Web Download. Philadelphia: Linguistic Data Consortium, 2004

Contributor: Palmer, Martha

Chiou, Fu-Dong

Xue, Nianwen

Lee, Tsan-Kuang

Date (W3CDTF): 2004

Date Issued (W3CDTF): 2004-03-15

Description: *Introduction* Chinese Treebank 4.0 was developed by the Linguistic Data Consortium (LDC) and contains approximately 400,000 words of Chinese newswire text annotated in the manner of the Penn English Treebank. The Penn Chinese Treebank is an ongoing project that started in the summer of 1998. The goal of the project is to create of a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11). More information about the project is available on the Chinese Treebank website. *Data* The content used in this corpus comes from the following newswire sources: Articles Source 698 Xinhua (1994-1998) 55 Information Services Department of HKSAR (1997) 80 Sinorama magazine, Taiwan (1996-1998 & 2000-2001) Here is the breakdown of the content: Words Hanzi Sentences Files 404,156 664,633 15,162 838 All files are GB encoded. The format of Chinese Treebank 4.0 is the same as the Penn English Treebank. All files have been annotated at least twice. The first pass was done by one annotator, and the resulting files were checked by a second annotator (second pass). The corpus also provides seven files intended to serve as the gold standard annotation. The corpus provides four versions of files: bracketed, raw, segmented, and part-of-speech tagged. The raw, segmented, and part-of-speech tagged versions are generated from the bracketed version and so do not reflect the previous annotation stages. *Samples* Please view these samples: * Raw XML * Segmented XML * POS Tagged XML * Bracketed XML *Updates* None at this time. *Sponsorship* This corpus was funded in part through the DARPA-TIDES grant number N66001-00-1-8915.

Extent: Corpus size: 26624 KB

Identifier: LDC2004T05

https://catalog.ldc.upenn.edu/LDC2004T05

ISBN: 1-58563-287-2

ISLRN: 191-685-030-898-8

DOI: 10.35111/0qv2-1916

Language: Mandarin Chinese

Language (ISO639): cmn

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2004T05

Rights Holder: Portions © 1997 The Government of the Hong Kong Special Administrative Region, © 1996-1998, 2000-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 2001, 2004 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2004T05

DateStamp: 2024-03-06

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Palmer, Martha; Chiou, Fu-Dong; Xue, Nianwen; Lee, Tsan-Kuang. 2004. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T05
Up-to-date as of: Wed Oct 29 7:00:21 EDT 2025

Metadata
Title:		Chinese Treebank 4.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Palmer, Martha, et al. Chinese Treebank 4.0 LDC2004T05. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:		Palmer, Martha
		Chiou, Fu-Dong
		Xue, Nianwen
		Lee, Tsan-Kuang
Date (W3CDTF):		2004
Date Issued (W3CDTF):		2004-03-15
Description:		Introduction Chinese Treebank 4.0 was developed by the Linguistic Data Consortium (LDC) and contains approximately 400,000 words of Chinese newswire text annotated in the manner of the Penn English Treebank. The Penn Chinese Treebank is an ongoing project that started in the summer of 1998. The goal of the project is to create of a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11). More information about the project is available on the Chinese Treebank website. Data The content used in this corpus comes from the following newswire sources: Articles Source 698 Xinhua (1994-1998) 55 Information Services Department of HKSAR (1997) 80 Sinorama magazine, Taiwan (1996-1998 & 2000-2001) Here is the breakdown of the content: Words Hanzi Sentences Files 404,156 664,633 15,162 838 All files are GB encoded. The format of Chinese Treebank 4.0 is the same as the Penn English Treebank. All files have been annotated at least twice. The first pass was done by one annotator, and the resulting files were checked by a second annotator (second pass). The corpus also provides seven files intended to serve as the gold standard annotation. The corpus provides four versions of files: bracketed, raw, segmented, and part-of-speech tagged. The raw, segmented, and part-of-speech tagged versions are generated from the bracketed version and so do not reflect the previous annotation stages. Samples Please view these samples: * Raw XML * Segmented XML * POS Tagged XML * Bracketed XML Updates None at this time. Sponsorship This corpus was funded in part through the DARPA-TIDES grant number N66001-00-1-8915.
Extent:		Corpus size: 26624 KB
Identifier:		LDC2004T05
		https://catalog.ldc.upenn.edu/LDC2004T05
		ISBN: 1-58563-287-2
		ISLRN: 191-685-030-898-8
		DOI: 10.35111/0qv2-1916
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2004T05
Rights Holder:		Portions © 1997 The Government of the Hong Kong Special Administrative Region, © 1996-1998, 2000-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 2001, 2004 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2004T05
DateStamp:		2024-03-06
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Palmer, Martha; Chiou, Fu-Dong; Xue, Nianwen; Lee, Tsan-Kuang. 2004. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text