OLAC Record
oai:www.ldc.upenn.edu:LDC99T42

Metadata
Title:Treebank-3
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Marcus, Mitchell P., et al. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium, 1999
Contributor:Marcus, Mitchell P.
Santorini, Beatrice
Marcinkiewicz, Mary Ann
Taylor, Ann
Date (W3CDTF):1999
Description:*Introduction* This release contains the following Treebank-2 Material: * One million words of 1989 Wall Street Journal material annotated in Treebank II style. * A small sample of ATIS-3 material annotated in Treebank II style. * A fully tagged version of the Brown Corpus. and the following new material: * Switchboard tagged, dysfluency-annotated, and parsed text * Brown parsed text The Treebank bracketing style is designed to allow the extraction of simple predicate/argument structure. Over one million words of text are provided with this bracketing applied. *Data* The Penn Treebank (PTB) project selected 2,499 stories from a three year Wall Street Journal (WSJ) collection of 98,732 stories for syntactic annotation. These 2,499 stories have been distributed in both Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42) releases of PTB. Treebank-2 includes the raw text for each story. Three "map" files are available in a compressed file (pennTB_tipster_wsj_map.tar.gz) as an additional download for users who have licensed Treebank-2 and provide the relation between the 2,499 PTB filenames and the corresponding WSJ DOCNO strings in TIPSTER. *Samples* Please view the following samples: * Part-of-Speech Tags * Dysfluency Annotation * Dysfluency Annotation & Part-of-Speech Tags * Dysfluency Annotation, Part-of-Speech Tags & Turns Joined * Syntactic Annotation * Syntactic Annotation & Part-of-Speech Tags *Updates* After publication, it was discovered that not all of the postscript (*.ps) files had been converted to pdfs and that some of the converted pdfs contained errors. For pdf copies of the documentation files, please go to addenda for a list of the files available. As of October 5, 2016 252 wsj files from Treebank-2 were added that were previously missing. As of February, 2017, 2,499 "raw" wsj files were added from Treebank-2 (LDC95T7). Corpus downoads after these dates will include these missing files.
Extent:Corpus size: 264192 KB
Identifier:LDC99T42
https://catalog.ldc.upenn.edu/LDC99T42
ISBN: 1-58563-163-9
ISLRN: 141-282-691-413-2
DOI: 10.35111/gq1x-j780
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC99T42
Rights Holder:Portions © 1987-1989 Dow Jones & Company, Inc., © 1993-1995, 1999 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC99T42
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Marcus, Mitchell P.; Santorini, Beatrice; Marcinkiewicz, Mary Ann; Taylor, Ann. 1999. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC99T42
Up-to-date as of: Mon Mar 25 7:20:07 EDT 2024