OLAC Record
oai:www.ldc.upenn.edu:LDC93T1

Metadata
Title:ACL/DCI
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Linguistic Data Consortium. ACL/DCI LDC93T1. Web Download. Philadelphia: Linguistic Data Consortium, 1993
Contributor:Linguistic Data Consortium
Date (W3CDTF):1993
Description:ACL Data Collection Initiative contains text from the Wall Street Journal, the Collins English Dictionary, scientific abstracts provided by the U.S. Department of Energy and a variety of grammatically tagged and parsed materials from the Treebank project at the University of Pennsylvania. The total amount of uncompressed text is 620 Mbytes. The many formats of the original texts have been mapped into a markup language consistent with the SGML standard (ISO 8879). The format of the material from the Wall Street Journal uses a labelled bracketing, expressed in the style of SGML, although no formal SGML DTD is provided. The tag set has been modified by turning the Dow Jones header categories into tags and by creating ad hoc tags such as "". The original datelines are presented as separate text units; the text is divided and tagged into paragraphs and sentences with each sentence presented on a single line. Nothing has been done to modify the typographical methods used to subdivide headlines and stories into sections, nor are any of the text features within sentences (quotes, ellipsis, etc.) normalized. The Collins English Dictionary is present in two forms. One form was approximately parsed into fielded records as an exercise in learning a language called "FIT", by a student working under the direction of Lloyd Nakatani at ATT Bell Laboratories during the summer of 1990. The original digital image of the typographer's tape that the database version was prepared from had serious flaws that were not detected and corrected until later; the corrected version, a clean typographer's tape, is presented in a separate directory. A properly-analyzed database version will be provided in the future. The documentation includes notes developed during the new attempt to analyze the tape from scratch. The Department of Energy abstracts reside in files that are approximately one megabyte each. The original 950 separators have been replaced with newlines and space padding between articles was removed. An acronym dictionary that was extracted from the database as an indication of the material's topic areas has been included in a separate directory. Provisional material from the Penn Treebank project is divided into two subdirectories on this disk. The subdirectory "postext" contains text with part-of-speech annotations; "parstext" contains text with syntactic bracketing.
Identifier:LDC93T1
https://catalog.ldc.upenn.edu/LDC93T1
ISBN: 1-58563-000-4
ISLRN: 663-248-563-590-7
DOI: 10.35111/vdfv-av77
Language:English
Language (ISO639):eng
License:ACL/DCI Agreement: https://catalog.ldc.upenn.edu/license/acl-slash-dci-license-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC93T1
Rights Holder:Portions © 1987-1989 Dow Jones & Company, Inc., © 1979 William Collins Sons Co., Ltd., © 1990, 1991, 1993 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC93T1
DateStamp:  2021-06-16
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Linguistic Data Consortium. 1993. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC93T1
Up-to-date as of: Mon Mar 25 7:19:50 EDT 2024