OLAC Record
oai:www.ldc.upenn.edu:LDC94T4A

Metadata
Title:UN Parallel Text (Complete)
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Graff, David. UN Parallel Text (Complete) LDC94T4A. Web Download. Philadelphia: Linguistic Data Consortium, 1994
Contributor:Graff, David
Date (W3CDTF):1994
Description:*Introduction* UN Parallel Text (Complete) contains English, French and Spanish official documents provided to the Linguistic Data Consortium (LDC) by the United Nations (UN) for use in research on machine translation technology. The documents are from achives maintained by the UN Office of Conference Services in New York and span the period 1988-1993. The following individual releases by language are also available from LDC: LDC94T4B-1 UN Parallel Text (English) LDC94T4B-2 UN Parallel Text (French) LDC94T4B-3 UN Parallel Text (Spanish) *Data* All parallel files in this corpus are English-based: for every file in the English directory, there is a corresponding file in either the French or Spanish directory, or both. Tables are included to assist in determining which parallels are present. Similarly, the documents are arrranged in a parallel directory structure for each language so that corresponding translations of a document are found directly by means of the directory paths and file names. The total content by number of words (milllions) per language is summarized below (values are approximate): English: 22,00059 French: 20,00058 Spanish: 14,40048 French/Spanish parallel data: 12,70038 (per language) An SGML (Standard Generalized Markup Language) tagging structure was applied to the text. It preserves all typographic and meta-information present in the UN archival files. For using SGML, a working DTD (Document Type Definition) is provided. If SGML is not used, a simple script is included for use with the sed (stream-editor) utility to filter out SGML-specific material and meta-information, leaving only the plain text. The character set is 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table. Parallel samples of the three languages in this publication are listed below. * LDC1994T04 English Sample * LDC1994T04 French Sample * LDC1994T04 Spanish Sample Based on the combined usage of title strings and document numbers, parallel sets amounting to over 60% of the data in the archive (a total of 56,684 files in 21,986 parallel sets) were identified. Parallel sets in the remaining 40% were not identified, due in part to the fact that this data set contains only English-based parallel sets. Parallel sets that include only French and Spanish versions are not part of this release. Parallel sets identified by this automatic method include errors. A number of cases (over 700 in the corpus as a whole) where the members of a parallel set show a serious discrepancy in quantity of text were observed. Also, some of these sets (and perhaps some less obvious cases) constitute a complete mismatch. The reftable files in the tables directory provide an indication of the relative consistency among members of parallel set in terms of overall size. From these tables, the least likely candidates for parallelism can be identified.
Identifier:LDC94T4A
https://catalog.ldc.upenn.edu/LDC94T4A
ISBN: 1-58563-038-1
ISLRN: 804-587-727-227-7
DOI: 10.35111/zedp-vb74
Language:French
English
Spanish
Language (ISO639):fra
eng
spa
License:UN Parallel Text Agreement: https://catalog.ldc.upenn.edu/license/un-parallel-text-license.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC94T4A
Rights Holder:Portions © 1988-1993 United Nations, © 1994 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC94T4A
DateStamp:  2020-11-30
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Graff, David. 1994. Linguistic Data Consortium.
Terms: area_Europe country_ES country_FR country_GB dcmi_Text iso639_eng iso639_fra iso639_spa olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC94T4A
Up-to-date as of: Mon Mar 25 7:19:52 EDT 2024