OLAC Record
oai:www.ldc.upenn.edu:LDC2003T18

Metadata
Title:Multiple-Translation Arabic (MTA) Part 1
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Walker, Kevin, et al. Multiple-Translation Arabic (MTA) Part 1 LDC2003T18. Web Download. Philadelphia: Linguistic Data Consortium, 2003
Contributor:Walker, Kevin
Bamba, Moussa
Miller, David
Ma, Xiaoyi
Cieri, Christopher
Doddington, George
Date (W3CDTF):2003
Date Issued (W3CDTF):2003-10-15
Description:*Introduction* Multiple-Translation Arabic (MTA) Part 1 was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T18 and ISBN 1-58563-276-7. To support the development of automatic means for evaluating translation quality, the LDC was sponsored to solicit ten sets of human translations for a single set of Arabic source materials. The LDC was also asked to produce translations from various commercial-off-the-shelf-systems (COTS, including commercial Machine Translation (MT) systems as well as MT systems available on the Internet). There are a total of two sets of COTS outputs and one output set from a TIDES 2002 MT Evaluation participant, which is representative for the state-of-the-art research systems. To see if automatic evaluation systems such as BLEU track human assessment, the LDC has also performed human assessment on the two COTS outputs and the TIDES research system. The corpus includes the assessment results for one of the two COTS systems, the assessment results for the TIDES research system, and the specifications used for conducting the assessments. *Data* Source Data Selection Two sources of journalistic Arabic text were selected to provide the Arabic material: - Xinhua News Service: 66 news stories (files: artb_500 - artb_565) - AFP News Service: 75 news stories (files: artb_S01 - artb_S06, artb_001 - artb_069) (total: 141 stories) There are 141 source files, and 1,792 translation files (12 of the 13 systems produced translations for all 141 source files, while one system produced translations for only 100 of the 141 Arabic stories). The Xinhua data was drawn from the Xinhua News Agency's Arabic newswire feed in October 2001. The AFP Data was drawn from the LDC's Arabic Newswire Part 1). The story selection from the two newswire collections was controlled by story length: all selected stories contain between 700 and 1,500 Arabic characters. The overall count of Arabic words (excluding markup) is shown in the following table by source: AFP 12,674 Xinhua 11,155 ------------- 23,829 For the Arabic data, there are approximately 23K-words, while for the English translations, there are 366K-words in total and 163K unique words. Source Data Preparation for Human Translation The original source files used CP-1256 encoding for the Arabic characters, and SGML tags for marking sentence and paragraph boundaries and other information about each story. The source files were later converted to UTF8 encoding. To make things easier for the translators, nearly all sgml tags were removed or replaced by "plain text" markers. Human Translation Procedure and Quality Assessment Each initially selected translation team received the translation guidelines and a sample pair of source and translation (excluded from the final release) for review. After the team said that they understood the task requirements and would be willing to participate in the project, 75 AFP news stories were sent to them as a first installment of data. In accordance with the guidelines, each translation team was asked to return the first six AFP stories for quality checking. This was to ensure that the translation team had indeed understood and was following the guidelines and the translation quality was acceptable. The LDC sent the translations back to the translation team for any deviations from the guidelines or quality issues detected. Subsequent translation submissions were continuously monitored for conformance and quality. Once the full set of translations was complete, a final pass of reformatting and validation was carried out, to assure alignability of segments, and to convert the translated texts into SGML format. Each translation team was also asked to fill out and return a questionnaire to describe their procedures and professional background. Machine Translation Procedure Complete sets of automatic MT translations were also produced by submitting the 141 stories to each of the two publicly-available MT systems. Starting from the original SGML text format, special alterations were made to the files on an as-needed basis, so that they would be accepted and handled correctly by the various systems. Also, the systems differed in terms of the input and retrieval methods required to submit the source data for translation and to save the translated text in alignable form. Human Assessment Procedure The goal of this effort is to evaluate the quality of TIDES research, human translation teams and commercial off-the shelf (COTS) systems. Translations are evaluated on the basis of adequacy and fluency. Adequacy refers to the degree to which the translation communicates information present in the original source language text. Fluency refers to the degree to which the translation is well-formed according to the grammar of the target language. *Updates* There are no updates available at this time.
Extent:Corpus size: 10035 KB
Identifier:LDC2003T18
https://catalog.ldc.upenn.edu/LDC2003T18
ISBN: 1-58563-276-7
ISLRN: 610-045-411-801-3
OAI: oai:www.ldc.upenn.edu:LDC2003T18
Language:English
Standard Arabic
Language (ISO639):eng
arb
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2003T18
Rights Holder:Portions © 2001-2002 Xinhua News Agency, © 1998-2000 Agence France Press, © 2003 Trustees of the University of Pennsylvania
Type (DCMI):Text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2003T18
DateStamp:  2014-07-17
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Walker, Kevin; Bamba, Moussa; Miller, David; Ma, Xiaoyi; Cieri, Christopher; Doddington, George. 2003. Linguistic Data Consortium.
Terms: area_Asia area_Europe country_GB country_SA dcmi_Text iso639_arb iso639_eng


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2003T18
Up-to-date as of: Fri Oct 24 0:18:07 EDT 2014