OLAC Record
oai:www.ldc.upenn.edu:LDC2025T14

Metadata
Title:BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Tracey, Jennifer, et al. BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations LDC2025T14. Web Download. Philadelphia: Linguistic Data Consortium, 2025
Contributor:Tracey, Jennifer
Chen, Song
Delgado, Dana
Strassel, Stephanie
Date (W3CDTF):2025
Date Issued (W3CDTF):2025-10-15
Description:*Introduction* BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations was developed by the Linguistic Data Consortium (LDC) and consists of transcripts and their corresponding English translations for 116 hours of conversational telephone speech between native speakers of the Arabic dialect spoken in Egypt. The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging and chat -- in Chinese, Egyptian Arabic and English. The telephone data was transcribed, translated and annotated for various tasks including word alignment, treebanking, and co-reference. *Data* The source audio recordings consist of 274 telephone conversations taken from LDC's multilingual CALLFRIEND and CALLHOME series developed to support speech identification and language identification technology development. Transcribers were required to produce a verbatim transcript of all speech within a file using the CODA orthographic approach; diacritics were not included. Some transcripts contain redactions for potential personally identifying information. Further information about the transcription methodology is contained in the transcription guidelines accompanying this release. All speech data was transcribed. The goal of the BOLT translation task was to translate the Arabic transcripts into fluent English while preserving the meaning present in the original Arabic text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. Further information about the translation methodology is contained in the translation guidelines accompanying this release. 99% of the transcripts were translated into English. The data volume in this corpus is as follows: partition doc count su count src ntoken eng nword hours dev 29 9,663 63,401 83,206 6.27 eval 103 39,478 237,623 311,564 23.94 train 203 134,365 760,536 965,468 78.27 total 335 183,506 1,061,560 1,360,238 108.48 Transcripts and translations are presented in xml format, UTF-8 encoded. *Samples* * Egyptian Arabic Transcription Sample (XML) * Egyptian Arabic Translation Sample (XML) *Acknowledgement* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. *Updates* No updates at this time.
Extent:Corpus size: 52472 KB
Identifier:LDC2025T14
https://catalog.ldc.upenn.edu/LDC2025T14
ISLRN: 615-498-437-695-7
DOI: 10.35111/gkpt-d139
Language:English
Egyptian Arabic
Language (ISO639):eng
arz
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2025T14
Rights Holder:Portions © 1996, 1997, 1999, 2002, 2014, 2019, 2025 Trustees of the University of Pennsylvania
Subject:Egyptian Arabic language
Subject (ISO639):arz
Type (DCMI):Text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2025T14
DateStamp:  2025-10-15
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Tracey, Jennifer; Chen, Song; Delgado, Dana; Strassel, Stephanie. 2025. Linguistic Data Consortium.
Terms: area_Africa area_Europe country_EG country_GB dcmi_Text iso639_arz iso639_eng

Inferred Metadata

Country: Egypt
Area: Africa


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2025T14
Up-to-date as of: Thu Oct 16 0:36:28 EDT 2025