2026-06-02T19:33:29Zhttp://www.language-archives.org/cgi-bin/olaca3.pl

oai:www.ldc.upenn.edu:LDC2000S852021-07-01

Du Bois, John W.Chafe, Wallace L.Meyer, CharlesThompson, Sandra A.20002000-01-01*Introduction* The Santa Barbara Corpus of Spoken American English is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more. *Data* Part I contains 14 speech files of between 15-30 minutes each, from the Santa Barbara Corpus of Spoken American English. Collected by: University of California, Santa Barbara Center for the Study of Discourse, Director John W. Du Bois (UCSB), Associate Editors: Wallace L. Chafe (UCSB), Charlese Meyer (UMass, Boston), and Sandra A. Thompson (UCSB). The Santa Barbara Corpus of Spoken American English is part of the International Corpus of English (Charles W. Meyer, Director), representing the American Component. Each speech file is accompanied by a transcript in which phrases are time stamped with respect to the audio recording. Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. *Samples* For an example of the data in this corpus, please examine these samples of the recordings and transcripts: * Speech * Transcripts *Updates* There are no updates at this time.Corpus size: 1677721 KBDistribution: Web DownloadLDC2000S85https://catalog.ldc.upenn.edu/LDC2000S85ISBN: 1-58563-164-7ISLRN: 407-731-819-668-4DOI: 10.35111/s2q7-gq73Du Bois, John W., et al. Santa Barbara Corpus of Spoken American English Part I LDC2000S85. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000S85Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfSanta Barbara Corpus of Spoken American English Part ISound

oai:www.ldc.upenn.edu:LDC2000S862020-11-30

Linguistic Data Consortium2000*Introduction* This publication contains the evaluation test material used in the 1998 DARPA/NIST Continuous Speech Recognition Broadcast News HUB4 English Benchmark Test administered by the NIST Spoken Natural Language Processing Group and produced by the Linguistic Data Consortium (LDC), catalog number LDC2000S86, ISBN 1-58563-172-8. *Data* The test material is contained in two SPHERE-formatted waveform files. The file h4e_98_1.sph (set1) contains 1.5 hours of Broadcast News excerpts from 1996. The file h4e_98_2.sph (set2) contains 1.5 hours of Broadcast News excerpts from 1998. Each file should be separately recognized per the HUB4 English Evaluation Specification. *Additional Licensing Instructions* This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.Corpus size: 345088 KBDistribution: Web DownloadLDC2000S86https://catalog.ldc.upenn.edu/LDC2000S86ISBN: 1-58563-172-8ISLRN: 786-335-176-662-7DOI: 10.35111/j4qt-7y88Linguistic Data Consortium. 1998 HUB4 Broadcast News Evaluation English Test Material LDC2000S86. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000S86Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingUSC Marketplace Agreement: https://catalog.ldc.upenn.edu/license/usc-marketplace-speech.pdf1998 HUB4 Broadcast News Evaluation English Test MaterialSoundPortions © 1996 American Broadcasting Company, © 1996 Cable News Network, LP, LLLP, © 1996 Public Radio International, © 1996 The University of Southern California, USC Radio and Marketplace, © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000S872020-11-30

Schmidt-Nielsen, AstridMarsh, ElaineTardelli, JohnGatewood, PaulKreamer, ElizabethTremain, ThomasCieri, ChristopherWright, Jonathan2000*Introduction* Speech in Noisy Environments (SPINE) Training Audio Corpus was developed for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC) by Arcon Corp. Corresonding transripts, Speech in Noisy Environments (SPINE) Training Transcripts, are available as LDC2000T49. These corpora supported the 2000 Speech in Noisy Environments (SPINE1) evaluation. The 2000 Speech in Noisy Environments Evaluation (SPINE1) was a first attempt to assess the state of the art and practice in speech recognition technology in noisy military environments and to exchange information on innovative speech recognition technology in the context of fully implemented systems that perform realistic tasks. It was intended to be of interest to all university, industrial and commercial speech system developers working on the problem of robust speech recognition. The evaluation gave participants the opportunity to participate in a flexible evaluation, suited to development needs and abilities. The SPINE1 evaluation focused on the task of transcribing speech produced in noisy environments with emphasis on noisy military environments. The evaluation was designed to promote research progress in this area, to provide the opportunity for participants to try out new ideas for developing robust speech recognition systems that were of both scientific and practical interest, and to measure the performance of this technology. This work was sponsored in part by National Science Foundation Grant No. IIS-9982201. *Data* The evaluation task was to transcribe speech produced in noisy environments. The training and test speech data used for this evaluation were generated by ARCON Corp. for the DoD Digital Voice Processing Consortium (DDVPC) under controlled conditions. The speech data consists of conversations between two communicators working on a collaborative battleship-like task in which they seek and shoot at targets (ARCON Communicability Exercise, ACE). Participants could talk freely, but the total vocabulary used was fairly limited. Each person was seated in a sound chamber in which a previously recorded military background noise environment was accurately reproduced. The participants used handsets and transmission channels that were resident to the particular environment. The training data includes 10 of twenty available talker pairs with 14 five-minute conversations per talker pair (about 720 minutes total), which includes four noise scenarios. *Samples* For an example of the data contained in this corpus, please listen to this audio sample. *Updates* There are no updates at this time.Corpus size: 2411724 KBDistribution: Web DownloadLDC2000S87https://catalog.ldc.upenn.edu/LDC2000S87ISBN: 1-58563-173-6ISLRN: 884-501-625-360-2DOI: 10.35111/ka1n-0c43Schmidt-Nielsen, Astrid, et al. Speech in Noisy Environments (SPINE) Training Audio LDC2000S87. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000S87Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfSpeech in Noisy Environments (SPINE) Training AudioSound

oai:www.ldc.upenn.edu:LDC2000S882020-11-30

Linguistic Data Consortium2000*Introduction* This publication contains the English evaluation test material used in the 1999 NIST Broadcast News Transcription Evaluation administered by the NIST, Spoken Natural Language Processing Group and produced by the Linguistic Data ConsortiumCatalog number LDC2000S88 ISBN 1-58563-176-0. *Data* The test material is contained in two SPHERE-formatted waveform files. The file bn99en_1.sph (set1) contains 1.5 hours of Broadcast News excerpts from last year's set2 epoch. The file bn99en_2.sph (set2) contains 1.5 hours of Broadcast News excerpts from the summer of 1998. Each file should be separately recognized per the Broadcast News English Evaluation Specification. Additional test material for each set is also included. Test materials include evaluation map files (bn99en_1.uem), automatically generated segmentation files (bn99en_1.seg), transcripts from the evaluation (bn99en_1.utf) and the utf.dtd used to validate the transcripts, reference STM files (bn99en_1.stm), and transcript orthography mapping files (en981118.glm). For more complete information, see the 1998 HUB4 Website. *Updates* There are no updates at this time. Note that the waveform and transcript data on this disc are licensed through the Linguistic Data Consortium (LDC) and are subject to usage restrictions. Contact the LDC for license agreement information. *Additional Licensing Instructions* This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.Corpus size: 352339 KBDistribution: Web DownloadLDC2000S88https://catalog.ldc.upenn.edu/LDC2000S88ISBN: 1-58563-176-0ISLRN: 691-755-940-811-0DOI: 10.35111/r4e7-nb71Linguistic Data Consortium. 1999 HUB4 Broadcast News Evaluation English Test Material LDC2000S88. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000S88Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining1999 HUB4 Broadcast News Evaluation English Test MaterialSoundPortions Copyright 1998 PRI-Public Radio International Portions Copyright 1997-1998 ABC News Portions Copyright 1998 NBC News Portions Copyright 1997-1998 Cable News Network, Inc. All Rights Reserved <br><br><i>Note that the waveform and transcript data on this disc are licensed through the <a href="http://www.ldc.upenn.edu" rel="nofollow">Linguistic Data Consortium (LDC)</a> and are subject to usage restrictions. Contact the <a rel="nofollow">LDC</a> for license agreement information.</i>

oai:www.ldc.upenn.edu:LDC2000S892020-11-30

Graff, David2000*Introduction* Voice of America (VOA) Czech Broadcast News Audio was developed by the Linguistic Data Consortium (LDC). Corresponding transcripts are contained in Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53), the documentation for which is included with this release. *Data* Between February 9 and May 28, 1999, LDC collected approximately 30 hours of Czech broadcast audio from the Voice of America news service. The 62 data files presented in this corpus represent the audio of the daily broadcasts of 30-minute news programs. Due to technical limitations in the hardware at LDC that was used to receive the VOA broadcasts via a satellite downlink, a number of files contain brief portions where the audio signal was interrupted. These interruptions typically yielded regions of complete silence that lasted less than two seconds and were scattered sparsely throughout an affected audio file. Additional markup was provided in the transcription texts to isolate the regions where these interruptions occurred. The 62 audio files in this corpus are single-channel, 16 KHz, 16-bit linear SPHERE files. *Samples* For an example of the data in this corpus, please review this audio sample. *Updates* There are no updates at this time.Corpus size: 3355443 KBDistribution: Web DownloadLDC2000S89https://catalog.ldc.upenn.edu/LDC2000S89ISBN: 1-58563-179-5ISLRN: 748-783-667-076-9DOI: 10.35111/5tcz-x844Graff, David. Voice of America (VOA) Czech Broadcast News Audio LDC2000S89. Web Download. Philadelphia: Linguistic Data Consortium, 2000CzechLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000S89Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfVoice of America (VOA) Czech Broadcast News AudioSoundPortions © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000S922020-11-30

Garofolo, John S.Graff, David2000*Introduction* TDT2 (Topic Detection and Tracking) Careful Transcription Audio was developed by the Linguistic Data Consortium (LDC) and contains English broadcast news audio recordings collected by LDC in 1998. Corresponding transcripts are available in TDT2 Careful Transcription Text LDC2000T44. Topic Detection and Tracking refers to automatic techniques for finding topically-related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrence of new events (detection) and track the reoccurrence of old or new events (tracking). *Data* This publication contains 1998 broadcasts from the following sources: ABC News, Cable News Network, Public Radio International and Voice of America. *Samples* For an example of the data in this corpus, please review this audio sample. *Updates* There are no updates at this time.Distribution: Web DownloadLDC2000S92https://catalog.ldc.upenn.edu/LDC2000S92ISBN: 1-58563-167-1ISLRN: 163-189-179-812-6DOI: 10.35111/rf63-4396Garofolo, John S., and David Graff. TDT2 Careful Transcription Audio LDC2000S92. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000S92Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfTDT2 Careful Transcription AudioSoundPortions © 1998 American Broadcasting Company, Cable News Network, LP, LLLP, Public Radio International, © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000S962021-02-17

Schmidt-Nielsen, AstridMarsh, ElaineTardelli, JohnGatewood, PaulKreamer, ElizabethTremain, ThomasCieri, ChristopherWright, Jonathan2000*Introduction* Speech in Noisy Environments (SPINE) Evaluation Audio was developed for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC) by Arcon Corp. The corresponding transcripts, Speech in Noisy Environments (SPINE) Audio Transcripts, are available as LDC2000T54. These corpora supported the 2000 Speech in Noisy Environments (SPINE1) evaluation. The 2000 Speech in Noisy Environments Evaluation (SPINE1) was a first attempt to assess the state of the art and practice in speech recognition technology in noisy military environments and to exchange information on innovative speech recognition technology in the context of fully implemented systems that perform realistic tasks. It was intended to be of interest to all university, industrial and commercial speech system developers working on the problem of robust speech recognition. The evaluation gave participants the opportunity to participate in a flexible evaluation, suited to development needs and abilities. The SPINE1 evaluation focused on the task of transcribing speech produced in noisy environments with emphasis on noisy military environments. The evaluation was designed to promote research progress in this area, to provide the opportunity for participants to try out new ideas for developing robust speech recognition systems that were of both scientific and practical interest, and to measure the performance of this technology. This work was sponsored in part by National Science Foundation Grant No. IIS-9982201. *Data* The evaluation task was to transcribe speech produced in noisy environments. The training and test speech data used for this evaluation were generated by ARCON Corp. for the DoD Digital Voice Processing Consortium (DDVPC) under controlled conditions. The speech data consists of conversations between two communicators working on a collaborative battleship-like task in which they seek and shoot at targets (ARCON Communicability Exercise, ACE). Participants could talk freely, but the total vocabulary used was fairly limited. Each person was seated in a sound chamber in which a previously recorded military background noise environment was accurately reproduced. The participants used handsets and transmission channels that were resident to the particular environment. The evaluation data includes 20 talker-pairs, with six five-minute conversations per talker-pair (about 600 minutes total), from a set of four scenarios. It is contained in 120 files, one conversation in each file, for an approximate total of nine hours and 22 minutes (2.2 Gigabytes) of audio data. *Samples* For an example of the speech data in this corpus, please examine this audio sample. For an example of a corresponding transcript, please click here. *Updates* There are no updates at this time.Corpus size: 2 KBDistribution: Web DownloadLDC2000S96https://catalog.ldc.upenn.edu/LDC2000S96ISBN: 1-58563-188-4ISLRN: 940-433-236-519-4DOI: 10.35111/701b-nw95Schmidt-Nielsen, Astrid, et al. Speech in Noisy Environments (SPINE) Evaluation Audio LDC2000S96. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000S96Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfSpeech in Noisy Environments (SPINE) Evaluation AudioSound

oai:www.ldc.upenn.edu:LDC2000T432020-11-30

Charniak, EugeneBlaheta, DonGe, NiyuHall, KeithHale, JohnJohnson, Mark2000*Introduction* Brown Laboratory for Linguistic Information Processing (BLLIP)1987-89 WSJ Corpus Release 1 contains a complete, Treebank-style part-of-speech (POS) tagged and parsed version of the three-year Wall Street Journal (WSJ) collection from ACL/DCI (LDC93T1), approximately 30 million words. The annotation was performed using statistically-based methods developed by BLIIP researchers Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall, John Hale and Mark Johnson. This corpus both overlaps and supplements the million-word Penn Treebank (PTB) collection of parsed and POS-tagged WSJ texts. *Data* The PTB project selected 2,499 stories from a three-year WSJ collection of 98,732 stories for syntactic annotation. These 2,499 stories are distributed in Treebank-2 (LDC95T7) and Treebank-3 (LDC99T42), both of which include the raw text for each story. *Updates* There are no updates at this time.Corpus size: 1048576 KBDistribution: Web DownloadLDC2000T43https://catalog.ldc.upenn.edu/LDC2000T43ISBN: 1-58563-165-5ISLRN: 233-420-716-637-7DOI: 10.35111/fwew-da58Charniak, Eugene, et al. BLLIP 1987-89 WSJ Corpus Release 1 LDC2000T43. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T43Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingBLLIP 1987-89 WSJ Corpus Release 1 License Agreement: https://catalog.ldc.upenn.edu/license/bllip-1987-89-wsj-corpus-release-1-license-agreement.pdfBLLIP 1987-89 WSJ Corpus Release 1TextPortions © 1987-1989 Dow Jones & Company, Inc., © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000T442020-11-30

Strassel, StephanieMartey, Nii2000*Introduction* TDT2 (Topic Detection and Tracking) Careful Transcription was developed by the Linguistic Data Consortium (LDC) and contains transcripts of English broadcast news audio recordings collected by LDC in 1998. The corresponding audio data is available in TDT2 Careful Transcription Audio LDC2000S92. Topic Detection and Tracking refers to automatic techniques for finding topically-related material in streams of data such as newswire and broadcast news. This corpus was created to support three TDT2 tasks: to find topically homogeneous sections (segmentation), to detect the occurrence of new events (detection) and to track the reoccurrence of old or new events (tracking). *Data* The broadcast data was collected from the following sources: ABC News, Cable News Network, Public Radio International and Voice of America. Please look at this sample transcript. *Updates* There are no updates at this time.Corpus size: 1058830 KBDistribution: Web DownloadLDC2000T44https://catalog.ldc.upenn.edu/LDC2000T44ISBN: 1-58563-166-3ISLRN: 387-208-758-013-8DOI: 10.35111/0ywn-fh57Strassel, Stephanie, and Nii Martey. TDT2 Careful Transcription Text LDC2000T44. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T44Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfTDT2 Careful Transcription TextTextPortions © 1998 American Broadcasting Company, Cable News Network, LP, LLLP, Public Radio International, © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000T452020-11-30

Cole, AndyWalker, Kevin2000*Introduction* This corpus is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. The collection includes articles from the date ranges listed below. Please click here to see an example of the newswire. Not all dates in each interval are represented by files or articles: 1994 Jun. 2 to Dec. 31 87 files, 8.6 MB 1995 Jan. 1 to Dec. 31 179 files, 16.9 MB 1996 Jan. 1 to Mar. 29 83 files, 10.6 MB 1997 Jul 28 to Dec. 31 245 files, 48.9 MB 1998 Jan. 2 to Dec. 31 285 files, 64.2 MB 1999 Jan. 3 to Dec. 31 216 files, 56.7 MB 2000 Jan. 3 to Mar. 20 56 files, 13.6 MB Total 1,151 files 219.5 MB *Data* The articles provided here have been collected by means of a continuous feed from the news provider over a modem connection. Incoming data from the modem was spooled directly to a "raw collection" file on a daily basis and the raw files were then processed to produce the format for release by the LDC. There are approximately 143,137 articles this corpus. It is probable that there are duplicate articles in this corpus. We have taken steps to remove articles that were corrupted by failures or noise in modem transmission. The kinds of corruption that we were able to eliminate include truncated articles (a valid end-of-article sequence is not observed before a valid start-of-article) and invalid character codes within the text segment of articles. Some corruption may have occurred that did not produce these symptoms (e.g. service interruptions that might cause partial loss of data within or across articles or corruptions that garble the content but happen not to produce any invalid character codes). At present we have no means for detecting these more subtle problems in the data, but we expect that they are relatively infrequent. The format chosen for release consists of SGML tagging (since this gives a fairly simple and self-explanatory presentation of the data) and the KSC-5601 Korean character encoding. *Updates* There are no updates at this time.Corpus size: 221184 KBDistribution: Web DownloadLDC2000T45https://catalog.ldc.upenn.edu/LDC2000T45ISBN: 1-58563-168-XISLRN: 210-777-697-418-7DOI: 10.35111/4wep-9z24Cole, Andy, and Kevin Walker. Korean Newswire LDC2000T45. Web Download. Philadelphia: Linguistic Data Consortium, 2000KoreanLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T45Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfKorean NewswireTextPortions Copyright 1994-2000, Korean Press Agency, All Rights Reserved

oai:www.ldc.upenn.edu:LDC2000T462020-11-30

Ma, Xiaoyi20002000-01-15*Introduction* Hong Kong News Parallel Text was developed by the Linguistic Data Consortium (LDC) and consists of parallel Chinese - English news articles from the Information Services Department of Hong Kong Special Administrative Region (HKSAR) of the Peoples Republic of China. LDC wishes to thank the Hong Kong Special Administrative Region of the Peoples Republic of China for granting the LDC permission to distribute this data to the research community. *Data* This corpora contains 18,147 aligned article pairs released by HKSAR from July 1, 1997 to April 30, 2000. Automatic article alignment was done at the LDC. The data directory contains 36,294 articles. Each article is a separate file, thus there are 18,147 article pairs. The files are named using the convention yyyymmdd_nnn.[ce] where * yyyy = year * mm = month * dd = date * nnn = article date sequence number * c = Cantonese, and e = English. The example.c and example.e files contains a corresponding sample news article from the corpus. The articles were collected by an automated system from the internet. Incoming data was spooled directly to a raw collection file and the raw files were then processed to produce the following format for release by the LDC. Table.txt maps the Chinese files (*.c) to the corresponding English files (*.e). The Chinese files are encoded in BIG5 with user-defined characters by HKSAR. Click here for details. *Copying and Distribution* Permission has been granted to the Linguistic Data Consortium to make and distribute copies of the laws, press releases and news of Hong Kong Special Administrative Region provided that this copyright notice and the following permission notice are distributed with all copies. Permission has been given to the Linguistic Data Consortium reproduce the laws, press releases, and/or news articles from the Hong Kong Special Administrative Region Government website for research, education and technology development. *Updates* There are no updates at this time. *Additional Licensing Instructions* This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.Corpus size: 68608 KBDistribution: Web DownloadLDC2000T46https://catalog.ldc.upenn.edu/LDC2000T46ISBN: 1-58563-169-8ISLRN: 820-981-482-765-8DOI: 10.35111/5n10-kt36Ma, Xiaoyi. Hong Kong News Parallel Text LDC2000T46. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishChineseLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T46Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingHong Kong News Parallel TextTextPortions © 1997-2000 The Government of the Hong Kong Special Administrative Region, © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000T472020-11-30

Ma, Xiaoyi20002000-07-15*Introduction* Hong Kong Laws Parallel Text was developed by the Linguistic Data Consortium (LDC) and consists of processed and sentenced-aligned Chinese-English documents from the Department of Justice of the Hong Kong Special Administrative Region (HKSAR) of the Peoples Republic of China. LDC wishes to thank the Hong Kong Special Administrative Region of the Peoples Republic of China for granting the LDC permission to distribute this data to the research community. *DATA* This corpora is organized into 19 parallel file pairs for a total of 38 files. Each parallel file pair is named hklaws.nn.[ec] where: * nn = sequence number and * the file extensions, c = Cantonese and e = English Each files holds up to 2,000 sequentially numbered sentences tagged with a sentence index and sequence number as described below for a total of 37,807 sentence indices across all 19 file pairs. The sentence numbering spans the file pairs such that the initial sentence index (in files hklaws.01.e and hklaws.01.c) is 1, and the last sentence index (in files hklaws.19.e and hklaws.19.c) is 37807. The sentence numbering establishes the sentence parallelism two sentences having the same index and sequence number are purported to be parallel in content. Each sentence index may contain one or more sequentially numbered sentences, with corresponding files in English and Chinese containing the corresponding sets of sentences. The initial sequence number of each sentence is 1. The sentence sequence number plus the sentence index number is sufficient to uniquely identify parallel sentences. There are 313,659 sentences in the corpora. Each sentence is of the form:...... ...... where # represents a one to five digit sentence index or sequence number. Automatic sentence alignment was done at the LDC. The example.c and example.e files contains sample corresponding Chinese and English Law files from the corpus. The Chinese files are encoded in BIG5 with user-defined characters by HKSAR. See http://www.info.gov.hk/gccs for details. *Copying and distribution* Permission has been granted to the Linguistic Data Consortium to make and distribute copies of the laws, press releases and news of Hong Kong Special Administrative Region, provided this copyright notice and permission notice are distributed with all copies. Permission has been given to the Linguistic Data Consortium to reproduce the laws, press releases, and/or news articles from the Hong Kong Special Administrative Region Government website for research, education, and technology development. *Updates* There are no updates at this time. *Additional Licensing Instructions* This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.Corpus size: 2764 KBDistribution: Web DownloadLDC2000T47https://catalog.ldc.upenn.edu/LDC2000T47ISBN: 1-58563-170-1ISLRN: 596-847-245-337-1DOI: 10.35111/zfbe-bt19Ma, Xiaoyi. Hong Kong Laws Parallel Text LDC2000T47. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishChineseLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T47Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingHong Kong Laws Parallel TextTextPortions © 1999 The Government of the Hong Kong Special Administrative Region, © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000T492020-11-30

Schmidt-Nielsen, AstridMarsh, ElaineCieri, ChristopherStrassel, StephanieRennert, Kara2000*Introduction* Speech in Noisy Environments (SPINE) Training Transcripts was developed for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC) by Arcon Corp. The corresponding audio data, Speech in Noisy Environments (SPINE) Training Audio, is available as LDC2000S87. These corpora supported the 2000 Speech in Noisy Environments evaluation. For an example transcript, please click here. The 2000 Speech in Noisy Environments Evaluation (SPINE1) was a first attempt to assess the state of the art and practice in speech recognition technology in noisy military environments and to exchange information on innovative speech recognition technology in the context of fully implemented systems that perform realistic tasks. It was intended to be of interest to all university, industrial and commercial speech system developers working on the problem of robust speech recognition. The evaluation gave participants the opportunity to participate in a flexible evaluation, suited to development needs and abilities. The SPINE1 evaluation focused on the task of transcribing speech produced in noisy environments with the emphasis on speech produced in noisy military environments. The evaluation was designed to promote research progress in this area, to provide the opportunity for participants to try out new ideas for developing robust speech recognition systems that were of both scientific and practical interest, and to measure the performance of this technology. This work was sponsored in part by National Science Foundation Grant No. IIS-9982201. *Data* The evaluation task was to transcribe speech produced in noisy environments. The training and test speech data used for this evaluation were generated by ARCON Corp. for the DoD Digital Voice Processing Consortium (DDVPC) under controlled conditions. The speech data consists of conversations between two communicators working on a collaborative, battleship-like task in which they seek and shoot at targets (ARCON Communicability Exercise, ACE). Participants could talk freely, but the total vocabulary used was fairly limited. Each person was seated in a sound chamber in which a previously recorded military background noise environment was accurately reproduced. The participants used handsets and transmission channels that were resident to the particular environment. The training data includes 10 of 20 available talker pairs with 14 five-minute conversations per talker pair (about 720 minutes total) available, which includes four noise scenarios. *Updates* There are no updates at this time.Corpus size: 2560 KBDistribution: Web DownloadLDC2000T49https://catalog.ldc.upenn.edu/LDC2000T49ISBN: 1-58563-174-4ISLRN: 176-611-193-688-0DOI: 10.35111/zh7f-8r93Schmidt-Nielsen, Astrid, et al. Speech in Noisy Environments (SPINE) Training Transcripts LDC2000T49. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T49Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfSpeech in Noisy Environments (SPINE) Training TranscriptsText

oai:www.ldc.upenn.edu:LDC2000T502020-11-30

Ma, Xiaoyi2000*Introduction* Hong Kong Hansards Parallel Text was developed by the Linguistic Data Consortium (LDC) and contains excerpts from the Official Record of Proceedings of the Legislative Council of the Hong Kong Special Administrative Region (HKSAR) from October 1995 to April 2000. LDC thanks the Hong Kong Special Administrative Region of the Peoples Republic of China for granting permission to distribute this data to the research community. The Legislative Council normally meets every Wednesday afternoon in the Chamber of the Legislative Council Building. Business includes: discussion of subsidiary legislation, papers, reports, addresses, statements, questions, the three readings of bills, motions and debates. From time to time, the Chief Executive attends a special Council meeting to brief Members on policy issues and to answer questions from Members. All Council meetings are open to the public. The proceedings of the meetings are recorded verbatim in the Official Record of Proceedings of the Legislative Council (Hansard). The record of proceedings is in the original language delivered by the speakers (Floor Version). They are then translated into English and Chinese versions separately. *Data* This corpus contains excerpts from the official record of meetings from October 1995 to April 2000. There are 11.9 million English words and 18.15 million Chinese characters in this release. Chinese text is presented in the traditional script and encoded as BIG5. There are 388 files in the data/ subdirectory of this corpus, half (194 files) in English in the data/english/ subdirectory and half (194 files) in Chinese in the data/chinese/ subdirectory. Data file names are in the form YYYYMMDD_[ce].doc, where YYYYMMDD indicates the date of the meeting, c=Chinese and e=English. As an example of the text in this corpus the Chinese sample is part of the Chinese language record of the meeting held on May 24, 1997. The parallel English file is in the English sample. *Copying and Distribution* Permission has been granted to the Linguistic Data Consortium to make and distribute copies of the laws, press releases and news of Hong Kong Special Administrative Region provided this copyright notice and permission notice are distributed with all copies. Permission has been given to reproduce the laws, press releases, and/or news articles from the Hong Kong Special Administrative Region Government website for research, education, and technology development. *Updates* There are no updates at this time. *Additional Licensing Instructions* This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.Corpus size: 108544 KBDistribution: Web DownloadLDC2000T50https://catalog.ldc.upenn.edu/LDC2000T50ISBN: 1-58563-175-2ISLRN: 272-276-125-586-5DOI: 10.35111/0dcb-s792Ma, Xiaoyi. Hong Kong Hansards Parallel Text LDC2000T50. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishChineseLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T50Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingHong Kong Hansards Parallel TextTextPortions © 1995-2000 The Government of the Hong Kong Special Administrative Region, © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000T512020-11-30

Rogers, Willie2000*Introduction* TREC Spanish was developed by the Linguistic Data Consortium and consists of Spanish newswire data from Agence France Presse and El Norte that was used in the TREC (Text REtrieval Conference) Spanish tasks sponsored by NIST (National Institute of Standards and Technology), specifically, TREC-3, TREC-4 and TREC-5. *Data* The El Norte material (250 megabytes) was used in TREC-3 and TREC-4; the Agence France Presse documents (300 megabytes) were used in TREC-5. The text has been formatted to include TREC document IDs. Further information about TREC-5 is available from the NIST TREC-5 website. *Updates* There are no updates at this time.Corpus size: 332800 KBDistribution: Web DownloadLDC2000T51https://catalog.ldc.upenn.edu/LDC2000T51ISBN: 1-58563-177-9ISLRN: 445-901-162-731-2DOI: 10.35111/krhs-dd17Rogers, Willie. TREC Spanish LDC2000T51. Web Download. Philadelphia: Linguistic Data Consortium, 2000SpanishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T51Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingTREC Spanish Agreement: https://catalog.ldc.upenn.edu/license/trec-spanish-license-agreement.pdfTREC SpanishTextPortions © 1994 Agence France Presse, © 1994 INFOSEL, © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000T522020-11-30

Rogers, Willie2000This publication contains the TREC ("Text REtreival Conference") Mandarin Corpus used for the Chinese task in TRECs 5-6 and consist of approximately 170 megabytes of articles drawn from the People's Daily newspaper and the Xinhua newswire formatted to include TREC document IDs. The text is Mandarin Chinese and is encoded using the GB encoding scheme. The topics (questions) and relevance judgments (right answers) are not included in this publication but can be downloaded from the Data/Non-English section of the TREC web site. The Mandarin Chinese text data is from the Xinhua News Agency and the People's Daily News Service (both from mainland China). Click here to see the appereance of a sample file from Xinhua Newswire and People's Daily. This collection of text was originally gathered by the Linguistic Data Consortium (LDC), and then adapted by the National Institute of Standards and Technology (NIST) for use in the TREC Mandarin evaluation program.Corpus size: 174894897 KBDistribution: Web DownloadLDC2000T52https://catalog.ldc.upenn.edu/LDC2000T52ISBN: 1-58563-178-7ISLRN: 964-663-671-938-8DOI: 10.35111/rn8c-7105Rogers, Willie. TREC Mandarin LDC2000T52. Web Download. Philadelphia: Linguistic Data Consortium, 2000Mandarin ChineseLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T52Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingTREC Mandarin Agreement: https://catalog.ldc.upenn.edu/license/trec-mandarin.pdfTREC MandarinTextPortions © 1991, 1993 People's Daily, © 1994, 1995 Xinhua News Agency, © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000T532020-11-30

J, PsutkaV, RadovaL, MullerJ, MatousekP, Ircing2000*Introduction* Voice of America (VOA) Czech Broadcast News Transcripts was developed by the University of West Bohemia. The transcripts in this release correspond to Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89). Support for this work was provided by the Ministry of Education of the Czech Republic (Grant No. VS97159); by the Ministry of Education of the Czech Republic (Project ME293); and by the NSF Language Engineering Workshop at the Johns Hopkins University, Baltimore, MD USA (NSF Grant No. IIS-9820687). *Data* Between February 9 and May 28, 1999, the Linguistic Data Consortium (LDC) collected approximately 30 hours of Czech broadcast audio from the Voice of America news service. The 62 data files presented in this corpus represent the transcripts of the daily broadcasts of 30-minute news programs. The transcriptions were created by native Czech speakers, Pavel Ircing, Jindrich Matousek, Ludek Muller, and Vlasta Radova, working at the Department of Cybernetics, University of West Bohemia in Pilsen under the direction of Josef Psutka. They used transcription software provided by LDC (the "Transcriber" package), developed by Eduoard Geoffrois and Claude Barras at DGA, France, with assistance from Zhibiao Wu at LDC. The version of Transcriber used for this project produced a text file format which is no longer supported by the software; also, the format does not resemble any previous transcription format published by LDC. Therefore, the files in this release have been converted into an SGML format that has been used for other broadcast news transcription corpora, specifcally, the the "Universal Transcription Format" (UTF -- not to be confused with the "Unicode Transformation Formats") defined by the speech group at NIST (National Institute of Standards and Technology). A description of that format is provided in the "utf.ps" (Postscript) and "utf.pdf" (Adobe Acrobat) files, and the formal SGML definition is provided in "utf.dtd," all in the release "doc" directory. The transcription text is rendered using the ISO 8859-2 character set. Information relating this character set to the Unicode standard is available at this site and from the Unicode Consortium. Due to technical limitations in the hardware at LDC that was used to receive the VOA broadcasts via a satellite downlink, a number of files contain brief portions where the audio signal was interrupted. These interruptions typically yielded regions of complete silence that lasted less than two seconds and were scattered sparsely throughout an affected audio file. Additional markup was provided in the transcription texts to isolate the regions where these interruptions occurred. Please click on LDC2000T53.sample to view an example transcript. *Updates* There are no updates at this time.Distribution: Web DownloadLDC2000T53https://catalog.ldc.upenn.edu/LDC2000T53ISBN: 1-58563-180-9ISLRN: 152-783-757-211-5DOI: 10.35111/zsbe-6d67J, Psutka, et al. Voice of America (VOA) Czech Broadcast News Transcripts LDC2000T53. Web Download. Philadelphia: Linguistic Data Consortium, 2000CzechLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T53Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfVoice of America (VOA) Czech Broadcast News TranscriptsTextPortions © 2000 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2000T542020-11-30

Schmidt-Nielsen, AstridMarsh, ElaineCieri, ChristopherStrassel, StephanieRennert, Kara20002002-06-17*Introduction* Speech in Noisy Environments (SPINE) Evaluation Transcripts was developed for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC) by Arcon Corp. The corresponding audio, Speech in Noisy Environments (SPINE) Evaluation Audio, is available as LDC2000S96. These corpora supported the 2000 Speech in Noisy Environments evaluation. For an example transcript, please click here. The 2000 Speech in Noisy Environments Evaluation (SPINE1) was a first attempt to assess the state of the art and practice in speech recognition technology in noisy military environments and to exchange information on innovative speech recognition technology in the context of fully implemented systems that perform realistic tasks. It was intended to be of interest to all university, industrial and commercial speech system developers working on the problem of robust speech recognition. The evaluation gave participants the opportunity to participate in a flexible evaluation, suited to development needs and abilities. This work was sponsored in part by National Science Foundation Grant No. IIS-9982201. *Data* The SPINE1 evaluation focused on the task of transcribing speech produced in noisy environments with the emphasis on speech produced in noisy military environments. The evaluation was designed to promote research progress in this area, to provide the opportunity for participants to try out new ideas for developing robust speech recognition systems that were of both scientific and practical interest, and to measure the performance of this technology. The evaluation task was to transcribe speech produced in noisy environments. The training and test speech data used for this evaluation were generated by ARCON Corp. for the DoD Digital Voice Processing Consortium (DDVPC) under controlled conditions. The speech data consists of conversations between two communicators working on a collaborative battleship-like task in which they seek and shoot at targets (ARCON Communicability Exercise, ACE). Participants could talk freely, but the total vocabulary used was fairly limited. Each person was seated in a sound chamber in which a previously recorded military background noise environment was accurately reproduced. The participants used handsets and transmission channels that were resident to the particular environment. The evaluation data includes 20 talker-pairs, with six five-minute conversations per talker-pair (about 600 minutes total), from a set of four scenarios. *Updates* August 13, 2001: A tagging error was discovered in which several files containing occurrences of the incorrect tag "[{noise}]," were converted to the correct tag, "[/noise]." There were 433 occurrences of this error across all files. Also, a single occurrence of two instances of "[noise/]" on the same line was corrected to "[/noise]" in the second instance. The corpus has been corrected.Corpus size: 240 KBDistribution: Web DownloadLDC2000T54https://catalog.ldc.upenn.edu/LDC2000T54ISBN: 1-58563-189-2ISLRN: 742-218-645-985-8DOI: 10.35111/k42c-jh17Schmidt-Nielsen, Astrid, et al. Speech in Noisy Environments (SPINE) Evaluation Transcripts LDC2000T54. Web Download. Philadelphia: Linguistic Data Consortium, 2000EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2000T54Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfSpeech in Noisy Environments (SPINE) Evaluation TranscriptsText

oai:www.ldc.upenn.edu:LDC2001S042020-11-30

Schmidt-Nielsen, AstridMarsh, ElaineTardelli, JohnGatewood, PaulKreamer, ElizabethTremain, ThomasCieri, ChristopherStrassel, StephanieMartey, NiiGraff, DavidTofan, Cristina2001*Introduction* Speech in Noisy Environments (SPINE2) Part 1 Audio was used as part of the training set for the Second Speech in Noisy Environments Evaluation (SPINE2). SPINE2 provided a continuing forum for assessing the state of the art and practice in speech recognition technology for noisy military environments and for exchanging information on innovative speech recognition technology in the context of fully implemented systems that perform realistic tasks. The evaluation provided researchers, potential sponsors, and customers with a quantitative means to appreciate the strengths and weaknesses of the technologies. This work was sponsored in part by National Science Foundation Grant No. IIS-9982201. *Data* This publication contains the Speech in Noisy Environments 2 (SPINE2) Clean and Vocoded Training Audio Corpus created for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC) by Arcon Corp. The transcripts for this publication are available as Speech in Noisy Environments (SPINE2) Training Transcripts LDC2001T05. For an example transcript, please click here. These corpora supported the 2001 Speech in Noisy Environments evaluation. The training data comprises two talker pairs (four speakers total) with 32 conversations (sessions) per talker pair (64 conversations total). The audio for each session is presented in three forms: * Unprocessed: the signal recorded at the participant's microphone * Bitstream: the compressed "channel" data produced by the vocoder's analysis stage for transmission from sender to receiver * Processed: the signal produced by the vocoder's synthesis stage, given the bitstream data as input. There are a total of 64 clean audio files and 64 vocoded files, one "game" each, for a rough total of seven hours of audio data, 1.6Gb (including the unprocessed, the processed, and the bitstream files), 20,850 total tokens (730 unique tokens). *Samples* Please view this unprocessed audio sample and processed audio sample. *Updates* There are no updates at this time.Corpus size: 1589513 KBDistribution: Web DownloadLDC2001S04https://catalog.ldc.upenn.edu/LDC2001S04ISBN: 1-58563-206-6ISLRN: 275-012-111-000-7DOI: 10.35111/dr8g-r615Schmidt-Nielsen, Astrid, et al. Speech in Noisy Environments (SPINE2) Part 1 Audio LDC2001S04. Web Download. Philadelphia: Linguistic Data Consortium, 2001EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2001S04Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfSpeech in Noisy Environments (SPINE2) Part 1 AudioSoundPortions © 2001 Trustees of the University of Pennsylvania

oai:www.ldc.upenn.edu:LDC2001S062020-11-30

Schmidt-Nielsen, AstridMarsh, ElaineTardelli, JohnGatewood, PaulKreamer, ElizabethTremain, ThomasCieri, ChristopherStrassel, StephanieMartey, NiiGraff, DavidTofan, Cristina2001*Introduction* Speech in Noisy Environments (SPINE2) Part Audio was used as the development set for the Second Speech in Noisy Environments Evaluation (SPINE2). SPINE2 provided a continuing forum for assessing the state of the art and practice in speech recognition technology for noisy military environments and for exchanging information on innovative speech recognition technology in the context of fully implemented systems that perform realistic tasks. The evaluation provided researchers, potential sponsors, and customers with a quantitative means to appreciate the strengths and weaknesses of the technologies This work was sponsored in part by National Science Foundation Grant No. IIS-9982201. *Data* This release contains the Speech in Noisy Environments 2 (SPINE2) Clean and Vocoded Development Audio Corpus created for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC) by Arcon Corp. The transcripts for this publication are available as Speech in Noisy Environments (SPINE2) Development Transcripts LDC2001T07. For an example transcript, please click here. These corpora supported the 2001 Speech in Noisy Environments evaluation. The development data comprises two talker pairs (four speakers total) with 16 conversations (sessions) per talker pair (32 conversations total). The audio for each session is presented in three forms: * Unprocessed: the signal recorded at the participant's microphone * Bitstream: the compressed "channel" data produced by the vocoder's analysis stage for transmission from sender to receiver * Processed: the signal produced by the vocoder's synthesis stage, given the bitstream data as input. There are a total of 32 clean audio files and 32 vocoded files, one "game" each, for a rough total of three and a half hours (207 minutes) of audio data, 811Mb (including the unprocessed, the processed, and the bitstream files), 9,700 total tokens (600 unique tokens). *Samples* Please view this unprocessed sample and processed sample. *Updates* There are no updates at this time.Corpus size: 792538 KBDistribution: Web DownloadLDC2001S06https://catalog.ldc.upenn.edu/LDC2001S06ISBN: 1-58563-208-2ISLRN: 912-790-987-794-7DOI: 10.35111/q1mw-5s87Schmidt-Nielsen, Astrid, et al. Speech in Noisy Environments (SPINE2) Part 2 Audio LDC2001S06. Web Download. Philadelphia: Linguistic Data Consortium, 2001EnglishLinguistic Data Consortiumhttps://www.ldc.upenn.eduhttps://catalog.ldc.upenn.edu/docs/LDC2001S06Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtainingLDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdfSpeech in Noisy Environments (SPINE2) Part 2 AudioSoundPortions © 2001 Trustees of the University of Pennsylvaniae0b0f101-dd7b-4e9a-ad8a-69ce447c504a