OLAC Record
oai:www.ldc.upenn.edu:LDC2004T04

Metadata
Title:ICSI Meeting Transcripts
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Janin, Adam, et al. ICSI Meeting Transcripts LDC2004T04. Web Download. Philadelphia: Linguistic Data Consortium, 2004
Contributor:Janin, Adam
Edwards, Jane
Ellis, Dan
Gelbart, David
Morgan, Nelson
Peskin, Barbara
Pfau, Thilo
Shriberg, Elizabeth
Stolcke, Andreas
Wooters, Chuck
Date (W3CDTF):2004
Date Issued (W3CDTF):2004-01-30
Description:*Introduction* ICSI Meeting Transcripts was produced by the Linguistic Data Consortium (LDC) and contains word-level transcripts for approximately 72 hours of English meeting recordings, totaling about 795,000 words. The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute in Berkeley (ICSI) during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway; they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each. The corresponding speech files for these transcripts are available in ICSI Meeting Speech (LDC2004S02). *Data* This corpus consists of 75 word-level transcripts (one transcript file per meeting), time-synchronized to digitized audio recordings. There are approximately 795 K-words (thousands of words) and 13K unique words in the transcripts. The meetings were recorded with close-talking and far-field microphones. The transcripts were based mostly on the close-talking microphones, either separately or blended together in a so-called "mixed" channel. The focus of the transcripts was on capturing the flow of audible events, especially the words which were spoken, and who spoke them. In addition to recording the meetings themselves, the participants were also asked to read digit strings, similar to those found in TIDIGITS (LDC93S10), at the start or end of the meeting. This small-vocabulary read-speech component of the recordings -- using the same meeting room, speakers, and microphones -- provides a valuable supplement to the natural conversational data, allowing a factorization of the speech challenges offered by the corpus. For all but a dozen of the meetings included in the corpus, at least some of the participants read digit strings; for the great majority of meetings, all participants did. The digit readings are included as part of the wave files for the meeting as a whole and are fully transcribed as part of the associated transcripts. The transcripts are provided in an XML format developed for this corpus, which we call MRT files (for Meeting Room Transcript). The format is detailed in the associated documentation. Transcripts were prepared by means of the Channeltrans interface. Channeltrans is an extension of the Transcriber interface. There are a total of 53 unique speakers in the corpus. Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe. *Samples* Please view the following sample: Transcript *Sponsorship* The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the Communicator project and through a ROAR "seedling," the Swiss IM2 project (National Centre of Competence in Research, sponsored by the Swiss National Science Foundation), and a supplementary award from IBM. *Updates* There are no updates available at this time. More information is available at http://www.ICSI.Berkeley.EDU/Speech/mr.
Extent:Corpus size: 22528 KB
Identifier:LDC2004T04
https://catalog.ldc.upenn.edu/LDC2004T04
ISBN: 1-58563-286-4
ISLRN: 295-380-961-299-0
DOI: 10.35111/0sq8-s977
Language:English
Language (ISO639):eng
License:LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2004T04
Rights Holder:Portions © 2000-2003 International Computer Science Institute, © 2004 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2004T04
DateStamp:  2024-03-01
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Janin, Adam; Edwards, Jane; Ellis, Dan; Gelbart, David; Morgan, Nelson; Peskin, Barbara; Pfau, Thilo; Shriberg, Elizabeth; Stolcke, Andreas; Wooters, Chuck. 2004. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2004T04
Up-to-date as of: Mon Mar 25 7:19:43 EDT 2024