OLAC Record: 1996 CSR HUB4 Language Model

OLAC Record
oai:www.ldc.upenn.edu:LDC98T31

Metadata

Title: 1996 CSR HUB4 Language Model

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: MacIntyre, Robert. 1996 CSR HUB4 Language Model LDC98T31. Web Download. Philadelphia: Linguistic Data Consortium, 1998

Contributor: MacIntyre, Robert

Date (W3CDTF): 1998

Description: *Introduction* This corpus contains data from transcribed news broadcasts, designated for use in the baseline language model (LM) for the 1996 CSR HUB4 Evaluation. *Data* The LDC obtained the bulk of the data from broadcast news CD-ROMs produced by Primary Source Media, Inc. This portion includes the period from January 1992 to April 1996 and contains approximately one gigabyte of data uncompressed. This release also includes about 36 megabytes of material received on floppy disks covering the period from late May through June 1996, with somewhat different format from the bulk of the data. The text data are presented in two forms: (1) a relatively unprocessed ("raw" or "sentence-tagged") form and (2) a fully processed ("conditioned," "verbalized-punctuation") form. The "raw" form includes the header and footer information accompanying the articles, such as network, show name, headline, copyright, credits and so forth; the text and ancillary data are presented in a fairly consistent (though simple) SGML format. The "processed" form contains only the text content of the articles, together with SGML tags to mark the boundaries of articles, paragraphs and sentences; the text content has been modified by replacing numeric strings (dates, dollar amounts, quantities) with orthographic strings (e.g. "nineteen ninety six"), replacing abbreviations ("Inc.," "Ltd.," "Corp.," etc.) with corresponding full-word forms and replacing punctuation characters with corresponding word tokens (e.g. "," becomes "COMMA"). This release also includes an archive of the tools used to create the "processed" form of the data. *Updates* There are no updates at this time. *Additional Licensing Instructions* This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.

Identifier: LDC98T31

https://catalog.ldc.upenn.edu/LDC98T31

ISBN: 1-58563-122-1

ISLRN: 905-430-625-113-0

DOI: 10.35111/jvpt-9682

Language: English

Language (ISO639): eng

License: 1996 CSR Hub-4 Language Model Agreement: https://catalog.ldc.upenn.edu/license/1996-csr-hub-4-language-model.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC98T31

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC98T31

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: MacIntyre, Robert. 1998. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC98T31
Up-to-date as of: Wed Oct 29 7:00:48 EDT 2025

Metadata
Title:		1996 CSR HUB4 Language Model
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		MacIntyre, Robert. 1996 CSR HUB4 Language Model LDC98T31. Web Download. Philadelphia: Linguistic Data Consortium, 1998
Contributor:		MacIntyre, Robert
Date (W3CDTF):		1998
Description:		Introduction This corpus contains data from transcribed news broadcasts, designated for use in the baseline language model (LM) for the 1996 CSR HUB4 Evaluation. Data The LDC obtained the bulk of the data from broadcast news CD-ROMs produced by Primary Source Media, Inc. This portion includes the period from January 1992 to April 1996 and contains approximately one gigabyte of data uncompressed. This release also includes about 36 megabytes of material received on floppy disks covering the period from late May through June 1996, with somewhat different format from the bulk of the data. The text data are presented in two forms: (1) a relatively unprocessed ("raw" or "sentence-tagged") form and (2) a fully processed ("conditioned," "verbalized-punctuation") form. The "raw" form includes the header and footer information accompanying the articles, such as network, show name, headline, copyright, credits and so forth; the text and ancillary data are presented in a fairly consistent (though simple) SGML format. The "processed" form contains only the text content of the articles, together with SGML tags to mark the boundaries of articles, paragraphs and sentences; the text content has been modified by replacing numeric strings (dates, dollar amounts, quantities) with orthographic strings (e.g. "nineteen ninety six"), replacing abbreviations ("Inc.," "Ltd.," "Corp.," etc.) with corresponding full-word forms and replacing punctuation characters with corresponding word tokens (e.g. "," becomes "COMMA"). This release also includes an archive of the tools used to create the "processed" form of the data. Updates There are no updates at this time. Additional Licensing Instructions This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.
Identifier:		LDC98T31
		https://catalog.ldc.upenn.edu/LDC98T31
		ISBN: 1-58563-122-1
		ISLRN: 905-430-625-113-0
		DOI: 10.35111/jvpt-9682
Language:		English
Language (ISO639):		eng
License:		1996 CSR Hub-4 Language Model Agreement: https://catalog.ldc.upenn.edu/license/1996-csr-hub-4-language-model.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC98T31
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC98T31
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		MacIntyre, Robert. 1998. Linguistic Data Consortium.
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text