OLAC Record: Chinese Web 5-gram Version 1

OLAC Record
oai:www.ldc.upenn.edu:LDC2010T06

Metadata

Title: Chinese Web 5-gram Version 1

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Liu, Fang, Meng Yang, and Dekang Lin. Chinese Web 5-gram Version 1 LDC2010T06. Web Download. Philadelphia: Linguistic Data Consortium, 2010

Contributor: Liu, Fang

Yang, Meng

Lin, Dekang

Date (W3CDTF): 2010

Date Issued (W3CDTF): 2010-04-19

Description: *Introduction* Chinese Web 5-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2010T06 and isbn 1-58563-539-1, was created by researchers at Google Inc. It consists of Chinese word n-grams and their observed frequency counts generated from over 800 million tokens of text. The length of the n-grams ranges from unigrams (single words) to 5-grams. This data should be useful for statistical language modeling (e.g., segmentation, machine translation) as well as for other uses. Included with this publication is a simple segmenter written in Perl using the same algorithm used to generate the data. *Data Collection* N-gram counts were generated from approximately 883 billion word tokens of text from publicly accessible web pages. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect only Chinese language pages, some text from other languages is incidentally included in the final data. Data collection took place in March 2008; no text that was created on or after April 1, 2008 was used to develop this corpus. *Preprocessing* The input character encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized by an automatic tool, and all continuous Chinese character sequences were processed by the segmenter. The following types of tokens are considered valid: * A Chinese word containing only Chinese characters. * Numbers, e.g., 198, 2,200, 2.3, etc. * Single Latin tokens, such as Google, &ab, etc. *Extent of Data* * File sizes: approx. 30 GB compressed (gzip'ed) text files * Number of tokens: 882,996,532,572 * Number of sentences: 102,048,435,515 * Number of unigrams: 1,616,150 * Number of bigrams: 281,107,315 * Number of trigrams: 1,024,642,142 * Number of fourgrams: 1,348,990,533 * Number of fivegrams: 1,256,043,325 *Sample* Sample screen shot

Extent: Corpus size: 31047151 KB

Identifier: LDC2010T06

https://catalog.ldc.upenn.edu/LDC2010T06

ISBN: 1-58563-539-1

ISLRN: 958-238-545-740-0

DOI: 10.35111/647p-yt29

Language: Mandarin Chinese

Language (ISO639): cmn

License: Chinese Web 5-gram Version 1 Agreement: https://catalog.ldc.upenn.edu/license/chinese-web-5-gram-version-1.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2010T06

Rights Holder: Portions © 2008 Google Inc., © 2010 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2010T06

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Liu, Fang; Yang, Meng; Lin, Dekang. 2010. Linguistic Data Consortium.
Terms: area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2010T06
Up-to-date as of: Wed Oct 29 7:01:11 EDT 2025

Metadata
Title:		Chinese Web 5-gram Version 1
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Liu, Fang, Meng Yang, and Dekang Lin. Chinese Web 5-gram Version 1 LDC2010T06. Web Download. Philadelphia: Linguistic Data Consortium, 2010
Contributor:		Liu, Fang
		Yang, Meng
		Lin, Dekang
Date (W3CDTF):		2010
Date Issued (W3CDTF):		2010-04-19
Description:		Introduction Chinese Web 5-gram Version 1, Linguistic Data Consortium (LDC) catalog number LDC2010T06 and isbn 1-58563-539-1, was created by researchers at Google Inc. It consists of Chinese word n-grams and their observed frequency counts generated from over 800 million tokens of text. The length of the n-grams ranges from unigrams (single words) to 5-grams. This data should be useful for statistical language modeling (e.g., segmentation, machine translation) as well as for other uses. Included with this publication is a simple segmenter written in Perl using the same algorithm used to generate the data. Data Collection N-gram counts were generated from approximately 883 billion word tokens of text from publicly accessible web pages. This data set contains only n-grams that appeared at least 40 times in the processed sentences. Less frequent n-grams were discarded. While the aim was to identify and collect only Chinese language pages, some text from other languages is incidentally included in the final data. Data collection took place in March 2008; no text that was created on or after April 1, 2008 was used to develop this corpus. Preprocessing The input character encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized by an automatic tool, and all continuous Chinese character sequences were processed by the segmenter. The following types of tokens are considered valid: * A Chinese word containing only Chinese characters. * Numbers, e.g., 198, 2,200, 2.3, etc. * Single Latin tokens, such as Google, &ab, etc. Extent of Data * File sizes: approx. 30 GB compressed (gzip'ed) text files * Number of tokens: 882,996,532,572 * Number of sentences: 102,048,435,515 * Number of unigrams: 1,616,150 * Number of bigrams: 281,107,315 * Number of trigrams: 1,024,642,142 * Number of fourgrams: 1,348,990,533 * Number of fivegrams: 1,256,043,325 Sample Sample screen shot
Extent:		Corpus size: 31047151 KB
Identifier:		LDC2010T06
		https://catalog.ldc.upenn.edu/LDC2010T06
		ISBN: 1-58563-539-1
		ISLRN: 958-238-545-740-0
		DOI: 10.35111/647p-yt29
Language:		Mandarin Chinese
Language (ISO639):		cmn
License:		Chinese Web 5-gram Version 1 Agreement: https://catalog.ldc.upenn.edu/license/chinese-web-5-gram-version-1.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2010T06
Rights Holder:		Portions © 2008 Google Inc., © 2010 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2010T06
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Liu, Fang; Yang, Meng; Lin, Dekang. 2010. Linguistic Data Consortium.
Terms:		area_Asia country_CN dcmi_Text iso639_cmn olac_primary_text