OLAC Record
oai:www.ldc.upenn.edu:LDC2006T13

Metadata
Title:Web 1T 5-gram Version 1
Access Rights:Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:Brants, Thorsten, and Alex Franz. Web 1T 5-gram Version 1 LDC2006T13. Web Download. Philadelphia: Linguistic Data Consortium, 2006
Contributor:Brants, Thorsten
Franz, Alex
Date (W3CDTF):2006
Date Issued (W3CDTF):2006-09-19
Description:*Introduction* Web 1T 5-gram Version 1 was contributed by Google Inc. and contains English word n-grams and their observed frequency counts for approximately 1 trillion tokens. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses. *Data* The n-gram counts were generated from text taken from publicly accessible Web pages. The input encoding of documents was automatically detected, and all text was converted to UTF-8. The data was tokenized in a manner similar to the tokenization of the Wall Street Journal portion of the Penn Treebank. Notable exceptions include the following: * Hyphenated word are usually separated, and hyphenated numbers usually form one token. * Sequences of numbers separated by slashes (e.g. in dates) form one token. * Sequences that look like urls or email addresses form one token. The files total 24 GB compressed (gzip'ed) text files containing the following: Tokens 1,024,908,267,229 Sentences 95,119,665,584 Unigrams 13,588,391 Bigrams 314,843,401 Trigrams 977,069,902 Fourgrams 1,313,818,354 Fivegrams 1,176,470,663 *Samples* For an example of the 3-gram data in this corpus, please review this text sample (TXT). For an example of the 4-gram data in this corpus, please review this text sample (TXT). *Updates* None at this time.
Extent:Corpus size: 20971520 KB
Identifier:LDC2006T13
https://catalog.ldc.upenn.edu/LDC2006T13
ISBN: 1-58563-397-6
ISLRN: 831-344-220-094-6
DOI: 10.35111/cqpa-a498
Language:English
Language (ISO639):eng
License:Web 1T 5-gram Version 1 Agreement: https://catalog.ldc.upenn.edu/license/web-1t-5-gram-version-1.pdf
Medium:Distribution: Web Download
Publisher:Linguistic Data Consortium
Publisher (URI):https://www.ldc.upenn.edu
Relation (URI):https://catalog.ldc.upenn.edu/docs/LDC2006T13
Rights Holder:Portions © 2006 Google Inc., © 2006 Trustees of the University of Pennsylvania
Type (DCMI):Text
Type (OLAC):primary_text

OLAC Info

Archive:  The LDC Corpus Catalog
Description:  http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:www.ldc.upenn.edu:LDC2006T13
DateStamp:  2021-02-26
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Brants, Thorsten; Franz, Alex. 2006. Linguistic Data Consortium.
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text


http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2006T13
Up-to-date as of: Mon Mar 25 7:20:13 EDT 2024