OLAC Record: French Gigaword Third Edition

OLAC Record
oai:www.ldc.upenn.edu:LDC2011T10

Metadata

Title: French Gigaword Third Edition

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Graff, David, Ângelo Mendonça, and Denise DiPersio. French Gigaword Third Edition LDC2011T10. Web Download. Philadelphia: Linguistic Data Consortium, 2011

Contributor: Graff, David

Mendonça, Ângelo

DiPersio, Denise

Date (W3CDTF): 2011

Date Issued (W3CDTF): 2011-09-15

Description: *Introduction* French Gigaword Third Edition is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This third edition updates French Gigaword Second Edition (LDC2009T28) and adds material collected from January 1, 2009 through December 31, 2010. The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows: * Agence France-Presse(afp_fre) May 1994 - Dec. 2010 * Associated Press French Service (apw_fre) Nov. 1994 - Dec. 2010 The seven-letter codes in parentheses include the three-character source name abbreviations and the three-character language code (fre) separated by an underscore (_) character. The three-letter language code conforms to the ISO 639-2/B standard. *Data* Each data file name consists of the 7-letter prefix plus another underscore character, followed by a 6-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a .gz file extension, indicating that the file contents have been compressed using the GNU gzip compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure all text consists of printable ASCII, white space, and printable code points in the Latin1 Supplement character table, as defined by the Unicode Standard (ISO 10646) for the accented characters used in French. The Supplement/accented characters are presented in UTF-8 encoding. The file dtd/gigaword_f.dtd in the dtd directory provides the formal Document Type Declaration for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file. The SGML structure for this release represents some notable differences relative to the markup strategy used in early (pre-Gigaword) LDC publications of newswire data these are intended to facilitate bulk processing of the present corpus. The major differences are: * Early corpora usually organized the data as one file per day, or limited the average file size to one megabyte (MB). Typical compressed file sizes in the current corpus range from about 0.1 MB to about 10 MB this equates to a range of about 0.5 to 30 MB per file when the data are uncompressed. In general, these files are not intended for use with interactive text editors or word processing software (though many such programs are likely to work reasonably well with these files). Rather, its expected that the files will be used as input to programs that are geared to dealing with data in such quantities, for filtering, conditioning, indexing, statistical summary, etc. * Early corpora tended to use different markup outlines (different tag sets) depending on the data source the data source structural properties were generally preserved to the extent possible (even though many elements of the delivered structure may have been meaningless for research use). The present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). The dateline is a brief string typically found at the beginning of the first paragraph in each news story, giving the location the report is coming from, and sometimes the news service and/or date since this content is not part of the initial sentence, we separate it from the first paragraph (this was not done prior to the Gigaword corpora). For all of the documents in this corpus, we have applied a rudimentary (and approximate) categorization of DOC units into four distinct types. The classification is indicated by the type=string attribute that is included in each opening DOC tag. The four types are: * story : This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. * multi : This type of DOC contains a series of unrelated blurbs, each of which briefly describes a particular topic or event this is typically applied to DOCs that contain summaries of todays news, news briefs in ... (some general area like finance or sports), and so on. * advis : (short for advisory) These are DOCs which the news service addresses to news editors -- they are not intended for publication to the end users (the populations who read the news). * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike multi type DOCS), and they typically do not contain paragraphs or sentences (they are not really stories) these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The overall totals for each source are summarized below. Note that the Totl-MB numbers show the amount of data when the files are uncompressed (i.e. approximately 15 gigabytes, total) the Gzip-MB column shows totals for compressed file sizes, the K-wrds numbers are simply the number of white space-separated tokens (of all types) after all SGML tags are eliminated. Source#FilesGzip-MBTotl-MBK-wrds#DOCs afp_fre 195 1503 4255 641381 2356888 apw_fre 194 489 1446 221470 801075 TOTAL 389 1992 5701 862851 3157963 *Sample* Please view this sample.

Extent: Corpus size: 2037582 KB

Identifier: LDC2011T10

https://catalog.ldc.upenn.edu/LDC2011T10

ISBN: 1-58563-593-6

ISLRN: 447-232-270-158-3

DOI: 10.35111/2fnv-vm59

Language: French

Language (ISO639): fra

License: LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2011T10

Rights Holder: Portions © 1994-2010 Agence France-Presse, © 1994-2010 The Associated Press, © 2006, 2009, 2011 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2011T10

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Graff, David; Mendonça, Ângelo; DiPersio, Denise. 2011. Linguistic Data Consortium.
Terms: area_Europe country_FR dcmi_Text iso639_fra olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2011T10
Up-to-date as of: Wed Oct 29 7:01:17 EDT 2025

Metadata
Title:		French Gigaword Third Edition
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Graff, David, Ângelo Mendonça, and Denise DiPersio. French Gigaword Third Edition LDC2011T10. Web Download. Philadelphia: Linguistic Data Consortium, 2011
Contributor:		Graff, David
		Mendonça, Ângelo
		DiPersio, Denise
Date (W3CDTF):		2011
Date Issued (W3CDTF):		2011-09-15
Description:		Introduction French Gigaword Third Edition is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This third edition updates French Gigaword Second Edition (LDC2009T28) and adds material collected from January 1, 2009 through December 31, 2010. The two distinct international sources of French newswire in this edition, and the time spans of collection covered for each, are as follows: * Agence France-Presse(afp_fre) May 1994 - Dec. 2010 * Associated Press French Service (apw_fre) Nov. 1994 - Dec. 2010 The seven-letter codes in parentheses include the three-character source name abbreviations and the three-character language code (fre) separated by an underscore (_) character. The three-letter language code conforms to the ISO 639-2/B standard. Data Each data file name consists of the 7-letter prefix plus another underscore character, followed by a 6-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a .gz file extension, indicating that the file contents have been compressed using the GNU gzip compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source. All text data are presented in SGML form, using a very simple, minimal markup structure all text consists of printable ASCII, white space, and printable code points in the Latin1 Supplement character table, as defined by the Unicode Standard (ISO 10646) for the accented characters used in French. The Supplement/accented characters are presented in UTF-8 encoding. The file dtd/gigaword_f.dtd in the dtd directory provides the formal Document Type Declaration for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file. The SGML structure for this release represents some notable differences relative to the markup strategy used in early (pre-Gigaword) LDC publications of newswire data these are intended to facilitate bulk processing of the present corpus. The major differences are: * Early corpora usually organized the data as one file per day, or limited the average file size to one megabyte (MB). Typical compressed file sizes in the current corpus range from about 0.1 MB to about 10 MB this equates to a range of about 0.5 to 30 MB per file when the data are uncompressed. In general, these files are not intended for use with interactive text editors or word processing software (though many such programs are likely to work reasonably well with these files). Rather, its expected that the files will be used as input to programs that are geared to dealing with data in such quantities, for filtering, conditioning, indexing, statistical summary, etc. * Early corpora tended to use different markup outlines (different tag sets) depending on the data source the data source structural properties were generally preserved to the extent possible (even though many elements of the delivered structure may have been meaningless for research use). The present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs). The dateline is a brief string typically found at the beginning of the first paragraph in each news story, giving the location the report is coming from, and sometimes the news service and/or date since this content is not part of the initial sentence, we separate it from the first paragraph (this was not done prior to the Gigaword corpora). For all of the documents in this corpus, we have applied a rudimentary (and approximate) categorization of DOC units into four distinct types. The classification is indicated by the type=string attribute that is included in each opening DOC tag. The four types are: * story : This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences. * multi : This type of DOC contains a series of unrelated blurbs, each of which briefly describes a particular topic or event this is typically applied to DOCs that contain summaries of todays news, news briefs in ... (some general area like finance or sports), and so on. * advis : (short for advisory) These are DOCs which the news service addresses to news editors -- they are not intended for publication to the end users (the populations who read the news). * other : This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike multi type DOCS), and they typically do not contain paragraphs or sentences (they are not really stories) these are things like lists of sports scores, stock prices, temperatures around the world, and so on. The overall totals for each source are summarized below. Note that the Totl-MB numbers show the amount of data when the files are uncompressed (i.e. approximately 15 gigabytes, total) the Gzip-MB column shows totals for compressed file sizes, the K-wrds numbers are simply the number of white space-separated tokens (of all types) after all SGML tags are eliminated. Source#FilesGzip-MBTotl-MBK-wrds#DOCs afp_fre 195 1503 4255 641381 2356888 apw_fre 194 489 1446 221470 801075 TOTAL 389 1992 5701 862851 3157963 Sample Please view this sample.
Extent:		Corpus size: 2037582 KB
Identifier:		LDC2011T10
		https://catalog.ldc.upenn.edu/LDC2011T10
		ISBN: 1-58563-593-6
		ISLRN: 447-232-270-158-3
		DOI: 10.35111/2fnv-vm59
Language:		French
Language (ISO639):		fra
License:		LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2011T10
Rights Holder:		Portions © 1994-2010 Agence France-Presse, © 1994-2010 The Associated Press, © 2006, 2009, 2011 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2011T10
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Graff, David; Mendonça, Ângelo; DiPersio, Denise. 2011. Linguistic Data Consortium.
Terms:		area_Europe country_FR dcmi_Text iso639_fra olac_primary_text