OLAC Record: Wolverhampton Business English Corpus

OLAC Record
oai:catalogue.elra.info:ELRA-W0028

Metadata

Title: Wolverhampton Business English Corpus

Access Rights: Rights available for: nonCommercialUse, commercialUse

Date Available (W3CDTF): 2001-08-14

Date Issued (W3CDTF): 2001-08-14

Date Modified (W3CDTF): 2004-05-12

Description: The WBE was created by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsP&P (Language Resources Production & Packaging - LE4-8335).A survey of electronic language resources in the business domain carried out at Wolverhampton revealed that there are very few business corpora in existence, and almost none of them are widely accessible. There is significant demand for a business corpus, from both the NLP and pedagogic (language, business communication, and linguistics teachers and students) communities.The Wolverhampton Corpus of Written Business English is:- A synchronic corpus, including only texts available on the web during a 6-month period in 1999-2000 AD.- A monolingual English corpus: it comprises only texts written in English; but no restriction was applied as regards the variety of English used. On the contrary, the WBE deliberately tried to capture a wide range of varieties of English, by including documents from websites in Britain, USA, Pakistan, Netherlands, Belgium, Switzerland, Hong Kong, etc.- A written corpus: it contains only written materials. However, a few of the documents are transcripts of speeches.- A business corpus: the texts were selected manually, and care was taken to ensure that all the texts were from the business domain.The corpus consists of 10,186,259 words from 23 different Web sitesThe data can contribute to a wide range of NLP tasks, including information retrieval, information extraction, summarisation, etc. The WBE was built using materials solely from the Web. However, this does not mean that the corpus gives access only to a restricted range of categories of texts. On the contrary, the amount of information available online allowed us to select from a wide variety of categories. These range from product descriptions, company press releases, and annual financial reports, to business journalism, academic research papers, political speeches and government reports. The texts have been grouped according to the source site. The corpus is distributed in three formats. - The first one is the original encoding of the text. The majority of the texts are in HTML and plain text format. There are a few in PDF format or Microsoft Word DOC format. - The second format is plain text. The files were converted automatically if they were not in plain text format, and manually checked. - The corpus is also provided as SGML encoded files, using the Corpus Encoding Standard (http://www.cs.vassar.edu/CES/). The header of each file provides information about the title of the file, length in words, etc. The paragraph and sentence boundaries, and part of speech tags for each word are marked using SGML tags. All the available files were converted to 8-bit ASCII format using ISO 8859-1. Characters with ASCII codes from 127255 (also known as Extended ASCII) were manually checked in order to ensure the correct representation of the characters. The corpus was checked for spelling errors, but special care was taken to ensure that any variant spellings specific to the business domain were not wrongly corrected.A validation work was carried out by an external validator. It consisted of checking text files, tools, tagging and documentation.

Identifier: ELRA-W0028

ISLRN: 327-829-185-850-4

Identifier (URI): https://catalog.elra.info/en-us/repository/browse/ELRA-W0028/

Language: English

Language (ISO639): eng

Medium: Not specified

Publisher: ELRA (European Language Resources Association)

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: ELRA Catalogue of Language Resources

Description: http://www.language-archives.org/archive/catalogue.elra.info

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:catalogue.elra.info:ELRA-W0028

DateStamp: 2001-08-14

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: n.a. 2001. ELRA (European Language Resources Association).
Terms: area_Europe country_GB dcmi_Text iso639_eng olac_primary_text

http://www.language-archives.org/item.php/oai:catalogue.elra.info:ELRA-W0028
Up-to-date as of: Wed Oct 1 0:55:01 EDT 2025

Metadata
Title:		Wolverhampton Business English Corpus
Access Rights:		Rights available for: nonCommercialUse, commercialUse
Date Available (W3CDTF):		2001-08-14
Date Issued (W3CDTF):		2001-08-14
Date Modified (W3CDTF):		2004-05-12
Description:		The WBE was created by the Computational Linguistics Group at University of Wolverhampton through a funding from ELRA in the framework of the European Commision project LRsP&P (Language Resources Production & Packaging - LE4-8335).A survey of electronic language resources in the business domain carried out at Wolverhampton revealed that there are very few business corpora in existence, and almost none of them are widely accessible. There is significant demand for a business corpus, from both the NLP and pedagogic (language, business communication, and linguistics teachers and students) communities.The Wolverhampton Corpus of Written Business English is:- A synchronic corpus, including only texts available on the web during a 6-month period in 1999-2000 AD.- A monolingual English corpus: it comprises only texts written in English; but no restriction was applied as regards the variety of English used. On the contrary, the WBE deliberately tried to capture a wide range of varieties of English, by including documents from websites in Britain, USA, Pakistan, Netherlands, Belgium, Switzerland, Hong Kong, etc.- A written corpus: it contains only written materials. However, a few of the documents are transcripts of speeches.- A business corpus: the texts were selected manually, and care was taken to ensure that all the texts were from the business domain.The corpus consists of 10,186,259 words from 23 different Web sitesThe data can contribute to a wide range of NLP tasks, including information retrieval, information extraction, summarisation, etc. The WBE was built using materials solely from the Web. However, this does not mean that the corpus gives access only to a restricted range of categories of texts. On the contrary, the amount of information available online allowed us to select from a wide variety of categories. These range from product descriptions, company press releases, and annual financial reports, to business journalism, academic research papers, political speeches and government reports. The texts have been grouped according to the source site. The corpus is distributed in three formats. - The first one is the original encoding of the text. The majority of the texts are in HTML and plain text format. There are a few in PDF format or Microsoft Word DOC format. - The second format is plain text. The files were converted automatically if they were not in plain text format, and manually checked. - The corpus is also provided as SGML encoded files, using the Corpus Encoding Standard (http://www.cs.vassar.edu/CES/). The header of each file provides information about the title of the file, length in words, etc. The paragraph and sentence boundaries, and part of speech tags for each word are marked using SGML tags. All the available files were converted to 8-bit ASCII format using ISO 8859-1. Characters with ASCII codes from 127255 (also known as Extended ASCII) were manually checked in order to ensure the correct representation of the characters. The corpus was checked for spelling errors, but special care was taken to ensure that any variant spellings specific to the business domain were not wrongly corrected.A validation work was carried out by an external validator. It consisted of checking text files, tools, tagging and documentation.
Identifier:		ELRA-W0028
Identifier:		ISLRN: 327-829-185-850-4
Identifier (URI):		https://catalog.elra.info/en-us/repository/browse/ELRA-W0028/
Language:		English
Language (ISO639):		eng
Medium:		Not specified
Publisher:		ELRA (European Language Resources Association)
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		ELRA Catalogue of Language Resources
Description:		http://www.language-archives.org/archive/catalogue.elra.info
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:catalogue.elra.info:ELRA-W0028
DateStamp:		2001-08-14
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		n.a. 2001. ELRA (European Language Resources Association).
Terms:		area_Europe country_GB dcmi_Text iso639_eng olac_primary_text