OLAC Record: Hungarian-English Parallel Text, Version 1.0

OLAC Record
oai:www.ldc.upenn.edu:LDC2008T01

Metadata

Title: Hungarian-English Parallel Text, Version 1.0

Access Rights: Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining

Bibliographic Citation: Varga, Dániel, et al. Hungarian-English Parallel Text, Version 1.0 LDC2008T01. Web Download. Philadelphia: Linguistic Data Consortium, 2008

Contributor: Varga, Dániel

Németh, László

Halácsy, Péter

Kornai, András

et al.

Date (W3CDTF): 2008

Date Issued (W3CDTF): 2008-01-22

Description: *Introduction* Hungarian-English Parallel Text, Version 1.0 (also known as the "Hunglish Corpus") is a sentence-aligned Hungarian-English parallel corpus consisting of approximately two million sentence pairs. The corpus contains additional language resources for the Hungarian text, including a monolingual corpus, morphological toolset and aligner. Hungarian-English Parallel Text, Version 1.0 is a joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (BUTE) and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics. Additional information about this release is available from the corpus website maintained by BUTE. *File formats, character encoding* This publication is issued on CD as a tarred zip file. Commonly available utilities such as Gnu Zip or Stuffit will readily extract this publication from its compressed form. Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. The .bi files do not contain segments where deletion or contraction occurred. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible. Some .bi files were shuffled (sorted alphabetically). Alignment "ladder" (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token " ~~~ " is placed between sentences. The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded. hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively. *Samples* For an example of the data contained in this corpus, please examine this sample screen capture of bilingual text.

Extent: Corpus size: 1992294 KB

Identifier: LDC2008T01

https://catalog.ldc.upenn.edu/LDC2008T01

ISBN: 1-58563-461-1

ISLRN: 694-868-944-045-4

DOI: 10.35111/khb2-hh45

Language: Hungarian

Language (ISO639): hun

License: Hungarian-English Parallel Text, Version 1 Agreement: https://catalog.ldc.upenn.edu/license/hungarian-english-parallel-text-version-1.pdf

Medium: Distribution: Web Download

Publisher: Linguistic Data Consortium

Publisher (URI): https://www.ldc.upenn.edu

Relation (URI): https://catalog.ldc.upenn.edu/docs/LDC2008T01

Rights Holder: Portions © 2005 Budapest University of Technology and Economics, © 2005 Hungarian Academy of Sciences Institute of Linguistics, © 2005 Diplomacy and Trade Magazine, © 1996, 2008 Trustees of the University of Pennsylvania

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: The LDC Corpus Catalog

Description: http://www.language-archives.org/archive/www.ldc.upenn.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:www.ldc.upenn.edu:LDC2008T01

DateStamp: 2020-11-30

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Varga, Dániel; Németh, László; Halácsy, Péter; Kornai, András; et al. 2008. Linguistic Data Consortium.
Terms: area_Europe country_HU dcmi_Text iso639_hun olac_primary_text

http://www.language-archives.org/item.php/oai:www.ldc.upenn.edu:LDC2008T01
Up-to-date as of: Wed Oct 29 7:01:01 EDT 2025

Metadata
Title:		Hungarian-English Parallel Text, Version 1.0
Access Rights:		Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
Bibliographic Citation:		Varga, Dániel, et al. Hungarian-English Parallel Text, Version 1.0 LDC2008T01. Web Download. Philadelphia: Linguistic Data Consortium, 2008
Contributor:		Varga, Dániel
		Németh, László
		Halácsy, Péter
		Kornai, András
		et al.
Date (W3CDTF):		2008
Date Issued (W3CDTF):		2008-01-22
Description:		Introduction Hungarian-English Parallel Text, Version 1.0 (also known as the "Hunglish Corpus") is a sentence-aligned Hungarian-English parallel corpus consisting of approximately two million sentence pairs. The corpus contains additional language resources for the Hungarian text, including a monolingual corpus, morphological toolset and aligner. Hungarian-English Parallel Text, Version 1.0 is a joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (BUTE) and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics. Additional information about this release is available from the corpus website maintained by BUTE. File formats, character encoding This publication is issued on CD as a tarred zip file. Commonly available utilities such as Gnu Zip or Stuffit will readily extract this publication from its compressed form. Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. The .bi files do not contain segments where deletion or contraction occurred. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible. Some .bi files were shuffled (sorted alphabetically). Alignment "ladder" (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token " ~~~ " is placed between sentences. The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded. hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively. Samples For an example of the data contained in this corpus, please examine this sample screen capture of bilingual text.
Extent:		Corpus size: 1992294 KB
Identifier:		LDC2008T01
		https://catalog.ldc.upenn.edu/LDC2008T01
		ISBN: 1-58563-461-1
		ISLRN: 694-868-944-045-4
		DOI: 10.35111/khb2-hh45
Language:		Hungarian
Language (ISO639):		hun
License:		Hungarian-English Parallel Text, Version 1 Agreement: https://catalog.ldc.upenn.edu/license/hungarian-english-parallel-text-version-1.pdf
Medium:		Distribution: Web Download
Publisher:		Linguistic Data Consortium
Publisher (URI):		https://www.ldc.upenn.edu
Relation (URI):		https://catalog.ldc.upenn.edu/docs/LDC2008T01
Rights Holder:		Portions © 2005 Budapest University of Technology and Economics, © 2005 Hungarian Academy of Sciences Institute of Linguistics, © 2005 Diplomacy and Trade Magazine, © 1996, 2008 Trustees of the University of Pennsylvania
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		The LDC Corpus Catalog
Description:		http://www.language-archives.org/archive/www.ldc.upenn.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:www.ldc.upenn.edu:LDC2008T01
DateStamp:		2020-11-30
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Varga, Dániel; Németh, László; Halácsy, Péter; Kornai, András; et al. 2008. Linguistic Data Consortium.
Terms:		area_Europe country_HU dcmi_Text iso639_hun olac_primary_text