OLAC Record: FERNET-C5

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-3776

Metadata

Title: FERNET-C5

Bibliographic Citation: http://hdl.handle.net/11234/1-3776

Creator: Lehečka, Jan

Švec, Jan

Date (W3CDTF): 2021-09-20T12:33:51Z

Date Available: 2021-09-20T12:33:51Z

Description: The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042 The same models are also released at https://huggingface.co/fav-kky/FERNET-C5

Identifier (URI): http://hdl.handle.net/11234/1-3776

Language: Czech

Language (ISO639): ces

Publisher: University of West Bohemia, Department of Cybernetics

Rights: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

http://creativecommons.org/licenses/by-nc-sa/4.0/

Subject: Czech

BERT

Czech language

Subject (ISO639): ces

Type: languageDescription

Type (DCMI): Text

Type (OLAC): language_description

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-3776

DateStamp: 2021-09-20

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Lehečka, Jan; Švec, Jan. 2021. University of West Bohemia, Department of Cybernetics.
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_language_description

Inferred Metadata
Country: Czech Republic
Area: Europe

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-3776
Up-to-date as of: Mon Jun 16 1:05:48 EDT 2025

Metadata
Title:		FERNET-C5
Bibliographic Citation:		http://hdl.handle.net/11234/1-3776
Creator:		Lehečka, Jan
Creator:		Švec, Jan
Date (W3CDTF):		2021-09-20T12:33:51Z
Date Available:		2021-09-20T12:33:51Z
Description:		The FERNET-C5 is a monolingual BERT language representation model trained from scratch on the Czech Colossal Clean Crawled Corpus (C5) data - a Czech mutation of the English C4 dataset. The training data contained almost 13 billion words (93 GB of text data). The model has the same architecture as the original BERT model, i.e. 12 transformation blocks, 12 attention heads and the hidden size of 768 neurons. In contrast to Google’s BERT models, we used SentencePiece tokenization instead of the Google’s internal WordPiece tokenization. More details can be found in README.txt. Yet more detailed description is available in https://arxiv.org/abs/2107.10042 The same models are also released at https://huggingface.co/fav-kky/FERNET-C5
Identifier (URI):		http://hdl.handle.net/11234/1-3776
Language:		Czech
Language (ISO639):		ces
Publisher:		University of West Bohemia, Department of Cybernetics
Rights:		Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Rights:		http://creativecommons.org/licenses/by-nc-sa/4.0/
Subject:		Czech
		BERT
		Czech language
Subject (ISO639):		ces
Type:		languageDescription
Type (DCMI):		Text
Type (OLAC):		language_description
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-3776
DateStamp:		2021-09-20
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Lehečka, Jan; Švec, Jan. 2021. University of West Bohemia, Department of Cybernetics.
Terms:		area_Europe country_CZ dcmi_Text iso639_ces olac_language_description
Inferred Metadata
Country:		Czech Republic
Area:		Europe