OLAC Record: ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)

OLAC Record
oai:lindat.mff.cuni.cz:11234/1-2580

Metadata

Title: ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)

Bibliographic Citation: http://hdl.handle.net/11234/1-2580

Creator: Kopřivová, Marie

Komrsková, Zuzana

Lukeš, David

Poukarová, Petra

Škarpová, Marie

Date (W3CDTF): 2018-01-02T12:21:53Z

Date Available: 2018-01-02T12:21:53Z

Description: ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus is composed of 332 recordings from 2012–2017 and contains 1 014 786 orthographic words (i.e. a total of 1 236 508 tokens including punctuation); a total of 624 different speakers appear in the probes. ORTOFON v1 is fully balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v1 is lemmatized and morphologically tagged. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-2579

Identifier (URI): http://hdl.handle.net/11234/1-2580

Is Replaced By (URI): http://hdl.handle.net/11234/1-5687

Language: Czech

Language (ISO639): ces

Publisher: Charles University, Faculty of Arts, Institute of the Czech National Corpus

Rights: Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

http://creativecommons.org/licenses/by-nc-sa/4.0/

Subject: balanced corpus

spoken language

informal language

Czech

Type: corpus

Type (DCMI): Text

Type (OLAC): primary_text

OLAC Info

Archive: LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

Description: http://www.language-archives.org/archive/lindat.mff.cuni.cz

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:lindat.mff.cuni.cz:11234/1-2580

DateStamp: 2024-10-10

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Kopřivová, Marie; Komrsková, Zuzana; Lukeš, David; Poukarová, Petra; Škarpová, Marie. 2018. Charles University, Faculty of Arts, Institute of the Czech National Corpus.
Terms: area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text

http://www.language-archives.org/item.php/oai:lindat.mff.cuni.cz:11234/1-2580
Up-to-date as of: Mon Jun 16 1:05:14 EDT 2025

Metadata
Title:		ORTOFON v1: balanced corpus of informal spoken Czech with multi-tier transcription (transcriptions)
Bibliographic Citation:		http://hdl.handle.net/11234/1-2580
Creator:		Kopřivová, Marie
		Komrsková, Zuzana
		Lukeš, David
		Poukarová, Petra
		Škarpová, Marie
Date (W3CDTF):		2018-01-02T12:21:53Z
Date Available:		2018-01-02T12:21:53Z
Description:		ORTOFON v1 is designed as a representation of authentic spoken Czech used in informal situations (private environment, spontaneity, unpreparedness etc.) in the area of the whole Czech Republic. The corpus is composed of 332 recordings from 2012–2017 and contains 1 014 786 orthographic words (i.e. a total of 1 236 508 tokens including punctuation); a total of 624 different speakers appear in the probes. ORTOFON v1 is fully balanced regarding the basic sociolinguistic speaker categories (gender, age group, level of education and region of childhood residence). The transcription is linked to the corresponding audio track. Unlike the ORAL-series corpora, the transcription was carried out on two main tiers, orthographic and phonetic, supplemented by an additional metalanguage tier. ORTOFON v1 is lemmatized and morphologically tagged. The (anonymized) corpus is provided in a (semi-XML) vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query engine to registered users of the CNC at http://www.korpus.cz Please note: this item includes only the transcriptions, audio (and the transcripts in their original format) is available under more restrictive non-CC license at http://hdl.handle.net/11234/1-2579
Identifier (URI):		http://hdl.handle.net/11234/1-2580
Is Replaced By (URI):		http://hdl.handle.net/11234/1-5687
Language:		Czech
Language (ISO639):		ces
Publisher:		Charles University, Faculty of Arts, Institute of the Czech National Corpus
Rights:		Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Rights:		http://creativecommons.org/licenses/by-nc-sa/4.0/
Subject:		balanced corpus
		spoken language
		informal language
		Czech
Type:		corpus
Type (DCMI):		Text
Type (OLAC):		primary_text
OLAC Info
Archive:		LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Description:		http://www.language-archives.org/archive/lindat.mff.cuni.cz
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:lindat.mff.cuni.cz:11234/1-2580
DateStamp:		2024-10-10
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Kopřivová, Marie; Komrsková, Zuzana; Lukeš, David; Poukarová, Petra; Škarpová, Marie. 2018. Charles University, Faculty of Arts, Institute of the Czech National Corpus.
Terms:		area_Europe country_CZ dcmi_Text iso639_ces olac_primary_text