OLAC Record: Corpora, collections, data – Reusing outputs of language documentation

OLAC Record
oai:scholarspace.manoa.hawaii.edu:10125/25304

Metadata

Title: Corpora, collections, data – Reusing outputs of language documentation

Bibliographic Citation: Thieberger, Nick, Thieberger, Nick; 2015-02-27; With the success of new methods in language documentation comes the creation of collections of records in an increasing number of small languages. Looking back over the past decade of such work reveals a heterogeneity in the form of collections that reflects the context in which each linguist has been trained and the relative focus they put on creating records. The Australian Centre of Excellence in the Dynamics of Language is a new seven-year documentation program that will include a data management and archiving ‘thread’, with the need to consider what form its primary research material should take. The distinction between primary and secondary data is laid out in Himmelmann (2012) and this paper will explore the range of types of primary data that can be considered part of a corpus, and how is a corpus distinct from a collection. A corpus for these purposes is structured and often allows some interoperability with other corpora for searching comparable phenomena, as in the Corpo AfroAs [1] project for example. Collections are more idiosyncratic and typically require some work on the part of the researcher to make use of them. Data types representing the outputs of funded research range from unannotated primary data through to elaborately annotated, interlinear glossed text and media. What is the ideal form of the material that would allow it to be interpreted, accessed and re-used, and how can current and future researchers collaborate in the construction of corpora that will be accessed in this way in future? While linguists have adopted several standard formats for their research outputs, typically based on the schema provided by the tools used (Elan, Fieldworks, Praat and so on), and perhaps using conventions like the Leipzig Glossing Rules, there is no agreement about the internal structure of a documentary corpus, nor in the methods used to expose elements of the corpus for discovery and analysis. We have so far built a repository (PARADISEC [2]) which provides citability and preservation of primary records. We have also built a system for presenting corpus material as interlinear text with media (EOPAS [3]). We are exploring what gaps there are in the workflow from creation to reuse of research material in order to build new tools with the aim of providing richer sources of information about the world’s languages. [1] http://corpafroas.huma-num.fr/ [2] http://paradisec.org.au [3] http://eopas.org Reference Himmelmann, Nikolaus P. 2012. Linguistic Data Types and the Interface between Language Documentation and Description. Language Documentation & Conservation 6. 187-207.; Kaipuleohone University of Hawai'i Digital Language Archive;http://hdl.handle.net/10125/25304.

Contributor (speaker): Thieberger, Nick

Creator: Thieberger, Nick

Date (W3CDTF): 2015-03-12

Description: With the success of new methods in language documentation comes the creation of collections of records in an increasing number of small languages. Looking back over the past decade of such work reveals a heterogeneity in the form of collections that reflects the context in which each linguist has been trained and the relative focus they put on creating records. The Australian Centre of Excellence in the Dynamics of Language is a new seven-year documentation program that will include a data management and archiving ‘thread’, with the need to consider what form its primary research material should take. The distinction between primary and secondary data is laid out in Himmelmann (2012) and this paper will explore the range of types of primary data that can be considered part of a corpus, and how is a corpus distinct from a collection. A corpus for these purposes is structured and often allows some interoperability with other corpora for searching comparable phenomena, as in the Corpo AfroAs [1] project for example. Collections are more idiosyncratic and typically require some work on the part of the researcher to make use of them. Data types representing the outputs of funded research range from unannotated primary data through to elaborately annotated, interlinear glossed text and media. What is the ideal form of the material that would allow it to be interpreted, accessed and re-used, and how can current and future researchers collaborate in the construction of corpora that will be accessed in this way in future? While linguists have adopted several standard formats for their research outputs, typically based on the schema provided by the tools used (Elan, Fieldworks, Praat and so on), and perhaps using conventions like the Leipzig Glossing Rules, there is no agreement about the internal structure of a documentary corpus, nor in the methods used to expose elements of the corpus for discovery and analysis. We have so far built a repository (PARADISEC [2]) which provides citability and preservation of primary records. We have also built a system for presenting corpus material as interlinear text with media (EOPAS [3]). We are exploring what gaps there are in the workflow from creation to reuse of research material in order to build new tools with the aim of providing richer sources of information about the world’s languages. [1] http://corpafroas.huma-num.fr/ [2] http://paradisec.org.au [3] http://eopas.org Reference Himmelmann, Nikolaus P. 2012. Linguistic Data Types and the Interface between Language Documentation and Description. Language Documentation & Conservation 6. 187-207.

Identifier (URI): http://hdl.handle.net/10125/25304

Rights: Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported

Table Of Contents: 25304.mp3

25304.pdf

OLAC Info

Archive: Language Documentation and Conservation

Description: http://www.language-archives.org/archive/ldc.scholarspace.manoa.hawaii.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:scholarspace.manoa.hawaii.edu:10125/25304

DateStamp: 2024-09-08

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Thieberger, Nick. 2015. Language Documentation and Conservation.

http://www.language-archives.org/item.php/oai:scholarspace.manoa.hawaii.edu:10125/25304
Up-to-date as of: Thu Sep 25 0:32:04 EDT 2025

Metadata
Title:		Corpora, collections, data – Reusing outputs of language documentation
Bibliographic Citation:		Thieberger, Nick, Thieberger, Nick; 2015-02-27; With the success of new methods in language documentation comes the creation of collections of records in an increasing number of small languages. Looking back over the past decade of such work reveals a heterogeneity in the form of collections that reflects the context in which each linguist has been trained and the relative focus they put on creating records. The Australian Centre of Excellence in the Dynamics of Language is a new seven-year documentation program that will include a data management and archiving ‘thread’, with the need to consider what form its primary research material should take. The distinction between primary and secondary data is laid out in Himmelmann (2012) and this paper will explore the range of types of primary data that can be considered part of a corpus, and how is a corpus distinct from a collection. A corpus for these purposes is structured and often allows some interoperability with other corpora for searching comparable phenomena, as in the Corpo AfroAs [1] project for example. Collections are more idiosyncratic and typically require some work on the part of the researcher to make use of them. Data types representing the outputs of funded research range from unannotated primary data through to elaborately annotated, interlinear glossed text and media. What is the ideal form of the material that would allow it to be interpreted, accessed and re-used, and how can current and future researchers collaborate in the construction of corpora that will be accessed in this way in future? While linguists have adopted several standard formats for their research outputs, typically based on the schema provided by the tools used (Elan, Fieldworks, Praat and so on), and perhaps using conventions like the Leipzig Glossing Rules, there is no agreement about the internal structure of a documentary corpus, nor in the methods used to expose elements of the corpus for discovery and analysis. We have so far built a repository (PARADISEC [2]) which provides citability and preservation of primary records. We have also built a system for presenting corpus material as interlinear text with media (EOPAS [3]). We are exploring what gaps there are in the workflow from creation to reuse of research material in order to build new tools with the aim of providing richer sources of information about the world’s languages. [1] http://corpafroas.huma-num.fr/ [2] http://paradisec.org.au [3] http://eopas.org Reference Himmelmann, Nikolaus P. 2012. Linguistic Data Types and the Interface between Language Documentation and Description. Language Documentation & Conservation 6. 187-207.; Kaipuleohone University of Hawai'i Digital Language Archive;http://hdl.handle.net/10125/25304.
Contributor (speaker):		Thieberger, Nick
Creator:		Thieberger, Nick
Date (W3CDTF):		2015-03-12
Description:		With the success of new methods in language documentation comes the creation of collections of records in an increasing number of small languages. Looking back over the past decade of such work reveals a heterogeneity in the form of collections that reflects the context in which each linguist has been trained and the relative focus they put on creating records. The Australian Centre of Excellence in the Dynamics of Language is a new seven-year documentation program that will include a data management and archiving ‘thread’, with the need to consider what form its primary research material should take. The distinction between primary and secondary data is laid out in Himmelmann (2012) and this paper will explore the range of types of primary data that can be considered part of a corpus, and how is a corpus distinct from a collection. A corpus for these purposes is structured and often allows some interoperability with other corpora for searching comparable phenomena, as in the Corpo AfroAs [1] project for example. Collections are more idiosyncratic and typically require some work on the part of the researcher to make use of them. Data types representing the outputs of funded research range from unannotated primary data through to elaborately annotated, interlinear glossed text and media. What is the ideal form of the material that would allow it to be interpreted, accessed and re-used, and how can current and future researchers collaborate in the construction of corpora that will be accessed in this way in future? While linguists have adopted several standard formats for their research outputs, typically based on the schema provided by the tools used (Elan, Fieldworks, Praat and so on), and perhaps using conventions like the Leipzig Glossing Rules, there is no agreement about the internal structure of a documentary corpus, nor in the methods used to expose elements of the corpus for discovery and analysis. We have so far built a repository (PARADISEC [2]) which provides citability and preservation of primary records. We have also built a system for presenting corpus material as interlinear text with media (EOPAS [3]). We are exploring what gaps there are in the workflow from creation to reuse of research material in order to build new tools with the aim of providing richer sources of information about the world’s languages. [1] http://corpafroas.huma-num.fr/ [2] http://paradisec.org.au [3] http://eopas.org Reference Himmelmann, Nikolaus P. 2012. Linguistic Data Types and the Interface between Language Documentation and Description. Language Documentation & Conservation 6. 187-207.
Identifier (URI):		http://hdl.handle.net/10125/25304
Rights:		Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
Table Of Contents:		25304.mp3
Table Of Contents:		25304.pdf
OLAC Info
Archive:		Language Documentation and Conservation
Description:		http://www.language-archives.org/archive/ldc.scholarspace.manoa.hawaii.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:scholarspace.manoa.hawaii.edu:10125/25304
DateStamp:		2024-09-08
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Thieberger, Nick. 2015. Language Documentation and Conservation.