OLAC Record: Towards a more general model of interlinear text

OLAC Record
oai:scholarspace.manoa.hawaii.edu:10125/26180

Metadata

Title: Towards a more general model of interlinear text

Bibliographic Citation: Arkhipov, Alexandre, Arkhipov, Alexandre; 2013-03-01; The interlinear glossed text (IGT) is a complex object, the complexity of its structure depending on factors such as origin, intended use, languages involved etc. Developing tools and workflows for integrated linguistic analysis environments calls for particular attention to those aspects which in many common cases can be disregarded as insignificant; thus, collaborating for ELAN–FLEx integration was particularly motivating for this paper. IGT is often conceived of as a tree: the root node corresponds to the whole text, subdivided into smaller units (sentences, words, morphemes). Each unit has a number of associated annotations, generally one per information type, like sentence translation, part-of-speech label, morpheme gloss. However, an IGT can easily amount to a large set of trees. Unresolved ambiguities of all kinds are one reason for it. Each pair of alternative analyses (e.g. two concurrent parses of a word) implies two distinct trees, identical except for the node in question and all its descendants. The more ambiguities arise, the more underlying trees should be posited. Still, all trees in such a tree family stem from a single analyzed object (transcript, original orthographic representation). Storing entire trees for each combination of relevant alternatives being utterly inefficient, a more compact storage model is needed. Turning to the media dimension, an accurate transcript of a spontaneous discourse is most often unsuitable for a grammatical analysis without some preprocessing (normalization) dealing with various speech errors, incomprehensible fragments etc. to produce a grammatically correct and coherent text for subsequent grammatical analysis – whereas the “raw” transcript feeds phonological and possibly discourse analysis. We thus get two distinct texts, interconnected but giving rise to independent (families of) analysis trees; only one of them is linked directly to the media timeline. In some scenarios, more than one media-based timeline emerge which need to be interlinked (cf. BOLD framework: sound annotations to sound events; retelling experiments, e.g. pear stories; sign languages translated from/into spoken languages). The reference axis may not be properly a timeline (text, path through a complex graphic image). One should mention further complicating factors such as multi-speaker and multi-lingual settings, collaboration and versioning. The overall structure (an XML sketch will be presented) might grow unreasonably complex for any specialized analysis component to handle. It may thus be efficient to use an intermediate repository, e.g. a unified underlying RDF representation [Nakhimovsky et al. 2012], to which all changes made in specific tools are merged. References Bow, Cathy, Baden Hughes and Steven Bird. 2003. Towards a General Model of Interlinear Text. Nakhimovsky, Alexander, Jeff Good, Tom Myers. 2012. Interoperability of Language Documentation Tools and Materials for Local Communities // Digital Humanities 2012.; Kaipuleohone University of Hawai'i Digital Language Archive;http://hdl.handle.net/10125/26180.

Contributor (speaker): Arkhipov, Alexandre

Creator: Arkhipov, Alexandre

Date (W3CDTF): 2013-03-01

Description: The interlinear glossed text (IGT) is a complex object, the complexity of its structure depending on factors such as origin, intended use, languages involved etc. Developing tools and workflows for integrated linguistic analysis environments calls for particular attention to those aspects which in many common cases can be disregarded as insignificant; thus, collaborating for ELAN–FLEx integration was particularly motivating for this paper. IGT is often conceived of as a tree: the root node corresponds to the whole text, subdivided into smaller units (sentences, words, morphemes). Each unit has a number of associated annotations, generally one per information type, like sentence translation, part-of-speech label, morpheme gloss. However, an IGT can easily amount to a large set of trees. Unresolved ambiguities of all kinds are one reason for it. Each pair of alternative analyses (e.g. two concurrent parses of a word) implies two distinct trees, identical except for the node in question and all its descendants. The more ambiguities arise, the more underlying trees should be posited. Still, all trees in such a tree family stem from a single analyzed object (transcript, original orthographic representation). Storing entire trees for each combination of relevant alternatives being utterly inefficient, a more compact storage model is needed. Turning to the media dimension, an accurate transcript of a spontaneous discourse is most often unsuitable for a grammatical analysis without some preprocessing (normalization) dealing with various speech errors, incomprehensible fragments etc. to produce a grammatically correct and coherent text for subsequent grammatical analysis – whereas the “raw” transcript feeds phonological and possibly discourse analysis. We thus get two distinct texts, interconnected but giving rise to independent (families of) analysis trees; only one of them is linked directly to the media timeline. In some scenarios, more than one media-based timeline emerge which need to be interlinked (cf. BOLD framework: sound annotations to sound events; retelling experiments, e.g. pear stories; sign languages translated from/into spoken languages). The reference axis may not be properly a timeline (text, path through a complex graphic image). One should mention further complicating factors such as multi-speaker and multi-lingual settings, collaboration and versioning. The overall structure (an XML sketch will be presented) might grow unreasonably complex for any specialized analysis component to handle. It may thus be efficient to use an intermediate repository, e.g. a unified underlying RDF representation [Nakhimovsky et al. 2012], to which all changes made in specific tools are merged. References Bow, Cathy, Baden Hughes and Steven Bird. 2003. Towards a General Model of Interlinear Text. Nakhimovsky, Alexander, Jeff Good, Tom Myers. 2012. Interoperability of Language Documentation Tools and Materials for Local Communities // Digital Humanities 2012.

Identifier (URI): http://hdl.handle.net/10125/26180

Language: English

Language (ISO639): eng

Rights: Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported

Table Of Contents: 26180.pdf

OLAC Info

Archive: Language Documentation and Conservation

Description: http://www.language-archives.org/archive/ldc.scholarspace.manoa.hawaii.edu

GetRecord: OAI-PMH request for OLAC format

GetRecord: Pre-generated XML file

OAI Info

OaiIdentifier: oai:scholarspace.manoa.hawaii.edu:10125/26180

DateStamp: 2024-09-12

GetRecord: OAI-PMH request for simple DC format

Search Info
Citation: Arkhipov, Alexandre. 2013. Language Documentation and Conservation.
Terms: area_Europe country_GB iso639_eng

http://www.language-archives.org/item.php/oai:scholarspace.manoa.hawaii.edu:10125/26180
Up-to-date as of: Thu Sep 25 0:31:46 EDT 2025

Metadata
Title:		Towards a more general model of interlinear text
Bibliographic Citation:		Arkhipov, Alexandre, Arkhipov, Alexandre; 2013-03-01; The interlinear glossed text (IGT) is a complex object, the complexity of its structure depending on factors such as origin, intended use, languages involved etc. Developing tools and workflows for integrated linguistic analysis environments calls for particular attention to those aspects which in many common cases can be disregarded as insignificant; thus, collaborating for ELAN–FLEx integration was particularly motivating for this paper. IGT is often conceived of as a tree: the root node corresponds to the whole text, subdivided into smaller units (sentences, words, morphemes). Each unit has a number of associated annotations, generally one per information type, like sentence translation, part-of-speech label, morpheme gloss. However, an IGT can easily amount to a large set of trees. Unresolved ambiguities of all kinds are one reason for it. Each pair of alternative analyses (e.g. two concurrent parses of a word) implies two distinct trees, identical except for the node in question and all its descendants. The more ambiguities arise, the more underlying trees should be posited. Still, all trees in such a tree family stem from a single analyzed object (transcript, original orthographic representation). Storing entire trees for each combination of relevant alternatives being utterly inefficient, a more compact storage model is needed. Turning to the media dimension, an accurate transcript of a spontaneous discourse is most often unsuitable for a grammatical analysis without some preprocessing (normalization) dealing with various speech errors, incomprehensible fragments etc. to produce a grammatically correct and coherent text for subsequent grammatical analysis – whereas the “raw” transcript feeds phonological and possibly discourse analysis. We thus get two distinct texts, interconnected but giving rise to independent (families of) analysis trees; only one of them is linked directly to the media timeline. In some scenarios, more than one media-based timeline emerge which need to be interlinked (cf. BOLD framework: sound annotations to sound events; retelling experiments, e.g. pear stories; sign languages translated from/into spoken languages). The reference axis may not be properly a timeline (text, path through a complex graphic image). One should mention further complicating factors such as multi-speaker and multi-lingual settings, collaboration and versioning. The overall structure (an XML sketch will be presented) might grow unreasonably complex for any specialized analysis component to handle. It may thus be efficient to use an intermediate repository, e.g. a unified underlying RDF representation [Nakhimovsky et al. 2012], to which all changes made in specific tools are merged. References Bow, Cathy, Baden Hughes and Steven Bird. 2003. Towards a General Model of Interlinear Text. Nakhimovsky, Alexander, Jeff Good, Tom Myers. 2012. Interoperability of Language Documentation Tools and Materials for Local Communities // Digital Humanities 2012.; Kaipuleohone University of Hawai'i Digital Language Archive;http://hdl.handle.net/10125/26180.
Contributor (speaker):		Arkhipov, Alexandre
Creator:		Arkhipov, Alexandre
Date (W3CDTF):		2013-03-01
Description:		The interlinear glossed text (IGT) is a complex object, the complexity of its structure depending on factors such as origin, intended use, languages involved etc. Developing tools and workflows for integrated linguistic analysis environments calls for particular attention to those aspects which in many common cases can be disregarded as insignificant; thus, collaborating for ELAN–FLEx integration was particularly motivating for this paper. IGT is often conceived of as a tree: the root node corresponds to the whole text, subdivided into smaller units (sentences, words, morphemes). Each unit has a number of associated annotations, generally one per information type, like sentence translation, part-of-speech label, morpheme gloss. However, an IGT can easily amount to a large set of trees. Unresolved ambiguities of all kinds are one reason for it. Each pair of alternative analyses (e.g. two concurrent parses of a word) implies two distinct trees, identical except for the node in question and all its descendants. The more ambiguities arise, the more underlying trees should be posited. Still, all trees in such a tree family stem from a single analyzed object (transcript, original orthographic representation). Storing entire trees for each combination of relevant alternatives being utterly inefficient, a more compact storage model is needed. Turning to the media dimension, an accurate transcript of a spontaneous discourse is most often unsuitable for a grammatical analysis without some preprocessing (normalization) dealing with various speech errors, incomprehensible fragments etc. to produce a grammatically correct and coherent text for subsequent grammatical analysis – whereas the “raw” transcript feeds phonological and possibly discourse analysis. We thus get two distinct texts, interconnected but giving rise to independent (families of) analysis trees; only one of them is linked directly to the media timeline. In some scenarios, more than one media-based timeline emerge which need to be interlinked (cf. BOLD framework: sound annotations to sound events; retelling experiments, e.g. pear stories; sign languages translated from/into spoken languages). The reference axis may not be properly a timeline (text, path through a complex graphic image). One should mention further complicating factors such as multi-speaker and multi-lingual settings, collaboration and versioning. The overall structure (an XML sketch will be presented) might grow unreasonably complex for any specialized analysis component to handle. It may thus be efficient to use an intermediate repository, e.g. a unified underlying RDF representation [Nakhimovsky et al. 2012], to which all changes made in specific tools are merged. References Bow, Cathy, Baden Hughes and Steven Bird. 2003. Towards a General Model of Interlinear Text. Nakhimovsky, Alexander, Jeff Good, Tom Myers. 2012. Interoperability of Language Documentation Tools and Materials for Local Communities // Digital Humanities 2012.
Identifier (URI):		http://hdl.handle.net/10125/26180
Language:		English
Language (ISO639):		eng
Rights:		Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
Table Of Contents:		26180.pdf
OLAC Info
Archive:		Language Documentation and Conservation
Description:		http://www.language-archives.org/archive/ldc.scholarspace.manoa.hawaii.edu
GetRecord:		OAI-PMH request for OLAC format
GetRecord:		Pre-generated XML file
OAI Info
OaiIdentifier:		oai:scholarspace.manoa.hawaii.edu:10125/26180
DateStamp:		2024-09-12
GetRecord:		OAI-PMH request for simple DC format
Search Info
Citation:		Arkhipov, Alexandre. 2013. Language Documentation and Conservation.
Terms:		area_Europe country_GB iso639_eng