The Open Language Archives Community

Archiving and linguistic resources or How to keep your data from becoming endangered

Organized by Jeff Good (MPI Leipzig) and Heidi Johnson (University of Texas, Austin and AILLA) for the annual meeting of the Linguistic Society of America, 2005, Oakland, California

Materials from the tutorial, in order of presentation

  1. [ pdf | ppt ] Peter K. Austin, Robert Munro, and David Nathan: Archives, linguists, and language speakers
  2. [ pdf | ppt ] Gary Simons: The Open Language Archives Community: Building a worldwide library of digital language resources
  3. [ pdf | ppt ] Helen Dry: School of Best Practices in Digital Language Documentation
  4. Helen Agüera: Archival projects and the NEH (to be provided)
  5. [ pdf ] Gary Holton: Ethical practices in language documentation and archiving
  6. [ pdf | ppt ] Mark Kaiser: Digitizing the audio archive of linguistic fieldwork at the Berkeley Language Center
  7. [ pdf | ppt ] Heidi Johnson: Language documentation and archiving
  8. [ pdf | ppt ] Jeff Good: Archiving and linguistics databases
  9. [ pdf | ppt ] Nick Thieberger: Archiving and the flow of field work


Over the last several decades, the rise of new technologies has drastically changed the nature of linguistic resources. Whereas formerly primary documentary evidence for a language consisted of notebooks and analog recordings, today linguists doing documentary work have a bewildering array of choices of media, tools, and standards to choose from when creating materials.

The production of textual materials, for example, is complicated by the plethora of choices in computer software. Typical word processors are designed for the business world where documents do not generally need to be preserved for centuries, as is the case with endangered language documentation. Furthermore, much linguistic documentation does not consist of unstructured prose, which is what word processors are designed for, but, rather, highly structured information, like entries in a lexicon or morphological paradigms. In the ideal case, such data would be entered into a database where its structure could be properly encoded. There are a variety of database programs which can be used for this purpose, but most of them have the same limitations as word processors—they produce proprietary, short-lived formats that are difficult to migrate forward as technologies and methodologies evolve.

Another layer of complexity is that different digital resources need to be properly "linked" in order for them to have maximal value. Documentation may begin with audio or video recordings, but often a transcription and text analysis will also be created to accompany a recording. Ideally, recordings should not be inextricably linked to any particular analysis. However, any researcher making use of a recording would almost certainly want to be aware of an accompanying transcription or analysis. Dealing with the problem of indicating relationships among related resources is a problem in the creation of metadata—that is, archival data about resources which facilitates their access.

A final issue engendered by the rise of new technologies for documentation is how to manage "legacy" data—that is, data produced before the rise of digital tools—in order to make it accessible and to preserve it for future generations. Some important questions in this area include: What is the best way to digitize field notes and analog recordings? And, who has the rights to material produced before the academic community became more sensitive to the relationship between communities of speakers and data from their languages?

At present, it is difficult for linguists to locate specific recommendations for creating archivable resources. The purpose of this tutorial was to create a forum wherein linguists who have created, or are planning to create, documentary linguistic resources could access a range of talks on current accepted standards of best practice for resource production and conservation. The tutorial aimed for breadth, rather than depth, of coverage in order to address the needs of as many individuals as possible. By hearing from a number of experts from different areas, attendees of the tutorial—or readers of these materials—should be able to identify appropriate individuals whom they can contact in order to get answers to their specific questions.

Abstracts of presentations

Peter K. Austin, Endangered Languages Academic Programme, School of Oriental andAfrican Studies (SOAS)
Robert Munro, Endangered Languages Archive,SOAS
David Nathan, Endangered Languages Archive, SOAS
Archives, linguists, and language speakers

This talk begins by addressing the following broad questions:

1. What is language archiving?
2. What kinds of language archives are there?
3. Who uses language archives?
4. Why should linguists work together with a language archive?

It then examines the following two questions in more detail:

5. What can a language archive (eg. SOAS-ELAR) offer you?

6. What can you archive and what do you have to do to archive your materials?

All types of materials relating to endangered languages can be deposited. The archivist will ask you to provide cataloguing metadata and to keep in contact in order to ensure that rights information is kept up to date. They may also require you to put your data in a particular format. Typically, a catalogue of deposited materials will appear on the world wide web. Items will be accessible to others, subject to the wishes of the information providers.

ELAR (and other archives) also offer other services:

Gary Simons, SIL International
The Open Language Archives Community: Building a worldwide library of digital language resources

New ways of documenting and describing language via digital media coupled with new ways of distributing the results via the World-Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. The Open Language Archives Community (OLAC) is an international partnership of almost 30 projects and institutions who are addressing these issues by (1) developing consensus on best current practice for the digital archiving of language resources, and (2) developing a network of interoperating repositories and services for housing and accessing such resources.

This talk presents the OLAC vision for creating a virtual library of the language resources that are housed all over the world by its member archives. It then describes the infrastructure that has been built in order to achieve this objective. Special attention is given to explaining the various mechanisms that make it possible for a project or institution to become a participating archive and to demonstrating the global search portal that allows any Web user to present a single search query to all participating archives at once.

Helen Aristar-Dry, Eastern Michigan University and LinguistList
The E-MELD School of Best Practices

The Electronic Metastructure for Endangered Languages Data (E-MELD) Project, is a five-year collaborative project designed to build digital infrastructure for the long term preservation of linguistic documentation in "best practice" format. Best practice recommendations are designed to ensure that digital language resources will remain accessible and intelligible by future generations. A goal of the E-MELD project is to create a comprehensive but user-friendly website which offers information about creating such resources; this is the E-MELD School of Best Practices in Digital Language Documentation (, which was demonstrated at this symposium. The site includes:

Helen Agüera, National Endowment for the Humanities (NEH), Acting Deputy Director, Division of Preservation and Access
Archival projects and the NEH

Helen Agüera discusses NEH's support for projects related to linguistic archives. She describes the range of preservation and access activities funded by the Endowment through the Division of Preservation and Access. These activities include: the arrangement and description of a collection of linguistic materials that needs to be brought under intellectual control; the digital reformatting of sound and moving image collections for preserving and enhancing access to linguistic materials; and the creation of online archives that integrate multiple collections from widely dispersed sources or repositories to facilitate comparative studies and broad educational use.

Agüera summarizes the characteristics of successful projects, touching on what aspects of a proposal NEH evaluators consider essential for endorsing a project--from information about the language or languages represented in a project to details about the proposed methodology and adherence to (or departure from) established standard and best practices. She gives special attention to questions concerning the long-term preservation of digital objects.

Finally, Agüera also discusses NEH's partnership with the National Science Foundation, "Documenting Endangered Languages," and the role linguistic archives can play in this effort to develop and advance knowledge concerning endangered languages.

Gary Holton, Alaska Native Language Center (ANLC)
Ethical practices in language documentation and archiving

This paper presents some ethical guidelines for language documentation and archiving, drawing on experiences at the Alaska Native Language Center archive and other primary language archives. A clearly defined approach to intellectual property rights is crucial in order for a language archive to meet its dual obligations of preservation of and access to language documentation materials. This point is perhaps most obvious with respect to access: proper access cannot be achieved unless legal intellectual property responsibilities are met. But ethical issues are also crucial to preservation efforts. This is because a lack of clear ethical guidelines may actually impede or inhibit the collection of documentary material, leading to the potential loss of irreplaceable data. Creators/authors of endangered language material are reluctant to deposit materials with a language archive without assurances as to the maintenance of intellectual property rights. On the other hand, archives have traditionally been reluctant to accept materials without full legal rights or ownership. Here we suggest ethical guidelines by which language archives can work in collaboration with creators of documentary materials to ensure preservation of materials while respecting restrictions on access to materials imposed by the creators and by language communities.

Mark Kaiser, Berkeley Language Center (BLC)
Digitizing the Audio Archive of Linguistic Field Work

The Berkeley Language Center manages three main and several minor archives of audio recordings. This presentation focuses on our efforts to digitally preserve and provide access to the Audio Archive of Linguistic Field Work, which consists of nearly 1,400 hours of field recordings of Native American languages. We discuss legal issues regarding copyright and the rights of consultants and Native American communities, as well as ethical issues surrounding the preservation and distribution of materials deposited at the BLC long before use of the Internet. We also address technical issues of archiving and delivery (bit depth and sampling rates, file formats, backup), and finally, our efforts to anticipate and comply with metadata standards.

Heidi Johnson, Archive of the Indigenous Languages of Latin America (AILLA)
Preparing documentary materials for archiving

The Archive of the Indigenous Languages of Latin America (AILLA) at the University of Texas at Austin is a digital repository of multimedia resources. Our primary mission is the digitization and preservation of "legacy" materals; that is, recordings and texts produced in analog media over the past half-century. We have received a wide variety of materials, with and without metadata (catalog information). This diversity is unavoidable when dealing with materials produced long ago, often by someone other than the person who sends us the package, but it is not particularly desirable.

In the course of her duties as manager of AILLA, and from her own experiences as a field linguist, Johnson has developed a set of guidelines for corpus management that she hopes will be useful for documentary linguists in the creation of an orderly, archive-ready, language documentation corpus. This talk presents those guidelines with examples from AILLA's materials. In the brief time allowed she touches on the essential elements: documenting consent, labelling, digital formats, and metadata. More detailed information on all of these topics and more is available on the web.

Jeff Good, the Max Planck Institute for Evolutionary Anthropology
Databases and archiving

Many types of linguistic data are highly amenable to being stored in databases, including, for example, lexical and typological data. Unfortunately, many commonly-used database programs produce, as a default, resources in proprietary formats that are not suitable for archiving.

However, if appropriate measures are taken, it is possible to use almost any kind of database software and still create resources which can be archived over the long term. To do this, it is important to maintain a distinction between at least two different formats which the data in a database can take on: archival and working.

An archival format for a database is one which is expected to be readable in the long term. Typically, it will take on the form of a text file, perhaps one which uses XML to annotate the data. A working format is typically optimized for data entry and searching. There is nothing intrinsically wrong with making use of a working format as long it is regularly exported to an archival format.

In addition to covering basic distinctions in database formats, this talk discusses the advantages and disadvantages of using common database software with respect to creating archivable resources.

Nick Thieberger, the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC)
Archiving and the work flow of field work

Archiving is not something we do at the end of our fieldwork, it is part of everyday work. Recent technological advances have pointed to the importance of planning data management and workflow for ethnographic recording. Recordings should always be of high quality, but it is in the context of small and endangered cultures and languages that the quality of recording takes on new significance (quality here refers both to the content and the form of the recording). If we are the only recorders of the last remaining speakers or performers then we are providing historical documents that will be of use not only to other researchers, but primarily to those recorded and their descendants. So, right from the moment of recording, we must be concerned with making good documents which will be placed into a suitable repository for storage and discovery.

In this session we discuss a workflow that builds in development of archival data. We show that making the initial recordings and their digital representation citable by means of a persistent identifier allows further work to be located with reference to that primary data. Further description of the data with standard metadata terms allows its discovery in the long term.

Jeff Good and Heidi Johnson