Requirements on the Infrastructure for Open Language Archiving

Gary Simons and Steven Bird
Draft: 7 December 2000


About this document. This document has been prepared in conjunction with the workshop on Web-Based Language Documentation and Description, held in Philadelphia on 12-15 December 2000. It lists the requirements on language archiving infrastructure that informed the design of the proposed Open Language Archives Community. It also explains how each of the requirements is addressed by the resulting OLAC design.


Table of Contents

1. INTRODUCTION
2. REQUIREMENTS
  2.1. User Requirements
  2.2. Creator Requirements
  2.3. Archivist Requirements
  2.4. Developer Requirements
  2.5. Sponsor Requirements
 


1. INTRODUCTION

Recent years have witnessed dramatic advances in digital storage and digital publication technologies, making it possible to house virtually unlimited quantities of linguistic data online, and to disseminate this data in digital form on CD-ROM/DVD and over the web for negligible cost. The development of XML and Unicode greatly facilitate the interchange and reuse of structured multimodal and multilingual data and the development of interoperating software tools. These developments are having a pervasive influence on the way primary linguistic data are gathered, stored, analyzed and disseminated, as part of projects to document and describe languages, and they present major new challenges for modeling, creating, archiving and accessing this data.

So far, these challenges are being addressed by the "language documentation community" in a fragmentary manner. Given the scarcity of resources and the scale of the challenges, the best approach would seem to be one in which the whole community collaborated on designing and constructing a shared infrastructure.

In moving towards this shared infrastructure, we envision three stages of increasingly substantive agreement on the nature of the infrastructure:

  1. Requirements. What properties would the ideal digital infrastructure possess?
  2. State of the Art. What is the current state of the art, and how well does it meet these requirements?
  3. Best Practices. In light of the above, what are the recommended best practices for modeling, creating, archiving and accessing language documentation?

The present document focuses on the first of these, the requirements.

2. REQUIREMENTS

We can identify at least five special interest groups that would want to levy requirements on the enterprise:

users The people who want to access language materials which have been stored away in archives.
creators The people who create the language materials that get archived.
archivists The people who manage the process of acquiring, maintaining, and accessing the information resources stored in archives.
developers The people who create data models, tools and formats for storing and manipulating digital language documentation.
sponsors The organizations that fund the creation of information resources and their maintenance in archives.

In this document we attempt to enumerate the requirements for each of these groups with respect to the total infrastructure required to support digital language documentation and description.

We have stated these requirements at a high level. Each requirement could itself be expanded into a set of more detailed requirements; however, this can be left to a later stage.

We have stated each requirement both positively and negatively. That is, the first column describes the desired state of things, while the second column describes the situation we want to avoid. In the third column of each table we make note of how the proposed Open Language Archives Community (see the white paper) would meet that particular requirement.

2.1. User Requirements

While online language archives hold the promise of unparalleled access to information, it also presents the specter of unparalleled chaos as information resources pop up in every corner of the world-wide web. The following statements describe the state of the world that users of online language archives would like to see, as opposed to the contrasting chaotic state that might be more likely to be realized in the absence of deliberate efforts to prevent it.

  What users want What users don't want How OLAC meets the requirement
1. There is a single site on the Web where any user can go to discover what language information resources are available, regardless of where they may be archived. The only way to discover language resources on the Web is to visit all the individual archives or to hope that the resources one is interested in have been indexed in an intuitive way by one's favorite general-purpose search engine. Linguist List (www.linguistlist.org) will host a combined catalog of all participating archives.
2. All language resources (regardless of where they may be archived) are catalogued with a consistent set of metadata descriptions, so that the user can ascertain all the basic facts about a resource without having to download it. The only way to get a good idea about what a resource contains, who is responsible for it, or what are its terms of availability is to retrieve it. Every holding in the combined catalog is described using the OLAC metadata set. Since that metadata set includes all the elements of the Dublin Core, it offers enough breadth to handle all the basic facts about a resource.
3. Uniform metadata descriptions can be used to perform focussed searching of language resources by metadata categories regardless of where on the Web they may actually be archived. There is no way to reliably search on metadata categories for language resources since there is no standardized framework for describing resources. The search interface to the combined catalog at Linguist List will allow searching by particular metadata elements, either singly or in combination.
4. All language resources (regardless of where they may be archived) are tagged in a consistent way to identify the languages they relate to, so that a single search for a particular language will retrieve all relevant resources on the Web. The only way to find resources in or about a particular language is to depend on keyword searching. This will fail when the language a resource is in is not identified by a keyword, or when different submitters supply different names for the same language. The OLAC metadata set uses the SIL language codes to give a unique identifier to each living (and recently extinct) language of the world.
5. When a user discovers the existence of a resource, full information is available on how to obtain the resource, and on any restrictions concerning the format of the resource and rights to its use. The user requests the resource, and after taking shipment of it discovers that it is in a proprietary format and is thus not really useful, or after some delay discovers that the owner of the resource is not prepared for this particular user to have the resource, or that the owner places previously undisclosed restrictions on the use of the resource. The OLAC metadata set has special elements to document the openness of the resource (both in terms of format and of rights) in a consistent way for all holdings. The generic rights element can be used to supply specific details.
6. For a resource in a digital format, the archived information documents how it is electronically encoded. There is no obvious way to find out what format a binary file is in, or in the case of a text file, what the encoded characters or the markup tags represent. For digital resources, the OLAC metadata set has special elements for documenting the file format and the character encoding.
7. When users obtain a digital resource, they see the same thing that the submitter originally saw. The user cannot properly view the resource for lack of fonts, stylesheets, or the right rendering software. When a digital resource requires the use of other resources that are not packaged within it, the relation.required metadata element must be used to document this fact and the value is a URI of a resource that has also been archived.
8. For any given resource, it is possible to find the software tools that are appropriate for querying it or for converting it to another format. Users cannot do anything with resources they download since they are not in a format they can use. Linguist List will also host a catalog of software tools that will use the same metadata element for format that the language archive holdings use. The search interface for language archives will offer a link to look up available tools for the format of the language resource.
9. For any given resource, there is a unique and durable method for citing it. Users may refer to a resource in a variety of ways, and none of these references is guaranteed to work indefinitely. Following the standards of the Open Archives Initiative, every archived resource has a persistent and globally unique identifier which is a URI beginning with "oai:". The gateway for the OLAC community will be able to resolve a unique identifier to a metadata record for any holding of a participating archive.
10. When a user seeks to obtain a resource discovered via its metadata, the resource can actually be obtained in a timely fashion, and when it arrives it is found to meet expectations of content as indicated by the metadata. The user finds the metadata for resources that look promising, but then find the archive to be uncooperative in terms of actually providing the resource, or finds that the actual resource is of inferior quality or does not live up to its metadata description. The gateway for the OLAC community will host a "peer review" service which will allow users to post their evaluations, both good and bad, concerning the content and service of participating archives. The archives and other users will be able to post their responses.

2.2. Creator Requirements

While inexpensive and powerful computers have dramatically lowered the bar for would-be creators of digital language documentation, only limited software support is presently available for the special needs of this user community. The following statements describe the state of the world that the creators of digital language documentation would ideally like to see, in contrast to the frustrating state which is all too familiar to this community.

  What creators of language documentation want What creators of language documentation don't want How OLAC meets the requirement
1. For each of the descriptive and analytical practices widely used in language documentation, inexpensive software is available which provides a suitable user interface and which ensures data integrity. Unsuitable general-purpose software tools, such as word processors and spreadsheets, must be purchased at a significant cost. There is no built-in support for the particular task and data consistency must be checked manually. When such software tools exist, the OLAC gateway will make it possible for the whole language documentation community to find them. However, developing such tools falls outside the scope of OLAC.
2. The language documentation created with the software is in a form suitable for immediate archiving, so that archiving digital data is essentially no more difficult than making a particular kind of backup. Non-trivial reformatting is required before the language documentation can be archived, with the result that users often don't get around to archiving their work. Software developers will build data export facilities that follow the best-practice recommendations developed by OLAC.
3. Incomplete or uncertain information representing the state of the investigator's current knowledge about the language under study can be archived. Language documentation can only be archived when it meets certain quality requirements. The OLAC standards specify requirements to which the metadata for a holding must conform, but do not put constraints on the content itself.
4. It is straightforward to convert existing language documentation that is in one format into a different format so that it can be reused when generating new documentation or new descriptive works. Creators of language data must manually reformat and restructure the existing archived language documentation before they can use it in creating new documentation. The OLAC metadata set has a special category for data conversion tools and records the formats it can handle. The query interface for the combined catalog can use this information to link a data resource with appropriate conversion tools.
5. There are easy-to-use and up-to-date guidelines on what hardware, software, and formats to use for a given data gathering situation and budget. There is a bewildering array of options available to would-be creators of digital language documentation, and there is no way to good advice on what choices to make. Would-be creators of digital language documentation can come to the OLAC gateway to search for all such advice, especially to find what is recommended as best practice.
6. The creator of the language documentation has moral obligations to the speakers and the language community. When archiving the data, it is possible to select from a range of contracts with the digital archive which establish appropriate access and usage rights in perpetuity. The descriptivist is unable to constrain access or usage once the material is archived, and a subsequent undesired use results in the descriptivist being denied access to the language community. OLAC has best-practice recommendations concerning these issues (see 2.3.5) and participating archives follow them. The OLAC peer review mechanism gives added incentive for participating archives to follow the guidelines.

2.3. Archivist Requirements

While digital archiving holds the promise to archivists of offering unparalleled access to the information they curate, the range of issues involved in doing the job well is so wide, and for many archivists so technical, that it is virtually impossible for any one archive to master them all. The following statements describe the state of the world that archivists would like to see, as opposed to the contrasting state that might be more likely to be realized in the absence of deliberate efforts to prevent it.

  What digital language archivists want What digital language archivists don't want How OLAC meets the requirement
1. There is a set of best-practice guidelines in use by the language archiving community that can be followed to ensure that the digital data stored in the archive will be encoded in such a way as to be maximally useful. Each archive must study the issues of character encoding, data markup, and file formats and develop its own standards. OLAC provides a process for developing best-practice guidelines and a single gateway through which to find them.
2. There is a repository where the archivist can find an off-the-shelf software system for implementing the archive catalog. Such a system would allow for entering and maintaining catalog records, supporting a public-access catalog viewer on the Internet, and sharing of metadata records with services that provide union catalogs of multiple archives. Every archive is on its own to find or develop the software needed to build and disseminate its catalog. The OLAC metadata set for tools will include a category for archive infrastructure software so that archivists can use the OLAC gateway to find software that will help them implement a participating archive.
3. There is a repository of off-the-shelf software tools that archivists can use to test items they accession for conformance to the best-practice encoding guidelines. Each archive must hunt for tools that will help them maintain encoding standards in their collection. Failing that, they would either develop their own tools or do without. A best-practice recommendation would instruct archives in how to test submissions for conformance to best-practice encoding, and software tools for doing this are accessible through the gateway.
4. There is a standard that all archives can follow to devise and maintain a public identifier for each archived item that is guaranteed to be globally unique and globally persistent (even when the URL for where an item is stored may change). Every archive must work out its own system for building unique identifiers and for guaranteeing persistence. The resulting identifier has meaning only within the context of the particular archive, but not on a global scale. The Open Archives Initiative has already developed a standard for building such public identifiers and OLAC will follow it.
5. There are community-wide best-practice guidelines all archives can follow in matters of ensuring informed consent, upholding intellectual property rights, and addressing other ethical issues, as these relate to contributing researchers, language communities, funding agencies and institutional review boards. Each archive must study these issues on its own and develop its own guidelines. In some cases, institutional paranoia prevents any dissemination of digital holdings. The archives participating in OLAC develop best-practice recommendations concerning these legal and ethical issues.
6. A straightforward way to provide finding aids to local and remote users, and other archives, exporting metadata records from an in-house database to an external, widely used format. Maintaining finding aids in a plethora of formats overburdens the resources of the institution. The OLAC (and Dublin Core) metadata sets coupled with the OAI metadata harvesting protocol address this.
7. Linguists can deliver digital data and documents in a form that is immediately archivable and disseminable. New materials require considerable manual processing before they are in a format that will be accessible in future, once current versions of the software used to create the data no longer exist. The OLAC best-practice guidelines will tell linguists what form they should deliver their data in. Software developers will follow these guidelines to make software that exports these forms.
8. An archive can make its whole collection known to the user community. Only holdings in digital format can be shared via the Internet. The OAI framework (that OLAC is built on) allows that a metadata record may describe a holding that is not digital.
9. Technical support (both software and documentation) is available for converting old archive holdings into digital form. The solution is economical, making it easier for archives to bid for funds to digitize their holdings. Each archive must study these issues on its own and develop its own solutions. Pitfalls with newly acquired technology lead to costly delays and possible abandonment of the process. The archives participating in OLAC develop best-practice recommendations concerning the digitization of legacy holdings.
10. Archives can get useful and timely feedback and evaluation from remote users concerning archive services. Archives learn indirectly that remote users are dissatisfied with the archive services, and are unable to respond effectively. The peer review mechanism provided by the OLAC gateway addresses this.
11. Tools exist to convert parochial 8-bit character codings to Unicode, and to convert markup into the best-practice formats. Legacy data requires expensive and time-consuming manual processing. Archivists can use the OLAC gateway to find the appropriate software tools.

2.4. Developer Requirements

Developers often work in relative isolation from the wider community, serving the needs of a particular descriptivist. While the descriptivist understands the linguistic domain, he or she typically has only limited understanding of data modeling and software development. The specification for the software tool may be too vague, or else too specific, thereby limiting the software in unnecessary ways. The following statements describe the state of the world that the developers of software for language documentation would ideally like to see, in contrast to the unproductive state they often find themselves in.

  What developers want What developers don't want How OLAC meets the requirement
1. Explicit, widely-accepted data models exist for the different types of documentation. Each programming task requires a developer to investigate the full range of cases likely to be encountered, depending on a descriptivist who does not know which aspects of representation are likely to cause problems for a computational model. Developers can use the OLAC gateway to find the best-practice recommendations of the language archiving community.
2. Re-usable low-level components exist for standard kinds of media display and data creation. Public application programming interfaces (APIs) make it easy to develop tools on top of these components. Every component of the system must be assembled from scratch. When such software components exist, the OLAC gateway will make it possible for developers to find them. However, developing such software falls outside the scope of OLAC.
3. Standard formats exist for data storage and interchange, and come with standard APIs. For each data type it is necessary to craft a new format. For each pair of programs needing to exchange data it is necessary to create a format conversion tool. Such standard formats are specified in the OLAC best-practice recommendations. An API that supports such a standard would be a tool whose metadata specified that format.
4. Suitable web delivery mechanisms exist, including stylesheets, conversion tools, indexing methods, query languages, streaming media technologies, and so on, and these encompass the needs for security and privacy. Language materials cannot be disseminated on the web, since the delivery and rendering methods are non-existent or ineffective. To the extent that such software tools exists, the OLAC gateway will make it possible for developers to find them. However, developing such software falls outside the scope of OLAC.
5. Developers can easily discover and obtain any existing data models, formats and tools that support a particular kind of language documentation. The state of the art is undocumented, and developers waste resources reinventing pieces of the common infrastructure. The OLAC gateway will offer a search facility based on these kinds of metadata elements.

2.5. Sponsor Requirements

While government funding agencies and non-governmental organizations have long sponsored language documentation projects, they face new challenges as they develop priorities and programs concerning digital language documentation. Funded initiatives need to be cost effective, and the resources they create need to be disseminated and reused by the community. The following statements describe the state of the world that the sponsors of language documentation work would ideally like to see, in contrast to an opposite state in which they are not able to meet their goals.

  What sponsors want What sponsors don't want How OLAC meets the requirement
1. An institution that sponsors the creation of language resources may want to host those resources on its own Internet site. Any resource that is to be made available to the public in a uniform way through a common catalog must be hosted at a single site. The OAI framework (on which OLAC is built) permits each sponsoring institution to serve as its own data provider.
2. The resources a sponsor has helped to develop are being widely used by the community. The funding was in vain, since the community cannot discover that the resources exist, or if they happen to find them, they cannot use them for lack of documentation, proper encoding practice, fonts, software, and the like. Once cataloged in a participating archive, a resource is available to the entire language documentation through the single OLAC gateway.
3. Software tools which were developed under a funded project have been thoroughly documented and distributed with an open source license, and have been successfully adapted for new projects. Expensive programmer time was partially wasted, since the software generated by the project was never made available, and new funded projects had to repeat the same work. Software resources developed by projects can similarly be made available to the entire community by depositing them in a participating archive.
4. There is an online peer-review process for language archives and documentation projects, concerning the quality and availability of the materials. It is difficult to determine the extent to which the language documentation community values and uses the materials provided by language archives and documentation projects. The peer review mechanism provided by the OLAC gateway addresses this.