A query facility for selective harvesting of OLAC metadata

Date issued:2003-07-29
Status of document:Draft Implementation Note. This is only a preliminary draft that is still under development; it has not yet been presented to the whole community for review.
This version:http://www.language-archives.org/NOTE/query-20030729.html
Latest version:http://www.language-archives.org/NOTE/query.html
Previous version:http://www.language-archives.org/NOTE/query-20021102.html
Abstract:

Documents a verb, Query, supported by the CGI interface to the OLAC Aggregator. The purpose of the verb is to support selective harvesting of OLAC metadata, such as would be needed in order to offer a specialized service based on OLAC metadata. The request returns a ListRecords response; its parameters support the construction of an SQL query to specify the subset of records to harvest.

Editors: Gary Simons, SIL International (mailto:gary_simons@sil.org)
Changes since previous version:

Updated to reflect changes from version 0.4 of the OLAC metadata standard to version 1.0.

Copyright © 2003 Gary Simons (SIL International). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.

Table of contents

  1. Introduction
  2. The query interface
  3. Expressing the selection criterion
  4. Implementation
References

1. Introduction

A key feature of the openness of the Open Archives Initiative protocol for metadata harvesting [OAI-PMH] on which OLAC is based is that any site on the web is free to become a service provider. That is, it may harvest metadata from the participating data providers and offer a service based on the harvested metadata. In general, it is complicated to implement and operate a complete harvester with the result that few sites rise to the challenge of becoming a service provider.

The Open Language Archives Community is seeking to change this. It has taken the following steps to make it easy for the members of its community to offer services based on OLAC metadata:

  1. The OLAC Aggregator [OLACA] is a service that harvests metadata from all OLAC data providers and in turn serves as a single data provider for all OLAC metadata.

  2. It is planned that the OLAC Aggregator will support a special OLAC Display format [OLAC-Display] that resolves coded attribute values to display labels and presents a reader-friendly view of OLAC metadata.

  3. The OLAC Aggregator supports a query interface (described in this document) that makes it possible for a would-be service provider to harvest only the metadata records of interest.

  4. The OLAC Aggregator cooperates with the virtual service provider [Viser] so that the results of a selective harvesting query to OLACA can be rendered as an HTML page that presents a service to an end user.

The purpose of this document is to document the query interface and to illustrate how it can be used.

2. The query interface

In addition to the six verbs of the OAI harvesting protocol [OAI-PMH], the OLAC Aggregator supports a seventh—Query. The Query verb takes the following arguments:

elements

A required argument that specifies the number of metadata elements that are referred to in the selection criterion.

sql

A required argument that specifies the selection criterion expressed as the content of a WHERE clause in MySQL syntax.

count

An optional argument that specifies the number of metadata records to return in a single response. If this argument is not specified, a default value of 20 is assumed.

resumptionToken

An exclusive argument with a value that is the flow control token [OAI-FC] returned by a previous Query request that issued a partial response. It is exclusive in that when it is used, it is the only argument in addition to verb.

The result of a Query request is a ListRecords response [OAI-LR]. The metadata records are returned in order of their OAI identifiers. In the current implementation, the records are returned in OLAC format; it is intended that they will be returned in OLAC Display format [OLAC-Display] when it is implemented. If more records match the selection criterion than the number indicated by the count parameter, a resumption token is returned at the end of the response as described in [OAI-FC].

3. Expressing the selection criterion

The selection criterion is expressed as a where_definition in MySQL syntax [MySQL]. The query has access to each element in a metadata record and to all the parts of an element, which are named as follows:

TagName  

The generic identifier for the element's XML tag.

Content  

The value of the element's content.

Code  

The value of the element's olac:code attribute.

Lang  

The value of the element's xml:lang attribute.

Type  

The value of the element's xsi:type attribute.

The first step in designing a query is to identify how many elements in a metadata record must be consulted in order to test the criterion. This number is the value of the elements argument. In the query, the elements are referred to as e1 through en, where n is the value of elements. Thus, the content of the first element is referred to as e1.Content, while the generic identifier of the second element is referred to as e2.TagName. Selection criterion may also make use of the OaiIdentifier column in order to limit a search to the holdings of a particular archive.

The following are some sample queries.

The above examples illustrate that testing for the element tag itself is often redundant when a code or content value is only associated with one metadata element. For instance, in the last example, 'text/xml' is the Content value for the <format> element.

In order to pass the criterion expression as an argument in a URL, it must be URL encoded. The key changes to make are:

Thus, for instance, the third-to-last and last sample queries listed above translate into the following requests to the OLAC Aggregator (which you may test by clicking on the links):

4. Implementation

The Query request is implemented by the serve_Query subroutine in the Aggregator.pm module and the getTable_Query subroutine in the DB.pm module. These are part of the [OLAC-Suite] release. More insight on the functioning of the query expression can be gained by consulting the schema of the MySQL harvesting database [OLAC-Schema].

The Query request builds an SQL query like the following:

select OaiIdentifier, DateStamp, a.Item_ID {, e1.*}
from ARCHIVED_ITEM as a {, METADATA_ELEM as e1}
where {a.Item_ID=e1.Item_ID and} ( URL-unencoded-sql-argument )
order by OaiIdentifier

The three code fragments in curly braces are repeated the same number of times as the value of the elements argument. The table alias is incremented with each repetition (e.g., e1, e2, and so on). The value of the sql argument is URL unencoded and then placed within parentheses in order to ensure the correct precedence of operators with respect to the rest of the WHERE clause.


References

[MySQL]MySQL Language Reference (especially section 6.3).
<http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html>
[OAI-FC]"Flow Control," section 3.5 of The Open Archives Initiative Protocol for Metadata Harvesting, Version 2.0 (2002-06-14).
<http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#FlowControl>
[OAI-LR]"ListRecords," section 4.5 of The Open Archives Initiative Protocol for Metadata Harvesting, Version 2.0 (2002-06-14).
<http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#ListRecords>
[OAI-PMH]The Open Archives Initiative Protocol for Metadata Harvesting, Version 2.0 (2002-06-14). .
<http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm>
[OLAC-Display]Specifications for an OLAC metadata display format and an OLAC-to-OAI_DC crosswalk.
<http://www.language-archives.org/NOTE/olac_display.html>
[OLAC-Schema]Relational database schema for OLAC metadata harvester.
<http://www.language-archives.org/tools/olac_schema.sql>
[OLAC-Suite]OLAC Suite: A suite of OLAC harvesting tools implemented in MySQL + Perl.
<http://sourceforge.net/project/showfiles.php?group_id=6577>
[OLACA]OLAC Aggregator Service.
<http://www.language-archives.org/cgi-bin/olaca3.pl>
[Viser]Viser: A virtual service provider for displaying selected OLAC metadata.
<http://www.language-archives.org/NOTE/viser.html>