LINGUIST Codes for Ancient and Constructed Languages

Anthony Aristar, Wayne State University and LINGUIST List

DRAFT 2002-02-19

Statement of Purpose

The workgroup's aim is to produce a supplementary set of language codes that will, in conjunction with the Ethnologue's set, constitute a complete set of codes for all languages of which there is any historical or current record. To this end, it describes a proposal for supplementing Ethnologue's set with codes for ancient and constructed languages. It discusses the criteria for the Ethnologue codes, and how these might, with modifications, be extended to ancient and constructed languages. It then lists a proposed set of criteria by which codes should be assigned to ancient and constructed languages.

The Ethnologue Codes

The most complete set of language codes in existence is the Ethnologue system (, produced and maintained by the Summer Institute of Linguistics. This system assigns a 3-letter code to every distinct natural language in existence. As noted by Constable and Simons 2001[1], the Ethnologue attempts to:

  1. Consistently apply an operational definition of language so that all entities for which an identifier is assigned are of a comparable nature,
  2. Encompass all of the languages of the world,
  3. Clearly document the speech variety that each identifier denotes,
  4. Maintain and update the system on an on-going basis
  5. Make the system freely and readily accessible to the public over the Internet.

Every language description, what is more, always includes information on:

  1. The countries the language is spoken in
  2. The alternate names that refer to the language
  3. The number of speakers of the language
  4. The classification of the language

It is essential to note that the notion of mutual non-intelligibility, listed in criterion (1), is fundamental. Varieties of language are only assigned a code if they are mutually unintelligible with varieties of any language to which a code has already been assigned. Simply stated, a dialect whose manifestation is merely linguistically idiosyncratic should not normally merit its own code unless speakers of other dialects cannot understand it.

With reference to criterion (2), it should be noted that the Ethnologue system is intended to encompass only those languages of the world in current use. Thus the Ge'ez (Ethnologue code GEE) and Sanskrit (Ethnologue code SKT) languages both appear in Ethnologue, even though they have not been spoken by native speakers for many centuries, simply because they are in common liturgical use today. Akkadian (LINGUIST code XAKK), on the other hand, does not appear in the Ethnologue, simply because it has not been used in any function for almost 2000 years.

Most ancient languages - and many languages which have become extinct over the last 500 years - are thus absent from Ethnologue, along, of course, with all constructed languages except Esperanto (Ethnologue code ESP), which has a small number of native speakers.

There is one clear inconsistency in this regard in Ethnologue. Languages which are recently extinct often appear there, even when they have no current function. The Romance language Dalmatian, for example - spoken on the coast of modern Croatia until the late 19th century - has the Ethnologue code DLM. On the other hand the language Abipon (LINGUIST code XABI), a South American Indian language which became extinct at around the same time, does not appear in Ethnologue.

There are also two shortcomings in Ethnologue's system, and these have to do with the notions of provenance and conflict. Every language in Ethnologue is documented to a greater or lesser degree. But we usually do not have a clear idea of the evidence upon which it was decided to assign the language a unique code. Nor does the system allow for conflicting language classifications. For example, there is disagreement amongst scholars as to the classification of Low German dialects. This is not indicated in Ethnologue.

In mitigation of both these points, it might be noted that Ethnologue intends to include provenance information in the future, and that Ethnologue is not designed to provide a complete classification of the languages to which it assigns codes. The classification it does indicate is merely intended to be of service to those interested in such information.

Criteria for Assigning Codes to Ancient and Constructed languages

Since OLAC is designed to allow the categorization of any language, a complete set of language codes must be available, a set which includes codes for both ancient and constructed languages. We propose here that the set of supplementary codes designed by LINGUIST for this purpose should become the OLAC standard, and that the union of these two code sets should be called the Universal Language Codes or ULC.

This set of supplementary codes should conform as closely as is reasonable to the standards set by Ethnologue. However, to instantiate a useful set of codes for ancient and constructed languages entails considerable loosening of the Ethnologue standards. Most importantly, the criterion of mutual intelligibility is problematic here for three reasons. First, in some cases the criterion of mutual intelligibility has to be abandoned simply because the language had a cultural distinctness, and scholars treat it uniquely. A case in point is Anglo-Norman, which was in reality an aberrant dialect of Old French. However, since it evolved independently, and has a literature distinct from that of Old French, which scholars treat separately, it must be assigned a distinct code so that work on it can be discriminated from work on Old French generally.

Mutual intelligibility also breaks down in another way. Ancient languages often have a diachronic dimension that can usually be ignored with modern languages. Old Latin gave rise to Classical Latin, which in turn gave rise to Late Latin, which in turn gave rise to Vulgar Latin or Proto-Romance. It is likely that no two adjacent stages of this complex process would have been mutually incomprehensible, had there been any speakers who could speak the two versions. How many codes do we assign here on the basis of mutual intelligibility?

There is also the issue of ancient languages in scripts which have as yet not been deciphered (e.g. Minoan, the language(s) of Linear A, LINGUIST code XMIO), or which cannot be understood, though their texts are written in scripts which can be read (e.g. Eteocretan, LINGUIST code XECR).

In conclusion then, we propose the following overriding standard for assigning codes to ancient languages:

  1. Codes should be assigned to ancient languages which are treated distinctly by the scholarly community.

In addition, we propose that the following should serve as criteria for the assigning of codes:

  1. The standard of mutual intelligibility should apply as far as possible. That is, all apparently mutually intelligible ancient languages spoken at approximately the same period should be assigned one code, unless it conflicts with scholarly usage. In cases where the level of mutual intelligibility cannot be clearly ascertained, separate codes should however be assigned.
  2. Codes should be assigned to undeciphered scripts, and to uninterpretable ancient languages in known scripts.
  3. The system should be as complete as possible. Ancient languages should not be excluded simply because they are obscure. When scholars need to refer to a language, distinct codes will be assigned to them, even when there is very little information about them. For example, almost nothing is known about Noric (LINGUIST code XNOR) the language except that it was spoken in the Roman province of Noricum and was Continental Celtic. It must nevertheless be assigned a code, for Celtic scholars sometimes refer to it.
  4. All alternate names of ancient should be listed, even those which are deprecated by scholars, though deprecated names should be clearly indicated.
  5. In order to allow as much integration with Ethnologue codes as possible, the primary geographic categorization of ancient languages should be by the modern countries in which they once existed.

With regard to constructed languages, the following standards should apply:

  1. Constructed languages cannot be treated by the criterion of mutual intelligibility, since they are almost never actually spoken, and are as much cultural objects as linguistic. In some cases there exist variants of originally identical constructed languages which have begun evolving independently. Esperanto (Ethnologue code ESP) and Ido (LINGUIST code CIDO)are instances of this phenomenon. These should be assigned distinct codes.
  2. No attempt should be made to assign constructed languages to geographical regions, since they do not exist in the real world.

Demarcation between Ethnologue and LINGUIST Codes

Since we do not believe that any purpose is gained by multiplying language codes, we propose that LINGUIST codes should merely fill the gaps in the Ethnologue system. Thus, liturgical languages should remain part of the Ethnologue system, even if ancient; and all extinct languages already in Ethnologue should remain there.

For the future, however, a clear line of demarcation should exist, so that scholars will know which organization is responsible for assigning a code they need to a language. We propose that an arbitrary division should be decided on, thus:

  1. All languages which require codes and which became extinct before 1800 should become the responsibility of LINGUIST. All languages after 1800 will be in the purview of Ethnologue.

Form of Supplementary Codes

To ensure that the LINGUIST codes are always clearly distinguishable from those of Ethnologue, and so that codes can be assigned by one organization without reference to the other, we further propose a distinction between Ethnologue and LINGUIST codes:

  1. All codes assigned by LINGUIST will contain four letters, with ancient languages having the prefix X (e.g. XAKK Akkadian) and all constructed languages the prefix C (e.g. CKLN Klingon). Documentation of Supplementary Codes To ensure that the basis upon which codes are assigned is always clear, we suggest that:
  2. The documentation of all codes assigned by the supplementary LINGUIST system should ultimately include information on their provenance.
  3. The documentation of each code should ultimately include information as to conflicting views of the language and the conflicting subgroupings to which it may be assigned.

A list of all current LINGUIST ancient language codes can be found at the URL:

A list of all current LINGUIST constructed codes can be found at the URL:


[1] Constable, Peter and Gary Simons. 2000. "Language identification and IT: addressing problems of linguistic diversity on a global scale." SIL Electronic Working Papers (2000-001).