Knowledge-Based Systems (Linguistics)

If the construction of corpora is the main method by which statistical methods can be applied to the various problems of linguistics research, there is clearly a need to also examine a knowledge-based approach, which in practical terms means the attachment of encoded or categorised data and the construction of ontological classification systems to assist with a whole host of research questions relating to grammatical, syntactical, semantic, phonological, morphological, lexical and diachronic (i.e. historical) data issues. This is not to say that the two approaches are necessarily discreet. On the contrary - as has already been shown elsewhere in reference to a study by Paul Baker (see Corpus Linguistics) – research is often a blend of methods and approaches and an accumulation of information through phased querying.

As an example of an approach that is widely used across a range of disciplines, the Text Encoding Initiative (TEI) is a rarity that does credit to the conception behind the encoding system that it describes, and also highlights the obvious and widespread requirement that exists for a standardised textual markup framework. Unsurprisingly, linguistics is one of the subject areas that has benefited from being able to reference the TEI guidelines, allowing as they do the description of a variety of language features including the separation of elements of spoken discourse and the segmentation of text into sentences, phrases, words, morphemes and graphemes. In addition to the more formal aspects of language analysis, it also allows the description of identities of speakers, the context of textual sources, the inclusion of temporal information, the description of methodological principles, and a great deal of other broadly applicable information that is encapsulated by the definitions available within the framework.

In response to needs expressed by researchers in the natural language processing and language engineering fields, the TEI framework has been refined and extended to facilitate a deeper level of encoding. The CES (Corpus Encoding Standard) utilizes the TEI modular DTD and the TEI customization mechanisms to allow for the description of elements that are specifically appropriate to corpus encoding. An XML compliant version called XCES has been in development since 2000 and like its predecessor, it is particularly suited to those corpora which shed light on problems associated with language processing and engineering.

A more recent initiative (2003) to develop methods of describing linguistic resources has been proposed by the GOLD (General Ontology for Linguistic Description) community , whose objectives are to:

  • to promote best practice as suggested by the E-MELD project
  • to encourage data interoperability through the use of ontologies
  • to encourage the re-use of software
  • to facilitate search across disparate data sets
  • to create a forum for data providers and consumers

GOLD work closely with those involved with the E-MELD initiative (who promote best practices in Digital Language Documentation) and OLAC (Open Language Archives Community) who are also engaged with defining and describing linguistic resources using Dublin Core and other metadata frameworks. The GOLD ontology is an attempt to provide a way of mapping information from disparate sources, about different languages, from different theoretical perspectives, onto a common semantic resource. This resource, referred to as a set of ‘descriptive profiles’ was originally based on the detailed and very useful glossary of terms provided by SIL (initially known as the Summer Institute of Linguistics), which have been substantially added to by members of the GOLD community.

In the wider context of knowledge-based systems, the use of externally defined ontologies, taxonomies and thesauri is also widespread and has been mentioned elsewhere in the context of USAS (UCREL Semantic Annotation System) and Wordnet. (see Corpus Linguistics)

Table 3 - The USAS top level categories (taken from UCREL website)Table 3 - The USAS top level categories (taken from UCREL website)

The latter is an online lexical reference system where nouns, verbs, adjectives and adverbs are organised into synonym sets which additionally link with other sets in various ways and are also incorporated into the Suggested Upper Merged Ontology (SUMO) which refers to itself as ‘the largest formal public ontology in existence today’.

The USAS category system (see table 3) currently has 21 top level fields, each of which is assigned with a capital letter, which then subdivides into 232 category labels (designated by numbers) which then allow for further hierarchically structured mapping of discrete concepts. Antonymity of conceptual classifications can be indicated by plus or minus markers within the tags (eg. A5.1+ = good; A5.1- = bad) and multiple possible semantic domains can be incorporated by the use of slash tags (e.g. sportswear may come under the category of both clothing and sport – B5/K5.1).

One further example of a knowledge-based system (and one that refers specifically to historical word forms) is the Historical Thesaurus of English.

Table 4 – Synonyms for Grandfather Listed in the Historical Thesaurus of English (taken from Kay (2005))Table 4 – Synonyms for Grandfather Listed in the Historical Thesaurus of English (taken from Kay (2005))

The data in the thesaurus (650,000 meanings in 26 major semantic fields) is taken from the Oxford English Dictionary and includes the first - and where relevant the last - recorded dates of usage for each word along with a total of 29 fields of metadata including broad categorical style and status descriptions and information pertaining to part of speech. Data in this format is demonstrably useful to researchers working in a number of areas but for those involved with the diachronic study of linguistics it is a hugely valuable resource, not only for gauging the relative prominence or insignificance of lexical items from the period defined as Old English (c.700 – 1150 A.D.) through to the present day, but also in disambiguating historical terminology for semantically identical words (see table 4).

It is a serious problem for researchers that the further one goes back in time, the more words are prone to be variously and inconsistently spelt and it is increasingly through probabilistic analysis and tagging methods, based on accurate sets of manually created sample data, that large corpora can now be automatically tagged with likely variant spelling definitions using a tool like VARD (Variant Detector Tool) developed at Lancaster University and described by Dawn Archer at the Methods Network workshop on Historical Text Mining. (

Syndicate content