Data Management (Library and Information Studies)

Though closely entwined with the previous section, focusing on ‘data management’ issues affords opportunities to concentrate more specifically on data standards, ontologies, thesauri, data types and data processing methods. The Open Archives Initiative (OAI – acronymically similar but conceptually distinct from OAIS ) is an influential community of individuals and organisations that are interested in promoting efficient methods for broadly disseminating information using interoperable standards. It is formally supported by the Andrew W. Mellon Foundation, the Coalition for Networked Information, The Digital Library Federation and the National Science Foundation, but also has non-U.S. based organisations taking an active part, notably the Joint Information Systems Committee (JISC) and UKOLN (UK Office for Library Networking) who were both represented at a meeting in April 2006 which addressed the theme, ‘Augmenting Interoperability Across Scholarly Repositories’.

One of the principal concepts promoted by OAI is the use of the OAI-PMH (Protocol for Metadata Harvesting), a method by which XML-based descriptive metadata records can be automatically acquired by OAI-PMH compliant systems. Typically records would take advantage of well established methods such as using Dublin Core or MARCXML format data. The OAI site lists around thirty different tools that have been developed by members of the OAI community that relate to the protocol, examples of which include,

  • Archimede - Archimede is an open-source software for institutional repositories. It features full text searching, multiplatform support, Web user interface, and more. Archimede fully supports OAI-PMH requests version 2.0.
  • OAI-perl Library - A library of PERL language classes that allow the rapid deployment of an OAI compatible interface to an existing web server/database
  • ZMARCO - ZMARCO is an Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) 2.0 compliant data provider. The 'Z' in ZMARCO stands for Z39.50; 'MARC' stands for MAchine-Readable Cataloging; and the 'O' stands for OAI, as in the Open Archives Initiative. ZMARCO allows MARC records which are already available through a Z39.50 server to relatively easily be made available via the OAI-PMH.

Making data available using the above methods is a well established process and is in accordance with the recommendations given for content providers wishing to make their systems interoperable with the JISC Information Environment (JISC-IE).

Whilst methods of standardising metadata and the use of protocols might appear to stretch definitions of ‘tools’ usage, it is difficult to not to refer to these initiatives when reflecting on LIS research territories, central as they are to the concerns of data management. A similarly critical area of research focuses on the development and application of ontologies, a term that overlaps to some extent with the related activities of glossary, controlled vocabulary, thesaurus and taxonomy creation. The purpose of an ontology is to map all of the objects or concepts relating to a field of knowledge into a systematic arrangement that then displays the relationships that exist between them and their properties in relation to the whole domain. The application of ontologies allows systems to operate more knowledgeably by allowing related semantic concepts to be aggregated together for search and retrieval purposes.

Data modelling exercises of this type might vary in size depending on the scope of the domain in question. So called ‘upper’ ontologies map downwards from very high level general concepts and examples of this include: SUMO (Suggested Upper merged Ontology) which is limited to meta, generic, abstract and philosophical concepts; and Cyc, a manually entered knowledegebase of more than a million human assertions formalising common-sense notions such as ‘The earth orbits the Sun’. At a more specific level, ontologies exist for much narrower domains (e.g. Atomic elements, biological viruses) but they also vary in their complexity and in their applicability to be defined as ontologies. The Dublin Core schema is an example of a simple ontology, whilst WordNet, a large lexical database of English with words grouped into conceptually related sets of synonyms, might alternatively be described as a semantic lexicon or a combination of taxonomy and controlled vocabulary.

Another influential ontology model which is widely used in the Museums and Cultural heritage sector is the CIDOC-CRM. Full reference will be made to this in a forthcoming Methods Network working paper (see Digital Tools for Museums and Cultural Heritage Research), but to summarise, the CRM provides a common and extensible semantic framework that was originally developed for cultural information but is potentially of conceptual use to any domain of activity. The FRBR model (Functional Requirements for Bibliographic Records), developed by IFLA (International Federation of Library Associations and institutions) in 1998, took a fresh look at what functions bibliographic records perform and then systematically tried to map the bibliographic realm by means of defining entities, relationships and attributes.

Fig. 3 The FRBR Group 1 bibliographic entitiesFig. 3 The FRBR Group 1 bibliographic entities

The FRBR also considered what the users of catalogues ‘do’, and what constitutes the product of artistic endeavour.

More recently, a working group has been convened to look into the harmonisation of the CIDOC-CRM and FRBR and has concluded that there are mutually beneficial components in both models. The FRBR enriches the CIDOC-CRM with its notions of the stages of intellectual creation (see fig. 3), and conversely, the FRBR is able to appropriate a general model of historical events from the CIDOC-CRM.

Discussion of data management issues inevitably requires reference to XML (and its related tools and techniques) and versions of the CIDOC-CRM model have been encoded in RDFS (Resource Description Framework Schema). This is based on XML and is an extension of RDF, a standard and universally machine-readable way of expressing information about web resources (e.g title, author, modification date, contents, etc). RDF Schema provides a framework to describe application-specific classes and properties and has some resemblance to the use of classes and sub-classes in object-oriented programming, enabling hierarchical descriptions of objects to be encoded, eg. ‘horse’ can be defined as a sub-class of ‘animal’.

Fig. 4 Example of RDFS class (taken from W3 Schools tutorial)Fig. 4 Example of RDFS class (taken from W3 Schools tutorial)
This clearly has applications to the construction of ontologies and the standard way of doing this with XML is by using OWL (Web Ontology Language), a W3C recommended tool that allows machine-readable semantic content to be processed and exchanged between different operating systems and application languages. OWL and RDF share the same look and feel but the former is defined with a larger vocabulary and has a stronger syntax, meaning that it has greater machine interpretability. It includes three sublanguages that sequentially increase in complexity and are nested in terms of functionality:

  • OWL lite
  • OWL DL (includes OWL lite)
  • OWL Full (includes OWL DL)

The use of various applications of XML for a wide variety of data management tasks is too broad a field to summarise adequately in this paper but the following approaches are indicative of ways that the use of encoding can assist with the handling of data. METS (Metadata Encoding and Transmission Standard) is a schema for descriptive, administrative and structural metadata about objects within a digital library and requires file inventory information to be included (listing all files associated with the digital object) and a structural map which outlines the hierarchical structure linking the object with its content files and metadata. EAD (Encoded Archival Description) is based on the TEI Guidelines (Text Encoding Initiative) and is aimed at library and museum professionals who need to make finding-aids such as inventories, indexes and registries machine-readable. The enthusiastic global adoption of TEI across multiple academic disciplines makes this encoding method an important element in strategies to do with information management. Not only are the TEI Guidelines a standard way of expressing information about textual data, but they have also been extended for use in more specific realms, e.g. CES (Corpus Encoding Standard – and its XML instantiation, XCES) for describing corpora; and MEP (Model Editions Partnership) for creating editions of historical documents.

Syndicate content