Data Preparation (Electronic Texts)

Using XML (Extensible Markup Language) encoding to add descriptive value to digitised textual material is perhaps about as commonplace an activity throughout humanities computing as word processing is more generally across all subject disciplines. In the context of a paper on electronic texts however, it may be worthwhile rehearsing at least a few of the reasons why this practice is so widespread, if for no other reason, so that a foundation can be established for subsequently addressing more applied XML techniques. One particular implementation of XML that has gained a remarkable level of international acceptance is the Text Encoding Initiative (TEI), a set of guidelines originally developed using the SGML (Standardised General Markup Language) standard but updated to be compatible with XML when it replaced its predecessor as the widely used markup scheme of choice. The TEI guidelines allow those working with texts wide latitude in terms of how lax or precise they wish to be in describing their chosen resource, but the best returns on effort expended are generally had by using as much of the intrinsic detail of the system as possible and by applying it rigorously throughout the data. Jerome McGann in his influential essay, ‘The Rationale of Hypertext’ argues vividly in favour of the application of electronic methods to literary studies and states some of the advantages to be gained from their application, as opposed to more traditional print-based techniques

  • Internal links and data relationships within the resource can be minutely and accurately defined
  • Previously distributed elements (unavoidable in printed formats of the work) can be made simultaneously present to each other
  • A larger quantity of material can be addressed and complex navigational routes through this mass allowed
  • Open ended and collaborative editions of works can be envisioned rather than having to commit to laborious and expensive reprinting of paper based copies

Whilst his remarks embrace tools generally, the ‘cement’ that binds the information together and enables linking elements to be embedded into information is most often some implementation of XML (often incorporating TEI compliance) and is thus central to any strategy that the supporters of digital editions will promote to increase the perception of these resources as the most logical and sensible way to compile this type of complex scholarly work.

Whilst the XML/TEI model is pervasive and powerful in the way that it can be implemented, the intrinsic complexity of written data, particularly poetry and certain forms of prose literature, require encoding models that include a semantic and syntactic flexibility that regular implementations of markup language can struggle to accommodate, based as they are primarily on tree-like structures and strictly nested components. The widely acknowledged problem of accommodating overlapping hierarchies in XML/TEI has engendered a number of potential solutions and is the focus of a TEI Special Interest Group (SIG) who maintain a wiki listing a number of systems which provide examples of extended or alternative models. (

MECS (Multi-element Coding System) is one of these alternative schemes, proposed by Michael Sperberg-McQueen (incidentally one of the principal architects of the TEI) and Claus Huitfeldt, who developed it for the Wittgenstein archives in Bergen, Norway, in response to the complex multi-hierarchical nature of that dataset. This system relies on a slightly different tag format which enables non-nested passages to be marked-up. A further development of this system is TexMECS where the encoding defines the following features:

1) Empty elements marked by sole-tags
2) Normal elements with start and end tags
3) Interrupted elements with start, suspend, resume and end tags
4) Elements with children whose order has no significance and which can therefore be reordered
5) Virtual elements, which have a generic identifier and attributes, and who share children with another element in the document
6) Self-overlapping elements, which use a simple co-indexing scheme: tags are co-indexed by a tilde and a suffix of numbers and letters

Every TexMECS document should be translatable into a GODDAG (General Ordered-Descendant Directed Acyclic Graph) structure which is another proposal by Sperberg-McQueen and Huitfeldt which makes use of graph theory to navigate around the problem of overlap and provide individual entities with additional nodes as points of relation within a tree-like structure. The graph resembles a tree, but differs from it in that multiple parent nodes can contain the same child. (

The ARCHway project ( based at the University of Kentucky, has also looked at the problem of overlapping markup and has developed a package called the Edition Production Technology (EPT) to deal with the type of tagging problems associated with the image based resources that the project focuses on. The method implemented in EPT is called Concurrent Markup Hierarchies (CMH) which manages the encoding through Extended XPath, an extension of regular XPath, which also takes advantage of the node method defined by the GODDAG structure. XPath is a W3C recommended method of modelling XML documents as trees of nodes, such that individual parts of the document can effectively be addressed, which in turn allows other processes such as XSLT (Extensible Stylesheet Language Template) to then carry out transformations on the data contained at that node. Another notable feature of EPT is that it is based on an Eclipse platform that features a plug-in architecture offering extension points for new code to be added which will extend the existing functionality and present opportunities for collaborative work at a range of different levels.

The difficulty of presenting effectively marked up text reflects the intellectual challenge of analysing the source material. One of the most useful and positive scholarly aspects of applying markup to a humanities text is simply to gain a better understanding of all of its complexities and its structure at a very fine level of detail. Another major challenge specifically faced by those preparing digital editions is how to collate the variations of any given text so that a detailed picture of all of the deviations between those different versions becomes apparent. Although this sounds simple in principle, actually carrying out the task involves considerable work, particularly where there are numerous variants to juggle and where it is difficult to precisely locate and correlate the amendments from one manuscript to another. It was as a response to this problem that Peter Robinson first developed the COLLATE program in 1989 when he was working on an edition of two ancient Norse poems which existed in forty four separate manuscripts. He devised a Macintosh font to transcribe the characters of these manuscripts and found it easier to edit the closest version rather than begin again from scratch with the next version. This principle is akin to editing from a ‘base text’ and informed the subsequent design of the system. The current version of the programme can collate 2000 simultaneous variant texts and, is informing the design of EDITION ( the next generation tool that Robinson is developing.

IATH (Institute of Advanced Technology in the Humanities) at the University of Virginia have recently announced a new tool called JUXTA ( which also carries out collation tasks and was developed principally with nineteenth and twentieth century textual material in mind. As well as providing similar functions to COLLATE in that variant witnesses can be compared against a base-text which can be replaced with an alternative and recompiled at any time, it also features analytical visualization tools that include a ‘heat’ map of all textual variants and a histogram of collations. The latter displays the density of all variation from the base text which is particularly useful for long texts.

Whether the above tools are in fact ‘data preparation’ or ‘query and analysis’ tools is a moot point, but to finalise this section, it will be worth mentioning one more influential tool that has a collation module amongst its range of functions. TuStep (TUebingen System of Text Processing Programs) was developed by Wilhelm Ott at the University of Tuebingen in the late 1960’s and has been used in a huge number of research projects. Its longevity bears witness to an ongoing need for the functionality it delivers as well as the development work that has clearly been put into the system to keep it relevant to communities of users engaging with SGML and XML.

Syndicate content