Data Capture (Electronic Texts)

As detailed in one of the other working papers in this series (Tools and Methods for Historical Research) digital methods for the bulk acquisition of data from printed material into digital formats have now been developed to the point where very large scale archives comprising of scores of millions of items are not only a possibility but are becoming a reality. The JISC funded Eighteenth Century Parliamentary Papers project (http://www.bopcris.ac.uk/18c/) based at the University of Southampton has installed scanning equipment that is capable of a throughput of one million items a year and achieves this by using vacuum enabled page turning equipment and laser guided edge detection sensors to identify the borders of pages. In conjunction with automated Optical Character Recognition software (OCR), this project intends to digitise all the original printed parliamentary materials from the eighteenth century and to make them freely available on the Web.

The levels of automation that are being used on digitization projects vary widely according to their respective objectives and there is much debate about what might be considered the most appropriate or effective techniques, both for sustaining large amounts of digital archival material far into the future, and in terms of what the most suitable format for delivery of resources is right now. The aim of an initiative such as the Carnegie Mellon Universal Library (http://tera-3.ul.cs.cmu.edu/) is to amass a very substantial number of texts that will act as a globally accessible digital repository of written language. The motivation for doing this is expressed in terms of philanthropy and preservation and as such, the catalogue records do not necessarily provide the sort of data that might be useful to someone involved with advanced literary research. What they do provide however is a very useful summary of an object's original and digital properties and allows for the possibility of clicking on a link to access various versions of the text itself, presented either as an image of the original page, or transcribed into plain or HTML text.

Fig. 2 A catalogue record from the Carnegie Mellon Universal LibraryFig. 2 A catalogue record from the Carnegie Mellon Universal Library

The acquisition of material for very large initiatives such as this can only realistically be achieved by using ‘rough’ or ‘dirty OCR’ techniques. Used in the Making of America (http://www.hti.umich.edu/m/moagrp/) and American Memory (http://memory.loc.gov/ammem/index.html) projects (for example), the search and retrieval functions are based on unchecked OCR generated text which sits behind images representing the original pages of the object. For the rapid generation of very large archives with limited searching mechanisms this is clearly very effective. One obvious drawback is that unchecked automatically generated text is only going to provide approximate information retrieval accuracy.

An alternative approach is to manually key all relevant text and this is the strategy that has been adopted by the Text Creation Partnership (http://www.lib.umich.edu/tcp/), a collaboration based at the University of Michigan, who are working with images of texts that have been created by three different projects: Early English Books Online (EEBO - http://eebo.chadwyck.com/home), Evans Early American Imprint Collection (EVANS - http://ets.umdl.umich.edu/e/evans/), and Eighteenth Century Collections Online (ECCO - http://www.gale.com/EighteenthCentury/). The text of the selected material featured from these archives is being carefully encoded using XML tags to define various features such as titles, headings, notes, stage directions, captions, acts and verses etc.

Adopting what might be considered an intermediate approach between these two positions, the Nineteenth Century Serials Edition Project, a collaboration between Birkbeck College, the British Library and King’s College London, have teamed up with a commercial company, Olive Software (http://www.uk.olivesoftware.com/conference/Project_Report_A4...), to implement a system that attempts to automatically apply XML encoding to the text contained within scanned images. The software, ActivePaper Archive uses a combination of intelligent segmentation and a whole array of algorithms particularly attuned to accommodate the difficult range of layouts found in historic newspapers to provide a searchable textbase. The poor quality printed texts are subject to enhancement using fuzzy logic and probabilistic matching processes resulting in significantly more accurate results than standard OCR scanning.

Historians refer to the recognition of the importance of the original context of information as a ‘source-oriented’ approach and it is equally valuable for those involved with literary research to be able to see and analyse the original object in sufficient detail. An adequately high resolution image might, for instance, mean a scholar will be able to understand some of the physical properties of the object such as the type of paper used, the method of binding, or the printing technique employed in the production of a book or manuscript. The standards of digital image capture employed by projects such as DIAMM, the Digital Image Archive of Medieval Music (http://www.diamm.ac.uk/) and the Codices Electronici Sangallenses (CESG) Virtual Library (http://www.cesg.unifr.ch/en/index.htm), are both good examples of online resources that enable advanced research through high resolution data capture. In the context of text editing, a highly desirable feature is to be able to link a specific part of the image to the corresponding unit of text in a manually or automatically created transcription. Implementing this for an entire page is relatively trivial, but at the level of a single word, the manual intervention required to identify the precise part of the image corresponding to every word can be very onerous. One potential solution to this problem is to apply OCR in reverse, whereby the software starts with the disambiguated word and attempts to find a pattern on the image file that corresponds with that text.

Link to the full working paper

Re: Data Capture (Electronic Texts)

I am a little surprised that there's no reference here to the pre-eminent standard for deployment of those "XML tags useto define various features such as titles, headings, notes, stage directions, captions, acts and verses etc" -- i.e. the Text Encoding Initiative's recommendations (see http://www.tei-c.org).

It's worth pointing out also that the most recent version of the TEI's recommendations include provision for mixed image/text digital editions of various kinds, based on the experience of several of the projects mentioned here, along with many others.

Syndicate content