Tools and Web Resources (History)

To address the question of what is currently available to historians, one must inevitably turn first to the Web for an overview of initiatives that are the result of recent (and not-so-recent) attempts to use digital tools to break new ground in areas of historical research. One early and very influential project that is widely cited as providing a benchmark for presenting complex historical research material on the web is the ‘Valley of the Shadow’, a hypermedia archive about northern and southern communities during the American Civil War containing more than 100,000 items taken from newspapers, letters, diaries, photographs, maps, church records, population census data, agricultural census data and military records. Hosted by the Virginia Center for Digital History, this project might usefully represent the medium of the web itself as one of the most ubiquitous and most widely accepted tools, complementing the use of email as the other major communication tool that presumably all historians now accept as an intrinsic part of their activities.

A number of repository/gateway/portal/listings sites exist to aid historians in finding relevant resources (see below for a representative sample of these sites – all sites active 1 March 2007).

With such a plethora of very effective portal-type tools for finding what amounts to an enormous array of resources for historical research, it is useful to have a recent AHRC funded report entitled ‘Peer review and evaluation of digital resources for the arts and humanities’ (http://www.history.ac.uk/digit/peer/Survey_report2006.pdf) which is based on a total of 777 responses to a survey circulated to a number of (mostly history-related) mailing lists. 442 responses were received to question 6 which asked respondents to:

Name the three digital resources which you use most often in your own research (such as TLG, EEBO, RHS Bibiography, Old Bailey Proceedings, Oxford DNB; please do not include online journals, library catalogues or search engines). Why are these resources of such value?

The top ten resources listed were as follows:

Table 1 Taken from Peer Review and Evaluation of Digital Resources for the Arts and Humanities (pg.5)Table 1 Taken from Peer Review and Evaluation of Digital Resources for the Arts and Humanities (pg.5)

Whilst these ‘resources’ are slightly to one side of the main thrust of this paper - the principal purpose of which is to consider ‘tools’ - it is nonetheless illuminating to understand which sites are perceived as particularly valuable. In addition to relevant content which will almost always be the principle reason that users will keep on returning to specific websites, a less important but nonetheless significant enticement will be:

  • The functionality that is available (e.g. search and sort options)
  • The flexibility of the data (e.g. variant spelling detection; text hyperlink to original manuscript image)
  • The robustness and durability of the data (e.g. the technological platform; the encoding standards employed)

Taking some of the above resources as examples, it may be useful to consider some of the tools associated with their development, functionality or implementation to understand what elements other historians might usefully aim to employ in their own ICT-related resource development projects.

The Oxford Dictionary of National Biography uses a very flexible search interface which offers a ‘type and hope’ simple search field for finding a specific person or word from the full text of the DNB, but also offers five other search screens incorporating many more fields (sometimes linking to further ‘advanced’ options e.g. name search) which allow the user to access all manner of gender related, professional, geographical, chronological, religious and financial information – in addition to image and bibliographical data. One of the comments in the survey pointed out the specific usefulness of this site in creating prosopographical statistics and it is clear that the online DNB fulfils the criteria often quoted for establishing the legitimacy of a technological resource, i.e. that it allows researchers to access information that would be impossible or very difficult to have discovered before that resource became available.

Obviously, one of the critical tasks in assembling a resource of this magnitude (55,800 lives; 63 million words;10,300 portraits) is the rendering of the original text into a machine-readable format, and whilst this is less of a problem for text that has incrementally been transferred into digital formats over the years (which may have been the case with the DNB) it becomes a huge technological challenge when major repositories of primary source material are only in printed formats and require either manual text transcription or some form of scanning procedure (image and/or optical character recognition) to transform them into digitally accessible archives.

The Eighteenth Century Parliamentary Papers Project, one of the initiatives funded by the Joint Information Systems Committee (JISC) digitization programme, will use very fast bulk scanning equipment that is capable of scanning 600 pages per hour, facilitated by vacuum enabled page turning technology and laser guided edge detection sensors. The size, weight and cost of this equipment (see fig.1) clearly requires a long term institutional commitment and will exceed the strategic requirements of many organisations and projects, but it is symptomatic of a new approach to resource creation that such tools now exist that can process around one million scanned pages a year, using techniques that are sensitive to the fragile nature of the primary sources and mindful of the technical standards intrinsic to sustainable resource creation. At a recent Methods Network workshop, one participant reported that the Open Content Alliance is using this kind of hardware in a multiple-machine facility that perhaps begins to approach the sort of capabilities available to commercial entities such as Google and Microsoft, who are separately in the process of creating colossal digital repositories of books numbering millions of items.

Fig. 1 University of Southampton, BOPCRISFig. 1 University of Southampton, BOPCRIS

Whilst different in scale to the commercial initiatives referred to above, Early English Books Online (EEBO) is an important archival record of virtually all books produced in Great Britain, Ireland and British North America before 1700, and also features prominently in the AHRC Peer Review Project (see Table 1). The relevance to historians of this archive extends to an enormous range of subject matter that encompasses royal, governmental and provincial official public documentation as well as a prodigious amount of social historical material relating to all classes of society. Whilst the image scanning of book pages in this project is of enormous significance, it is the related Text Creation Partnership (TCP) initiative that is of further interest in the present context. EEBO-TCP is managed by the Universities of Michigan and Oxford and financially supported by over seventy other institutions to produce manually keyed and SGML/XML encoded text editions of a significant portion of the EEBO corpus. The aim is to create full text editions of about 20% of the works represented by images in the archive, thereby enabling search functionality that goes way beyond keyword-type linguistic analysis processes (e.g. concordance, collocation and clustering), highly useful as those are for historians as well as those involved with diachronic linguistics. As a result of the textual encoding though, searches can also be formulated that, for instance, only search for proper names, or only search in the marginalia, or perhaps only search for non-English terms that appear in stage directions.

The decision by the Text Creation Partnership to rely on laborious methods of manual keying and encoding clearly has a critical effect on what it is possible to achieve given certain funding criteria, and is in contrast to the abovementioned Eighteenth Century Parliamentary Papers project which declares on its website that Abbyy OCR (optical character recognition) software will be used in conjunction with the Agora Content Management System to automate as much of the data management as possible. Given that there is no single recommended solution for major projects like these, it is a useful reminder that the best strategic methodologies and the most appropriate tools need to be defined according to the nature of the work being undertaken. This might encompass cases where a hybrid solution involving automation combined with human expert feedback into the system is the preferred methodology, improving the chances of more accurate scanning as the project progresses and as the system learns from its mistakes. The Gamera project is an open source initiative to supply a programming framework for building recognition systems for ‘difficult’ historical documents. It is envisaged as a tool that programmers will use in conjunction with subject experts and supports a reiterative test and refine model, a graphical user interface and a modular plug-in architecture that is capable of performing five separate document recognition tasks:

  • pre-processing
  • document segmentation and analysis
  • symbol segmentation and classification
  • syntactical or structural analysis
  • output

Gamera stores output in an XML-based file format, which enhances its interoperability with other systems and is also in keeping with its open source ethos.

Syndicate content