Data Mining (History)

Having considered issues to do with the presentation and structuring of data, the former principally in the form of web resources and the latter as database and textual information systems, the next stage of the information cycle involves the querying and analysis of that data. One deceptively simple experimental project, H-Bot (http://chnm.gmu.edu/tools/h-bot/), is an example of a ‘question answering’ (QA) system which accepts natural language questions and attempts (in one of its modes) to return exact answers using the Google index to find statistically likely matches on the key keywords that the user has entered. The system is capable of performing reasonably well in the context of a certain type of factual question and its creators argue that further development work on the underlying rule sets would further improve its capabilities. Designing QA systems, according to Dan Cohen, ‘exercises almost all of the computational muscles’, involving as it does: search methodologies, document classification, question interpretation (natural language processing) and statistical and linguistic text analysis.

Cohen has also used a similar approach with another data mining system called Syllabus Finder which uses a keywords-in-context (KWIC) approach to return highly relevant results relating to syllabi information. Having determined a relevant keyword set using word frequency analysis from a known set of documents related to syllabi, the search term entered by the user is optimised by the addition of these terms and the bundled query then interrogates Google’s API service (as well as a locally stored database) which returns search results in a SOAP envelope (an XML schema used for server-to-server communications). Additional statistical and expression matching analysis is then carried out on this dataset resulting in highly relevant and specific output that is capable of identifying college or university organisation names and course assigned book titles. These and other tools created at CHNM are complex in their conception but are designed with simple interfaces and have the feel of small group developmental research systems that have been constructed to address particular user-defined needs.

A more widely collaborative data mining initiative that is currently focused on information retrieval in the area of literary studies is the NORA project, the objective of which is ‘to produce software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources in existing digital libraries.’ (http://www.noraproject.org/description.php) The development of this software is being done by a consortium of organisations involving subject specialists and computing scientists and the functionality of the demo version of the system displays a similarly distributed methodology. The user interface software runs as a Java Webstart application delivered to the user via one server; the relational database management system (Tamarind) runs on another server; and the D2K data-mining framework currently runs on a third server. The D2K (Data-to-Knowledge) framework supports other projects: T2K (Text-to-Knowledge) and M2K (Music-to-Knowledge), all of which are being developed to play a significant new role in how the Humanities Computing community works together to build, use and share tools.

Fig 2 Screenshot from the ‘Social Network Demo’ available from the Nora ProjectFig 2 Screenshot from the ‘Social Network Demo’ available from the Nora Project

The principle is that ‘modules’ of well defined reusable code can be implemented as single or nested components to carry out a wide variety of functions – which in the case of the T2K framework would include a rich set of natural language pre-processing tools which could carry out: lemmatization, tokenization, part of speech tagging, data cleaning and named-entity extraction.

In the context of the NORA project, the concept is to build an application using the D2K framework which will retrieve ‘only what’s needed’ from large digital repositories (of XML encoded text) and place that information into a database, thereby cutting down on the processing overhead of having to deal with large amounts of redundant or null data. Once collected, an additional function of NORA is then to provide visualizations of the analysis of that data that will provide researchers with clearer insights into a range of issues including social networks, content overview, document classification types etc. (see Fig. 2).

A project based at Sheffield University is also looking at the problem of mining information intelligently from distributed data sources and is specifically addressing how to deal with the sort of fuzzy and ambiguous data that often characterises historical information. The ARMADILLO project is based on twelve datasets that contain information about eighteenth century London and the objective is to employ probabilistic disambiguation methods (using pre-determined algorithms) to the heterogeneous data, which is then mapped to a fairly simple ontology consisting of the following entities:

  • Name
  • Role
  • Place (i.e. place name which may be subject to a change of location)
  • Location (associated with a map reference)
  • Time (either point in time or time period)
  • Resource

Users would then be able to turn these categories ‘on’ or ‘off’ depending on the nature of their search and the likelihood of returning meaningful results, allowing for much finer control of the data retrieval process. The stated aim of this project is to employ as much ‘knowledge’ to the task of initially processing information as possible, in order that the likelihood of error introduced by synonyms, homonyms, variant historical spellings, inconsistencies introduced by human error, and so forth, are minimised when records are retrieved for analysis.

Syndicate content