Information Retrieval

The term Information Retrieval (IR) covers a multitude of processes and techniques but is often used in contexts where there is significant input from ‘computing’ (rather than ‘information’) science. Eidenburger presents evidence (based on papers relating to image and video retrieval deposited with the IEEE digital library) that suggests that the numbers of papers submitted to scientific journals that address IR research has been steadily increasing over the last twenty years. The field also has a number of conferences that are dedicated to discussing various areas of IR research e.g. ECIR (European Conference on Information Retrieval), ISMIR (International Conference on Music Retrieval), CIVR (ACM International Conference on Image and Video Retrieval), so it is clear that research into this area is well funded and offers attractive challenges to computer science.

Corporate scale search engines such as Google and Yahoo inevitably dominate the wider context of the information retrieval landscape and it is fairly clear that for many users, these tools are always their first (and sometimes their last!) resort. There are valid concerns that where some kind of authorisation or an acknowledgement of terms and conditions is required to access data, web crawlers are unable to index this material, some of which is held within HE sector institutional digital repositories, resulting in a great mass of data that is ‘hidden’ from general Internet browsing. With the advent of OAI-PMH compliant repositories however, the major search engines are increasingly able to include these ‘deep web’ resources into their search results. Whereas other search engines rely on OAI-PMH repositories choosing to register, Yahoo has set up an agreement with OAISter, a search engine targeted specifically at OAI repositories, to share access to a wide range of academic material hosted by libraries and institutions all over the world. Consequently, a study carried out in 2006 reported that of the 3.3 million unique web resources described in the Dublin Core metadata available from OAI-PMH repositories, the following percentage of material was discovered using three different search engines:

  • 65% - Yahoo
  • 44% - Google
  • 7% - MSN
  • 21% - Material not indexed by the above search engines

Integration of the retrieval power of major search engine technology has been a feature of some academic projects, an example of which is Syllabus Finder, developed by Dan Cohen at the CHNM (Centre for History and New Media). Keywords relating to relevant syllabi are identified by word frequency analysis on a set of documents that are known to be relevant to the topic. These keywords are then bundled up with the desired search term and are used in conjunction with the Google Custom Search Engine, which then delivers highly relevant results. An augmented or intelligent search process such as this might begin to define the broad area of research known as ‘data mining’ (or ‘text mining’).

Witten at al describe the use of the GATE (General Architecture for Text Engineering) development environment in combination with the Greenstone Digital Library System and conclude that the added text mining functionality that GATE provides can be very successfully embedded. Greenstone features a modular architecture which accommodates plug-ins for various functions and this includes a number text mining subsystems. One example includes a process to extract acronymns and their definitions from the full text of a collection. Another process extracts key phrases contained in the text of documents and adds them as metadata. A third ‘computes a hierarchy of all phrases contained in the text of the documents and allows the user to browse it, optionally in conjunction with a standard thesaurus’. GATE however provides richer functionality incorporating many tasks associated with linguistics research, e.g. part-of-speech tagging, tokenization, sentence splitting, all functions which can assist the semantic tagger, a central feature of the system. It also provides access to resources such as lexicons and ontologies and comes with its own lightweight information extraction system called ANNIE, a component which, amongst other tasks, is able to detect person and organization names, geographical locations, dates, times and money amounts. Witten et al conclude that by customizing a digital library with enhanced text mining capabilities, users can experience the type of advantages that the ‘semantic web’ promises.

The U.S. based project NORA Project applies data mining techniques to a collection of literary texts and uses the D2K (Data to Knowledge) framework developed at the University of Illinois’ National Center for Supercomputing Applications (NCSA).
[quote]Data to knowledge is a rapid, flexible data mining and machine learning system that integrates analytical data mining methods for prediction, discovery, and deviation detection, with data and information visualization tools.

Fuller references to the architecture of this system can found elsewhere ( but it may be useful to highlight the visualization feature that is a component part of this system (see fig.5). This paper, for reasons of brevity and to avoid encroaching on territory covered by other working papers, has largely conceptualized the ‘library’ as a collection of texts that can be managed, manipulated and visualized using text-specific tools. NORA and a host of other tools offer users alternative ways of visualizing the results of data analysis processes.

Fig.5 Screenshot from NORA social network demoFig.5 Screenshot from NORA social network demo

Re: Information Retrieval

There is a detailed chapter on Information Retrieval in DigiCULT Technology Watch No. 3:

It's a few years old now but still very relevant.

Syndicate content