Corpus Linguistics (Linguistics)

In many cases, discussion of tools development within the context of linguistics takes for granted the principle that those tools will have suitable datasets to interact with, process and/or analyze. The key activity that underpins much of this work is the process of corpus construction. The aggregation of machine-readable text from a variety of sources including transcriptions of spoken language, literature, specialist publications, mass circulation periodicals and all manner of other sources, has given rise to the sub-discipline of corpus linguistics, a field of research that goes back to the early 1960’s with the creation of the first modern, electronically readable corpus, the Brown Corpus of Standard American English. ( /brown.html) Forty years later, there is a substantial collection of corpora available in a wide range of specialist and non-specialist areas and many of these collections of text are freely available or available via enquiry or subscription, for researchers to carry out analysis on or to use as reference corpora when comparisons are required with a larger body of language.

At the Methods Network expert seminar on linguistics in 2005, copious examples of using corpora for research were cited, mostly in the context of using word frequencies to illuminate aspects of particular texts. One of the significant tools mentioned in this forum was Wordsmith, developed by Mike Scott, which contains all the functionality one might anticipate from a corpus analysis tool including wordlist creation, concordancing, word clustering, collocation and lemmatization. This software has been in development since 1996 (the current iteration is version 4.0) and is in wide use by a large number of organisations and projects, including the Oxford University Press who use it for their own lexicographical work when building English and foreign language dictionaries - thereby taking advantage of Wordsmith’s support for Unicode.

The way that a programme such as Wordsmith might be used - in conjunction with additional subsequent tools and methods - is exemplified by Paul Baker in a paper that analyses linguistic elements in transcriptions from a debate in the House of Commons about fox hunting. By using the keyword list function in Wordsmith, Baker created two separate lists that related to speeches made by the pro and anti-hunt lobbies. By comparing the two lists and the occurrence of keywords in both, Baker was able to examine which words were more ‘key’ (i.e. more relevant) to one text rather than the other and then was able to discover the context of these words by using concordancing and collocational analysis tools. He cites an interestingly disproportionate use of the word ‘criminal’ by the pro-hunt lobby and draws out conclusions as to why this mode of speech might suit that particular agenda. He then describes comparing both lists against the FLOB Corpus (1 million words of 1990’s British English - Freiburg Lancaster-Oslo Bergen Corpus,, in order to discover if both sides of the debate were using words that occur more often than one might expect in ‘standard’ British English usage. The most significant word highlighted by this stage of analysis turned out to be the word ‘cruelty’ - used with almost equal frequency by both lobbies and therefore not picked up by an analysis between the sub-texts.

In a further refinement of his analysis, which also usefully illustrates another common exploitation of corpus information, he then used a semantic tagger - in this case the UCREL Semantic Analysis System (USAS) – to examine how words with similar meanings might aggregate together to become significantly disproportionate in their usage. Certain low frequency words might be represented by a number of synonyms which if collated together might increase their ‘keyness’ to the point where they are a significant analytical component of the text. Additionally, antonyms are also pertinent to the analysis as words that are posed in direct opposition to each other are still conceptually connected, even if they represent two sides of the (same) argument.

Table 2 - Semantically Linked Word Concordance (from presentation by Paul Baker)Table 2 - Semantically Linked Word Concordance (from presentation by Paul Baker)

As can be seen in Table 2, this can be accommodated by the USAS system, which not only categorises the word ‘reasonable’ as being usefully related to the concept of being ‘rational’, but also characterises those words as having an antonymic relationship with words such as ‘absurd and ‘illogical’.

The corpus that Baker refers to in his paper contains just 129,798 words and is, relatively speaking, tiny in comparison to the types of corpora that researchers use to analyze more general usage of language. It is clear however that in this case, the corpus has been designed to address very specific research questions and it is logical that the size of the corpus that one constructs should be appropriate for, and representative of, the type of information that one is seeking to analyze. In relation to a limited study of textual information where the object of the exercise is to drill down into texts to reveal specific instances of word usage via frequency lists, concordances and citations, the easy-to-use and freely available tool TextSTAT is a very useful starting point for constructing ‘home-made’ corpora. It even features an advanced query editing function which allows two terms (with wildcard functionality) to be entered with specifiable minimum and maximum word distances between them, thereby allowing collocational analysis of the texts.

Towards the other end of the scale is what is known as ‘mega-corpora’, also known as second-generation corpora to distinguish them from the collections put together in the 1960’s,70’s and 80’s. The British National Corpus (BNC) is a notable example of this type of undertaking and is ‘designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written.’ This monumental resource featuring 100 million words provides researchers with a wealth of information extracted from a wide variety of sources,

[quote]The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text.

The spoken component comes from equally diverse sources and all of this information is annotated according to Text Encoding Initiative (TEI) guidelines, which provides contextual information about the content in the form of metadata, as well as allowing data about the structural properties of the text to be included, such as parts of speech analysis – carried out by the CLAWS tagging system (the Constituent Likelihood Automatic Word-tagging System) developed by the University of Lancaster.

The BNC can be queried using the original query system, SARA, which provides users with word and phrase search functions, concordancing and collocation features but doesn’t allow searches by parts of speech or allow outputs in the form of charts. The updated query system, XAIRA, is a fully functional general purpose XML search engine with full Unicode support that can be used on any corpus of well formed XML documents but has principally been designed for use with the BNC-Baby and the BNC-Sampler corpora. A new edition of the BNC features the whole corpus in XML format which replaces the original SGML annotation and allows for greater interoperability with most software and closer alignment with current development methods e.g. the use of DTD’s (Document Type Definitions) and XSLT (Extensible Stylesheet Language Transformation).

The BNC supports a wide range of activities that includes reference book publishing, linguistic research, artificial intelligence, natural language processing, English language teaching and in at least one instance, a work of Internet Art that uses the relative word frequencies to rank 86,800 tokens from the BNC in descending order beginning with the most popular word in the corpus (which happens to be ‘the’) down to the least used as represented in Wordcount (which, for interest, is ‘conquistador’) . At the scholarly level, tools developed to work with the BNC include the VIEW system (Variation in English Words and Phrases), developed by Mark Davies at Brigham Young University, which offers users an enhanced search interface onto the BNC corpus, allowing - amongst other things – the specification of search terms in and across specific registers (i.e. spoken, academic, poetry, medical, etc). In addition to the very fast searching mechanism facilitated by the use of linked database tables and SQL queries, it also provides the possibility of conducting searches using synonyms and semantically related terms, the latter made possible in conjunction with Wordnet, a semantically-organized lexicon of English, freely available and hosted at Princeton. (

Other mega-corpora include COBUILD (a.k.a. the Bank of English) which is defined as a ‘monitor corpus’ in that it is designed to continue growing in size to reflect the condition of the English language as new words appear and others fall out of favour. As of December 2004, the size of this corpus was reported to have reached 524 million words. Another significant project is the International Corpus of English (ICE) which is a collection of initiatives to document the English language as it is spoken in different countries around the world. ICE-GB (the British component of ICE) is distributed with the retrieval software ICECUP (International Corpus of English Corpus Utility Program) which facilitates the querying of parsed corpora. One further important corpus that is currently in development is the American National Corpus (ANC) which aims to emulate the BNC in terms of its size and scope and will presumably become as influential a research tool as its British counterpart. Links to many other corpora including non-English, parsed, historical, subject specific, spoken and specialised examples can be found at (

Syndicate content