Stochastic Methods (Linguistics)

The early development of stochastic (mathematical and statistical) methods of analysing information have had a profound influence on more recent initiatives to deal with the prodigious amount of information that is now available to researchers, but it is plausibly as much to do with developments outside of the field of linguistics that stochastic techniques are now very firmly back on the agenda again after falling out of favour in the 1960’s. With the exponential growth of available data over the World Wide Web and the increasing availability of corpora and treebanks (parsed corpora), it makes logical sense that such methods are now a standard way of dealing with the vast amount of information that is available to researchers and that a great deal of focus is now being put on probability-based models and statistical analysis. An additional external influence on this trend is, of course, the explosion in processing and storage capacity that has occurred in relation to computing since the 1980’s, which has allowed the analysis and exploitation of data to take place outside of high specification machine-rooms and away from dedicated servers.

Acknowledgement is also required of advances made within the field however, one very specific example being the work based on the ‘hidden Markov model’ which uses likelihood and probability to discover unknown values based on related visible parameters. Developed in the late 1960’s and early 70’s, this had particular relevance to speech recognition techniques and went onto have broad cross disciplinary relevance, particularly in the field of bioinformatics. More generally though, some of the rapid advances in linguistics research over the last two decades can be ascribed to the fact that in many cases, experimental practice can be measured against real-world data and benchmarks can be established that will indicate whether the process that is under scrutiny is actually capable of delivering useful results.

In the case of part of speech (POS) tagging for instance, a subset of the tagged information can be analysed against a manually tagged portion of the data and the automated process can then be evaluated to see how accurately it has managed to mimic the very labour-intensive but (theoretically) 100% accurate reference source. The kind of figures that can be obtained from these comparisons are extremely useful in defining what the capabilities of the currently available tools and methods are and provide researchers with tangible goals and challenges to try and aim for in subsequent development phases. In the context of the arts and humanities, this is an unusual working model and reinforces the slightly anomalous status of linguistics research in comparison with the more orthodox interpretative and critical approaches that are normally associated with other arts and humanities subject areas. By way of example, table 1 refers to the different levels of word accuracy rate that might reasonably be expected from current speech recognition systems within four different contexts – word accuracy being defined as the relatively simple measure of how many words the system misses or confuses when trying to automatically transcribe speech from a variety of sources

Table 1 –Automatic Speech Recognition CapabilitiesTable 1 –Automatic Speech Recognition Capabilities

The figures in table 1 require qualification on a number of points, particularly with reference to the parameters built into the various systems that have provided the performance metrics. As of 2004, Hajič (Hajič, J., Linguistics Meets Exact Sciences, in Schreibman, S., Siemens, R., Unsworth, J., (eds), A companion to Digital Humanities, (pp. 79 - 87), 2004) states that the latest speech recognizers could handle vocabularies of 100,000 or so words and that there were examples of research systems which contained one million word vocabularies. Accuracy rates will very much depend on the domain of speech that is being analysed and the relative sizes of the reference vocabularies available to respective systems. As such, the above table is mostly illustrative in its intent.

Syndicate content