Developer Tools and Environments (Linguistics)

Spending any time doing research into linguistic tools reveals that an enormous amount of computational work is being carried out in many areas of the discipline, and much of this effort seems to be coming from a community of practitioners who are familiar with various programming languages and capable of working with complex mathematical models. For those looking for an accessible route into this kind of activity, the programming language that is widely referenced as being particularly suited to developing linguistics resources is Perl. As well as having a wealth of pattern-matching and string-handling constructs that complement this kind of research, its structure is also ‘comparatively transparent and logical’ making it an attractive choice for those new to programming. (Online exercises are available on pages maintained by Paul Bennett at the University of Manchester -

Web pages maintained by Dan Melamed, Assistant Professor of Computing Science at New York University ( feature links to almost 300 different linguistics software tools - written mostly in Perl by him, his colleagues and his students - all of which are available under a GNU General Public License (GPL). There is an assumption on this site (also in evidence elsewhere) that those wishing to carry out development work in this area will choose to use a UNIX platform.

Another widely used and very influential development environment is the General Architecture for Text Engineering (GATE), supported by the Natural Language Processing Group at the University of Sheffield. Styled as the ‘Eclipse of Natural Language Engineering’ it consists of three main elements:

  • An architecture describing how language processing systems are made up of components
  • A framework (or class library, or SDK), written in Java and tested on Linux, Windows and Solaris
  • A graphical development environment built on the framework

Extensive documentation is available on the website ( including information about support for a diverse number of languages using the Java Multilingual Unicode Text Toolkit (JMUTT).

For researchers and developers working in the field of open source natural language processing software, the OpenNLP website acts as a central reference point for project listings and also hosts the OpenMaxent NLP machine learning package. A variety of java-based NLP tools use this resource for processes such as sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference, and there are links to dozens of other tools, API’s and models written in a variety of languages including Perl, Python and C++.

Syndicate content