Lemmatisation

Refers to techniques used to group a set of forms of a word (a lexeme) together under a single headword, or lemma – the form of the word that would be listed in a dictionary, glossary or index. This enables different inflected forms of the same word to be analysed as a single item, which is useful when compiling frequency and distribution information.

In English, the lemma is normally the singular form of a noun, or the infinitive of a verb.

For example, the words ‘am’, ‘was’, ‘are’, ‘is’, ‘were’ and ‘been’ all belong to the same lexeme, which is represented by the lemma ‘be’. Therefore ‘be’ is said to be the lemmatised form.

Specialised lemmatisation software can be used for this process. In order to do this accurately, the software must be able to understand context and identify parts of speech.

Stemming software is similar, but it only recognises the stem of the word and does not understand context or parts of speech, which can result in reduced accuracy. For example, ‘good’ is the lemma for ‘better’, but this would not be recognised by a stemmer.

Related methods include: Indexing.

Syndicate content