Corpus linguistics, text mining, textual analysis, data mining

Multiple Text Datasets
Corpus linguistics
Text Mining
Textual Analysis
Data Mining

It seemed to me that we did use some of these terms interchangeable during the workshop (and I think Ian made a similar comment). Now, there is definitely some overlap and the Wikipedia article on Text mining (as of 02.08.2007) even starts with: "Text mining, sometimes alternately referred to as text data mining", but I think some clarification might be useful.

What do you think are the main differences?
Why do we bother with 'text mining' when we could simply use 'data mining'?
Is 'data mining' evil because that is what companies do to exploit the customer? Or would it scare away proper historians who think that they deal with text and the rest is for social or economical historians?


data mining vs text mining

Hi, whilst most of the terms you mentioned have some overlap, I think
"data mining" and "text mining" are different. The difference lies in
the nature of the source: data is heavily structured information (mostly figures in tables), text is unstructured information in the form of - well - text on a page. It would be interesting to discuss, what influence that might have on the outcome of mining technologies. Does the definition of data mining as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" really comply with text mining?

Data Mining / Text Mining / Computational Linguistics

There are three distinct disciplines:

* Data mining: Discovering new, interesting information from large sources of data. For example, Association Rule Mining to discover which products consumers buy together. The other two typical sub-disciplines are Classification (learning how to classify an object from a series of examples) and Clustering (discovering groups of objects which are more similar to each other than to objects outside of the group)

* Text Mining: Discovering new information from large sources of textual data. This includes the extraction of semantics from the text, whereas with data mining this information is encoded directly into the data. I know that 4 in the legs column means that the animal has four legs by definition. "A horse has four legs" says exactly the same thing, but the machine has a much harder time resolving it to the same fact.

* Computational Linguistics: This is what Dawn and Paul were talking about: the analysis of language through the use of statistics on text. The /machine/ does not learn anything or derive new facts, it simply provides a way for a human to view patterns.

As I showed, you can treat text as data and use data mining applications to process it. This does not make it text mining. Even if I'd used part of speech tags, this is still not text mining as I don't care what the /meaning/ of the word is, just its existence.

Data Mining / Text Mining / Computational Linguistics

Thank you for these very helpful definitions/clarifications, most welcome!

While we were up in Glasgow discussing Text Mining, my colleague Neil made the new Methods Network Working Papers available as wiki articles via the Digital Arts & Humanities Wiki. These working papers are intended to assist arts and humanities researchers with the task of acquiring knowledge about ICT tools and methods. Especially the paper on linguistics deals with several topics relating to the workshop:

Corpus Linguistics
Data Mining
Knowledge Based Systems
Querying and Analysis of Data

We hope that you will find these (and the other) articles useful and we would be especially grateful for any comments and criticisms - via direct changes to the wiki, using the comment function or via email to Neil or me.


data mining vs text mining

I agree that 'data mining' does indeed sound like (relational) database. Still, with the border between traditional databases and XML encoded text becoming more and more blurry this kind of distinction between 'data mining' and 'text mining' might not always be that easy...

The two Wikipedia definitions at least stress that data mining is much more about sophisticated analysis - I would like to leave the questions of how 'nontrivial' text mining is to the specialists. Regarding your question I think that the definition you quoted should also apply to text mining, at least if one would exchange the last word ('data') with 'text'. Anyway, I think you are right in suggesting that it might be more useful to approach this from a somewhat more practical angle and look at outcomes and tools. Are there data mining tools that could be applied to do the job of text mining tools or enhance our understanding of texts?

Syndicate content