Universiteit Leiden

nl en

What is text & data mining?

Data Mining is the computational process of discovering and extracting knowledge from structured data. Text Mining is the computational process of discovering and extracting knowledge from unstructured data.

Text Mining may be viewed as a specific form of Data Mining, in which the various algorithms firstly transform unstructured textual data into structured data which may then be analysed more systematically. Therefore the term TDM (Text & Data Mining) is often used.

The term TDM is also increasingly used to designate the Text & Data Mining of scholarly content, such as journal articles, book chapters or conference proceedings. TDM may entail the following activities:

  • Information retrieval (to gather relevant texts)

  • Information extraction (to identify and extract entities, facts and relationships between them)

  • Data mining (to find associations among the pieces of information extracted from text)

TDM is applied in all parts of the research process. Exactly how and what can be achieved depends on the licensing, format and location of the text to be mined.

Due to the ever growing availability of digital data and the so-called Big Data, Data Science and Digital Humanities are rapidly growing fields. In September 2014 Leiden University openend the Leiden Centre of Data Science, which focuses on the development of statistical and computational methods for scientifical data.

More information on techniques and applications of TDM can be found in Ronen Feldman, James Sanger, The text mining handbook : advanced approaches in analyzing unstructured data , Cambridge : Cambridge University Press, 2007

Leiden University Libraries has an important collection of publications on TDM (catalogue request).