Universiteit Leiden

nl en

Text & datamining

Text and Data Mining (TDM) is increasingly applied in various scientific disciplines to extract structured data (in databases) from unstructured data (text), thereby offering the opportunity to acquire new information and knowledge

The Centre for Digital Scholarship (CDS) offers support to researchers wishing to apply TDM techniques

Our various services include:

  • Preparation and disclosure of digital and digitized library collection for TDM research
  • Support for data cleaning  and data enrichment
  • Support for data analysis and data visualisation
  • Support for data curation and data preservation

Our website offers you information on TDM in general, a shortlist of tools and software that can be used for TDM, information about licences and their conditions that may apply to TDM, resources, tutorials, blogs, and other publications about TDM.

Text & data mining explained

Text Mining may be viewed as a specific form of Data Mining, in which the various algorithms firstly transform unstructured textual data into structured data which may then be analysed more systematically. Therefore the term TDM (Text & Data Mining) is often used.

The term TDM is also increasingly used to designate the Text & Data Mining of scholarly content, such as journal articles, book chapters or conference proceedings. TDM may entail the following activities:

  • Information retrieval (to gather relevant texts)

  • Information extraction (to identify and extract entities, facts and relationships between them)

  • Data mining (to find associations among the pieces of information extracted from text)

TDM is applied in all parts of the research process. Exactly how and what can be achieved depends on the licensing, format and location of the text to be mined.

Due to the ever growing availability of digital data and the so-called Big DataData Science and Digital Humanities are rapidly growing fields. In September 2014 Leiden University openend the Leiden Centre of Data Science, which focuses on the development of statistical and computational methods for scientifical data.

More information on techniques and applications of TDM can be found in Ronen Feldman, James Sanger, The text mining handbook : advanced approaches in analyzing unstructured data , Cambridge : Cambridge University Press, 2007

Leiden University Libraries has an important collection of publications on TDM (catalogue request).

Publishers' policies

Springer
Gale Cengage
Ebsco
Oxford University Press (Oxford Journals)
Elsevier
CrossRef Text and Data Mining services  

General Directories (including TDM software & tools)

  • DiRT (Digital Research Tools)
    The DiRT Directory is a registry of digital research tools for scholarly use.
    Resources range from content management systems to music OCR, statistical analysis packages to mindmapping software.
    The DiRT directory is supported by the Andrew W. Mellon Foundation

  • PORT (Postgraduate Online Research Training)
    PORT is the public research training platform from the School of Advanced Study of the University of London. It contains a variety of training resources tailored toward postgraduate study in the Humanities. Most of the training resources are free.
    Tools included in the resource on Quantitative Methods are: semantic data, text mining, visualisation, linked data, cloud computing. For each tool a series of case studies has been provided alongside a tool audit. Free login.

  • TaPOR
    Text Analysis Portal for Research

Tools & software

  • Textpresso    
    Information extracting and processing package for biological and biomedical literature.
    Textpresso is part of WormBase at the California Institute of Technology, California and supported by a grant from the National Human Genome Research Institute at the US National Institutes of Health 

  • GATE (General Architecture for Text Engineering)
    Developed by the University of Sheffield

  • Ontotext
    Provides tools for text mining, semantic annotation, data integration, and semantic curation

  • WMatrix
    Leiden University campus license 

    Parsers

  •  PDFMiner-Python PDF parser and analyzer
    Tool for extracting information from PDF documents.
    Includes a PDF converter that can transform PDF files into other text formats.

  • Stanford parser
    Statistical parser

  • Alpino
    Dependency parser for Dutch, developed in the context of the PIONIER Project Algorithms for Linguistic Processing.

A selection of most commonly used open source and Leiden University licensed tools & software

Popular progamming languages used for TDM

  • Python
    Widely used general-purpose programming language. Has a large standard library providing tools for data analysis and data modelling. An introduction to the basic concepts and features of the Python language and system can be found in the Python Tutorial

  • Perl
    Includes powerful tools for processing text that make it ideal for working with HTML, XML, and all other mark-up and natural languages.

  • Sharing code on GitHub


Quantitative data analysis software
 

  • R
    For Statistical Computing and graphics

  • Mallet
    Java-based package for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications to text

  • WinStats

  • SPSS


Qualitative data analysis software

  • Atlas ti
    Tool for data analysis and management.
    Tutorial by University Library of the University of Illinois at Urbana-Champaign
    University Leiden campus license


Data cleaning


OCR


Visualization

  • Textexture For visualizing text as a network

  • Gephi 
    For visualization of network analysis
    Introduction to Network Visualization with Gephi by Martin Grandjean, University of Lausanne

  • QGIS

  • Tableau public
    Free to use version of the commercial data analysis and visualisation software called Tableau Desktop. Makes interactive charts, graphs and maps from your data

  • OpenHeatMap
    Data can be used to make static and interactive animated maps. 
    By using spreadsheets from Excel or Google Docs you can map any dataset that is linked to an array of locations such as IP addresses, street addresses and longitude and latitude coordinates.

  • Google Fusion Tables
    For making charts,maps and network graphs. 

TDM may involve Intellectual Property Rights.

In 2012 JISC published the report Value and Benefits of Text Mining on the benefits, barriers and risks associated with text mining in the UK, stressing the importance of the reform of IPR laws.
Among researchers, research organisations and librarians an important discussion is ongoing regarding national IPR and the TDM licensing policy of some of the larger publishers. More information on this discussion can be found at:

  • The Hague Declaration

  • LIBER (Ligue des Bibliothèques Européennes de Recherche – Association of European Research Libraries)

  • IFLA (International Federation of Library Associations and Institutions)

  • ALPSP (Association of Learned and Professional Society Publishers)

  • The Content Mine

More information on IPR can be found at the site of our Copyright Information Office.
For advise please contact: auteursrecht@library.leidenuniv.nl 

For advise and support for dealing with TDM licenses please contact our License Manager: e-resources@library.leidenuniv.nl

 

Digtal collections

  • Delpher
    More than 1 million of Dutch books, newspapers and journals

  • Googe Books

  • Hathi Trust Digital Library
    More than 13.5 million volumes
    Login required

  • Hathi Trust Research Centre
    The HathiTrust Research Center (HTRC) enables computational access for nonprofit and educational users to published works in the public domain.

  • The New York Public Digital Library
    Login required

  • The University of Oxford Text Archive
    The University of Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching, and learning. The OTA also gives advice on the creation and use of these resources, and it is involved in the development of standards and infrastructure for electronic language resources. 

  • Early English Books Online Text Creation Partnership (EEBO-TCP)
    The EEBO-TCP corpus covers the period from 1473 to 1700 and is estimated to comprise more than two million pages and nearly a billion words. Having previously been available only to academic institutions which subscribe to ProQuest’s Early English Books Online resource, over 25,000 texts from the first phase of EEBO-TCP were made freely available as open data in the public domain as of January 2015.

APIs for scholarly resources

APIs, short for application programming interfaces, are tools used to share content and data between software applications. Many scholarly publishers, databases, and products offer APIs to allow users with programming skills to more powerfully extract data to serve a variety of research purposes.

Catalogue of APIs for scholarly research (MIT libraries)

This page offers a selection of tutorials, the most current blogs and key publications.

Tutorials

Blogs

Publications

This website uses cookies.