In short, data management can be defined as the creation, storage, maintenance, disclosure, archiving and sustainable preservation of research data. Increasingly the so called FAIR principles are referred to as a final goal: data should be made 'Findable, Accessible, Interoperable and Re-usable'.
By carefully planning how you will deal with your data even before the stage of collecting or creation, it will be easier in the end to make the data FAIR, not only for yourself, but also for others.
Good data management is important for:
- Making research data findable and accessible, also in the long term;
- Guaranteeing the safety and confidentiality of data;
- Ensuring the quality of research;
- Data re-use specifically and progression of research in general;
- Increasing the visibility –and impact—of research;
- Compliance with requirements from funders, institutions and publishers.
Effective data management practices need to start with a thorough preparation. When you carefully plan the various ways in which data need to be stored and be shared, this can save much time during and after the actual research project.
While writing a research proposal, it is useful to make a good estimation of the data management facilities that will be required. It can be advantageous, for instance, to investigate whether or not the research can be conducted on the basis of data that are already available. It is important, furthermore, to make an estimation of the nature and the extent of the data to be produced. When it is clear that the study will generate large quantities of data, this usually demands specific data management facilities. This is similarly the case when studies make use of personal data, or data which are otherwise sensitive in nature. Studies which involve international or interdisciplinary collaboration often need to ensure that their shared data remain findable and accessible for all participants. Evidently, such specific facilities can have considerable implications for the overall costs of the research project.
The Centre for Digital Scholarship (CDS) can provide support, firstly, during the initial planning of the research project. It can offer advice on how to comply with the various data management requirements, formulated by Leiden University, by funders such as NWO or H2020, or by publishers. The CDS can explain how sensitive data can be protected. Tailor made advice may be given in collaboration with Leiden University’s Data Protection Officer. The CDS can give advice on the various facilities and services that can be used during research projects, as well as on the archives that can secure the long term preservation of data. The decisions that have been taken at the beginning of the project often determine whether the data will be findable, accessible, interoperable and reusable (or FAIR) at the end the project.
Policies and requirements
Leiden University had adopted a regulation for Data Management in April 2016. The main general requirements are:
- All research projects must have a data management plan before the start of the project.
- During the research, data must be stored securely, which means that the integrity, availability and, if required, confidentiality of the data are guaranteed.
- After the project research data must be managed in such a way that they are findable, accessible, assessable, re-usable and sustainable.
- Data must be archived according to international guidelines for at least 10 years.
The regulation is in line with most research funders' requirements.
Costs for data management made during a research project can be inserted into a proposal’s budget. These may be costs related to temporary storage, to the anonymisation or the transcription of data, or to the curation of data before sustainable archiving. The National Coordination Point for Research Data Management has published a guide with activities and ways to calculate the costs.
Good organisation of your research data is time investment that will surely pay itself back, when you finish your research and find your data easily and fully intact, without any difficulty to understand, archive and share with your colleagues and journal editors.
Most data will be stored on the University network, or J-drive. The ISSC will take care to make a back-up every night.
There may also be exceptions, for instance in the case of data from field work, patient data from hospitals, or data that are generated by a piece of instrument. For those data a suitable plan must be devised: how often should the data be transferred to the network, which data can be placed on the network, and which data are not suitable for network storage; when should you make a back-up yourself?
A virtual research environment may be a good solution for groups of researchers working together from different locations outside the University network.
Filenaming, structure and version control
With logical and unambiguous file naming you ensure that your data will be intelligible by your colleagues, or after a period of time even by yourself. Do take care to follow standard procedures and workflows in your specific field of research. This does not only apply to file names: the structure of your folders, fields in your spreadsheets also deserve solid naming.
By using version control you will be able to tell the data from different stages in your research apart and prevent unnecessary doubling or overwriting of data.
Metadata and documentation
Metadata and documentation make your data findable and intelligible for other people. The methods applied vary per subject area: this may be a database, an xml file according to international standards, or even a readme.txt. It is important to make use of the expertise in your field and to stick to standard working procedures.
Access to data
Access restrictions may be applied to some data: there may be privacy issues concerning the information that is stored, or limitations due to commercial interests, or pending patents. You will need to take these restrictions into account, when you choose your place for storage and when you do or do not give other people access to your data. Anonymisation and/or encryption may be helpful and sometimes even necessary.
After the research project has finished, the main results can generally be disseminated. Research outputs often consist of articles or monographs, but they may also include data sets. According to Leiden University’s data management regulation, the research data that have emerged from research projects must to be preserved during a period of at least 10 years. However, which data need to be preserved precisely?
Selection criteria for research data
Data can often be highly valuable for other researchers, or for society at large. It is not always necessary, however, to preserve all data. In the case of large data sets, preservation may be very costly. In some cases, the costs for replication (if possible) may be lower than the costs of preservation. Alternatively, it may also be the case that the models or the algorithms which have produced the data sets are ultimately more important than the data sets by themselves. While assessing the need to preserve data, the following criteria may be used:
- Are the data unique? It is often impossible to replicate observational data, for instance.
- Are the costs of replication disproportionally high?
- Is there there formal obligation to preserve data for the longer term? Such requirements may have been stipulated by funders or publishers.
See also the hand-out ‘Selection of Data for Archiving’.
Journals and funders increasingly stimulate researchers to provide open access to research data. Such forms of openness can produce a variety of benefits. When data are publicly accessible, they can be cited, and such citations can result in credits for the research that has been done. Open access to research data also encourages the reuse of these resources by other researchers. It enables peers to replicate specific analyses, or to validate the claims that have been made about these data in publications.
A growing number of journals ask their authors to provide access to the data that underlie a publication, either as supplementary materials or via a data repository. This is the case, for instance, for Science and Nature. The peer review processes that are organised by journals may occasionally include an inspection of the data. To avoid unpleasant surprises during the publication phase, it is very important to bear in mind thoughout the entire research process that some publishers may demand that research data are openly accessible. Data can be published most efficiently, if they are stored in the correct format, and if they have been documented well, by making use of appropriate metadata formats.
Where to publish data?
To ensure that data can be reused responsibly and productively, it is best to store these data in a trusted data repository. The option to archive data as supplementary materials, attached to a publication, ought to be avoided whenever possible. Data repositories have taken various measurs to make sure that data can remain findable and accessible. The data which are stored in data repositories can in most cases be cited though a persistent identifier (such as the DOI). Such archives consequently enhance your visibility as a researcher. Various studies have shown that the data which archived in data repositories receive more citations. The Research Data Services catalogue provides an overview of the most relevant data managemen facilities. The site also indicates whether or not these various services adhere to the requirements that are formulation in Leiden University’s data management policy. A similar catalogue of data management services can be found at www.re3data.org.
Data as a publication in itself
Even when a study produces data which do not support the overall objectives of this study, it can still be useful to publish these research data separately. In this way, it can become possible to avoid duplications of research efforts. Alternatively, these data may effectively be reused in another study.
You can draw attention to specific data sets by describing these in a data journal such as GigaScience or Scientific Data. These journals essentialy publish metadata about data set. Among other aspects, these metadata describe the way in which the data are produced, and they suggest potential applications of these data. Such descriptions of potenatial uses heighten the chance that these data sets
Leiden University and most research funders require a data management plan (DMP) before the start of a new research project.
In your DMP you list all gather all the information about the data in the project. You are asked to provide information on the type of data, the method of collection, the format and the documentation of the data. It also includes sections on facilities that are used, legal or ethical reasons (not) to share data, and on the way data is shared and preserved in the long term.
The CDS offers a workshop "How to write a datamanagement plan"
You may make use of several templates, when writing a DMP.
When you work with personal data, the General Data Protection Regulation (GDPR) requires you to record what happens to your data. In the data processing register you will explain which personal data you collect, who will have access, how you will protect the data, and how long you are planning to store the data. The university will support you in working in a privacy-proof way: on the staff website you will find all the information you need to compy with the GDPR.
Tools & tips for working securely online
The staff wesbite also provides this useful overview to help you work securely.
We set up a catalogue with data management facilities for researchers.
Research Data Services
This site aims to help researchers make a reasoned choice when planning for the management and the storage of their data. Additionally, the information that was accumulated should help to identify potential gaps or other shortcomings within the facilities which have been described.
You can find the catalogue at: https://digitalscholarship.nl/rds/
In our training sessions we often refer to the following handouts en best practices.
|Back-up strategies||File naming and folder structure|
|Versioning and authenticity||Anonymisation|
|Metadata||Selection of research data|
|Sensitive data protection||FAIR data|
Other useful references are:
- Expert tour guide on data management by CESSDA ERIC (Consortium of European Social Science Data Archives European Infrastructure Consortium).
- What is pseudonymous, de-identified or anonymous data? See A Visual Guide to Practical Data De-Identification (from the Future of Privacy Forum)
- Publishing and Sharing Sensitive Data decision tree by the Australian National Data Service