Many initiatives in clinical research, including the Danish Centre for Strategic Research in Type 2 Diabetes (DD2) initiative, have a hard time getting funds for building and maintaining needed software that make sure health science is the highest quality it can be. For this project, we aim to build software tools that make it easier to do better research, especially for managing and working with data. We’ll first be building these tools to help support the DD2 initiative.
We will be sharing these tools widely and freely, so that as many research groups as possible can use it for their own projects. Not only will we build these tools to help researchers with managing and sharing their data, we also will create them with beginner-friendly documentation and training material to make sure more researchers use our tools, no matter their skill level. We believe that with these tools, science on health and well-being can become better, ultimately helping people with diabetes and society in general.
In clinical and health research, especially for small- to mid-sized research groups, funding for building modern, open source software infrastructures for managing and using data has been limited. This gap has naturally led to organizational challenges for managing existing and incoming data for many research initiatives, including the Danish Centre for Strategic Research in Type 2 Diabetes (DD2) initiative, a national research collaboration and database initiative established in 2010 with continual enrollment of persons with type 2 diabetes. The aim of our project is to close this gap by creating and implementing an efficient, scalable, and open source data infrastructure framework that connects data collectors, researchers, clinicians, and other stakeholders, with the data, documentation, and findings within the DD2 study. This will improve and extend the existing DD2 research infrastructure into an open national state-of-the-art research infrastructure that will provide easy and transparent access to this resource for researchers, clinicians and stakeholders, thus enabling excellent data science driven research. Furthermore, we will create this framework in such a way that other research groups and companies, who are unable to adequately invest in building infrastructures of this type, can relatively easily implement it, and modify as needed, for their own purposes. By building this framework, we have the potential to help propel research groups and companies across Denmark (and globally) to quickly getting updated on modern, scalable, and efficient approaches to working with data. Within the DD2 setting, an open, transparent, and easy access to this constantly growing resource has the potential of greatly improving the interest in, use of, and scientific impact of this resource, thus leading to substantial scientific and medical advancements, individualised treatment and improved human health in not only persons with type 2 diabetes, but population overall.
In clinical research, software and data infrastructure development is undervalued and, aside from this funding call, underfunded, particularly for small- to mid-sized research organizations. Clinical and health researchers largely lack formal training, support, and awareness in research software engineering (RSE) and in building and managing data infrastructures. The result is that the overall software and computational ecosystem, as well as the technical capacities to maintain them, lags behind multiple other scientific domains (e.g., bioinformatics). Particularly with the recent rise of data science and the greater focus on analytical reproducibility, this issue has become increasingly apparent as data, and the skills required to work with it, become ever larger, more technical, and complex. Indeed, investing in and implementing scalable and modern data infrastructures and RSE processes, built on open source software, have the potential to greatly improve the quality of science, to produce more transparent and streamlined workflows, to lead to reproducible research, and generally better science in less time (1).
Funding for participant recruitment and data acquisition has historically been (and still is) easier to obtain than for building open source software and infrastructures that support and enhance science, particularly for managing and using data. This imbalance has naturally led to organizational challenges for managing existing and incoming data for many research initiatives within the field of clinical research, including for the Danish Centre for Strategic Research in Type 2 Diabetes (DD2) initiative (2,3).
DD2 is a national type 2 diabetes (T2D) research collaboration and database initiative that was established in 2010, with on-going enrollment by hospital physicians and general practitioners (GPs). Although T2D is a single diagnosis, it comprises several phenotypes with diverse prognoses and risks for complications, which can lead to treatments tailored to each phenotype. The overarching aim of DD2 is to improve and individualise the treatment of persons with T2D. Figure 1 shows the several datasets within DD2 (4–7). DD2 has received extensive funding from the Danish Council for Strategic Research and the Novo Nordisk Foundation as well as a Steno National Collaborative Grant for deep phenotyping. Continuously recruiting more participants, adding new data, and expanding the data access to researchers throughout Denmark and abroad has the potential to further increase the value of DD2. However, this comes with higher costs and resources for maintaining, extending, and improving on the existing DD2 research infrastructure.
Building modern data infrastructures has slowly been taking greater priority by funding and research agencies globally. For instance, the UK Biobank (8,9) is a large-scale biomedical database with highly detailed data on ~500,000 participants. It is regularly expanded with additional data and is globally accessible to approved researchers and is a role model to building a modern research infrastructure.
While the UK Biobank is a source of inspiration on the state-of-the-art, the underlying infrastructure itself is not openly accessible and reusable. The same applies to a similar Danish initiative, the “Single path to access Danish health data” project (10), where the Danish government and individual regions are collaborating to map out all Danish health data. Another state-of-the-art initiative led by the University of Chicago, USA is Gen3 (11), which contains modular open source services that can form the basis for a data infrastructure (12,13) and powers several research platforms, including the National Institutes of Health (14). However, we are unaware of any similar current national efforts that are open source, re-usable, and suitable for the Danish and EU legal context.
Our primary aim is to create and implement an efficient, scalable, and open source data infrastructure framework that connects data collectors, researchers, clinicians, and other stakeholders, with the data, documentation, and findings within the DD2 study. This will improve and extend the existing DD2 into an open national state-of-the-art research infrastructure that will provide easy and transparent access to this resource for researchers, thus enabling excellent data science driven research. Our secondary aim is to create this framework in such a way so that other research groups and companies, who are unable to adequately invest in building similar infrastructures, can relatively easily implement it and modify as needed for their own purposes.
Our first steps are to build a data infrastructure framework and the second step is to implement it in DD2. The details of the framework itself are described first and later we describe how we will apply it to DD2.
For this project, the data infrastructure framework is defined as 1) a set of software programs, 2) a defined and fixed set of conventions on the structure and format of the filesystem and URL paths, and 3) a defined structure to the data and associated documentation, all of which are linked together as modular components. The framework will serve as an open source starting template for setting up data infrastructures that make use of modern tools and processes.
This framework encompasses four target users and three layers, with a complete schematic shown in Figure 2. The three layers are the web portal frontend, the database and documentation backend, and the API (Application Programming Interface) that interacts with both. The four users and their associated use cases are:
Throughout this application, we’ll refer to these four users and three layers as we expand on and describe the framework.
To ensure the development of this framework is efficient and focused, it will adhere to key principles that are supported by strong philosophical and scientific rationale:
In order to maximise the potential for re-use and to minimise the technical debt and expertise needed to use, maintain, and modify the framework, we will use software and tools underlying the framework that fit these principles:
Based on the above principles, we have chosen the following software and conventions to form the framework’s foundation:
This interface is what all users interact with and use, with essentially three “permission” levels available:
All content would be rendered directly as plain HTML text to ease use of existing webpage translation services (e.g. Google Translate), so that content written in another language, i.e., Danish, would still be readable to non-native speakers. This would also lower the amount of maintenance necessary for documentation.
Modern web and computational infrastructures are built on web APIs. Any modern online resource or interface makes use of an API, such as from Google, Gen3, or the UK Biobank. An API is a mechanism by which different programs can communicate with one another. They form a set of instructions or conventions that allow easy communication between a user and the computer. APIs by their nature are transparent and if well-documented would ensure the linked data would be FAIR, safely and securely.
In this case, the API would be between the user and the web server that stores the underlying database and documentation. The API would be a combination of a predefined set of instructions that are sent to the web server to run certain commands as well as a set of explicit conventions and rules on how files and folders are structured and named. Taken together, this API would allow other software like R packages to be built to interact with the backend to automate tasks done by the users.
Given the heterogeneity in the sources of data input, the backend will need to be composed of multiple components: raw data files as plain text, cleaning and processing programming scripts, a formal database structure (e.g. SQL), a VC system to track changes to the raw data and processing scripts, a data version numbering system, a changelog describing the changes, and a data dictionary linked to the variables contained in the database. Versioning of the raw data and scripts is done for recordkeeping, auditing, and transparency, in addition to allowing comparison of data used between past and current projects that use the data.
A major challenge to building the backend is in the heterogeneity of the data input. The key is to establish and enforce a standardized Common Data Model (CDM) for all incoming data at the point of entry. For the framework, the exact contents of the database aren’t important, since as long as the contents follow the CDM it will be programmatically merged into the final formal database. This is necessary as the database contents depend heavily on the research topic and aims of the study that will use this framework.
The backend documentation is either largely generated automatically or manually written. For instance, the list of projects and findings would be generated by the submitted projects and input from User 2 (researchers) while the changelog would be updated either by automated additions or, optionally, manually from User 4. The data dictionary would be stored as a JSON file with the documentation text itself as Markdown text. This data dictionary would be publicly accessible and could be updated by anyone (with approval from User 4), potentially through “Merge Request” mechanisms. This mechanism involves automatically linking any addition or correction back to the main documentation and requesting it be merged into it.
Depending on the source of data, there may already be established data input processes. Substantial amounts of biomedical data, especially in Denmark, come from already established, routinely collected clinical data such as from outpatient clinics. For these sources of data, the data input pipeline would involve redirecting these data sources through the API and storage format so the data continues on to the backend.
Sources of data that don’t have well-established data input processes, such as from hospitals, medical laboratories, and so on, would use the data input portal. This portal would only accept data that is in a pre-defined format and would include documentation, and potentially automation scripts, on how to pre-process the data prior to uploading it.
Once the data is submitted through the portal, it would get sent in an encrypted, legally-compliant format to the server and stored in the way defined by the API and CDM. Any new or updated data that is uploaded would trigger generic automated data cleaning, processing, quality control checks of this new data. Any automated processing that is developed specific to a project would need to adhere to the API’s conventions. If any issues are found or if the data is entirely new to the database, they get sent to a log and User 4 would receive a notification to deal with the issue. If there are no issues or the issues have been dealt with, an automated script would take a snapshot of the data with the VC system, the version number (based on Semantic Versioning) of the data would be updated, an entry would get added to the changelog, and the formal database would get updated.
Researchers and other users who want to request access to the data would first need their identity verified and then be approved for authorized access. After approval, they would interact with the frontend by two routes:
When User 4 approves a data request project, it will trigger an API request that would automatically extract the requested subset of data, bundle and encrypt it, and send it to the researcher’s secure server. The framework will contain sufficiently generic methods for automating the data transfer process.
The framework assumes that this user would interact with the portal through at least three routes:
These users would largely interact with the web portal for managing and overseeing ongoing projects, approve access for new researchers, and approve projects requesting access to data. Approving new researchers would grant the researchers access permissions to enter the User 2 portal.
The framework itself does not contain any personal data. When the framework is deployed as an infrastructure for a database, only aggregate statistics, and not individual-level personal data, would be publicly accessible. Any personal data would be stored on a secure server that is decided and controlled by User 4, who would be responsible for complying with relevant legal requirements.
For data transfers of personal data, either from data collection centers, data generated from researchers, or when transferring data for approved projects, we would use well-established and compliant encrypted data transfer processes. Key authentication principles such as two-factor authentication and OAuth (open standard for access authentication) will be central to the framework to control who can update or transfer the data. The endpoint of the data transfer is dealt with by the legal teams of the relevant institutions.
To be aligned to the goals of openness, transparency, and FAIR principles, the complete development of the framework will take place openly on GitHub. From there we will link to and promote it through various outlets, including publications, conferences, and social media. The framework and all its components will be licensed under permissive copyright licenses like the MIT License for the software and the Creative Commons Attribution License for non-software content.
Integral to this framework is ensuring it is sustainable over the long-term by:
Usage of this framework depends on the quality of its documentation and training material. A key concept we will use heavily is Documentation Driven Development, where the framework’s development is guided and informed from the development of the documentation, which places documentation as a high priority. We will also be creating and running short workshops and tutorials that teach researchers how they can use this framework.
While the proposed framework is software-based, storing and deploying the infrastructure requires server space and IT support. The framework itself takes up little space and can optimise computational resources, but the underlying DD2 data requires considerable server space.
To have a meaningful impact on improving research infrastructure, the minimum skills and knowledge necessary are:
The biggest potential challenge to applying the framework to DD2 is getting the database backend into the appropriate structure to fit within the framework. With the current state of the DD2 data, considerable time and effort is needed to organize it. Our initial steps will be to:
Currently, User 3 can request data by filling out a Word application form and emailing it to the chair of the advisory board Kurt Højlund and programme leader Jens Steen Nielsen. Applications are reviewed by the steering committee and, once approved, the data manager at the Department of Clinical Epidemiology (KEA) in Aarhus University Hospital then manually extracts the requested data and transfers the data subset to the applicant’s secure server and does this for each individual research project. If requested, KEA may also perform analyses on the data. Researchers must already have valid authorized access to the secure servers on an existing “forskermaskine” or an HPC facility for the large-scale data, such as Computerome 2 and GenomeDK.
The costs of storing the original data are covered by DD2, while applicants cover the costs related to storing the transferred data. We will not charge for data access. As per legal requirements, researchers can only use the data for the intended purposes listed in the application. After project completion, the researchers must delete or close access to the data and inform DD2 as legally required. Any newly generated data must be returned to DD2 by uploading via the User 1 portal.
Because the framework will be built with modularity in mind, where each component could be used alone or together, nearly all the components could be deliverables (each User by each layer). Each deliverable would be to prototype a MVP to begin testing it, identifying bugs, getting feedback, and establishing maintenance procedures. See Figure 3 for the Gantt chart.
The framework will be developed at SDCA with Professor Annelli Sandbæk (applicant) as the lead PI responsible for reaching the overall goals of the project and defined milestones, as well as two postdocs, Luke Johnston, MSc, PhD and Alisa Kjærgaard, MD, PhD. A project group headed by the PI will be established that includes central persons from SDCA and DD2. In close collaboration with the project manager of DD2 and the current data manager, the deliverables will be planned and carried through. Completing the proposed project requires hiring data and research software engineer personnel into the project, which will be the first step in the project process. The DD2 advisory group will also act as the advisory group for this project. This group is chaired by Kurt Højlund, MD, PhD, head of research at Steno Diabetes Center Odense and contains representatives from affiliated research projects and other DD2 stakeholders.
We are at a key point in time within clinical research where it is increasingly being recognized that open source software and computational infrastructure are critical and necessary components to ensuring science is high-quality, reproducible, rigorous, and transparent. Funding agencies and research institutions globally are putting greater efforts into modernizing many of their infrastructures using the many software technologies that have risen in the last decade. By building this framework, we have the potential to help propel research groups and companies across Denmark (and globally) to quickly getting updated on modern, scalable, and efficient approaches to working with data.
Within the DD2 setting, an open, transparent, and easy access to this constantly growing resource has the potential of greatly improving the interest in, use of, and scientific impact of this resource. Incorporating new data generated from the DD2 resource back into DD2 will enable other researchers to test or use this data to advance their own work. This would lead to substantial scientific and medical advancements, which will ultimately lead to individualised treatment and improved human health in individuals with type 2 diabetes, and very likely the population overall.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/steno-aarhus/dif-project, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".