The Evolution of Open-Source Initiatives and New Standards Development for the Data Submission of the Future

CDISC| data submission |

August 14, 2023

Written by Angelo Tinazzi

GDD_Banner_new

In the first part of this post, I discussed the ongoing revolution, or maybe I should say evolution, we are living through with open-source initiatives and new standards development.

A good example to start with is the R-pilot initiative¹ by the r-consortium, which has already concluded two pilot projects and has two more in the pipeline. The overarching goal of this initiative is to showcase that we can construct data packages utilizing software other than SAS, and still secure acceptance from the FDA. Moreover, it strives to illustrate that within these submission data packages, tools can be incorporated to streamline the work of the reviewer. This very objective was realized in the second pilot, where the data submission package encompassed an R-Shiny application. The pilot effectively demonstrated that, with guidance presented in the analysis data reviewer guide, reviewers can install and utilize the tool within their local environment. These pilots also provided some idea of potential challenges and, as discussed in an FDA ad-hoc webinar, potential limitations.²

Among the emerging standards currently in development, the Analysis Results Standards³ — not to be confused with the Analysis Results Metadata part of the define.xml standards — holds a pivotal role in the pursuit of the 2041 goal. This standard is probably one of the most important missing elements in today’s CDISC foundational standards. Its implementation would facilitate the automation of analytical outputs, encompassing tables and figures. Furthermore, it would contribute to the storage, accessibility, processing, and, ultimately, the reproducibility of results. The underlying conceptual model can be essentially divided into two key components:

the part that defines through a set of metadata the needed elements to fully describe an analytical output. This includes the source datasets, any applied filter, exactly as the define.xml ARM but it has more metadata than ARM. These supplementary metadata enable us to fully generate the analytical outputs.
the part that contains the analytical data, so essentially the results are saved in an electronic format, for example, a SAS dataset. It will also therefore make our output results reusable because they will not simply sit in a PDF, which is, by the way, what we still create.

As I mentioned earlier, one of the existing obstacles is the use of an outdated data exchange format, namely the SAS XPT, which persists in our current data submission practices.⁴ While this format has served its purpose well thus far, it’s now time to consider its retirement. Due to certain limitations, this format has become outdated and no longer aligns with our 2041 vision. To cite some, its numeric limitations, the way it stores numbers, the character limitations, and no UTF-8 encoding making it inefficient for handling characters from various languages. Additionally, its metadata capabilities are limited, and it’s constrained in terms of string and column capacities. As a result, features such as variable names exceeding 8 characters, labels surpassing 40 characters, and variables with lengths greater than 200 are not feasible within this format.

To address this challenge, the CDISC Datasets-JSON⁵ project has been initiated, aiming to establish a new data exchange format built on JSON, a fully machine-readable structure. Once again, the project intends to conduct a pilot in collaboration with the FDA. Should the FDA succeed in adopting and effectively utilizing this novel format, it could lead us to a situation, possibly within the next five years, where we will be submitting data in alternative formats. Furthermore, the standard CDISC foundational models themselves, primarily SDTM and ADaM, could undergo positive transformations as a result. This means that the current limitations imposed by the SAS XPT format might no longer hinder them.

For those wondering about the prior CDISC Datasets-XML project launched around a decade ago, unfortunately, the project did not have a good end. The pilot submission to the FDA encountered difficulties, partly attributed to the sizable nature of XML datasets, which were even larger than those in the SAS XPT format. With the industry-standard JSON format, there’s a promising potential to resolve these issues, increasing the likelihood of a successful outcome this time.

We also need to take a step away from the conventional normative standards, standards specifications that fully sit within PDF documents. It’s imperative to enable electronic access and integration of standards within our organizations. This could be achieved through methods like utilizing the CDISC Library API, which serves as the authoritative metadata repository for all CDISC standards, acting as a singular source of truth. While we haven’t fully realized this goal, certain standards are already accessible within the library. The advantage is that this access is open to anyone.

There are then several ongoing open-source initiatives, or collaborations, that are under the umbrella of the CDISC COSA Project, the CDISC Open-Source Alliance.⁶

To just mention some of them, consider Admiral, an R package designed to support the creation of ADaM datasets. This is probably the best example of open-source initiative, that saw the participation in the inaugural webinar, then training, of more than 500 individuals. Many of these participants also engaged in a hackathon.⁷

Likewise, for SDTM, a more recent initiative had emerged: the oak R package. This initiative, which aims to provide not just a package to support SDTM mapping, but also a framework to handle metadata and standard specs repository.

The SAS Clinical Standards Toolkit is a recent inclusion in the COSA project. Although not a new concept, it stands as one of SAS’s efforts to bolster CDISC initiatives, aimed mainly at producing define-xml files using specific metadata. Importantly, it has now been released as an open-source tool.

Among all, the Open Study Builder stands out as perhaps the most ambitious open-source project. Its goal is to create an open-source tool that, once fully realized, will enable comprehensive consistency throughout the entire data lifecycle, from protocol development and CRF design to dataset creation, analysis, and reporting. Several use cases have already demonstrated the power of a metadata-driven approach through this tool. It allows for the generation of study protocols using metadata, which can then be reused to develop corresponding CRFs. The core concept revolves around the principle of “Write Once, Read Many,” as someone might have aptly expressed.

Lastly, we have the CORE, the CDISC for Open Rules project. This was a concern for many years: the execution of conformance rules created by CDISC for its foundational standards was largely in the hand of few vendors. The CORE project’s essence lies in establishing a singular authoritative source for all conformance rules. These rules will be presented in a machine-readable format accessible to all, including vendors. This approach ensures that the rules’ implementation remains consistent across vendors. By accessing the CORE repository, vendors will consistently adhere to rule implementations. This initiative demands substantial resources for rule development and testing. CDISC is actively inviting volunteers to participate in this endeavor.

Additionally, there are several initiatives poised to play pivotal roles in enhancing the utilization and integration of Real-World Data (RWD) and advancing digitalization.

In the area of RWD, multiple projects are currently underway to evaluate and potentially enhance standards to better accommodate specific RWD studies, such as Observational Studies. A noteworthy instance is the upcoming launch of the SDTM Implementation Guide for Real-World Data Project. Furthermore, there’s the CDISC for Observational Studies, with a draft version expected to be released this quarter. Both efforts are geared toward understanding necessary adjustments within standards and investigating whether flexibility might be required for this category of data and studies.⁸

It's also worth mentioning the HL7 FHIR to CDISC joint mapping IG,⁹ which is focused on evaluating the extent to which data can be directly extracted from Health Electronic Records into CDISC, mainly SDTM.

It’s crucial for us to remain attentive to the ongoing evolution, as highlighted by the recent remarks of the Novartis CEO in a recent broadcast at CNBC.¹⁰ We cannot afford to miss the opportunity to journey into the future. Our alignment, up-to-date knowledge of the latest advancements, and continuous collaboration are essential. This involves active participation in the various ongoing initiatives that I’ve highlighted throughout this article.

Interested in learning more about data submission? Download our complimentary new ebook, The Good Data Doctor on Data Submission and Data Integration: