The 5Vs Of Collecting Clinical Data

By Steve Chartier, Society for Clinical Data Management; Patrick Nadolny, world head, medical knowledge administration, Sanofi; and Richard Young, Society for Clinical Data Management

The evolution of medical analysis and supporting laws, in addition to large advances in expertise have essentially modified what medical knowledge is. As we outline the longer term past conventional digital knowledge seize (EDC), we have to rethink our approaches and perceive how the “5 Vs” of information are reshaping medical knowledge administration (CDM).

First and foremost, not all knowledge is created equal. Therefore, our knowledge methods should be commensurate with the dangers, complexity, and worth of the information collected. Additionally, knowledge safety and private knowledge safety are key parts that should be strategically anticipated. If the true worth of this knowledge is to be realized, it should be collected and captured in a constant and well timed method that considers all 5 “V” dimensions: quantity, selection, velocity, veracity, and worth.


In 2012, Tufts1 estimated that on common, Phase 3 research collected near 1 million knowledge factors. Today, we measure mHealth knowledge factors within the billions. This dramatic enhance calls for the adoption of recent methods to enhance the gathering, processing, and archiving of information supporting this new scale. CDM should re-imagine its practices to effectively transfer from managing just a few knowledge factors per case report type (CRF) to managing greater than tens of 1000’s of information factors generated per affected person per week.

Figure 1 reveals the anticipated quantity of actigraphy knowledge generated by wearables (in blue) in comparison with knowledge generated from website visits (in orange), which is barely seen on the determine by comparability. The protocol requires 260 sufferers to be handled for six months. The enrollment interval is estimated to final six months. With wearable gadgets set to transmit knowledge each minute, wearables would generate a pulse studying greater than 68 million instances all through the examine, with a spike at virtually 375,000 readings per day. In comparability, pulse would solely be generated 3,380 instances by website visits, assuming sufferers visited each two weeks, with at most 260 readings in per week throughout sufferers.

With the unimaginable enhance in knowledge quantity, CDM should be diligent and safe, utilizing high quality by design (QbD) to outline what actually must be collected to help the protocol speculation vs. all knowledge that may be generated by new applied sciences. Not all knowledge generated by gadgets could also be helpful for statistical or additional exploratory evaluation. In the case of wearables, CDM might contemplate retaining the 68 million pulse readings as e-Source knowledge whereas solely retrieving knowledge summaries at common intervals (e.g., each hour or day). Data collected might solely embody key knowledge traits (e.g., min, max, common, customary deviation, variety of observations generated, and many others.), aggregated (e.g., by hour) to raised help downstream actions akin to security monitoring, knowledge overview and statistical evaluation.

Fig 1. Daily quantity of actigraphy knowledge from wearable vs. weekly e-CRF pulse knowledge


With greater than 200 new well being apps added to app shops daily,2 it’s not shocking that sponsors are more and more utilizing digital well being applied sciences in medical analysis and leveraging apps to gather quite a lot of knowledge, together with reported outcomes and different real-world knowledge (RWD). However, most experiments with digital well being have been confined to Phase 4 trials, reflecting the perceived danger of incorporating digital measures into pivotal trials till they’re validated and strain examined.

This is unlucky, as these applied sciences can enhance the effectivity of medical analysis in some ways. Solutions for figuring out websites, focusing on and recruiting the best sufferers, amassing reported outcomes, gaining digital consent, screening sufferers remotely, and conducting decentralized trials have all confirmed to be efficient and helpful. First and foremost, they profit sufferers by eradicating enrollment limitations and enabling breakthrough medical advances, particularly for uncommon illnesses. As clearly seen throughout the COVID-19 pandemic, patient-centric options akin to telemedicine and residential nursing additionally profit sponsors by decreasing on-site actions, optimizing website choice, rushing up enrollment, easing knowledge assortment, and supporting fast decision-making by instant entry to knowledge.

To handle this rising quantity and number of knowledge, we should develop new medical knowledge science (CDS) ideas as an evolution from conventional CDM together with, new knowledge assortment instruments, and knowledge overview and knowledge analytics methods. As an instance, patient-centric knowledge collected as eSource is nearly unimaginable to change as soon as they’ve been generated. This implies that suggestions on the information high quality and integrity of this number of eSources must be supplied on the time of information technology. After knowledge is generated, CDM will hardly ever have the ability to ship a question to request its correction. So, remaining knowledge anomalies will seemingly should be tagged and defined. However, will knowledge tagging be sufficient to ship dependable knowledge to achieve sound conclusions required for regulatory approval? Beyond knowledge tagging, the U.Okay. Medicines & Healthcare merchandise Regulatory Agency (MHRA) launched the idea of “knowledge exclusion.”3 This implies that “unreliable knowledge” with a possible of impacting the reliability of the trial outcomes could possibly be excluded primarily based on a “legitimate scientific justification, that the information will not be consultant of the amount measured.”3

Additionally, in accordance with good recordkeeping and to permit the inspection and reconstruction of the unique knowledge, “all knowledge (even when excluded) must be retained with the unique knowledge and be out there for overview in a format that permits the validity of the choice to exclude the information to be confirmed.”3 Even if not broadly used but, knowledge tagging and exclusion might turn out to be an ordinary follow inside CDM to help the generalization of eSource and DCTs.

Furthermore, CDM is tasked with integrating each structured and unstructured knowledge from a variety of sources and remodeling them into helpful info. Integrating, managing, and deciphering new knowledge varieties akin to genomic, video, RWD, and sequenced info from sensors and wearables requires new knowledge methods and questions the centricity of conventional EDC techniques. The key questions are the place and the way, each logically and bodily, ought to these disparate knowledge sources be orchestrated to shorten the hole between knowledge technology to knowledge interpretation?

Additionally, though not new, the implementation of audit path overview (ATR) is gaining momentum in supporting examine monitoring. This can also be fueled by the extra frequent deal with audit trails from GCP Inspectors. Those can present important insights on how the information is being collected, resulting in the identification of course of enhancements or lack of know-how of the protocol directions, as much as the uncommon circumstances of manipulation from knowledge originators.4


To help real-time knowledge entry, we have to perceive, prioritize, and synchronize knowledge transactions into applicable knowledge storage at elevated quantity, velocity, and frequency. Data from wearables for example, may be generated 24 hours a day, seven days per week. Taking the instance in Figure 1, as much as 375,000 pulse readings could possibly be generated in a day assuming knowledge is transmitted each minute. This would develop to 22.5 million pulses with knowledge transmitted each second. In a world the place real-time knowledge is predicted, it’s not shocking that connectivity has turn out to be a core part of software program growth.

Application programming interfaces (API), used for web-based techniques, working techniques, database techniques, pc {hardware}, mHealth, and software program libraries, are enabling automated connectivity in new methods. This is transferring the main target from knowledge switch to knowledge integration. The integration of a excessive quantity and number of knowledge at excessive velocity is technically attainable, however is it crucial? So, CDM ought to consider the professionals and cons for each knowledge integration. We additionally must stretch our considering and expectations, as a result of APIs don’t simply join researchers, they supply a platform for automation.

Regardless of the information acquisition and integration expertise getting used, we have to synchronize the information circulate velocity to our wants throughout all knowledge streams. As a affected person’s knowledge is extremely associated to at least one one other, we have to overview and correlate a number of knowledge sources concurrently. As an instance, it might not make sense to reconcile two knowledge sources extracted months aside.

Fig 2a. Data switch frequency vs. knowledge reconciliation/consolidation

Referring to the straightforward theoretical in Fig. 2a, the information could possibly be synchronized and due to this fact optimally reconciled solely each two weeks. As proven in Fig. 2b, altering the eCOA switch frequency from each different day to 3 instances per week would no less than allow weekly optimum knowledge reconciliations and cut back the information refresh workload.

Fig. 2b. Data switch frequency vs. knowledge reconciliation/consolidation  

The want for knowledge synchronization can be true for all different data-driven actions, together with ongoing knowledge and security evaluations, danger assessments, and many others. Synchronizing knowledge flows would stop rework ensuing from knowledge refresh misalignments. Additionally, transferring ahead, synchronizing knowledge velocity will probably be extra often pushed by distant working practices. How we combine knowledge cleansing into site-driven workflows is taking part in a important function in our potential to be agile. As an instance, performing supply knowledge verification (SDV) throughout the COVID-19 pandemic pressured us to have a look at various options to remotely synchronize knowledge, paperwork, and processes with the sudden lack of bodily entry to websites. These new distant practices will in future require CDS to discover new knowledge overview capabilities with websites, past easy query-based clarifications.


We can affiliate veracity with the important thing attributes of information integrity and significantly ALCOA+ (Attributable, Legible, Contemporaneous, Original, Accurate, Complete, Consistent, Enduring, and Available). Veracity additionally may be related to a few of the attributes of information high quality akin to knowledge conformity, credibility, and reliability. In this context, CDS wants to ascertain proactive measures to safe the authenticity and safety of the information. This is changing into important on this planet of e-Source and RWD, the place knowledge can hardly ever be corrected and the place anonymization is more and more difficult and important.

First, we should not let perfection turn out to be the enemy of the great, particularly the place “good” is match for objective and ok. If veracity maps a journey towards fit-for-purpose, we should assess how far we pursue perfection for every knowledge sort. Directionally, the idea of high quality tolerance limits (QTLs) is an efficient instance of a fit-for-purpose and measurable high quality framework that can be utilized throughout knowledge streams. Additionally, with the adoption of risk-based approaches, not all knowledge could also be topic to the identical degree of scrutiny. Different high quality targets could also be acceptable throughout completely different knowledge varieties and sources. CDS might want to not solely handle knowledge but additionally decide and implement fit-for-purpose knowledge high quality requirements. In this setting, we will outline a constructive and a unfavorable objective for knowledge veracity. Positively, we will purpose to ship knowledge veracity to not exceed a set tolerance restrict (e.g., not exceed x% of lacking knowledge). We can also assign a unfavorable goal, the place we try to take away any points (e.g., tackle all lacking knowledge) that might alter the top evaluation. It is usually the case that this latter objective (“unfavorable”) will probably be simpler to outline in our cleansing technique, as defining a high quality goal requires historic info and could also be perceived as subjective. However, making an attempt to remove all knowledge points could also be neither attainable nor desired for non-critical knowledge. So, CDS should discover ways to arrange measurable and goal high quality targets by actually representing our knowledge veracity aims.

Additionally, It is not attainable to make use of handbook processes primarily based on listings or affected person profiles to substantiate knowledge veracity from such a big quantity of disparate knowledge coming at such excessive velocity. It is critical to implement completely different methods transferring past knowledge filtering and trending to methods primarily based on storytelling visualizations and statistical and machine studying (ML) fashions, in addition to leveraging clever automations. Interrogating such knowledge might require completely different expertise experience, akin to NoSQL (Not solely Structured Query Language) or semantic automation.

Eventually, we additionally might want to safe the veracity of information on techniques that we don’t straight management, akin to EHRs with their disparate and sophisticated knowledge constructions, like genomic knowledge, medical imaging, unstructured knowledge and paperwork, metadata, and sequenced knowledge.


Importantly, CDM wants to maximise the relative worth of anyone knowledge level on this ocean of information. In the present CDM context, we worth high quality knowledge enabling the dependable interpretation of the trial’s outcomes. In the context of CDS, the worth of information goes past integrity and high quality to make sure its interpretability. To leverage the total potential of the information now we have, we should look past its unique objective. During a medical trial, we acquire knowledge to validate the speculation of the protocol and, in the end, receive market authorizations. Once databases have been locked, most pharmaceutical firms will solely reuse them for regulatory functions (e.g., annual security updates, built-in efficacy and security summaries, market authorization in different nations, and many others.).

Figure 3. The knowledge worth extraction journey

However, to unleash the total worth of medical trial knowledge, sponsors should proactively anticipate what will probably be wanted sooner or later. It implies that we have to search affected person authorization up entrance, by unambiguous knowledgeable consent kinds, to make use of their knowledge for functions aside from the scope of the protocol. Some firms are starting to reuse medical and well being knowledge in new methods, influencing others to significantly contemplate it.

Examples embody:

Creating artificial arms both from previous medical trials or from RWD
Engaging and retaining sufferers by feeding them study-wide knowledge summaries throughout examine conduct
Creating machine studying coaching knowledge units to enhance operational processes akin to automating question detection or enhancing the reliability and accuracy of endpoint assessments
Extracting real-world proof from real-world knowledge to realize insights on the right way to enhance customary of care or higher perceive drug effectiveness in a real-world setting

At the top of the day, by the appliance of confirmed knowledge methods, we will leverage rising applied sciences to extract the total worth of information for all stake holders (i.e., sufferers, websites, sponsors, regulators, caregivers, and payers) and defeat knowledge silos, the enemy of our worth extraction journey.

Figure 4. The 5Vs knowledge journey from assortment to worth technology


Tufts, November 2012, Clinical Trial Complexity, out there at
IQVIA, November 2017, The Growing Value of Digital Health: Evidence and Impact on Human Health and the Healthcare System. Available at
MHRA, March 2018, ‘GXP’ Data Integrity Guidance and Definition, Available at
SCDM and eCF, Audit Trail Review, An Industry Position Paper on using Audit Trail Review as a key software to make sure knowledge integrity Available at 

About The Authors:

Steve Chartier is a contributing creator for the Society for Clinical Data Management (SCDM) innovation committee, in addition to govt director of information engineering at LogixHealth. Chartier has 35 years of expertise expertise with virtually 20 years throughout life sciences and CROs. Chartier is a obsessed with knowledge technique and analytics and is an knowledgeable in constructing and nurturing expertise organizations to ship state-of-the-art merchandise knowledge services and products to allow data-driven selections.

Patrick Nadolny has virtually 30 years of business expertise throughout pharmaceutical, system and biologics in addition to expertise answer growth. He is a practical chief specializing in expertise, innovation, strategic planning, change administration, and the setup of recent capabilities. Nadolny is the worldwide head of medical knowledge administration at Sanofi. In addition to his SCDM board member function, he leads the SCDM innovation committee, which launched many papers on the evolution of medical knowledge administration towards medical knowledge science.

Richard Young is an SCDM committee member and the vice chairman, Strategy Vault CDMS, at Veeva Systems.

Recommended For You