Why Be Normal? Part 4: Rollin' in the Deep Data Lake, a National Institutes of Health case study

Much excitement in healthcare today revolves around the unlocked potential of population-level ‘big data’, which can be leveraged to inform diagnosis and treatment best practices, enable earlier intervention or proactive mitigation of disease, and support the development of new and innovative medical procedures and interventions.

A luminary teaching and research organization, the National Institutes of Health (NIH) is an excellent example, conducting numerous research initiatives such as disease prevention and treatment, oncology, precision medicine, and neurotechnology, to name a few. Data normalization plays a significant role in supporting these initiatives by creating structured and consistent datasets to that can be leveraged by data mining technologies that enable the processing and analysis of the significant volumes of information required to perform this type of population-level research – a task that would insurmountable if not automated.

Cleaning and Screening

At NIH data analysis begins with the screening and selection of research patients who meet the specific protocol requirements for various ongoing clinical studies. This involves collection and evaluation of patients’ complete medical records including family and medical history, EMR data, imaging records, and much more to identify relevant genetic characteristics, demographic factors, disease profiles, and health statuses, etc.

The NIH screens thousands of patient candidates, and as such have sophisticated methods to collate, standardize, and analyze the aforementioned data. First, patient identifiers are normalized upon ingestion and a temporary ‘pre-admit’ MRN is assigned to ensure consistency across diverse data sets and facilitate longitudinal analysis of the complete patient record by NIH systems and researchers throughout the screening process. This also ensures candidate data is kept separate from approved, active research datasets until the patient has been officially selected – at which time the MRN is normalized once again to an active ‘admitted’ profile.

The Role of Diagnostic Imaging

Imaging data is a key component to the research performed at the NIH. As such, researchers collect patients’ previous imaging from various outside facilities. Key imaging attributes on imported studies, such as study and series descriptors, are normalized according to NIH policies to enable automated data analysis and simplify the screening process for researchers by ensuring exams hang reliably within their diagnostic viewers for quick and efficient review.

As well, the NIH often performs additional advanced imaging exams throughout clinical research projects. To ensure this newly-generated data accurately correlates with the patient’s prior imaging history, can be easily inspected by advanced data analysis technologies, and enables efficient review workflows for researchers the NIH also enforces normalization policies at the modality level.

Because the NIH works with a large number of trainees and fellows the aforementioned normalization of diagnostic imaging exams provides the added benefit of creating a consistent educational environment for teachers and students, supporting structured and replicable teaching workflows.

Leveraging and Developing Artificial Intelligence and Machine Learning

Artificial Intelligence (AI) and machine learning (ML) may be new entrants into mainstream clinical settings, but to the NIH they have been in use for many years already and represent foundational technologies that play a significant role in enabling automatic identification and comparison key data elements across hundreds or thousands of relevant research studies, which would not be possible if done manually.

However, there are some data elements that cannot yet be fully analyzed by these technologies. For those where bias or variability between reported values or criteria can exist, manual intervention is still required to make appropriate adjustments based on interpretation of the context within which the values were taken. This is the tricky thing about data normalization – it’s often context sensitive and the level of reasoning required to consider the ultimate research question or goal when evaluating data relevance and normalization needs cannot yet be reliably accomplished by machines.

For this reason, the NIH continues to support the development and refinement of AI and ML algorithms by leveraging their enormous collection of clean and consistent data to build diverse training environments and collaborating with luminary AI and ML technology organizations to support the further development of advanced context-aware data analysis and clinical use cases.

Anonymization/De-identification

Another critical aspect of the research performed at the NIH is the anonymization and/or de-identification of personally identifiable information (PII). In order to adhere to patient privacy and human subjects research protection regulations, at times, the NIH wishes to, or is required to de-identify research data by removing or obfuscating identifiable data that could otherwise reveal a patient’s identity.

This might be done to allow the NIH researchers to conduct secondary analyses without additional human subjects’ protections approvals (per 45 C.F.R. 46) or in order to share data with other research collaborators. The NIH accomplishes this through standard de-identification or coding techniques and normalization policies with the goal of scrubbing data to remove identifiers.

However, the NIH employs a specialized technique that ensures a link persists between the patient’s anonymized/de-identified and identifiable data through a process defined as an ‘Honest Broker’. This ensures researchers have the ability to follow-up with patients for care-related issues and/or can continue to follow outside patients for additional or future research purposes with appropriate safeguards and approvals.

The real value of data normalization

From tracking tumor response to new treatment programs to developing statistical models for a population’s health characteristics and risk profiles, data curation on the scale achieved by the NIH not only requires the collection of vast amounts of scattered information for analysis, but also mechanisms to transform and normalize incoming data to create well-defined and comparable datasets. For the NIH, data normalization enables the extraction of valuable and predictive clinical insights that can be used to improve both clinical outcomes and population-wide health.

If you enjoyed this post subscribe to our blog to be notified when new articles are published.

Reader Interactions

Leave a Reply