Big data takes on stroke by the numbers

Evolving, adaptive and increasing, big data provides more stroke treatment direction.


Getty Images 1223133003

The continuous evolution and growing sophistication of data to treat stroke was the focus of Friday’s “Data, Big Data, Biggest Data and Stroke” session.

Jeffrey Saver, MD, Geffen School of Medicine at UCLA, presented the next generation of Common Data Elements (CDEs) for Stroke from the NIH National Institute of Neurologic Disorder and Stroke (NINDS). He described the mission of the CDEs, how they apply to stroke and detailed the updates to the recently released Version 2.0. 

“Through the update of the Stroke CDEs to V2.0, the initiative strives to maintain the utility of CDEs as a valuable clinical research resource. NINDS encourages the use of CDEs to standardize research data collection across studies,” he said. 

Saver presented the V2.0 elements, which include instruments for formal assessments of different modalities, new conditional random fields and instruments that were removed from V1.0 that were no longer as relevant. The update includes updated individual descriptions and aspects of the existing data elements as well as elements of lifestyle modification therapies for case report forms.

Caitlyn Meinzer, PhD, Medical University of South Carolina in Johns Island, explored the use of adaptive trials as a path toward personalized medicine. She described three types of Multi-Arm, Multi-Stage (MAMS) trial platforms: umbrella where you might have multiple treatments for one disease; basket, where you have multiple diseases and one treatment; and minesweeper, where you have one treatment and one disease, but you are systematically adapting to multiple subgroups. 

She said the minesweeper trial is exactly like 1990s game. “You start off with a grid or risk surface where you don’t know anything about where the safe spaces and traps are. As you slowly click around, you get increasing amounts of information telling you treatment may be increasingly effective or ineffective,” she said. “Hopefully, as you progress, you are able to safely flag all those areas where treatment is futile or harmful without actually exposing patients needlessly.”

Minesweeper is the basic underlying concept for the proposed StrokeNet Thrombectomy Endovascular Platform (STEP) Platform. “We are using big data to estimate group level variability across many factors. We do this through hierarchical models and adaptive decision rules to respond to big data to allocate patients to the treatment we currently believe best fits. Finally, we use master protocols and data registries to manage big data in a cohesive and seamless manner.

“For the STEP trial for stroke, we are targeting 50,000 patients,” she said.

This platform illustrates the beauty of big data, she said. “If we want to create a risk surface that accounts for all of the risk factors (time, penumbra, vessel size, deficit, core, age and eloquence) so we can come up with a personalized approach to treatment, you end up needing a great number of subjects to cover the entire risk surface.”

Where the older approaches would have offered incremental expansion or megatrials, the newer approach uses an adaptive mechanism where you can iteratively change the allocation ratio up or down as you learn more in the trial, she said.

As a reality check on big data, Marco V. Perez, MD, from Stanford Medicine in California, described big shifts thanks to big data from his research of machine learning and other techniques.

“The paradigm shift is we are now able to collect many more data points per sample. A very good example of this is genomic data,” he said. “Today, when we do a genomic analysis, we are able to collect many genomic data points per person.”

There are other sources of big data, including electronic medical records, continuous monitoring and wearables, he said.

“Wearables are going to create a lot of big data. A quarter of the population will have some kind of wearable in the U.S. by 2022. With wearables, you can measure data continuously , such as heart rate data for months to years.”

How do we go about analyzing this data? He identified different approaches, including neural networks, data clustering and data reduction.

“Neural networks are really valuable when you are looking at complex and nonlinear relationships. These do really well when you have large datasets and multiple covariates; they handle missing data quite well. It just uses brute computational force to figure what the relationships are.”

He identified big data’s primary challenges. “When you are extracting a large amount of data, often you don’t have time to go through and carefully make sure that your data is well classified. You do not have time to curate all of your data. In other words, garbage in, garbage out.”

He said you can overcome this misclassification by increasing the sampling size and frequency. With algorithms, you can then perform a form of signal averaging to find where the signals are amid the noise and if there is a signal there.

Other criticisms of big data include not really knowing and trusting what the algorithms are doing with the data, it can be resource intensive and it can require a great deal of technical expertise.

“The advantages of big data and machine learning algorithms are that they do really well at finding patterns and trends, identifying interactions between data elements that traditional regression analysis has a tough time with, and it handles awkward data well.”

Perez recommended using big data for large numbers of covariates, especially if you have more covariates than you have samples and if you are looking for patterns in complex data.

He discouraged the use of big data when regression/classic modeling is sufficient or when you are trying to understand physiology or epidemiology.

“Datasets are getting much bigger. Traditional statistical tools are not sufficient anymore. We do need new approaches to analyze massive datasets with lots of covariates,” he said. “But we must be cautious because there are pitfalls in analyzing big data with machine learning approaches.”