Ask our experts: Working with real world data sets: What types of questions can your data answer?

Image for Ask our experts: Working with real world data sets: What types of questions can your data answer?

The promise of real-world evidence (RWE) comes with a steep learning curve. As regulators and industry continue to engage in the shared learning process to explore the use cases best suited to RWE, a number of questions are bound to arise. Fortunately, Aetion’s experts—who participate daily in principled database epidemiology and the exploration of real-world data sets’ fitness for purpose—are well equipped to answer questions from our readers on all things real-world data (RWD) and evidence.

We chatted with Emily Rubinstein, M.P.H., Senior Director of Global Data at Aetion, to address queries about data quality, access, and fit. Emily, who was formerly the Director of Real-World Data and Analytics at Pfizer, has held data leadership roles and engages daily with global real-world data sets. She is a graduate of Columbia University’s Mailman School of Public Health and Tufts University.

Q: We’ve seen increased interest from regulators in using RWD. How do they decide whether or not a data set is trustworthy?
Regulators prefer data that they can trace back to the patient level, that level of transparency is key. They want to know that the patient is being accurately captured in the larger health care ecosystem. Oftentimes, the real-world data we use was actually collected for another reason entirely than for RWD analyses, so there are times when data either will or will not work. It has a lot to do with the question you’re asking, and how the specific data entries answer that question. For example, is there a lot of missing data—or missingness—for a specific endpoint because it isn’t commonly used by doctors? Perhaps the field is not required for billing purposes and therefore will not accurately capture the endpoint needed? This could make the data unfit for regulatory use.

Also, I would add that data feasibility—the chance to pressure test a data set to see what it does and doesn’t include before using it in an analysis—is key to knowing a good data source. During the feasibility test you can assess for confounders and refine your question as needed.

Remember, there is no perfect real-world data source. It’s about asking the right question for the data that exists right then.

Q: Data structure can vary significantly depending on the cultures and populations from which the data was collected. Can you provide some examples of those factors that can shape data structure?
The bottom line is that the common standard of care in each country or region must be understood in order to identify what the data can tell you.

In France, for instance, it’s relatively easy for people to access a prescription drug once it’s approved. However, it can be difficult to get a doctor’s appointment.

That’s very different from, say, Germany. Germany is stricter than other countries about prescribing drugs that haven’t proven to be more effective than other less expensive options.

Japan is unique because insurance claims are submitted monthly, rather than on an encounter-by-encounter basis. Japanese data is also not linkable like it is in the U.S., since the practice of linking claims data with data collected in hospital electronic health records (EHRs) is prohibited.

When you’re looking across geographies, you have to keep these factors in mind as you’re designing and conducting analyses on the data.

Q: How can data be standardized across geographies to make analyses using global data sets more efficient? What are some of the common challenges to this standardization?
There are many attempts to standardize data for real-world evidence. There are often questions about how these standardization approaches treat the regional (and other) idiosyncrasies of the data generation, which will persist even with the best of harmonization techniques. For example, different countries can have unique medical coding systems—Japan’s own ICD system and the U.K.’s transition to SNOMED CT codes may not map directly to how the U.S. uses ICD codes, for example.

For that reason, it is more important to embrace the idiosyncrasies than to avoid them. Sometimes that “outlier” is actually telling us something. So if you design the question to be inclusive or work around the idiosyncrasies of the data, you’re going to get the best answer.

Q: What are the primary differences between data collected in an academic vs. community medical center setting?
Academic hospitals tend to be in urban areas, and doctors who work there usually have a research component to their appointments. And so their treatment tends to be much more on the cutting edge; the newest treatment is often accessed first in academic medical centers.

Community medical centers will see filtered incident guidelines, meaning that patients may receive care that is closer to established treatment pathways than they would in an academic center. That isn’t to say that community medical centers aren’t reading all the latest articles, but their remit is often different from an academic medical center doctor. So, watching the uptake of a new product can be very different if you have an academic medical network as your data source versus a community data network.

There’s also a higher prevalence of rare diseases that show up in academic medical centers, which can also alter the questions you can answer with the data because the population may not be represented in the community medical centers’ data sets.

Q: When regulators reject a real-world data submission, how can you determine what contributed to their decision?
The underlying question to ask yourself is: did you ask the right question for the data you have? Regulators may reject a data source that they’ve accepted previously because they don’t like the way the research question was asked.

When Aetion seeks to answer research questions, it usually becomes clear during the analysis whether or not a data source is fit to answer the question at hand. Maybe, for example, the sub population we’re looking at isn’t as robust as it could be in another data source, or the data could be older than we thought it would be. When this happens, we either tweak the question or search for alternate data sources that could answer our question more fully.

Q: What happens when you want to use disparate data sets in an analysis?
When combining real-world data sets, it is important to be able to “look” or search across the data sets. For example, looking at two claims data sources is easier than looking across a claims data source and survey data combined.

When you look across data sets you also have to think about why the data was collected in the first place. If you were to look across two databases of data that was captured for billing reasons, you’re essentially comparing apples to apples.

But if you’re looking across billing data to survey data, for example, you start comparing apples to oranges, and you have to think more carefully about the work you’re doing — there’s a reason why people say you can’t compare apples to oranges. Thinking through the science behind the data is important to allow for this combining of different types of data, and it allows one to consider what is necessary to recognize insights across data sets.

We welcome the opportunity to keep this conversation going. We invite your questions about data for real-world studies—please forward your queries for our experts at

eBook: 2021 update

The Role of Real-World Evidence in FDA Approvals

Download here