When discussing the merits of a particular healthcare quality measure, the concepts of “validity” and “reliability” are often used interchangeably (if used at all). This is understandable: they are technical terms (i.e., jargon) and they both refer to how “good” a measure is, so it’s no surprise that the details of what falls under each category can get a little muddy.

However, understanding the differences in these terms and what they truly represent is critical for not just those involved in measure development, but for anyone who uses a quality measure or wants to interpret its results. Just like other statistical concepts, the proper use of a metric requires an understanding of what that metric does – and doesn’t – represent. For example, when someone on a hospital staff starts talking about the average length of stay or time to thrombolytics, you probably know that averages can be influenced by outlying data points and therefore you know to ask some more probing questions about sample size and the spread of the data from which those averages are calculated. The same is true (or should be true) for validity and reliability.

As it relates to quality measures, validity reflects how accurately a measure represents whatever it is trying to capture. Healthcare “quality” is not often directly measurable, so you’re typically forced to measure surrogates you hope reflect the true underlying quality. For example, measuring the time from “door-to-balloon” for an AMI patient is intended to reflect the level of quality of AMI care. That metric is valid as a measure of quality of AMI care as long as shorter times can be considered to consistently represent better quality.

We can demonstrate validity by correlating the measure with other, similar measures (convergent validity), by showing that it discriminates between entities we know to be different by some other metric (known-groups validity), or by having content experts weigh in on its merits (face validity). Threats to validity can be either be conceptual or practical in nature. That is, conceptually there may be reasons why “door-to-balloon” time is or is not a good surrogate for AMI care quality; perhaps there are legitimate reasons to delay angioplasty not reflected in the measure, for example. On a practical level we need to be sure that we accurately and consistently acquire the data used to calculate it. Are the timestamps in the electronic health record accurate? Can we be sure certain data fields are always populated?

Reliability, on the other hand, relates to the ability of the measure to correctly discriminate differing levels of quality (or changes in quality) between entities. That is, if two hospitals or physicians actually differ in their care quality, how likely is it that the measure will detect that difference?

To establish the reliability of a measure, we test things like agreement between data pulled by multiple abstractors from the same chart (inter-rater reliability) or that observed variability is due mostly to differences in actual performance versus random fluctuation (signal-to-noise analysis). Challenges to reliability occur when too much of the variation in measurement performance is due to reasons other than differences in the underlying quality of care delivery. Small sample sizes, unintended biases, and uncontrolled factors may make it difficult to reliably and consistently differentiate entities that truly differ in the quality of care they deliver.

Why does it matter? Because, while related, the concepts of validity and reliability are NOT the same, and a good measure is one that is BOTH valid AND reliable. Having one without the other is not sufficient. When you’re developing a measure, it is important to consider how it will be used, where the data will come from, if the inclusion and exclusion criteria will systematically exclude a certain group of people, etc. Those who end up using the measure will need to have confidence that it reflects true quality (validity) and that when they improve the quality of care they deliver it will be reflected by an improvement in measure performance (reliability). If they don’t, it the measure becomes a burden rather than something that can be used to motivate and demonstrate quality improvement.

Once developed and used, in order to properly interpret the relative performance of measured entities one should consider aspects of both validity and reliability. That is, when looking at the relative performance of providers or facilities on a certain measure, try to think about what the potential threats to validity and reliability might be. Does the measure utilize data that seems like it would be difficult to consistently obtain? Are there situations where there might be legitimate reasons why someone would perform poorly on this measure other than their underlying quality? Do the inclusion or exclusion criteria allow for someone to potentially “game” the measure or “cherry pick” the best patients? In many cases, it’s likely that during development there was empirical testing performed to demonstrate adequate validity and reliability, but it’s difficult (if not impossible) to account for all possible scenarios and situations that could occur, so you need to be vigilant when interpreting measures.

When you read or hear about critiques and challenges to measures, it can be helpful to think about whether those critiques relate to the validity or reliability (or both) of the measure. For example, when someone challenges the use of patient outcome measures like 30-day readmission or mortality, are they challenging the validity of the measure itself (e.g., “too much can happen to a patient after discharge that’s out of a hospital’s control, and therefore the measure doesn’t reflect the quality of care it provides”)? Or are they unconvinced of its reliability (e.g., “Even with risk-adjustment, comparisons between these facilities are not appropriate or fair”)? Viewing the critiques in this light allows you to understand the nature of the criticisms and can help you evaluate whether they have merit and what (if anything) should be done about them. A challenge to the appropriateness of a measure as a surrogate for underlying quality is a vastly different issue than whether the available data are adequate to make appropriate comparisons. A valid measure isn’t necessarily a reliable one, and vice versa.

The Future of Validity and Reliability of Quality Measures

Going forward, the ever-expanding amount and availability of data will allow for more empirical testing of quality measures than ever before. Additionally, greater exploration and understanding of factors that influence population health – for example, the role of social determinants of health – will allow for the identification and specification of more appropriate surrogates for care quality and to more fully risk-adjust them for making comparisons. However, incorporating new knowledge into measures takes time, both in the conception of the measures themselves and in the collection and analysis of data related to that new knowledge. Validity and reliability will always be important considerations, and will always be at the heart of important discussions regarding the appropriateness and fairness of quality measures. A full and complete understanding of these concepts is essential for anyone hoping to develop, test, use, or interpret quality measures.