What Your Scores Mean: Limits of PHQ-9 and Mental Health Measurement

A zoomed-in photo of a patient handing a paper questionnaire to a healthcare professional wearing green scrubs

By Elizabeth Curley, Ph.D, MSW, ICI Director of Research & Policy

A recent brief report from JAMA psychiatry raises uncomfortable questions about one of the most widely used tools in mental health care: the Patient Health Questionnaire (PHQ). 

If you’ve been prescribed antidepressants through primary care, this tool has likely shaped your diagnosis, treatment decisions, and monitoring. The study was published showing that approximately half of the general population and fewer than one-fifth of clinical participants interpreted the PHQ as intended. 

The PHQ-9 asks how often you’ve been bothered by nine depression symptoms over the past two weeks. Response options range from “not at all” (0) to “nearly every day” (3). The key instructions, “bothered by”, is meant to capture not just how frequently symptoms occur, but how much distress they cause (i.e. severity). 

Yet, most people answer based mainly on frequency. In the JAMA study, participants evaluated a hypothetical scenario of oversleeping everyday but feeling comfortable with it; only 54.7% of the general sample and 15.5% of the clinical sample responded according to instructions. Similar issues emerged when people reflected on their own past answers and future intentions. 

These interpretation issues reduce the tool’s validity, especially in real-world contexts.

This issue isn’t completely new; it was echoed in a 2015 study of interviews with participants about their responses to the PHQ. Researchers found that participants tend to skip the instructions on self-report measures or misinterpret them entirely, even when they try to follow the instructions.

The real issue lies in that frequency and severity can be at odds with each other; a frequent symptom that barely bothers someone scores differently from a less common but deeply disturbing one. When these differences get blurred (especially for diagnoses that require impairment in daily functioning or distress), the clinical utility of the measure is undermined and skews who gets diagnosed and given medication.

This is especially concerning when recent estimates show that primary care providers make up the largest majority of prescribers (73.8%) for antidepressants and anxiolytics, often acting as the front door for anxiety and depression care. One study found that 89.4% of eligible patients were screened with the PHQ-9, and clinicians have described the PHQ-9 as useful for diagnosing, confirming severity, and documenting care.

In short, the PHQ turns screening into a gateway to prescribing. At its best, the PHQ is meant to inform screening, severity assessment, and support diagnosis for access to treatment and monitoring.

However, diagnosis is not always clearly recorded or communicated to patients, and some antidepressant prescribing occurs without a documented diagnosis. In one analyzed transcript of doctor-patient encounters, the doctor pulls out the PHQ-9 after a patient resisted treatment or being given a label of “depression” and the question responses were verbally modified by the doctor in administering it.

So if it is not being applied within the proper context, and it isn’t fully understood or reliable when it is implemented, what is its purpose?

The PHQ-9’s origin is valuable in our understanding of its use. Pfizer provided funding for its development during the era of Zoloft, with a commercial interest in making depression easier to identify and treat in primary care. While the developers maintain that they were scientifically independent, the financial incentive to expand treatment of depression is evident.

Measuring abstract concepts, like happiness or rumination, requires a specific process within research of measurement development and psychometric validation. This process uses statistics to ensure that we are capturing the concepts we intend; 10% reduction in an unverifiable feeling doesn’t result in much. So variation in reliability, accuracy, validity in use are expected to some degree for all measures - if they are even psychometrically tested to begin with.

However, the PHQ-9’s reliability sits atop broader diagnostic challenges. DSM-5 field trials showed minimal reliability for Major Depressive Disorder (MDD), with kappa values around 0.28 in some analyses (far below the thresholds commonly considered “good”). Although there is some debate over exact thresholds on kappa scores, common understanding is that scores above a .61 are considered “good” or better.

There are 227 possible symptom combinations that qualify for an MDD diagnosis, highlighting its heterogeneity as a clinical label. Somewhat unsurprisingly, false positive rates for brief screeners (like the PHQ9) often exceed 50%.

Real-world application can compound these issues. Studies of clinical sessions show flexible use of the tool and events where the measure is misapplied. In one transcript, rather than relaying the PHQ9 verbatim, the doctor deviates from the wording so that the response options are selectively offered to upgrade the severity of the patient’s symptoms - this is used explicitly to justify a treatment recommendation that the patient previously resisted.

There is no standard consensus across countries – recent guidelines state that a positive screen for depressive symptoms (such as through the PHQ) should be followed-up with a secondary diagnostic assessment (Ferenchick et al., 2019). However, some available guidance states that the “PHQ 9 tool is used to screen or diagnose depression, measure the severity of symptoms, and measure a patient’s response to treatment.”

Given real-world challenges that research has demonstrated, this representation may stretch the strengths of the measure beyond its applicability.

The PHQ is not useless, but it is imperfect. Its limitations highlight why questionnaire scores should not (and ultimately, can not) replace thoughtful clinical assessment. Measurements can be supportive, but it is important to know their context and understand their shortcomings.

Reliable depression scales exist, including some developed without pharmaceutical funding. In order to reliably do right by the patients, the path forward involves revisiting the research, and choosing the best tool for the specific need - not relying on what has become commonplace.

If patients are encouraged to ask questions like, “how is this score going to be used?” or “what else is being considered in this decision?,” clinicians have more opportunities to explain the strengths and limitations of the tools they use. In turn, patients gain context needed to make informed choices about treatment whether that means starting medication, pursuing therapy, combining approaches, or taking time to explore other factors.

Thoughtful assessment should always be available and asking questions shouldn’t be considered a confrontation - that’s the only way true informed choice can exist.