Part III: The Skeptic's Case — The Necessary Revolution

Chapter 9: What the Evidence Actually Shows

A genuinely skeptical reading of the psychotherapy evidence base — the kind of reading an epidemiologist would conduct rather than a therapy advocate — reveals a field that has oversold its results.

The Dodo Bird Verdict. The persistent finding that all bona fide psychotherapies produce roughly equivalent outcomes (Luborsky et al., 2002; Wampold, 2001) should be deeply troubling to the field. If CBT, psychodynamic therapy, humanistic therapy, and interpersonal therapy all produce similar results despite radically different theories of what causes suffering and how to fix it, then either:

All therapies work for the same reason (common factors — the relationship, expectation, attention), and the specific techniques are largely therapeutic theater, or
Our outcome measures are too blunt to detect real differences, or
Something else is happening that none of the theories adequately explain.

Each of these possibilities is uncomfortable. Option 1 means we have spent decades and billions developing, manualized, and RCT-testing specific techniques that don’t matter. Option 2 means our evidence base is built on instruments too crude to tell us what we need to know. Option 3 means the field doesn’t understand its own mechanism of action.

The therapist effect dwarfs the technique effect. Baldwin and Imel (2013) found that the identity of the therapist accounts for 5-9% of outcome variance, while the specific therapeutic modality accounts for roughly 1%. This means who your therapist is matters five to nine times more than what kind of therapy they practice. And yet the field continues to organize itself around modalities (CBT training, psychodynamic training, EMDR training) rather than around whatever it is that makes some individual therapists effective and others ineffective. We have built the entire professional infrastructure around the variable that matters least.

Publication bias has inflated the evidence. Cuijpers et al. (2010) demonstrated that when controlling for publication bias, the effect sizes for psychotherapy for depression shrink substantially. Positive results are published; null results disappear into file drawers. The evidence base the public sees is systematically more optimistic than the evidence base that actually exists. This is not fraud — it is the predictable consequence of incentive structures that reward positive findings — but the result is that patients and policymakers are making decisions based on an inflated picture of therapy’s effectiveness.

Long-term outcomes are rarely measured. The vast majority of therapy RCTs measure outcomes at end-of-treatment or at short-term follow-up (3-6 months). The few studies that follow patients for 2+ years show high relapse rates across virtually all conditions. The field has not demonstrated that therapy produces lasting change for most patients. It has demonstrated that therapy produces temporary improvement that often erodes.

Chapter 10: The Replication Crisis Comes for Therapy

Psychology’s replication crisis — the discovery that many landmark findings in the field cannot be reliably reproduced — has not spared psychotherapy research.

The Open Science Collaboration’s 2015 attempt to replicate 100 psychology studies found that only 36% produced statistically significant results on replication. While not all of these were therapy studies, the methodological problems that drive replication failure — small samples, researcher degrees of freedom, p-hacking, publication bias — are endemic in the therapy literature.

Specific concerns:

Allegiance effects. Researchers who develop a therapy and then test it consistently find larger effect sizes than independent researchers testing the same therapy (Munder et al., 2013). This suggests that researcher expectations and subtle methodological choices inflate results.
Waitlist control inflation. Comparing therapy to a waitlist control inflates effect sizes because being on a waitlist may actually worsen symptoms (the “nocebo” effect of knowing you are not receiving treatment). The field’s reliance on waitlist controls has systematically overstated therapy’s effectiveness.
Therapist selection bias in RCTs. Therapy trials typically use carefully selected, well-trained, supervised therapists delivering treatment with high fidelity. This bears little resemblance to the average community mental health clinician delivering treatment under high caseload pressure with minimal supervision. Efficacy (does it work under ideal conditions?) does not equal effectiveness (does it work in the real world?).
Outcome measure sensitivity. Self-report symptom measures (PHQ-9, GAD-7, BDI) are the standard. These instruments can detect symptom changes but say little about functional recovery — whether the person can hold a job, maintain relationships, parent effectively, or find meaning in their life. A patient whose PHQ-9 drops from 18 to 10 has “responded to treatment” by research standards but may still be significantly impaired in daily life.

Chapter 11: The Uncomfortable History

A skeptic must also reckon with the history of what the mental health profession has gotten wrong — not as ancient history, but as evidence of the field’s capacity for institutional error at enormous human cost.

Lobotomy. Between 1936 and the mid-1950s, approximately 40,000-50,000 Americans received lobotomies. Walter Freeman performed the procedure with an ice pick through the eye socket, often without surgical training or adequate anesthesia. The procedure was celebrated — Freeman was nominated for a Nobel Prize. It destroyed lives. This was not fringe medicine. It was mainstream psychiatry, endorsed by leading institutions.
Homosexuality as mental illness. The DSM classified homosexuality as a mental disorder until 1973. Thousands of Americans were subjected to conversion therapy, institutionalization, and social stigma based on the profession’s diagnostic judgment. The removal of homosexuality from the DSM was driven by political activism, not by new scientific evidence — the science had never supported the classification.
The recovered memory epidemic. In the 1980s and 1990s, a significant faction of the therapy profession promoted techniques for “recovering” repressed memories of childhood sexual abuse. Many of these memories were false, produced by suggestive therapeutic techniques. Families were destroyed. People were imprisoned. The field eventually acknowledged the problem, but not before causing immeasurable harm.
The DARE program and scared-straight interventions. Both were confidently promoted by mental health and criminal justice professionals. Both were subsequently shown by rigorous evaluation to be ineffective or actively harmful.
Overmedicalization of normal distress. The successive expansions of the DSM — from 106 diagnoses in DSM-I (1952) to 297 in DSM-5 (2013) — have progressively pathologized ordinary human experiences: grief, shyness, childhood energy, post-traumatic adjustment. Allen Frances, who chaired the DSM-IV task force, called DSM-5 a “bonanza for the pharmaceutical industry” and warned that it would turn normal human suffering into mental disorder on an unprecedented scale.

These are not anomalies. They are evidence of a pattern: the mental health profession, operating with confidence and institutional authority, regularly causes significant harm based on theories that are later recognized as wrong. A skeptic asks: which of today’s confident assertions will be tomorrow’s lobotomy?