Evidence Based Medicine: An annotated bibliography

Cite this article as:
Morgenstern, J. Evidence Based Medicine: An annotated bibliography, First10EM, January 10, 2022. Available at:
https://doi.org/10.51684/FIRS.124519

I read a lot, but I am not very organized. Over the years, I have read thousands of papers about evidence based medicine and methodology. I frequently find myself wanting to share interesting papers with students, or cite them in my blog posts, but I forget where to find the paper (and sadly, this seems to be happening more often with age). So I have created a living evidence based medicine annotated bibliography. This is really just a list of interesting or important EBM papers, with a few key notes. I think it will be helpful to others trying to understand evidence based medicine. It will be updated with time, as I sort through the many harddrives worth of PDFs that fill my office, so feel free to check back in. If there are papers that you think deserve to be on this list, or that I might just enjoy, please feel free to share them in the comments section at the end.

General Evidence Based Medicine

Sackett DL, Rosenberg WM, Gray JA, Haynes RB, Richardson WS. Evidence based medicine: what it is and what it isn’t. BMJ. 1996 Jan 13;312(7023):71-2. doi: 10.1136/bmj.312.7023.71. PMID: 8555924

This is one of THE classic papers in EBM, and should be core reading material for anyone with an interest.
“Evidence based medicine is the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients.”
“Good doctors use both individual clinical expertise and the best available evidence, and neither alone is enough”
“Evidence based medicine is not “cookbook” medicine. Because it requires a bottom up approach that integrates the best external evidence with individual clinical expertise and patients’ choice, it cannot result in slavish, cookbook approaches to patient care.”

Djulbegovic B, Guyatt GH. Progress in evidence-based medicine: a quarter century on. Lancet. 2017 Jul 22;390(10092):415-423. doi: 10.1016/S0140-6736(16)31592-6. Epub 2017 Feb 17. PMID: 28215660

This is a great overall summary of the history of EBM written by the inventor of the term – Gordon Guyatt. Worth a read.
“EBM has disseminated three major tenets: an increasingly sophisticated hierarchy of evidence, the need for systematic summaries of the best evidence to guide care, and the requirement for considering patient values in important clinical decisions.”
“Central to the epistemology of EBM is that what is justifiable or reasonable to believe depends on the trustworthiness of the evidence, and the extent to which we believe that evidence is determined by credible processes. Although EBM acknowledges a role for all empirical observations, it contends that controlled clinical observations provide more trustworthy evidence than do uncontrolled observations, biological experiments, or individual clinician’s experiences.”
“Evidence is, however, necessary but not sufficient for effective decision making, which has to address the consequences of importance to the decision maker within the given environment and context. Thus, the third epistemological principle of EBM is that clinical decision making requires consideration of patients’ values and preferences.”
They cite estimates that due to biased research, “50% of research effort is wasted at each state of generation and reporting of research, resulting in more than 85% of total research wasted.”
Biased research results in tremendous harm to patients. Examples:
- Unnecessary bone marrow transplants in breast cancer
- Harmful antiarrhythmic prescriptions
- Hormone replacement therapy
“Researchers have increasingly differentiated between explanatory (also known as mechanistic or proof-of-concept efficacy) trials that address the question “can intervention work in the ideal setting?” versus pragmatic (also known as practical, effectiveness) trials that address the question “does it work in real-world settings?” and “is it worth it and should it be paid for? (efficiency)”.”

Greenhalgh T, Howick J, Maskrey N; Evidence Based Medicine Renaissance Group. Evidence based medicine: a movement in crisis? BMJ. 2014 Jun 13;348:g3725. doi: 10.1136/bmj.g3725. PMID: 24927763 [full text]

There are some major issues in the current state of EBM
- Corporations have taken over the terminology of EBM and set the research agenda, but their goal is sales not truth, resulting in the use of many tactics (over-powering, surrogate outcomes, etc) designed to bias the results of studies
- The sheer volume of evidence is overwhelming, driven by a research culture that values quantity over quality
- With much of the “low hanging fruit” picked, research has largely moved into areas of marginal gains
- EBM has been mistranslated into algorithmic care and quality metrics, which are actually counter to the ethos of EBM
As solutions, the authors tell us that EBM must be individualized to the patient, based on judgement not rules, and used to build a strong interpersonal relationship between clinician and patient.
“Real evidence based medicine is as much about when to ignore or override guidelines as how to follow them”

Every-Palmer S, Howick J. How evidence-based medicine is failing due to biased trials and selective publication. J Eval Clin Pract. 2014 Dec;20(6):908-14. doi: 10.1111/jep.12147. Epub 2014 May 12. PMID: 24819404

For the most part, there is not resounding evidence that evidence based medicine has dramatically improved care for patients since the concept was introduced 40 years ago. The authors discuss the many ways in which EBM has been ineffectively implemented, which probably contribute to that problem.
The biggest issue is that we allow pharmaceutical companies to use “evidence based medicine” as marketing
Needed steps included enforcing trial registration (https://www.alltrials.net/), investment in independent research, having independent bodies set research priorities, and prioritizing methodologic quality over ‘positive’ results.

Dickersin K, Straus SE, Bero LA. Evidence based medicine: increasing, not dictating, choice BMJ. 2007; 334(suppl_1):s10-s10. 10.1136/bmj.39062.639444.94

“It is curious, even shocking, that the adjective “evidence based” is needed. The public must wonder on what basis medical decisions are made otherwise. Is it intuition? Magic?”
This article rebuts that common claim that EBM is about limiting choices. To anyone who practices EBM, it is very clear that it increases choices for patients and physicians. It just makes those choices safer, and more scientific.

Kiessling A, Lewitt M, Henriksson P. Case-based training of evidence-based clinical practice in primary care and decreased mortality in patients with coronary heart disease. Ann Fam Med. 2011 May-Jun;9(3):211-8. doi: 10.1370/afm.1248. PMID: 21555748

People often claim there is no evidence for evidence based medicine. (Personally, I think that is a silly argument, as we have seen tremendous benefit from the scientific method in medicine, dating back to the original controlled trials of scurvy or blood letting.)
This imperfect study randomized primary care doctors to receive active seminars on EBM guidelines for lipid lowering or control. At the end of 10 years, there was an astonishing 22% absolute reduction in mortality in the group with active implementation of evidence based guidelines.

Shuval K, Linn S, Brezis M, Shadmi E, Green ML, Reis S. Association between primary care physicians’ evidence-based medicine knowledge and quality of care. Int J Qual Health Care. 2010 Feb;22(1):16-23. doi: 10.1093/intqhc/mzp054. Epub 2009 Dec 1. PMID: 19951965

This observational study found that physicians with more knowledge about EBM (critical appraisal and information retrieval) also performed better on diabetes quality of care measures. In other words, EBM knowledge correlates with better quality of care.

Bias

The various sources of research bias are so important that they get their own section on First10EM.

Statistics

P Values

The many misconceptions about p values require a blog post of their own, but for a few key citations:

Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008 Jul;45(3):135-40. doi: 10.1053/j.seminhematol.2008.04.003. PMID: 18582619

This article goes over a number of important misconceptions about p values.
Most importantly, “The operational meaning of a p value less than 0.05 was merely that one should repeat the experiment.”
There is a ton to learn in this paper. My key learning points:
- A p value of 0.05 does not mean there is a 5% chance of the null hypothesis being true
- A p value above 0.05 doesn’t mean there is no difference between the groups
- A statistically significant difference is different from a clinically significant difference
- A p value of 0.06 is not really different or “conflicting” with a p value of 0.04
- Just because two studies have the same p value does not mean they provide the same degree of evidence against the null hypothesis
- P <0.05 is not the same as p = 0.05. In the era of computers, we should report the exact p value (p=X)
- Scientific conclusions and policies should not be based on p values

Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician. 2016; 70(2):129-133. [full article]

The American Statistical Association felt it was necessary to make a statement on p-values, because they are so widely misused. They came out with 6 principles:
- P-values can indicate how incompatible the data are with a specified statistical model
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Proper inference requires full reporting and transparency
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis
Probably the most important, and most mis-understood issue address here: The p-value has nothing to do with reality. It is only a comparison to a statistical model of the null hypothesis, which is never something we care about in medicine. Unless you are a statistician, you are almost certainly better off ignoring p-values and focusing on critical appraisal more broadly.

Wasserstein RL, Schirm AL, Lazar NA. Moving to a World Beyond “< 0.05” The American Statistician. 2019; 73(sup1):1-19. [full text]

A fantastic overall paper that repeats many of these warnings about the p value and provides its own annotated bibliography.
“Regardless of whether it was ever useful, a declaration of “statistical significance” has today become meaningless. Made broadly known by Fisher’s use of the phrase (1925), Edgeworth’s (1885) original intention for statistical significance was simply as a tool to indicate when a result warrants further scrutiny”
I like their suggestion for an alternative to analyzing statistics: “Accept uncertainty. Be thoughtful, open, and modest.” Remember “ATOM.” (Which is gone into in more detail in the paper.)

Benjamin, D.J., Berger, J.O., Johannesson, M. et al. Redefine statistical significance. Nat Hum Behav 2, 6–10 (2018). https://doi.org/10.1038/s41562-017-0189-z

These authors suggest redefining statistically significant using a p-value of 0.005
I think this would help in medicine, where we are awash in false positive results. But treating the p-value like a threshold still results in a fundamental misunderstanding of the measure.
John Ioannidis has a good commentary on this idea here: Ioannidis JPA. The Proposal to Lower P Value Thresholds to .005. JAMA. 2018;319(14):1429–1430. doi:10.1001/jama.2018.1536

P Hacking

Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and consequences of p-hacking in science. PLoS Biol. 2015 Mar 13;13(3):e1002106. doi: 10.1371/journal.pbio.1002106. PMID: 25768323 [full text]

P hacking (AKA inflation bias or selective reporting) occurs when researchers try out several statistical analyses, or compare multiple sets of data, and then selectively report those that are statistically significant
- Examples include “conducting analyses midway through experiments to decide whether to continue collecting data, recording many response variables and deciding which to report post analysis, deciding whether to include or drop outliers post analyses, excluding, combining, or splitting treatment groups postanalysis, including or excluding covariates postanalysis, and stopping data exploration if an analysis yields a significant p-value.”
If one collects the p values of all the published research in a field, you can find evidence of publication bias and p hacking. Ie, an overabundance of p values just below 0.05 is good evidence of p-hacking.
“Our study provides two lines of empirical evidence that p-hacking is widespread in the scientific literature” – this included medicine.
Some recommendations to prevent p-hacking
- Need to clearly label or register analyses as prespecified
- Perform analyses blinded
- Methods and results just be assessed independently of each other. *** I think this is a big one. Journals should be deciding whether to publish a study based on just the methods, without even seeing the results.***

Belas N, Bengart P, Vogt B. P-hacking in Clinical Trials A Meta-Analytical Approach. https://doi.org/10.24352/UB.OVGU-2018-573

Looking at a dataset of 1177 clinical trials submitted to the FDA for drug applications, these researchers found an inordinate number of primary outcomes with p values reported at just under the 0.05 and 0.01 threshold. There were many more results at these p values than would be expected by chance, which suggests significant p-hacking in this literature.

Confidence intervals

McCormack J, Vandermeer B, Allan GM. How confidence intervals become confusion intervals. BMC Med Res Methodol. 2013 Oct 31;13:134. doi: 10.1186/1471-2288-13-134. PMID: 24172248

We make many of the same mistakes with confidence intervals as we do with p values. (We chose the 95% confidence intervals for the same arbitrary reason as we chose p<0.05.)
The authors provide great examples of the silly need we have in medicine to dichotomize research results. Despite multiple trials clearly having equivalent results, the conclusions are drastically different based on whether the confidence intervals barely crossed 1
“medical authors feel the need to make black and white conclusions when their data almost never allows for such dichotomous statements”
“We encourage authors to avoid statements like “X has no effect on mortality” as they are likely to be both untrue and misleading”

Subgroup Analysis

Wallach JD, Sullivan PG, Trepanowski JF, Sainani KL, Steyerberg EW, Ioannidis JP. Evaluation of Evidence of Statistical Support and Corroboration of Subgroup Claims in Randomized Clinical Trials. JAMA Intern Med. 2017 Apr 1;177(4):554-560. doi: 10.1001/jamainternmed.2016.9125. PMID: 28192563

Subgroup claims are very common in studies, but the statistics are dubious, and they very rarely pan out in future research
Here they look at 64 RCTs that made 117 subgroup claims
- 40% were stated to be positive
- But only 28% had been prespecified and only 2% statistically adjusted for multiple testing
- Most importantly, only 5 (10%) has any subsequent corroboration attempts – and all 5 were negative

Sleight P. Debate: Subgroup analyses in clinical trials: fun to look at – but don’t believe them! Curr Control Trials Cardiovasc Med. 2000;1(1):25-27. doi: 10.1186/cvm-1-1-025. PMID: 11714402

The title says it all: Subgroup analyses in clinical trials: fun to look at – but don’t believe them!
As a whole in medicine, we under-estiamte the possibility that positive trials are positive due to chance alone, based on our acceptance of the p<0.05 cutoff. Subgroups just make chance findings much more common
In ISIS-2 (trial showing benefit of ASA in MI), there were two subgroups based on astrological signs (gemini and libra) in which aspirin was actually harmful. Out of 16 countries, streptokinase didn’t work in 2 countries. When looking for negative effects, the subgroups appear patently ridiculous – but we accept the opposite (positive) claims all the time.
Essentially, subgroups take a large randomized trial, and break it down into multiple much smaller (and often not properly randomized) trials. If you considered these smaller trials on their own, we wouldn’t believe the results. But we tend to look at the quality of the overall study and inappropriately apply those methods to the small subgroups.

Sun X, Briel M, Busse JW, You JJ, et al. The influence of study characteristics on reporting of subgroup analyses in randomised controlled trials: systematic review. BMJ. 2011 Mar 28;342:d1569. doi: 10.1136/bmj.d1569. PMID: 21444636

Industry funded trials are more likely to report subgroups when the primary outcome is negative. (They don’t when the primary outcome is positive). They were also less likely to prespecificy these subgroups, and less likely to appropriately adjust their statistics for making multiple comparisons. (Ie, they are p-hacking to try to sell their product.)

Yusuf S, Wittes J, Probstfield J, Tyroler HA. Analysis and interpretation of treatment effects in subgroups of patients in randomized clinical trials. JAMA. 1991 Jul 3;266(1):93-8. PMID: 2046134

If you want one paper to summarize subgroups, this is an excellent choice,
“The more questions asked of a set of data, the more likely it will yield some statistically significant difference even if the treatments are in fact equivalent.”
“Suppose 1000 patients with a mortality rate of 10% are allocated at random to two equally efficacious treatments. If the data are then divided at random into 10 equally sized subgroups and the relative risk calculated for each, the chance is roughly 99% that, in at least one subgroup, the relative risk will be at least 2; the chance is over 80% of observing a relative risk of at least 3; the chance is 5% of observing a relative risk of at least 10.”
The many negatives of subgroups does not mean there is no role. Patients are far more complex than the “average patient”, and so we should expect treatments to work differently in different patients. (Males versus females is a good example).
However, selection criteria for trials are so strict that the patients look much more homogenous than real patients, and so trials are enrolled in such a way that we should expect subgroups to be negative.
Subgroups are inappropriately powered (we power trials to be just big enough for the primary outcome, so any subdivision will leave the trial under-powered. Thus, in an overall positive trial, one needs to be very cautious of any negative subgroups, as these subgroups were not appropriately powered.
A simple approach to adjusting for multiple comparisons is to divide the p value you consider significant by the number of comparisons being made. Thus, in a trial that makes 20 comparisons, p<0.05 becomes p<0.0025. (Unfortunately, researchers often make many more comparisons that are unreported, so this simple approach will be inadequate.)

Fragility Index

Walsh M, Srinathan SK, McAuley DF, Mrkobrada M, Levine O, Ribic C, Molnar AO, Dattani ND, Burke A, Guyatt G, Thabane L, Walter SD, Pogue J, Devereaux PJ. The statistical significance of randomized controlled trial results is frequently fragile: a case for a Fragility Index. J Clin Epidemiol. 2014 Jun;67(6):622-8. doi: 10.1016/j.jclinepi.2013.10.019 PMID: 24508144

There are some very strong opinions about the fragility index. Personally, I think it is a valuable concept. (It is really just another way of discussing the p value, much like the NNT. That is helpful, because p values are incredibly counter-intuitive for most people.) However, keep in mind that a p value of 0.06 is really not different from a p-value of 0.04, so the fragility index is working with an artificial threshold, and will become a less important discussion point when we stop putting p=0.05 on a pedestal.
This paper defines the fragility index – “The minimum number of patients whose status would have to change from a nonevent to an event required to turn a statistically significant result to a nonsignificant result could be used as an index of the fragility of the result (ie, a Fragility Index), with smaller numbers indicating a more fragile result.” – and had a good discussion of its strengths and limitations.

Non-inferiority trials

Non-inferiority trials have their own blog post that goes into more details, and which can be found here.

Ricci S. What does ‘non-inferior to’ really mean? A clinician thinking out loud. Cerebrovasc Dis. 2010;29(6):607-8. doi: 10.1159/000312869 PMID: 20413972

Despite their name, non-inferiority trials don’t actually attempt to demonstrate that one treatment is non-inferior to another. At least, not in the way we commonly use those words. The goal of a non-inferiority try is not to demonstrate that two treatments are equivalent, but rather to show that one treatment is not much worse than another.
We are only supposed to accept the ‘not much worse’ treatment if there are other reasons that might balance out the margin of inferiority, such as a lower cost, easier administration, or decreased toxicity.
The big problem with non-inferiority trials is that there is a massive amount of subjectivity in deciding what an appropriate margin of ‘non-inferiority’ is. Would you accept an alternative to aspirin for MI that was 1% less effective? 5%? 10%? Who decides what is appropriate? (When companies run these trials, there is a lot of leeway to shape trial results to suit self-interest.)
Another important question that must be considered about all non-inferiority trials: why did they not perform a superiority trial? In order for us to accept the claim of non-inferiority, there is supposed to be some other reason to favour the new treatment, such as fewer adverse events. Why not just perform a normal RCT, and demonstrate that the new therapy is actually better?
There is a question about whether non-inferiority trials are even ethical. By definition, they are assuming and aiming to demonstrate that the new therapy is marginally worse than the old therapy. Even if the margin of inferiority is small, it can result in tremendous harm when multiplied over thousands of patients. (What we really care about is equivalence, not non-inferiority.)
When interpreting non-inferiority trials, the key is to remember they are trying to sell you an inferior product. “Thus, when reading papers or protocols based on non-inferiority, the right question we have to ask is ‘How much worse is it?’ This should be immediately followed by another question: ‘Are my patients keen to be offered a less effective treatment if it carries a different, clear-cut advantage?’”

Garattini S, Bertele’ V. Non-inferiority trials are unethical because they disregard patients’ interests. Lancet. 2007 Dec 1;370(9602):1875-7. doi: 10.1016/S0140-6736(07)61604-3. PMID: 17959239

An interesting essay that claims that non-inferiority trials are essentially always unethical. They are intrinsically designed such that we will accept new therapies that may be worse (sometimes significantly, depending on the margins) than standard care.
“We believe that non-inferiority studies have no ethical justification, since they do not offer any possible advantage to present and future patients, and they disregard patients’ interests in favour of commercial ones.”
“Few patients would agree to participate if this message were clear in the informed consent form: as we said before, why should patients accept a treatment that, at best, is not worse, but could actually be less effective or less safe than available treatments?”
The vast majority of non-inferiority trials are designed for commercial interests, not those of patients.
One example they cite: The COMPASS trial, in which the thrombolytic saruplase was judged to be equivalent to streptokinase in MI, despite having a 50% higher mortality.

Le Henanff A, Giraudeau B, Baron G, Ravaud P. Quality of reporting of noninferiority and equivalence randomized trials. JAMA. 2006 Mar 8;295(10):1147-51. doi: 10.1001/jama.295.10.1147. PMID: 16522835

A review of 162 noninferiority and equivalence trials found significant deviations from accepted good research practice
- 80% of trials did not provide a justification for the non inferiority margin being used
- 28% did not account for the non-inferiorty margin in the sample size calculation

Aberegg SK, Hersh AM, Samore MH. Empirical Consequences of Current Recommendations for the Design and Interpretation of Noninferiority Trials. J Gen Intern Med. 2018 Jan;33(1):88-96. doi: 10.1007/s11606-017-4161-4. Epub 2017 Sep 5. PMID: 28875400

A review of 182 noninferiority trials in top rated journals found numerous problems, including the fact that about 12% of the time the experimental therapy was statistically worse than active control, but the CONSORT recommended conclusion for the trial was “noninferior”. (Aberegg 2018)
This same study finds that an astonishing 77% of published non-inferiority trials make the claim of non-inferiority or superiority, as compared to only 2% that conclude that the novel therapy is inferior. If non-inferiority trials essentially never conclude that a treatment is inferior, that sounds a lot like there is significant bias, or there is a fundamental flow in this trial design. (Prasad 2017)

Flacco ME, Manzoli L, Boccia S, Capasso L, Aleksovska K, Rosso A, Scaioli G, De Vito C, Siliquini R, Villari P, Ioannidis JP. Head-to-head randomized trials are mostly industry sponsored and almost always favor the industry sponsor. J Clin Epidemiol. 2015 Jul;68(7):811-20. doi: 10.1016/j.jclinepi.2014.12.016. Epub 2015 Feb 7. PMID: 25748073

In head to head RCTs with industry funding, industry funding is strongly associated with a favourable outcome for the sponsor
In non-inferior/equivalence designs, 97% of trials reported favourable outcomes for the sponsor of the trial.

**I think it is a huge red flag of the non-inferiority trial design that they trials essentially never conclude inferiority. That seems to suggest massive bias.

Other papers with important stats concepts

Fatovich DM, Phillips M. The probability of probability and research truths. Emerg Med Australas. 2017 Apr;29(2):242-244. doi: 10.1111/1742-6723.12740. Epub 2017 Feb 15. PMID: 28201852

Again, we accept a very low bar of significance in medicine. We design studies with a 25% chance of being wrong from the outset (alpha plus beta error), whereas in physics they are only willing to accept a 1 in 3.5 million chance of being wrong.
Consider a study that compared aspirin to aspirin in 1000 patients. Although we know that there cannot be a difference, if you performed 10 subgroup analyses there is a 5% chance one of the groups will look 10 times better than its identical twin, and a 99% chance one of the two will appear twice as good
“Methodology and bias are much more important than statistics and p-values”
“When trials are stopped early, because interim analyses suggest large beneficial treatment effects, this is typically a misleading overestimate”

Meta-analyses

Pereira TV, Ioannidis JP. Statistically significant meta-analyses of clinical trials have modest credibility and inflated effects. J Clin Epidemiol. 2011 Oct;64(10):1060-9. doi: 10.1016/j.jclinepi.2010.12.012. Epub 2011 Mar 31. PMID: 21454050

It is estimated that up to 37% of meta-analyses that report a significant effect size are false positives!

Clarke M. The true meaning of DICE: don’t ignore chance effects. J R Soc Med. 2021 Dec;114(12):575-577. doi: 10.1177/01410768211064102. PMID: 34935558

Chance findings affect every perfectly designed RCT. With “the traditional threshold of p = 0.05 will lead to ‘statistically significant’ differences with almost the same frequency as people rolling 11 with a pair of dice. The problem becomes even worse if multiple analyses are done and the one with the most striking difference, or lowest p-value, is elevated to become a key result of the trial.”
This paper describes the “DICE trials”, which are incredible demonstrations of problems with medical science.
In DICE1, participants decided whether patients lived or died by rolling a normal 6 sided die. Obviously, the results should be the same in both the ‘treatment’ and the ‘control’ groups, but using standard techniques used in many meta-analyses (such as retrospectively eliminating some trials that were negative), they ended up with a conclusion that the ‘intervention’ (there wasn’t one) “showed a statistically significant decrease in the odds of death of 39% (95% CI: 60% decrease to 8% decrease, p = 0.02).”
DICE 2 and 3 show similar things with computer simulated models. Essentially, there are lots of statistical positive meta-analyses, even when the data is purely random.
They hint at a form of bias that I have not seen described before. The decision to perform a meta-analysis is not random. They are often under-taken after positive or exciting results. This simulated data shows that even a single positive study significantly increases the risk of a meta-analysis being false positive by random chance.

Kataoka Y, Banno M, Tsujimoto Y, Ariie T, Taito S, Suzuki T, Oide S, Furukawa TA. Retracted randomized controlled trials were cited and not corrected in systematic reviews and clinical practice guidelines. J Clin Epidemiol. 2022 Oct;150:90-97. doi: 10.1016/j.jclinepi.2022.06.015. Epub 2022 Jun 30. PMID: 35779825

There are many reasons to be cautious in your interpretation of systematic reviews and clinical practice guidelines. This paper tackles the issue of how these documents handle papers that have been retracted.
They identified 587 systematic reviews or guidelines that cited a study that had been retracted. 252 of these reviews were published after the retraction, meaning the authors of the review should have known the paper there were citing had been retracted. Not one of these reviews / guidelines corrected themselves after publication.
335 were published before the retraction. This is a more difficult situation, as I don’t expect researchers to constantly fact check prior publications – but journals probably do have some responsibility. 11 (5%) of these publications corrected or retracted their results based on the retraction.
Bad, or even fraudulent, science can make its way into systematic reviews. If you are thinking about changing practice, I think you should always read the base literature.

Reproducibility / Replication

Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016 May 26;533(7604):452-4. doi: 10.1038/533452a. PMID: 27225100 [full text]

More than 70% of researchers have tried to replicate a prior published study, and failed to reproduce the results.
Only a tiny minority published the results of their negative replication studies
Overall, reproducibility of published results is very poor: 40% in psychology and only 10% in cancer biology
Contributing factors including pressure to publish, selective reporting, low statistical power, and poor oversight and training

Ioannidis JP. Contradicted and initially stronger effects in highly cited clinical research. JAMA. 2005 Jul 13;294(2):218-28. doi: 10.1001/jama.294.2.218. PMID: 16014596

A look at reproducibility of studies published in top tier medical journals. Of 49 highly cited articles, 45 claimed the intervention was effective. This claim was only subsequently confirmed in 44%, while it was specifically refuted in 16% and weaker effects were found in another 16%. In other words, even the most cited trials in the biggest medical journals are not routinely being replicated.

Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012 Mar 28;483(7391):531-3. doi: 10.1038/483531a. PMID: 22460880

Of 53 ‘landmark’ oncology papers, study findings were only subsequently confirmed in 6 (11%). This paper discusses that issue and some of the necessary fixes.

Serra-Garcia M, Gneezy U. Nonreplicable publications are cited more than replicable ones. Sci Adv. 2021 May 21;7(21):eabd1705. doi: 10.1126/sciadv.abd1705. PMID: 34020944 [full text]

One would expect scientific publications to be focused on the truth, but much like the mainstream media, scientists also seem to favour excitement over accuracy.
Studies that are replicated (and therefore are more likely to be true) were cited 153 times less on average than papers that were not replicated. These don’t appear to be negative citations (acknowledging the original study might be wrong), but instead seem to represent the proliferation of a potentially disproved hypothesis.

Peer Review

One major problem with peer review is that it occurs after data has been collected. If, based on peer review, a trial is not published, it will contribute to publication bias. Something that isn’t discussed in any of these papers, but is the clear solution: studies should be peer reviewed and accepted for publication before any data is collected because peer review and publication really should be focusing on methods, and be independent of the results obtained.

Goldbeck-Wood S. Evidence on peer review-scientific quality control or smokescreen? BMJ. 1999 Jan 2;318(7175):44-5. doi: 10.1136/bmj.318.7175.44. PMID: 9872890 [full text]

This editorial points out a few issues with peer review, the most blaring of which: there is actually no evidence that it improves the quality of published manuscripts. There are also known issues with bias in peer review based on language, speciality, nationality, and perhaps gender.

Conflict of interest

This might be the most important section of the bibliography. It is certain to grow significantly with time, as conflict of interest is a massive issues in current medical research, undermining confidence in much of our evidence.

Smith R. Medical journals are an extension of the marketing arm of pharmaceutical companies. PLoS Med. 2005 May;2(5):e138. doi: 10.1371/journal.pmed.0020138. Epub 2005 May 17. PMID: 15916457 [full text]

Many of the other articles cited here suggest that conflict of interest is the biggest problem in modern evidence based medicine. (You shouldn’t let a company trying to sell you something perform their own science).
This paper reviews some of the many ways that pharmaceutical companies cheat, undermining evidence based medicine:
- Test the drug against a treatment known to be inferior
- Test the drug against too low a dose of a competitor drug
- Test your drug against too high a dose of a competitor drug (making your drug seem less toxic)
- Conduct trials that are too small to show differences from competitor drugs (cheating non-inferiority designs)
- Using multiple end points in the trial and publishing only those that give favourable results
- Perform multicentre trials and select for publication results from centres that are favourable
- Conduct subgroup analyses and select for publication those that are favourable
- Present results that are most likely to impress – for example, reduction in relative rather than absolute risk

Heres S, Davis J, Maino K, Jetzinger E, Kissling W, Leucht S. Why olanzapine beats risperidone, risperidone beats quetiapine, and quetiapine beats olanzapine: an exploratory analysis of head-to-head comparison studies of second-generation antipsychotics. Am J Psychiatry. 2006 Feb;163(2):185-94. doi: 10.1176/appi.ajp.163.2.185. PMID: 16449469

A classic paper demonstrating the problems with financial conflicts of interest. In head to head studies, 90% of the time the results favoured the drug being made by the company paying for the trial. As a result, every drug has been shown to be better than every other drug. (Ie, companies cheat using inappropriate doses, selection criteria, curtailed follow-up, lead in periods, selective reporting, adjustments, or other techniques to distort science.)

Yaphe J, Edman R, Knishkowy B, Herman J. The association between funding by commercial interests and study outcome in randomized controlled drug trials. Fam Pract. 2001 Dec;18(6):565-8. doi: 10.1093/fampra/18.6.565. PMID: 11739337

There is a strong association between industry finding and finding positive outcomes for the treatment being promoted by the company funding the trial.
This study looked at all RCTs in the 5 largest medical journals, and compared outcomes depending on whether there was commercial involvement in the study.
Commercial involvement was very common. 68% of studies had commercial involvement. This came in the form of direct financial funding (40%), personnel (33%), and supply of drugs (21%).
34% of RCTs without industry involvement reported negative results, as compared to only 13% of those with pharmaceutical company support (p<0.0001, odds ratio 3.54).
In other words, there is empirical evidence of bias. Because it is impossible to completely determine where this bias come into play, the only real option is to downgrade your confidence in any study that was funded by the manufacturer of the product being sold. (Which is also just common sense).

Lexchin J, Bero LA, Djulbegovic B, Clark O. Pharmaceutical industry sponsorship and research outcome and quality: systematic review. BMJ. 2003 May 31;326(7400):1167-70. doi: 10.1136/bmj.326.7400.1167. PMID: 12775614

This is a systematic review of all the studies looking at the influence of drug company funding on research. Obviously, the results clearly show the bias that occurs when you let companies test the products that they want to sell for billions of dollars.
Studies funded by industry were less likely to be published, and when published were more likely to be delayed or published in formats that are hard to find, such as in abstract form at a conference. That is, industry funding promotes publication bias, and industry is obviously more likely to leave negative studies unpublished, which explains the next point.
Clinical trials and meta-analyses funded by industry were significantly more likely to report positive results that favour the company funding the research. (The overall odds ratio was 4; so industry funding increases the odds of a trial being reported as positive by a factor of 4. Therefore, it is reasonable to decrease your confidence in any industry funded research by a similar factor.)
Industry funded research was generally thought to be of higher methodologic quality, but that is not surprising given the publication bias and the many other sources of bias that can be used to subtly shape results.

Jureidini J, McHenry LB. The illusion of evidence based medicine. BMJ. 2022 Mar 16;376:o702. doi: 10.1136/bmj.o702. PMID: 35296456

A pretty scathing (but in my mind spot on) review of the corruption of science that occurs when we allow industry involvement.
“The release into the public domain of previously confidential pharmaceutical industry documents has given the medical community valuable insight into the degree to which industry sponsored clinical trials are misrepresented. Until this problem is corrected, evidence based medicine will remain an illusion.”
“Scientific progress is thwarted by the ownership of data and knowledge because industry suppresses negative trial results, fails to report adverse events, and does not share raw data with the academic research community. Patients die because of the adverse impact of commercial interests on the research agenda, universities, and regulators.”
“What confidence do we have in a system in which drug companies are permitted to “mark their own homework” rather than having their products tested by independent experts as part of a public regulatory system?”
“Our proposals for reforms include: liberation of regulators from drug company funding; taxation imposed on pharmaceutical companies to allow public funding of independent trials; and, perhaps most importantly, anonymised individual patient level trial data posted, along with study protocols, on suitably accessible websites so that third parties, self-nominated or commissioned by health technology agencies, could rigorously evaluate the methodology and trial results.”

Baraldi JH, Picozzo SA, Arnold JC, Volarich K, Gionfriddo MR, Piper BJ. A cross-sectional examination of conflict-of-interest disclosures of physician-authors publishing in high-impact US medical journals. BMJ Open. 2022 Apr 11;12(4):e057598. doi: 10.1136/bmjopen-2021-057598. PMID: 35410932

Both JAMA and the New England Journal have clear rules on conflict of interest: you are supposed to report them. Unfortunately, these rules are not followed.
This study looked at 31 RCTs from each journal. The 118 total authors received a total of 7.5 million dollars from industry over the 3 year study period.
Of the 106 authors who received industry payments, 106 (90%) failed to disclose at least some of that money. 51 (48%) left more than half of their received funds undisclosed. (This is likely an under-estimate, because this only accounts for funds openly disclosed on the Open Payments system.)
Even if these disclosures worked (they don’t), you can’t trust them because the authors failed to disclose significant conflicts. This is also a pathetic failure of these two supposed top tier journals, as all these conflicts are openly reported, and could be checked in seconds with a simple search.

Taheri C, Kirubarajan A, Li X, et al. Discrepancies in self-reported financial conflicts of interest disclosures by physicians: a systematic review. BMJ Open 2021;11:e045306. doi: 10.1136/bmjopen-2020-045306

This is a systematic review and meta-analysis looking at discrepancies in financial conflict of interest reporting. They found 40 studies that looked at this issue (which each individually looked at larger numbers of clinical guidelines, published papers, or academic meetings.)
Discrepancies between reported conflicts and those identified through objective payment databases were very common. It depends exactly how you sort the data (one author can be on multiple papers so it is unclear how many time you should count them), but between 80 and 90 percent of reported financial conflicts of interest were discrepant when comparted to an objective database!
Studies with discrepancies were much more likely to report positive outcomes (odds ratio 3.21)
Bottom line: don’t trust declared fCOIs. They are almost always wrong. We need to get rid of this system where we allow people with financial conflicts to play such a prominent role in research.

Publication bias and trial registries

Turner EH, Matthews AM, Linardatos E, Tell RA, Rosenthal R. Selective publication of antidepressant trials and its influence on apparent efficacy. N Engl J Med. 2008 Jan 17;358(3):252-60. doi: 10.1056/NEJMsa065779. PMID: 18199864

Just one of many many examples, these authors looked at publication bias in the antidepressant literature. Of 38 trials with positive results, 37 were published. Conversely, of 36 trials with negative results, only 3 were published. Thus, a review of the published literature makes it look like 94% of trials were positive, when in fact it was only 50%.

Riveros C, Dechartres A, Perrodeau E, Haneef R, Boutron I, Ravaud P. Timing and completeness of trial results posted at ClinicalTrials.gov and published in journals. PLoS Med. 2013 Dec;10(12):e1001566; discussion e1001566. doi: 10.1371/journal.pmed.1001566. Epub 2013 Dec 3. PMID: 24311990

Trial registries are supposed to limit publication bias, but half of the registered trials looked at here were not published.
Comparing what gets into published manuscripts to what is reported on clinicaltrials.gov, the reporting in manuscripts is very incomplete:
- Flow of patients (exclusion / inclusion) only makes it into 48% of publications, but is found 68% of the time in the registry
- The primary outcome was reported in 68% of publications and 78% of registry entries
- Adverse events were only reported in 45% of publications, but 75% of registry entries
- Serious adverse events were even worse, with 63% of publications listing them, but 99% of registries
In other words, publication bias extends well beyond just the publication of manuscripts. The selective publication of individual trial components significantly skews the scientific literature. When appraising papers, it is really important to search registries like clinicaltrials.gov

Hannink G, Gooszen HG, Rovers MM. Comparison of registered and published primary outcomes in randomized clinical trials of surgical interventions. Ann Surg. 2013 May;257(5):818-23. doi: 10.1097/SLA.0b013e3182864fa3. PMID: 23407296

Unfortunately, even when trials are registered, researchers cheat. (This is also highly related to the topic of p-hacking).
This study looked at 327 surgical trials. 109 (33%) were not registered. 48 (22%) were registered after the study was completed, which is no better than not registering the trial at all.
Of the 152 trials that were registered before the end of the trial, 75 (49%) had discrepancies between the registered protocol and the published outcomes. The most common discrepancy was a change in primary outcomes.
This is why I always try to check clinicaltrials.gov myself when reviewing papers, but this is a pathetic summary of the state of medical science. Editors and peer reviewers are allowing researchers to change their outcomes and blatantly p-hack even when the trials are pre-registered.

Jones CW, Keil LG, Holland WC, Caughey MC, Platts-Mills TF. Comparison of registered and published outcomes in randomized controlled trials: a systematic review. BMC Med. 2015 Nov 18;13:282. doi: 10.1186/s12916-015-0520-3. PMID: 26581191

This is a systematic review looking at discrepancies with trial registries. They found 27 studies that looked at this issue, with a median of 33% of included studies having discrepancies between the trial registry and the publication.
Again, this is a massive issue. A scientific publication is supposed to describe the research you did. There is no role for creativity. Changing the outcomes is very bad research practice, and undermines the trustworthiness of the results.

Gopal AD, Wallach JD, Aminawung JA, Gonsalves G, Dal-Ré R, Miller JE, Ross JS. Adherence to the International Committee of Medical Journal Editors’ (ICMJE) prospective registration policy and implications for outcome integrity: a cross-sectional analysis of trials published in high-impact specialty society journals. Trials. 2018 Aug 23;19(1):448. doi: 10.1186/s13063-018-2825-y. PMID: 30134950

Despite rules to the contrary, a significant number of trials published in major medical journals are either completely unregistered, or registered retrospectively (after data is completed, and therefore undermining the entire value of the registry).
Unregistered trials were much more likely to report positive results (89% vs. 64%).

Haslberger M, Gestrich S, Strech D. Reporting of retrospective registration in clinical trial publications: a cross-sectional study of German trials. BMJ Open. 2023 Apr 18;13(4):e069553. doi: 10.1136/bmjopen-2022-069553. PMID: 37072362

Although trial registries are clearly good in theory, we have pretty good evidence that they are mostly failing in practice.
This study looked a German trial registry, and found than more than half of the trials were registered retrospectively (completely eliminating the value of registration). Less than 5% mention this retrospective registration in the published manuscript.
The one positive finding in this study is that retrospective registration is trending down with time, from 100% in the 1990s, to only about 25% in 2017. Unfortunately, 25% is still far too high, and there are still many other problems with these registries (such as the fact that journals apparently never look at them).

Stopping trials early

Early termination of trials, which occasionally necessary for the safety of participants, is dramatically overdone and skews the scientific literature.

Montori VM, Devereaux PJ, Adhikari NK, et al. Randomized trials stopped early for benefit: a systematic review. JAMA. 2005 Nov 2;294(17):2203-9. doi: 10.1001/jama.294.17.2203. PMID: 16264162

The number of trials stopped early for benefit more than doubled from 1990 to 2005
Trials stopped early for benefit yield implausibly large treatment effects (the median relative risk was 0.53)
In general, stopping trials early for benefit will systematically overestimate treatment effects. Trials with fewer events yielded greater treatment effects (odds ratio, 28; 95% confidence interval, 11-73).
One hundred thirty-five (94%) of the 143 RCTs did not report at least 1 of the following: the planned sample size (n=28), the interim analysis after which the trial was stopped (n=45), whether a stopping rule informed the decision (n=48), or an adjusted analysis accounting for interim monitoring and truncation (n=129).
Probably because the results are overly dramatic, trials that are stopped early tend to be published in more prestigious journals
“These findings suggest clinicians should view the results of such trials with skepticism.”

Mueller PS, Montori VM, Bassler D, Koenig BA, Guyatt GH. Ethical issues in stopping randomized trials early because of apparent benefit. Ann Intern Med. 2007 Jun 19;146(12):878-81. doi: 10.7326/0003-4819-146-12-200706190-00009. PMID: 17577007

These authors argue that there are often significant ethical concerns that arise from stopping trials early.
Trials stopped early dramatically overestimate treatment effects, and also leave us with significant uncertainty about important outcomes that may not have been the primary outcome (especially safety outcomes).
One point that I think is absolutely crucial, but that these authors fail to raise: Stopping trials early increases uncertainty. Often this necessitates further trials, which is an absolute contradiction. You stop the trial early because you think adding a few hundred more patients is unethical, but then you require thousands of patients to be randomized in a future trial. Even worse, if there isn’t a follow up study, potentially millions of patients will receive a treatment that may actually be net harmful, because the trial was never finished to completion.

Composite outcomes

Dash K, Goodacre S, Sutton L. Composite Outcomes in Clinical Prediction Modeling: Are We Trying to Predict Apples and Oranges? Ann Emerg Med. 2022 Jul;80(1):12-19. doi: 10.1016/j.annemergmed.2022.01.046. Epub 2022 Mar 24. PMID: 35339284

This paper provides a nice discussion of both the benefits and problems with composite outcomes.
Some potential benefits:
- Increased statistical efficiency
- The ability to increase event rates when individual event rates are low
- Improved research efficiency
- You might notice that the benefits are all about getting research done cheaper or faster, but not about validity or scientific soundness.
Some potentia
- “The construction of composite outcomes often lacks logic and is susceptible to post hoc choosing, or “cherry picking”, of favorable combinations of outcomes”. (See also section on p-hacking).
- Benefit might be driven by the less important part of the composite. (For example, we see many study claiming a decrease in major adverse cardiac events where these is no change in death or MI, and the only change in is revascularization.)
- Often make the assumption of uniform directionality. Ie, effects observed on separate components of a composite outcome may not be in the same direction.
  - Particularly bad if the qualitative value of the outcomes is different. Ie, a treatment that reduces symptoms but increases mortality might look good on composite outcomes, because the symptom signal outweighs the mortality signal.
  - Can also under-estimate benefits if the composite includes an outcome with no effect with one with a real effect.
- Outcomes are often not patient oriented, or irrelevant outcomes are combined with important outcomes.
- There is something called competing hazards bias, in which one outcome can influence the others. For example, if you die, you can’t possibly later develop coronary artery disease.

Other papers

These papers are also important, but didn’t seem to fit into any of the above categories.

Alsheikh-Ali AA, Qureshi W, Al-Mallah MH, Ioannidis JP. Public availability of published research data in high-impact journals. PLoS One. 2011;6(9):e24357. doi: 10.1371/journal.pone.0024357. Epub 2011 Sep 7. PMID: 21915316 [full text]

For scientific findings to be trustworthy, the data needs to be accessible. This allows other researchers to analyze the data and confirm the results, and also allows for replication efforts.
Looking at the 50 journals with the highest impact factor, 88% had rules or instructions to authors related to sharing of data. However, these rules were of variable quality, and probably not enforceable.
Most importantly, almost half of the studies published in these journals failed to follow the journal’s own rules on data sharing.

Lee TC, Senecal J, Hsu JM, McDonald EG. Ongoing Citations of a Retracted Study Involving Cardiovascular Disease, Drug Therapy, and Mortality in COVID-19. JAMA Intern Med. 2021 Aug 2:e214112. doi: 10.1001/jamainternmed.2021.4112. Epub ahead of print. PMID: 34338721

Retractions don’t work. Retracted studies will get cited months (and other papers say years) after being retracted, and they still get included in secondary analyses like meta-analyses

Kennedy AG. Evaluating the Effectiveness of Diagnostic Tests. JAMA. 2022 Mar 18. doi: 10.1001/jama.2022.4463. Epub ahead of print. PMID: 35302590

When evaluating clinical tests, there are 3 major considerations (and we often forget the last 2):
Accuracy: A test must be accurate (often measured by sensitivity and specificity, although I think there are better measures). Accuracy alone is not enough to warrant a test. Many accurate tests actually lead to patient harm. Accuracy is a necessary but not sufficient criterion.
Clinical Utility: The test must have a measurable net positive effect on a patient’s clinical outcomes (many of our tests, like stress tests and BNP fail at this level)
Patient benefit: This last criteria is very questionable, and probably should just be wrapped up into clinical utility. The authors try to distinguish a benefit to patients, even when a test does not directly influence clinical treatment decisions or prognosis. The example would be identifying a cancer that has no treatment, and therefore cannot impact “clinical decisions”, but might impact a patient’s life choices. This is a bit of a slippery slope. I firmly believe that tests that will not change a patient’s management should not be ordered. However, I believe that important life decisions, such as choices about end of life care, finances, and general well being, are firmly within the clinical realm, and represent potential clinical benefit. Therefore, in my mind, clinical utility and patient benefit are probably best thought of as the same thing.

Other great EBM resources

An introduction to the concepts of EBM: a bibliography of key resources

Not enough for you? Don’t worry, there are many more papers to be added to this list as soon as I have time. Leave any recommendations below.

Evidence Based Medicine: An annotated bibliography

Table of Contents:

General Evidence Based Medicine

Bias