WELCOME
Welcome
to a brief (and hopefully fun) online tutorial on the basics of
critical appraisal of therapeutic interventions. The emphasis is
on evaluating efficacy claims of primary studies using superiority
designs.
WHY THIS MATTERS
Here's
the short story on the huge problem we are trying to help solve: Delfini Flier.
For more details
on why critical appraisal matters so much, go to
our Critical
Appraisal Facts page.
RESOURCES AT OUR SITE: ABOUT SCIENCE & MORE THAN SCIENCE
There is a wealth
of freely available resources throughout our whole web site.
The broad focus of our work is healthcare information, clinical
improvement and patient communications. We specialize,
however, in the evaluation of medical science as
that area is particularly neglected and the need is so great. At
this page you will find basic instruction in critical appraisal
of the medical literature with an emphasis on efficacy
of therapies as evaluated through primary superiority
studies (original research). We strongly recommend you
read this first: Why
Critical Appraisal Matters.
CAVEATS
There are exceptions to everything we say. Think
of everything on our site as being a generality. There is no perfection
in this work. Much judgment is required.
MORE ON RESOURCES
Many instructional
materials that go more deeply can be found at the Delfini
Library of Tools & Educational Materials. For a
satellite view, see the Overview
at Possibly
Our Most Important Reading at the Library.
Time saving tips are available at Delfini
Library of Tools & Educational Materials—see
Pearls: Basics of Evaluating Evidence in Superiority Trials
for Therapies.
We list and link relevant additional reading below specific sections. Sometimes we link the same document in several places to keep key reading with the right context.
Also, we link to PDFs here for easier online reading; however, for templates we have Word documents available at the Delfini
Library to enable you to easily fill in forms as you like. |
Let's
Go On A Little Journey! Into the Land of Critical Appraisal of the
Medical Literature We Go!

|
| Delfini
Definition of Evidence-based Medicine (EBM)
"Evidence-based medicine is the use of the scientific
method and application of valid and useful science
to inform health care provision, practice, evaluation and decisions."
Science helps
us by improving our ability to predict probability of outcomes.
We believe it
is fine to make decisions on factors other than science. Know the
science first—and do not confuse opinion or other factors
with science.
More on Evidence-based
Medicine for consumers and their clinicians. |
| Delfini's
Hallmarks of Evidence-based Practice
1) When seeking
information on a topic, a systematic search is
conducted for science and science-based information using evidence-based
searching and filtering techniques.
2) All sources
of information to guide medical decision-making are critically
appraised, using science-based principles, for validity
and usefulness.
3) Any conclusions
drawn from the science are carefully crafted to be as valid as
possible.
4) Methods
used and reporting are transparent so that the
work can be evaluated for quality, replicated and updated.
5) Clinical
information sources are updated when significant
new information becomes available and such information is periodically
sought.
We then apply
evidence-based medicine to clinical quality improvement. To learn
more about the Delfini approach to clinical quality improvement
& value, read about the Delfini
Evidence- and Value-based Clinical Quality Improvement Model. |
Key
Terns (Oops! That would be Terms... Oops, this is a Gull...No! This
is the Delfini Flier!!!)
Go to our page
on Why Critical
Appraisal Matters to see a short list of key
terms (to the left on that page). |
|
Five "A"s of Evidence-based Medicine: Process Steps
- Ask:
Question framing drives your whole project. Consider PICO:
Population/condition, intervention, comparator, outcome.
- Acquire:
Use filtering techniques to narrow your yield. Match study type
to clinical question. For efficacy of therapies, look for randomized
controlled trials (RCTs). See the Searching & Sources
Tool at Delfini Library
of Tools & Educational Materials.
- Appraise:
Critically appraise all scientific sources and references for
validity (closeness to truth) and clinical usefulness (see Key
Terms).
- Apply:
See the Overview at Welcome.
Other clinical quality improvement tools are available at the
Delfini Library of Tools &
Educational Materials.
- "A"s
Again: Update! See the Searching & Sources
Tool at Delfini Library
of Tools & Educational Materials for advice on
updating.
Modified from
Leung 01 PMID: 12597509. |
The
3 Steps of Critical Appraisal (To Avoid Getting Burnt!)
- Match study design
to clinical question.
- Assess validity
of the study.
- Assess usefulness
of the results.
|
|
| Step
1. Match Study Design to Clinical Question |
The
2 General Study Types

Basic
Basics
- You need
at least two groups to compare for differences.
- These groups
should be from the same pool of people at the same time.
- The groups
should be similar (except for what you are studying). What if
these groups started out different from each other (as illustrated)
and this was a study of headache? Ow! There's a little bias going
on.
Medical research
is about comparing the difference in outcomes between groups.
For efficacy of therapies, look for valid randomized controlled
trials (RCTs). |
For Further Reading
From the Delfini
Library of Tools & Educational Materials, a short primer, Problems
with Case Series. |
| 
Research studies
are observations or experiments.
- In observations,
you observe what happens naturally. Observations are highly prone
to bias no matter what.
- Experiments,
of which randomized controlled trials (RCTs)
are the ideal, are the best study type for studying efficacy of
therapies, unless you have all-or-none results,
which are very rare.
|
Observational studies have many opportunities for flaws that the RCT design corrects.

|
The
Quick Method to Identify Experiments
Did the patient
(or his or her physician) CHOOSE the treatment?
If yes, this is an observational study. Choice often is associated
with other factors which could affect the results. |
For Futher Reading
From the Delfini
Library of Tools & Educational Materials—
- A short primer to learn some about problems
with Observational Studies. Observational studies
have many important uses, but they are highly prone to bias in evaluations
of therapies.
- Also see our 1-pager on "Real World" Data
|

Randomization
helps to create equal groups by distributing prognostic variables
across groups. You want NO DIFFERENCE between groups
except what you are studying. |
| Step
2. Assess Validity of the Study |
The
2 General Validity Types
- External
validity is the truth of the study outside of
the study context (i.e., "real life" application).
Considerations include similarity of studied patients to real
life patients and circumstances of care. If these are different,
outcomes may no longer be predictable.
- Internal
validity is the truth of the study within its
own context. To assess this, we perform critical
appraisal looking for anything which may invalidate the
study.
|
Preparation: Internal Validity Section 1-Pager
This section is summarized in the Delfini
Library of Tools & Educational Materials in Pearls: Basics of Evaluating Evidence in Superiority Trials
for Therapies. You may wish to print it out now for notetaking as you read and for later review. We strongly recommend using this tool—or other similar tool that you like—to help you be mindful of what to look for in a study to assess its validity. |
Internal
Validity: Key Questions
- What can
explain the results other than "truth?"
- We look for
bias, confounding and chance as possible explanations
for our results. If we can rule out these, then it may be reasonable
to conclude we have found "truth." (In general, we assess
the likelihood of chance outcomes as part of assessing results.)
- Another way
to state this is that there are 4 associations for outcomes:
bias, confounding, chance or cause and effect. Key point: not
all associations are causal.
- We are alert
to whether anything may favor the intervention.
- We need transparent
details of what was done (such as details of randomization) to
be able to evaluate likely success.
- Problems
or no details go on a list of "threats to validity."
- One or more
threats can render study results uncertain.
- We assign
a grade to our evaluation. See Delfini
Library of Tools & Educational Materials for
more advice about grading and about wording conclusions from studies or see below at Further Reading for this section.
|
The
4 Phases of A Clinical Trial
|
There
are four phases of a clinical trial.
- Selection:
Who is studied, how were they selected, how were they assigned
to their groups?
- Performance:
What is being studied and what is it being compared to?
- Data
Collection & Attrition: What information is collected,
how is it collected and what is missing?
- Assessment:
How is the difference in outcomes between the groups evaluated?
|
Bias
Bias is anything
that "systematically" leads away from truth.
This just means something led away from truth other than through
random chance.
We look for
bias in all 4 phases of a study. Bias may happen because of a problem
affecting all groups equally, such as
biased data collection. Or bias may occur because of a difference
between groups except for what is being studied. Any difference
between groups except for what is being studied is automatically
a bias.
Bias
in studies tends to favor the intervention under investigation.
Certain kinds of bias have been shown to distort research
results up to a relative 50 percent or more—for
each flaw.
See Delfini
Library of Tools & Educational Materials sections
for critical appraisal tools and primers or see Further Reading below this section. |
Confounding

|
Confounding
is a special kind of bias in which we are confused
into thinking that one variable is responsible for an outcome when
it is actually another variable that is responsible.
People who make
healthy choices might be more likely than others to take vitamins.
Therefore, taking vitamins is "linked"
to healthy lifestyle.
So are vitamins
responsible for reduced risk of coronary heart disease? Or are other
choices that go along with having a healthy lifestyle? |
For Further Reading
From the Delfini
Library of Tools & Educational Materials, see—
|
| Step
3. Assess Usefulness of the Results |
| Chance
Are the results
just due to tricksy random chance??? We look to p-values
(probability values) to determine whether results are
likely to be a chance effect.
(Technically, p-values address the statistical probability of a difference occurring in the sample size that is equal to or larger, by chance, than in the larger population. There are some complex issues here which you can dig into more deeply at our glossary entry for p-values if you wish, but application is not very realistic or maybe even possible—consequently, we think of the p-value as an indicator.)
That said, keeping this practical and simple (though not exact), we are looking for statistically significant
results which we use as a clue as to the likelihood of results being due
to chance.
Some key points (oversimplified):
- A p-value
of <0.05 roughly and practically can be thought of to mean that there is a less than 1-in-20 chance that the findings
are due to chance. (When you think about it, that's pretty high!
I'd like that lotto ticket!)
- P-values only have to do with measurements of chance. They cannot measure bias or tell you whether the results are true.
- Performing
multiple assessments (multiple outcomes or multiple
analyses) increases the likelihood of a chance effect
potentially as high as the number of outcomes or analyses: 20
outcomes evaluated could mean as high as a 20-in-20 risk of a
chance finding—this is 100 percent!)
- Stopping
clinical trials early, regardless of utilizing stopping
rules, puts results at very high risk of being due to
chance.
- We want certain
things determined in advance of a study (a priori) to reduce the likelihood that findings are
due to chance: research questions, populations for analysis and outcome
measures.
- Findings
of non-significance could be chance effects—meaning
there might have been too few people studied to realize an outcome.
Look at the confidence intervals to see if there is a
possibility of a meaningful result!
|
20-Sided
Di
The
20-sided di is here as a reminder that, mathematically, we can
ensure we get chance effects just by looking at enough variables.
This is why database research is hypothesis-generating
only.

|
For Further Reading
From the Delfini
Library of Tools & Educational Materials—
From our Web Links, compute confidence
intervals |
Clinical
Significance
Meaningful clinical
benefit has two main considerations: outcome area + size
of the outcome.
Meaningful
areas of outcomes for study =
- Morbidity
- Mortality
- Symptom Relief
- Functioning
(emotional or physical)
- Health-related
Quality of Life
Outcomes that
are not one of these areas (such as values from lab tests) are called
intermediate or proxy markers
and should only be considered if we have a strong
causal chain of proof that affecting that marker leads
to a beneficial affect in one of the five areas listed above. |
A Little About Analysis
Analysis is a very big topic. And many peope worry overly about needing to have a deep understanding of statistics in order to evaluate the reliablity of studies. Frequently a general understanding of a few basic statistics as described below and an understanding of confidence intervals is sufficient, provided the study has been analyzed for potential distortion from bias, confounding or chance.
Missing data points are an important validity consideration. The real issue concerns whether those who provide study data and those who do not differ from the randomized population in prognostic variables which could affect study outcomes. If they do not differ in these ways, it is reasonable to conclude that the absence of data did not distort the study outcomes.
That said, there are many ways populations may be analyzed. We prefer to see several methods used because we think sensitivity analyses are helpful to provide clues to reliability. So it is useful to see completer analyses (meaning the population analyzed was restricted to only those who completed the study) or per protocol analyses (people for whom the protocol was strictly followed), but the method preferred by us and many others for evaluating efficacy of a therapy in a superiority trial is what is called intention-to-treat analysis (ITT).
ITT analysis which requires that patients are analyzed to the group to which they are randomized regardless of actual intervention received and regardless of study completion which requires the use of some reasonable method for assigning missing data points (data imputation). You can read more about ITT analysis here.
ITT is not appropriate for non-inferiority and equivalence trials if the data imputation methods favor no difference between groups.
Safety populations for study should be restricted to only those who actually received the study medication and should be done not by how randomized, but by treatment group. Including patients who do not receive the drug can inappropriately understate adverse effects. |
For Further Reading
From the Delfini
Library of Tools & Educational Materials—
|
Size
of Results
I repeat: Meaningful
clinical benefit has two main considerations: outcome area
+ size of the outcome.
Two key issues
include—
- What is the
kind of measure used to assess the size of the
difference in outcomes between groups (remember! this is the whole
point of the research study! the difference in the outcomes between
the groups!) We will address the kind of measure forthwith after
explaining that another key issue is—
- What is the
associated time period within which the outcome
occurred—this is the time period of the study. A key point
about this is that it is "within" the study time period
unless the study has been able to pinpoint that more directly.
|

“Size
matters,” according to a palate-weary Sunset panel, after
blind-tasting more than 50 chickens, which I learn in my search
for the perfect roast chicken. So too in research results.
|
Measures
of Outcomes
This is a sparkly
and fancy term (as many terms in medical science unnecessarily are)
for measures used to assess the size of the difference between the groups studied.
Terms which express this difference include estimates of effect and point
estimates. Here are some ways in which these are expressed.
(More, such as Odds Ratios, are detailed at Delfini
Library of Tools & Educational Materials.) |
For Further Reading
From the Delfini
Library of Tools & Educational Materials—
 
Keep reading OR sit back and watch and listen to our video tutorial on Measures of Outcomes (12 min) or do both! |
Risk
With & Without Treatment (My Personal Favorite!)

|
Study outcomes
at their most basic. Simple! What happened to everyone! (Not always
easy to find in trial information, shockingly, but what I, as a
critical appraiser—and as a patient—most
want to know.) Don't give me that fancy stuff. |
Absolute
& Relative Risk Reduction
- Absolute
measures are the percentage point differences between the percent
outcomes in each group. In other words, if 15 percent of people
die in the control group and 10 percent of people die in the study
group, the Absolute Risk Reduction (ARR) is 5
percentage points.
- Relative
measures, such as Relative Risk Reduction (RRR)
reflect the relative difference in size between groups. 10 is
one 3rd smaller than 15 so it is 33 percent. Now I can
sell more of everything because 33 sounds bigger than 5 even though
it is the same result. Relative measures can (rarely)
equal absolute measures, but they are never smaller. They are
almost always bigger. Very bigger!
- Relative
begs the question,"Relative to what?"
Only knowing the relative measure is like learning there is a
90 percent-off sale and not being told 90 percent-off of what—you
know you need to know the base price. So too in medical research.
|
Number-Needed-to-Treat
& Number-in-100
Number-needed-to-treat
(NNT) is the reciprocal of the ARR. You take the ARR out of being
a percentage: in the example above, 5 goes into 100, 20 times. You
need to treat 20 people to benefit 1 person.
This is potentially
confusing to patients (and probably to all of us) since we may be
biased toward the larger number.
We like (thank
you Dr. Tim Young of St. John Medical Center & OMNI Medical
Group in Tulsa, Oklahoma) of simply taking the ARR number and expressing
that as the number-out-of-a-hundred: in the above example, 5 out
of 100 patients.
We like simplified
things such as natural frequencies. |
For Further Reading
See Delfini Library of Tools
& Educational Materials sections for critical appraisal
tools and primers including 1-pagers on Analyzing
Results including guidance on Confidence Intervals, Intention-to-Treat analysis, Time-to-Event analysis and more such as cross-over design and oncology outcomes, evaluating registries, critical appraisal of screeing and diagnostic tests and more. Much more including help regarding evidence synthesis, suggestions for decision support and communication aids, health care economic analysis tools, evidence-based QI tools, performance measurement tools and more...
Critical Appraisal
Example
An example of a critical appraisal report The DelfiniClick™:
Radiofrequency for the Treatment of Gastro-esophageal Reflux
Disease
Example: full
story; appraisal
only |
Safety
It is challenging
to assess safety for a number of reasons:
- Outcomes
are infrequent, hard to find
and frequently due to chance.
- Reporting
may be selective.
- Duration
of follow-up may be too short.
- We frequently
need to resort to weaker evidence, but should
be cautious when so doing.
- BEWARE
of statements of non-significance when it may be an issue
of too few people studied: review the confidence intervals!
|
 |
For Further Reading
From the Delfini Library of Tools
& Educational Materials, our 1-pager on Safety. |
Grading
Evidence grading
is an assignment of a summary code or statement to represent the
reliability and/or clinical usefulness of—well—some
evidence! This may be an outcome, a conclusion, a study, a recommendation—whatever.
Here are some
key points and tips.
- There are
a trillion systems (or somewhat fewer, but many!) In choosing
a system, consider complexity and criteria.
- In seeing
a grade, look up the criteria. Many criteria are flawed and allow
overstatement of quality.
- Our system
is simple and quick. A is outstanding. We celebrate B's—about
3 percent of the time. We don't want a C because that is passing,
so we quickly leap downward into the bowels of the alphabet and
grasp at U (for uncertain) about 90 percent of the time. About
7 percent of the time, we are perched at the border of B and U
for a grade of BU.
See Delfini
Library of Tools & Educational Materials for more
information on Evidence Grading. |
For Further Reading
From the Delfini
Library of Tools & Educational Materials, see—
|
Secondary
Studies & Secondary Sources
Garbage in,
garbage out, as they say.
- The biggest
problem with most (and we do mean the majority!) of secondary
studies and sources is that invalid studies are used!
- A second
big problem is that a systematic approach is not used
(see above EBM Hallmarks).
- A third big
problem is lack of transparency.
Buyer
beware cannot be stated biggly enough! |
 |
For Further Reading
From the Delfini
Library of Tools & Educational Materials, see—
- Secondary Studies (1-pager)
- Secondary Studies (long tool): another important tool—our critical appraisal checklist for secondary studies (overviews/narrative reviews, systematic reviews and meta-analyses) including examples of good answers and examples of poor answers
- Secondary Sources (1-pager)
- Secondary Sources (long tool): another important tool—our critical appraisal checklist for secondary studies (overviews/narrative reviews, systematic reviews and meta-analyses) including examples of good answers and examples of poor answers
- Audit advice for secondary studies and secondary sources (1-pager)
|
A
Few Bottom Lines
- One bottom
line is that we like to see several valid studies done in slightly
different ways by groups with different interests (potential conflicts)
with consistent results to feel we have arrived at "truth!"
- A very big
bottom line is USE TOOLS! Find tools you like
and that will not lead you into trouble. For example, avoid JADAD
scoring at all cost! Use our tools at the Delfini
Library if you like them. If not, find something
you like that is likely to lead you to a valid assessment of a
study.
- We emphasize
use of tools because they help remind you of what to assess and
help remind you of what is a threat to validity because of what
is missing.
|