| 
Contents

About
PMID Numbers: We frequently utilize a PMID number in place
of a citation. Where PMID numbers are available, enter that number
into the PubMed search box to retrieve that citation and listing. |
new
11/10/07:
Confidence-Intervals,
Power & Meaningful Clinical Benefit:
Advice to Readers on How to Stop Worrying about Power and Start
Using Confidence Intervals &
Using Confidence Intervals to Evaluate Clinical Benefit of Statistically
Significant Findings »
Other
Key Links:
Evidence
Essentials »
On
the Same Page™
»
Variation/Quality
of Care
The
Volume of Inappropriate Care in the US
»
Variations in Volume & Intensity of Hospital
Care
»
Variations in Experts' Recommendations
»
Variations
in Clinicians' Estimates of Pretest Probabilities
»
Underuse of Proven Interventions
»
Class
Effect: Caution Urged
»
Guidelines
& Effectiveness of Implementation
»
Oregon Preferred Drug List
»
Successful
Evidence-based QI Project: Diabetes Management at Dreyer Medical
Clinical »
Reporting
Untrustable
Abstracts & P-Values »
CONSORT
Statement on Harms
»
When
There is No Evidence
»
TREND:
Reporting Standards for Non-randomized Studies
»
Poorly
Written Papers
»
Media
Heyday: Aspirin and (Potentially) Reduced
Risk of Breast Cancer
»
Evidence
& Patients
Better
Evidence Choices: Carotid
Endarterectomy
@ On
the Same Page™
(use your BACK button to return to DelfiniClick™)
»
Screening
& Decision Aids »
01/05/07:
Bandolier
on Patients and Risk Information
@ On
the Same Page™
(use your BACK button to return to DelfiniClick™)
»
Strategies
for Increasing Adherence
»
Discussing
Information and Decisions with Patients
»
EBM
Humor
EBM
Secret Truths Uncovered!!! »
A
Field Guide to Experts!
»
|
Special
Feature: The Validity Detective
Quality
of Evidence
Observational
Studies
More
on the Problem with Drawing Cause-Effect Conclusions from Observational
Studies
»
Cause & Effect
Conclusions from Observations
»
Bias in Observational Studies — More on HRT in Menopause
»
Primary
Studies & General Concepts
Understanding Number Needed to Treat (NNT)
»
Concealment of Allocation
»
Blinding
and RCTs
»
Blinding in Surgery Trials
»
The
Importance of Blinded Assessors in RCTs »
Attrition
Bias: Intention-to-Treat Basics
»
Intention-to-Treat
Analysis: Censoring
»
Intention-to-Treat Analysis: Misreporting and Migraine
»
Missing
Data Points: Difference or No Difference »
Quality of Studies: Lower Quality = Greater Effect Size
»
More: Overestimation of Effect Size in Studies
of Low Quality »
External
Validity: Case of the Carotid Stent
»
Quality
of Studies: VIGOR
»
Confidence-Intervals,
Power & Meaningful Clinical Benefit »
Diagnostic
Studies
Bias in Diagnostic Studies
»
Secondary
Studies
Systematic Reviews: Quality
& Searching Tips
»
Systematic
Reviews: Untrustable "Trustable" Sources?
»
Delfini
Letter to BMJ: Corticosteriods are Not Proven for Treatment of Bell's
Palsy
»
Delfini Letter to NEJM: Bell's Palsy - REDUX!
»
Substandard Evidence: POEMS and Diabetes — Readers, Beware!
»
Quality
of Systematic Reviews: Misleading “POEM” on Hormone
Therapy
»
Secondary
Sources
Quality of Clinical Guidelines
»
Poor
Quality of Guidelines: Case Study — The Evidence on Well-Child
Care Recommendations
»
Quality
Improvement
Successful
Evidence-based QI Project: Diabetes Management at Dreyer Medical
Clinical
»
|
| For
Clicks that do not work, please email us to let us know. For many
of them, you may be able to go to PubMed and do a quick search using
the citation: www.PubMed.gov.
|

DelfiniClick™
  |
|
Confidence-Intervals,
Power & Meaningful Clinical Benefit:
Advice to Readers on How to Stop Worrying about Power and Start
Using Confidence Intervals &
Using Confidence Intervals to Evaluate Clinical Benefit of Statistically
Significant Findings
(Special thanks to Brian Alper, MD, MSPH and Ted Ganiats,
MD for their help in understanding this issue.)
Problems with Non-Statistically Significant
Findings
Research outcomes which are not statistically significant (also
referred to as “non-significant findings”) raise the
question, "Is there TRULY no difference, or were there not
enough people to show a difference if there is one?" (This
is known as beta- or Type II error.)
Power calculations are performed prior to a study
help investigators determine the number of people they should enroll
in the study to try and detect a statistically significant difference
if there is one. A power of >= 80% is conventional and provides
some leeway for chance. Power calculations are generally performed
only for the primary outcome. They entail a lot of assumptions.
Good News About Power!
The good news for readers is that you don’t need to worry
about power since you can evaluate inconclusiveness of findings
through using confidence intervals.
Here’s what they are, and here’s how
it’s done:
About Confidence Intervals (CIs)
The results of a valid study represent an approximation of truth.
There might be other possible values that could equally approximate
truth. (What if the study had been done on Friday instead of on
Tuesday, for example? Maybe the difference in outcomes would be
an absolute 4 percent and not 5 percent.) In recognition of this,
confidence intervals are calculations of equally statistically plausible
results generating a range within which there is a 95% chance that
the true answer lies for a valid study. (As with all allowances
for chance findings, 95 percent is conventional.) You can apply
confidence intervals to any measure of outcomes such as an odds
ratio or absolute risk reduction (ARR).
This
is how confidence intervals are reported:
Example: ARR = 5%; 95% CI (3% to 7%)
How to Use Confidence Intervals to Determine
Statistical Significance
Absolute
Risk Reduction and Relative Risk Reduction
For measures reported as percentages, if the range includes zero,
the outcomes are not statistically significant.
Relative
Risk (aka Risk Ratio) and Odds Ratio
For measures reported as ratios, if the range includes 1, the
outcomes are not statistically significant.
How to Use Confidence Intervals to Determine
Conclusiveness of Non-significant Findings
And if something is not statistically significant (also referred
to as non-significant or NS findings), you don’t know if there
truly is no difference, or whether there were not enough people
to show a difference if there is one.
You can look to the CIs to help you with this situation.
But first you want to decide what you would consider to be your
minimum requirement for a clinically significant outcome (difference
between outcomes in the intervention and comparison groups). This
is a judgment call.
Let’s assume we are looking at a study, the
primary outcome for which is absolute reduction in mortality. One
might reasonably conclude that an outcome of 1 percent or more is,
indeed, a clinically meaningful benefit.
[Below is a text explanation. Pictures tell this
best, however. Click here
to view a PDF of what this looks like graphically.
Note that the PDF starts out first with how to determine clinical
significance of statistically significant outcomes and then demonstrates
how to determine conclusiveness of non-significant findings.]
Example:
Clinical Significance Goal
>=1% absolute reduction in mortality
For
Non-Significant Findings:
Example
1
- ARR
= 2%; 95% CI (-1% to 5%)
- The
upper boundary tells you it is possible that the true result
WOULD meet your requirements for clinical significance –
thus, from that perspective this trial is inconclusive about
NO DIFFERENCE BETWEEN GROUPS - you do not know if the trial
was insufficiently powered (false negative due to insufficient
number of people to show a statistically significant difference
if there is one)
Example
2
- ARR
= 0%; 95% CI (-.5 to .5%)
-
The upper boundary does not reach your goal – therefore,
this can be considered sufficient evidence that there is no
difference between the groups that you would consider clinically
significant
How
to Use Confidence Intervals to Determine Conclusiveness of Non-significant
Findings
Again, you can also use confidence intervals to determine whether
a result from a valid study is of meaningful clinical benefit.
Requirements
for Meaningful Clinical Benefit
Remember that outcomes of clinical significance are those which
benefit patients in some way in the areas of morbidity, mortality,
symptom relief, physical or emotional functioning or health-related
quality of life. Intermediate markers are assumed to benefit patients
in these areas, but they may not - thus, a direct causal chain of
benefit must be proved to avoid waste and potential patient harms
occurring as unintended consequences. Meaningful clinical benefit
is a combination of benefits in a clinically significant area along
with the size of the results.
As
with evaluating the conclusiveness of a non-significant finding,
you apply judgment to set your minimum requirement for meaningful
clinical significance. Using the same example of your choosing 1
percent absolute reduction in mortality as meaningful clinical benefit:
Example:
Clinical Significance Goal
>=1% absolute reduction in mortality
For
Statistically Significant Findings:
Example
1
- ARR
= 2%; 95% CI (.5% to 3.5%)
- The
lower boundary tells you it is possible that the true result
will NOT meet your requirements for clinical significance –
thus, from that perspective this trial is inconclusive
Example
2
- ARR
= 2%; 95% CI (1 to 3%)
-
The lower boundary reaches your goals for clinical significance
– therefore, this can be considered sufficient evidence
of benefit
Again, pictures probably tell this best. Click here
to view the PDF.
The Authors Did Not Report CIs?
If you can create a 2 x 2 table from the study data,
you can compute them yourself using the confidence interval calculator
of the University
of British Columbia, Department of Health Care and Epidemiology
»
which can also be found in the Delfini
WebLinks »
under "confidence
interval calculations."
Evaluate Definitions for Outcomes
And remember, ensure you agree with the authors’ definitions
of the outcomes, especially if they are using a term like “improved,”
“success,” or “failure” – is a three-point
change on a 200 point scale really a meaningful clinical difference
that should define success? You get to be the judge.
|
| Overestimation
of Effect Size in Studies of Low Quality
In
a previous DelfiniClick, we summarized an article by Moher and colleagues
(1) in which the authors randomly selected 11 meta-analyses involving
127 RCTs which evaluated the efficacy of interventions used for
circulatory and digestive diseases, mental health, pregnancy and
childbirth. Moher and colleagues concluded that -
-
Low-quality trials compared with high quality trials (score >2),
were associated with a relative increased estimate of benefit
(34%).
- Trials
that used inadequate allocation concealment, compared with those
that used adequate methods, were associated with a relative increased
estimate of benefit (37%).
Below
we summarize another study that confirms and expands Moher’s
findings. In a study similar to Moher’s, Kjaergard and colleagues
(2) evaluated the effects of methodologic quality on estimated intervention
effects in randomized trials.
The
study evaluated 23 large and 167 small randomized trials and a total
of 136,164 participants. Methodologic quality was defined as the
confidence that the trial’s design, conduct, analysis, and
presentation minimized or avoided biases in the trial’s intervention
comparisons (3). The reported methodologic quality was assessed
using four separate components and a composite quality scale.
The
quality score was ranked as low (</=2points) or high (>/=3
points), as suggested by Moher et al. (1). The four components were
1) generation of allocation sequence; 2) concealment of allocation;
3) double-blinding; and, 4) reporting of loss-to-follow-up:
RESULTS
OF KJAERGARD ET AL’S REVIEW (all reported exaggerations
are relative increases):
Generation
of Allocation Sequence
The odds ratios generated by all trials (large and small) with inadequate
generation of the allocation sequence were on average significantly
exaggerated by 51% compared with all trials reporting adequate generation
of allocation sequence (ratio of odds ratios (95% CI) = 0.49 (0.30–0.81),
P <0.001.
Concealment
of Allocation
All trials with inadequate allocation concealment exaggerated intervention
benefits by 40% compared with all trials reporting adequate allocation
concealment (ratio of odds ratios (95% CI) = 0.60 (0.31–1.15),
P =0.12. Odds ratios were significantly exaggerated by 52% in small
trials with inadequate versus adequate allocation concealment (ratio
of odds ratios (95% CI) 0.48 (0.25–0.92), P = 0.027).
Double
Blinding
The odds ratios generated by all trials without double blinding
were significantly exaggerated by 44% compared with all double-blind
trials (ratio of odds ratios (95% CI) = 0.56 (0.33–0.98),
P = 0.041).
Reporting
of Loss-to-Followup
The analyses showed no significant association between reported
follow-up and estimated intervention effects (ratio of odds ratios
(95% CI) = 1.50 (0.80–2.78), P = 0.2).
Kjaergard
and Colleagues’ Conclusions
- Adequate
generation of the allocation sequence and adequate allocation
concealment should be required for adequate randomization.
Unlike
previous investigators (1,3,4, 5), the authors found that trials
with inadequate generation of allocation sequence exaggerate
intervention effects significantly.
- Trials
with inadequate allocation concealment also generate exaggerated
results.
This
is in accordance with previous evidence (1,3,5). The authors
found that despite the considerable overlap between generation
of allocation sequence and allocation concealment, both factors
may independently affect the estimated intervention effect.
- Trials
without double blinding exaggerate results.
This
study supports Schulz and colleagues’ finding of a significant
association between intervention effects and double blinding
and extends the evidence by including trials from several therapeutic
areas.
- There
was no association between reported follow-up and intervention
effect.
Delfini
Comment
It
is useful to know quantitatively how various threats to validity
affect results when doing critical appraisal of a study. The study
by Kjaergard and colleagues summarized above expands the findings
of Schulz, Moher, Juni and others.
Previous
studies have questioned the reliability of reported losses to follow-up
(5, 6). In accordance with Schulz and colleagues’ results
(5), the authors found no association between intervention effects
and reported follow-up.
Delfini Note: We have found that losses to follow-up may significantly
affect P values when sensitivity analysis is done. We consider loss
of =/>5% with differential loss or =/> 10% without differential
loss to be an important threat to validity.
In
agreement with the findings of Moher and associates (1,3) and Juni
and colleagues (7), the authors found that trials with a low quality
score on the scale developed by Jadad and colleagues (8) significantly
exaggerate intervention benefits.
Kjaergard
and colleagues conclude that assessment of methodologic quality
should focus on generation of allocation sequence, allocation concealment,
and double blinding. Delfini feels this is not sufficient –
but appreciates this study as one that further demonstrates the
importance of effective approaches to some of these methodologic
areas.
References
1. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, et al.
Does quality of reports of randomised trials affect estimates of
intervention efficacy reported in meta-analyses? Lancet. 1998;352:609-13.
[PMID: 9746022]
2.
Kjaergard LL, John Villumsen J, Gluud C. Reported Methodologic Quality
and Discrepancies between Large and Small Randomized Trials in Meta-Analyses.
Ann Intern Med. 2001;135:982-989.
3.
Moher D, Cook DJ, Jadad AR, Tugwell P, Moher M, Jones A, et al.
Assessing the quality of reports of randomised trials: implications
for the conduct of meta-analyses. Health Technol Assess. 1999;3:i-iv,
1-98. [PMID: 10374081]
4.
Emerson JD, Burdick E, Hoaglin DC, Mosteller F, Chalmers TC. An
empirical study of the possible relation of treatment differences
to quality scores in controlled randomized clinical trials. Control
Clin Trials. 1990;11:339-52. [PMID: 1963128]
5.
Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of
bias. Dimensions of methodological quality associated with estimates
of treatment effects in controlled trials. JAMA. 1995;273:408-12.
[PMID: 7823387]
6.
Gøtzsche PC. Methodology and overt and hidden bias in reports
of 196 double-blind trials of nonsteroidal antiinflammatory drugs
in rheumatoid arthritis. Control Clin Trials. 1989;10:31-56. [PMID:
2702836]
7.
Juni P, Witschi A, Bloch R, Egger M. The hazards of scoring the
quality of clinical trials for meta-analysis. JAMA. 1999;282:1054-60.
[PMID: 10493204]
8.
Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan
DJ, et al. Assessing the quality of reports of randomized clinical
trials: is blinding necessary? Control Clin Trials. 1996;17:1-12.
[PMID: 8721797] |
| Missing
Data Points: Difference or No Difference — Does it Matter?
We
continue to study the "evidence on the evidence" —
meaning we are continually on the look out for information which
may shed light on the impact on reported outcomes of certain kinds
of bias, for example, or information that provides help in how to
handle different biases. Missing
data points is an issue affecting the majority of studies, but currently
there is not clarity on how big an issue this is, especially when
there is not a differential loss between groups.
We
spoke recently about this issue with John M. Lachin, Sc.D., Professor
of Biostatistics and Epidemiology, and of Statistics, The George
Washington University, and author. (And then we did some "hard
thinking" as David Eddy would say.) Even without differential
loss between the groups overall, a differential loss could occur
in prognostic variables — and readers are rarely going to
have access to data about changes in prognostic characteristics
post-baseline reporting. So we continue to offer our conservative
approach that loss of around five percent with differential loss
is a bias as well as loss of around ten percent or more without
differential loss.
For
those who are tough and hardy and really want to mull on this, here's
our updated white paper on "missingness" [Word]
or [PDF]. We
welcome further thoughts (or evidence) on this area. |
| The
Importance of Blinded Assessors in RCTs
We have previously summarized the problems associated
with lack of blinding in surgical (and other) studies — see
Blinding in Surgery Trials
in a previous DelfiniClick.
The major problem with unblinded studies is that the outcomes in
the intervention group are likely to be falsely inflated because
of the biases introduced by lack of blinding.
Recently a group of orthopedists identified and
reviewed thirty-two randomized, controlled trials published in The
Journal of Bone and Joint Surgery between 2003 and 2004 to evaluate
the effect of blinded assessment vs non-blinded assessment on reported
outcomes [1].
Results
-
Sixteen of the thirty-two randomized controlled trials did not
report blinding of outcome assessors when blinding would have
been possible.
- Among
the studies with continuous outcome measures, unblinded outcomes
assessment was associated with significantly larger treatment
effects than blinded outcomes assessment (standardized mean difference,
0.76 compared with 0.25; p = 0.01).
- In
the studies with dichotomous outcomes, unblinded outcomes assessments
were associated with significantly greater treatment effects than
blinded outcomes assessments (odds ratio, 0.13 compared with 0.42;
p < 0.001).
-
This translates into a relative risk reduction of 38% for blinded
outcome assessments compared with 71% for unblinded outcome assessments
(a difference of 33%).
Conclusion
Unblinded
outcomes assessment dramatically inflates the reported benefit of
effectiveness of treatments.
Delfini
Commentary
This is yet another study pointing out the importance of blinding.
Based on this and other similar studies it is our conclusion that
studies or the results of studies without blinded assessors are
grade U or at best grade B-U (see evidence-grading scale here).
1.
Poolman RW, Struijs PA, Krips R, Sierevelt IN, Marti RK, Farrokhyar
F, Bhandari M. Reporting of outcomes in orthopaedic randomized trials:
does blinding of outcome assessors matter? J Bone Joint Surg Am.
2007 Mar;89(3):550-8. J Bone Joint Surg Am. 2007 Mar;89(3):550-8.
PMID: 17332104. »
Return
to Top.
|
| 
Successful
Evidence-based QI Project: Diabetes Management at Dreyer Medical
Clinical
Example
provided by Rami Rihani, PharmD, Director of Pharmacy
Delfini
Introduction
Measuring clinical improvements is complex. One of the most important,
frequently misunderstood issues is that cause and effect relationships
can only be drawn with reasonable certainty from valid experiments
(RCTs). However, if we have valid evidence from RCTs that an intervention
leads to improved clinical outcomes, it is then reasonable to use
process measures to evaluate the success of our evidence-based clinical
improvement project.
Generally,
we advise people to measure — not health status outcomes —
but to perform a process measurement to evaluate the success of
application of the intervention. In other words, we advise people
to measure the success of implementation of the clinical improvement.
For example, if we are trying to ensure patients get a beta-blocker
post-MI, we would recommend looking to see if prescriptions increased
for hospitalized MI patients — not to measure whether patient
survival was improved. This is because observational data, such
as information extracted from databases, can be highly prone to
confounding. If health status outcomes are measured, then we advise
people to ensure that there is a sufficient understanding of all
those utilizing the data that conclusions drawn from observational
data can be misleading. In the above example, if patient survival
decreased, there could be many explanations.
However,
if a health status outcome is measured, and if the before/after
change is dramatic, it is reasonable to hypothesize that our project
has been successful. For example…
Problem
Many diabetics have difficulty achieving a HbA1c <7.0. Frequently
diabetics are told their HbA1cs are too high but active medication
change is not aggressively pursued.
Evidence-based
QI Project: A quality improvement group at Dreyer Medical Clinic
developed a disease management initiative using PharmDs to actively
titrate dosages of insulin and other drugs based on the Intermountain
Health Care (IHC) diabetes management protocol. The process is as
follows:
-
Primary care physician (PCP) refers patient to the diabetes management
program;
-
PharmD aggressively titrates medication based on IHC protocol;
-
PharmD monitors for safety and efficacy of medication interventions
in collaboration with the PCP
Outcomes
Outcome
(n=1049)
|
Prior
to Enrollment |
Most
Recent Follow-up |
| %
at HbA1c < 7% |
18% |
48.5% |
| %
at LDL < 100 |
30% |
58% |
Delfini
Commentary
There was a significant improvement in the percent of patients achieving
goal HbA1c and LDL associated with this project.
It
is reasonable to believe that the clinical improvement project was
successful. Using outcomes data from the UK Prospective Diabetes
Study 35 (1), the QI team estimates that since inception, the disease
management initiative resulted in the prevention of —
-
four diabetes related deaths and
-
nine microvascular events (defined as renal failure, death from
renal failure, retinal photocoagulation, or vitreous hemorrhage)
1.
Stratton, I,M., Adler, A.I., et al, Association of glycaemia with
macrovascular and microvascular complications of type 2 diabetes
(UKPDS 35): prospective observation study. BMJ 2000; 321; 405-12.
Return
to Top. |
|
Blinding
In Surgical Trials — It is Through Blinding We Become Able To
See
Blinding is an important consideration when evaluating
a study. Without blinding, the likelihood of bias increases. Bias
occurs when patients in one group experience care or exposures not
experienced by patients in the other group(s), and the differences
in care affect the study outcomes.Lack of blinding may be a major
source of this type of bias in that unblinded clinicians who are
frequently “rooting for the intervention” may behave
differently than blinded clinicians towards patients whom they know
to be receiving the study drug or intervention being studied. The
result is likely to be that in unblinded studies, patients may receive
different or additional care. Unblinded subjects may be more likely
to drop out of a study or seek care in ways that differ from blinded
subjects. Unblinded assessors may also be “rooting for the
intervention” and assess outcomes differently from blinded
assessors.
How much difference does blinding make? Jüni
et al. reviewed four studies that compared double blinded versus
non-blinded RCTs and attempted to quantify the amount of distortion
(bias) caused by lack of double blinding [1]. Overall, the overestimation
of effect was about 14%. The largest study reviewed by Juni assessed
the methodological quality of 229 controlled trials from 33 meta-analyses
and then analyzed, using multiple logistic regression models, the
associations between those assessments and estimated treatment effects
[2]. Trials that were not double-blind yielded on average 17% greater
effect, 95% CI (4% to 29%), than blinded studies (P = .01).
Lack
of double blinding is frequently found in surgical trials and
results in uncertain evidence because of the problems stated above.
A case study helps to illustrate this. A recent multicenter RCT,
the Spine Patient Outcomes Research Trial (SPORT)[3] was a non-blinded
trial that serves as an interesting case study of the blinding
issues that arise when a surgical intervention is compared to
a non-surgical intervention, and blinding is not attempted. The
trial included patients with persistent (at least 6 weeks) disk-related
pain and neurologic symptoms (sciatica) who were randomized to
undergo diskectomy or receive usual care (not standardized but
frequently including patient education, anti-inflammatory medication,
and physical therapy, alone or in combination). There were a number
of problems with this study including lack of power, poor control
of non-study interventions, a high proportion of patients who
crossed over between treatment strategies (43% randomized to surgery
did not undergo surgery by 2 years and the 42% randomized to conservative
care did receive surgery) and lack of blinding. The degree of
missing data was 24%-27% without a true intention-to-treat analysis.
Of great interest was an editorial that dealt with the problem
of non-blinding in surgical studies. The editorialist, Flum, makes
the following points [4]:
- While
the technique of sham intervention is well accepted in studies
of medications using inactive pills (placebos), simulated acupuncture,
and nontherapeutic conversation in place of therapeutic psychiatric
interventions, it has only occasionally been applied to surgical
trials. This is unfortunate because the use of sham controls
has been critical in understanding just how much patient expectation
influences outcomes after an operation.
-
A sham-controlled trial would be particularly relevant for spine
surgery since the most commonly occurring and relevant outcomes
are subjective.
- Patients
chosing surgical options may have high expectations. They may
include a higher level of emotional “investment”
in surgical care compared with usual care based on the level
of commitment resulting from a decision to have an operation
and get through recovery. After the patient has accepted the
risks of surgical intervention, the desire for improvement may
drive perceptions about improvement.
-
Patients who opt for surgery may also differ from patients who
decline surgery in their beliefs regarding the benefits of invasive
interventions.
-
The surgeon’s expectations and direction are likely to
play an important role in patient improvement.
- Given
the proliferation of operative procedures for the treatment
of subjective complaints like back pain, the need for sham controlled
trials has never been greater.
Flum
goes on to present multiple examples of the power of suggestion
and the problem of doing non-blinded trials
in the field of surgery. Observational trials have often reported
procedural success, but sham-controlled trials for the same conditions
demonstrate how much of that success is due to the placebo effect.
-
Example 1 — Ligation of Internal Mammary:
After multiple observational studies suggesting that ligation
of the internal mammary artery was helpful in patients with coronary
disease, Cobb et al randomized patients to operative arterial
ligation or a sham procedure. Both groups improved after the intervention,
but there were similar, if not greater, improvements in subjective
measures such as exercise tolerance and nitroglycerin use in the
sham surgical group.
- Example
2 — Osteoarthritic
Knee Surgery —
and 3 —
Osteoarthritic Knee Joint Irrigation: After multiple
case series reported that patients with osteoarthritis of the
knee improve after arthroscopic surgery, Moseley et al demonstrated
just how much of that effect is related to the hopes, expectations,
and beliefs of the patient. The investigators randomized 180 patients
to undergo arthroscopy with debridement, arthroscopy with lavage,
or sham arthroscopy. The power of expectation was strong and patients
were unable to determine if they had been assigned to the treatment
or sham groups— and all groups improved. At 2 years after
randomization, all patients reported comparable pain scores and
functional scores. Another sham-controlled study in patients with
knee osteoarthritis demonstrated that patients benefit equally
from irrigation of the joint and from sham irrigation.
- Example
4 —
Parkinson’s Disease: Researchers found similar
improvements in quality of life after direct brain injections
of embryonic neurons or placebo in patients with advanced Parkinson’s
disease.
- Example
5 —
Transmyocardial Laser Revascularization in HF:
Heart failure patients undergoing transmyocardial laser revascularization
or sham procedures had equal improvements in subjective outcomes.
- Example
6 —
Hernia: After hernia repair, there was equal
improvement in pain control after cryoablation of nerves or sham
interventions.
- Examples
7-9 —
Laparoscopic Interventions: Multiple case series have
reported benefit on subjective outcomes such as pain control,
function, and readiness for discharge with laparoscopic cholecystectomy,
colon resection, and appendectomy compared with conventional approaches..Bias
arises when the clinical care team influences patient and discharge
expectations though coaching, communication, and management. Randomized
trials of these three procedures that included blinding of both
the patients and the discharging clinicians to the treatment that
patients received by placing large, side-to-side abdominal wall
dressings demonstrate little or no difference in patients reaching
discharge criteria. A reasonable conclusion is that when the clinician’s
expectations and “coaching” were removed by placing
a large bandage on the abdominal wall, the subjective benefits
disappeared. Flum concludes that studies not addressing both patient
and clinician expectation on subjective outcomes do not inform
the clinical community about the true role of the intervention.
Delfini
Commentary
Blinding of subjects and everyone
working with the subjects or study data to the assigned intervention
(double-blinding) decreases the likelihood of bias. Bias may be
more likely to occur when evaluating subjective outcomes such as
pain, satisfaction, and function in non-blinded studies, but it
has also been reported with objective outcomes such as mortality.
When dealing with subjective outcomes, as Flum points out, it is
critical to distinguish the effect of the intervention from the
effect of the patient’s expectation of the intervention. The
only way to distinguish the effect of a patient’s positive
expectations of an operation from the intervention itself is to
blind patients to the treatment they receive and randomize them
to receive the intervention of interest or to receive a sham intervention
(placebo). Yet we frequently hear, “But blinding is not possible
in surgical studies.” Frequently the argument is raised that
subjecting people to anesthesia and sham surgery is not ethical.
However, conducting clinical trials employing methods that result
in avoidable fatal flaws is also problematic. Flum’s position
is that when the risk of a placebo does not exceed a threshold of
acceptable research risk and if the knowledge to be gained is substantial,
a sham-controlled trial is needed and is ethical. He reasons that
ethical justification of placebo-controlled trials is based on the
following considerations:
- Invasive
procedures are associated with risks.
-
There are great harms created by conducting studies that are of
uncertain validity.
- Establishing
community standards based on uncertain evidence is more likely
to result in more harm than good.
-
Sham-controlled trials are justified when uncertainty exists among
clinicians and patients about the merits of an intervention.
The SPORT trial draws attention to the problem of
non-blinding in surgical trials. This was a very expensive, labor-intensive
study that provides no useful efficacy data. Research subjects were
undoubtedly told this study would provide answers regarding the
relative efficacy of surgery vs conservative care for lumbar spine
disease. The authors of the SPORT trial state that a sham-controlled
trial was impractical and unethical, possibly — according
to Flum —
because the risk of the sham would include general anesthesia (to
truly blind the patients). He would argue that in this case blinding
which would require anesthesia is the only way that valid, useful
evidence could have been created. Even though we graded the study
U (uncertain validity and usefulness) and would not use the results
to inform decisions about efficacy or effectiveness because of the
threats to validity, the study does report information regarding
risks of surgery that may be of great value to patients.
-----------
1 Jüni P, Altman DG and Egger M. Systematic
reviews in health care: Assessing the quality of controlled clinical
trials. BMJ. 2001;323;42-46. PMID: 11440947
2 Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence
of bias. Dimensions of methodological quality associated with estimates
of of treatment effects in controlled trials. JAMA 1995;273:40812.
PMID: 7823387.
3 Weinstein
JN, Tosteson TD, Lurie JD, et al. Surgical vs nonoperative treatment
for lumbar disk herniation: the Spine Patient Outcomes Research
Trial (SPORT): a randomized trial. JAMA. 2006;296:2441-2450. PMID:
17119141
4 Flum
DR. Interpreting Surgical Trials With Subjective Outcomes Avoiding
UnSPORTsmanlike Conduct. JAMA, November 22/29, 2006—Vol 296,
No. 20: 2483-1484. PMID: 17119146
Return
to Top.
|
| Blinding
and RCTs
A recent
article, Boutron I, Estellat C, Guittet L, Dechartres A, Sackett
DL, et al. (2006) Methods of blinding in reports of randomized controlled
trials assessing pharmacologic treatments: A systematic review.
PLoS Med 3(10): e425. DOI: 10.1371/ journal.pmed.0030425, provides
a great deal of useful information about and a way of classifying
blinding in research studies. The authors evaluated blinding in
RCTs of pharmacologic treatment published in 2004 in high impact-factor
journals. The following are some key points from the article:
• The authors identified 819 reports with
about 60% describing the method of blinding. The classification
identified three main methods of blinding:
(1) methods to provide identical treatments in both arms,
(2) methods to avoid unblinding during the trial, and
(3) methods of blinded outcome assessment.
• ESTABLISHING BLINDING OF PATIENTS AND PROVIDERS: 472 [58%]
described the method of blinding, but 236 [29%] gave no detail and
111 [13%] some data on blinding (i.e., reporting that treatments
were similar or the use of double dummies with no description of
the method). The methods of blinding identified varied in complexity.
The authors reported use of a centralized preparation of similar
capsules, tablets, or embedded treatments in hard gelatin capsules
(193/336 [57%]), similar syringes (37/336 [11%]), or similar bottles
(38/336 [11%]). Use of a double dummy procedure was described in
79 articles (23%). Other methods consisted of a sham intervention
performed by an unblinded health care provider who was not actively
involved in the care of patients and had no other contact with patients
or other caregivers and outcome assessors (17/336 [5%]). To mask
the specific taste of the active treatments, in ten articles researchers
used a specific flavor such as peppermint or sugar to coat treatments.
For treatments administered by care providers, authors reported
use of a centralized preparation of opaque coverage to adequately
conceal intravenous treatments with different appearances (14/336
[4%]).
• AVOIDING UNBLINDING OF PATIENTS AND PROVIDERS:
Only 28/819 [3%]) reported methods to avoid unblinding. Methods
to blind dosage adaptation relied on use of a centralized adapted
dosage or provision of sham results of complementary investigations
for treatments necessitating dosage adaptation. Methods to avoid
unblinding because of side effects relied mainly on centralized
assessment of side effects, partial information to patients about
side effects, use of active placebo or systematic prevention of
adverse effects in both arms.
• BLINDING ASSESSORS: These methods depend
on the main outcomes and are particularly important when blinding
cannot be established and maintained by the methods described above.
A total of 112 articles [14%] described these methods, which relied
mainly on a centralized assessment of the main outcome. Blinding
of outcome assessors is presumably achieved if neither patients
nor those involved in the trial have any means to discover which
arm a patient is in, for example because the placebo and active
drugs are indistinguishable and allocation is via a central randomization
service. 96 reports (86%) of the 112 reports in which specific measures
to blind the outcome assessor were reported concern trials in which
patients were reported as blinded or in which double blinding or
triple blinding was reported. These results suppose that, although
blinding was performed at an earlier stage, the investigators nevertheless
decided to perform a specific method of blinding the outcome assessor.
• AUTHORS COMMENTS AND CONCLUSIONS:
• Although blinding is essential to avoid bias, the reporting
of blinding is generally quite poor and reviews of trials that test
the success of blinding methods indicate that a high proportion
of trials are unblinded.
• The study results might be explained in
part by the insufficient coverage of blinding in the Consolidated
Standards for Reporting Trials (CONSORT) statements. For example,
three items of the CONSORT statements are dedicated to the description
of the randomization procedure, whereas only one item is dedicated
to the blinding issue. The CONSORT statements mainly focus on reporting
who is blinded and less on the reporting of details on the method
of blinding, and this information is essential to appraise the success
of blinding.
• Some evidence suggests that although participants
are reported as blinded, the success of blinding might be questionable.
For instance, in a study assessing zinc treatment for the common
cold, the blinding procedure failed, because the taste and aftertaste
of zinc was distinctive. And yet, tools used to assess the quality
of trials included in meta-analyses and systematic reviews focus
on the reporting of the blinding status for each participant and
rarely provide information on the methods of blinding and the adequacy
of the blinding method.
• There is a need to strengthen the reporting
guidelines related to blinding issues, emphasizing adequate reporting
of the method of blinding.
Delfini
Commentary
Lack of blinding appears to be a major source of bias in RCTs. Just
as well-done randomization and concealment of allocation to the
study groups decreases the likelihood of selection bias, blinding
of subjects and everyone working with the subjects or study data
to the assigned intervention (double-blinding) decreases the likelihood
of performance bias. Performance bias occurs when patients in one
group experience care or exposures not experienced by patients in
the other group(s) and the differences in care affect the study
outcomes. Lack of blinding may affect outcomes in that:
- Unblinded
subjects may report outcomes differently from blinded subjects,
have different thresholds for leaving a study, seek (and possibly
receive) additional care in different ways.
-
Unblinded clinicians may behave differently towards patients than
blinded clinicians.
-
Using unblinded assessors may result in systematic differences
in outcomes assessment (assessment bias).
A number of studies have shown that lack of blinding is associated
with inflated treatment effects.
In some cases blinding may not be possible. For
example, side effects or taste may result in unblinding. The important
point is that even if blinding is not possible, the investigators
do not get “extra” validity points for doing the best
they could (i.e., the study should not be “upgraded”).
Return
to Top. |
| More
on The Problem with Drawing Cause-Effect Conclusions from Observational
Studies
Our last teaching
engagement was in Framingham, Massachusetts and reminds us of the
value of observational studies to assist us in developing risk stratification
models. The Framingham Study began in 1948 as the first prospective
study of cardiovascular disease and is important because through
observations it has identified cardiovascular disease (CVD) risk
factors which can be associated with morbidity and mortality.
But there is
good evidence that basing cause and effect conclusions from observational
studies is unreliable. Cause and effect conclusions should be based
on randomized controlled trials (RCTs) where bias, confounding and
chance have been ruled out as possible explanations for the observed
association between the intervention and the outcome. Because there
are so many observational studies published each week and because
we keep seeing health professionals inappropriately basing treatment
decisions on them, it is worthwhile summarizing an excellent review
of the literature on this topic.
The study and
literature review can be found in the reference:
Deeks JJ, Dinnes
J, D'Amico R, Sowden AJ, Sakarovitch C, Song F, Petticrew M, Altman
DG. Evaluating non-randomised intervention studies. Health Technology
Assessment 2003; Vol. 7: No. 27.
Some key points from this article:
-
Comparison of results of randomized and non-randomized studies
across multiple interventions in multiple studies demonstrate
that, in the majority of cases, observational studies are not
consistent with the results of RCTs
- This
study, using meta-epidemiological techniques, demonstrates that
—
-
None of the study results can be adequately adjusted for bias
in observational studies using historic and concurrent controls
-
Logistic regression on average increases bias when applied
to observational studies
Conclusions
- Non-randomized
studies may still give seriously misleading results even when
those treated and control groups appear similar in key prognostic
factors
-
Standard methods of case-mix adjustment do not guarantee removal
of bias
-
Omission of important confounding factors can explain failure
of adjustment as a substitute for randomization
-
There is no known method for reliably adjusting for confounding
factors in observational studies
Delfini
Commentary
Extreme
caution is urged when considering results of observational studies
in interventions for screening, prevention and therapy. Cause and
effect conclusions should only be drawn from RCTs.
One reason for
this is that there may be major differences in the characteristics
(prognostic factors) of individuals who choose a therapy compared
to people who do not choose that therapy. A classic example is hormone
replacement therapy after myocardial infarction (MI) in women. Most
observational studies reported that roughly twice as many women
who did not choose to take hormone replacement therapy (HRT) had
a recurrent MI compared to women who chose to take HRT. This led
people to believe — incorrectly — that HRT caused this
benefit. Later, well-done RCTs were conducted and no such benefit
was found. Why? The most likely reason is that the observational
studies were highly prone to biases resulting from differences between
the groups which could not be eliminated even with statistical adjustments
in which researchers try to balance confounders between the groups,
such as adjusting for smoking.
Another reason
is that in observations, investigators do not “control”
all elements of the study as they do in RCTs. The end result is
that in observational studies other aspects affecting the groups
are almost certain to be different in important ways which are likely
to explain or affect the study results.
Key
Point
Any difference between groups — except for what is being studied
(e.g., HRT use) —
is a bias.
In the case
of HRT after MI, selection bias was present in that women who chose
to take HRT were probably more likely to be “health-conscious,”
exercise, watch their diets, etc., making them different from the
women who did not take HRT. It is also likely that there were other
differences in how the two groups experienced their health care
because in observational studies there is no formal protocol and
so there will be differences in many ways that could affected observed
outcomes such as other therapies used, how outcomes are assessed,
frequency for follow-up, and so on.
Even with statistical
adjustments for differences between potential and known prognostic
characteristics of the groups, bias cannot be reliably eliminated
because whatever is actually responsible for the outcome (i.e.,
the confounder) is what would have to be adjusted. This would entail
having advance knowledge of cause and effect (but that is why the
study is being conducted). Plus statistical adjustment has limitations.
How could every single factor that made the HRT users different
be adjusted? Humans embody an infinite number of variables such
as characteristics and exposures.
Comparisons
of RCTs and observational studies of the same interventions have
repeatedly demonstrated that even with the most meticulous statistical
adjustments, bias cannot be reliably eliminated from observational
studies. The key message is that without randomization and assurance
that interventions and assessments are the same for both study and
comparison groups, one cannot reliably draw conclusions about cause
and effect relationships. Associations between interventions and
outcomes in observational studies are very likely to be due to bias
or confounding. Therefore, observational studies are only useful
for hypothesis-generating when considering questions of preventive,
screening or therapeutic interventions.
Database
Studies
Some groups have tried to demonstrate improved health outcomes (e.g.,
death, stroke, etc.) through studies of their databases. It should
be remembered that this type of study is an observational study
and prone to bias and confounding for the reasons explained above,
plus it is highly prone to chance findings of statistical significance.
Therefore, database studies may be useful for suggesting areas for
further study, but they should not be thought of as valid studies
from which cause and effect relationships can be concluded.
Return
to Top. |
| Untrustable
P-values & Abstracts
One of the first things we teach our EBM learners
is that although abstracts can be useful to get a sense of what
an article is about and can be at times be used to exclude studies
from further review, abstracts cannot reliably be used to determine
if a study is valid.
Validity must be determined by examining the methods
of the study (assuming it is the right study type). A little-known
problem with abstracts is that the information provided in the abstract
cannot be documented in the body of the paper up to 68% of the time
in some of the top-tier medical journals [Pitkin, R et al. Accuracy
of Data in Abstracts of Published Research Articles. JAMA. 1999;
281: 1110-1111 PMID: 10188662 — reviewing JAMA, NEJM, The
Lancet, The Annuals of Internal Medicine, BMJ and the Canadian Medical
Journal]. In this DelfiniClick we report another problem with abstracts—the
problem of bias.
Peter C Gøtzsche in a BMJ article (Believability
of relative risks and odds ratios in abstracts: cross sectional
study. BMJ 2006;333;231-234; PMID: 16854948) reviews previous publications
reporting biased results-reporting and biased reporting of conclusions,
and he presents additional evidence of bias in reporting P values.
We do not have the expertise to evaluate all the
points made in his paper; however, we present his comments and findings
here for you to evaluate and draw your own conclusions. Although,
we believe the assumptions upon which Gøtzsche
bases his conclusions can be challenged,
the following should be of interest to anyone interested in critical
apppraisal of the medical literature.
Gøtzsche’s
Comments
-
Significant results in abstracts should generally be disbelieved
-
Ongoing research has shown that more than 200 statistical tests
are sometimes specified in trial protocols. If you compare a treatment
with itself—that is, the null hypothesis of no difference
is known to be true—the chance that one or more of 200 tests
will be statistically significant at the 5% level is 99.996% if
we assume the tests are independent
- Thus,
the investigators or sponsor can be fairly confident that “something
interesting will turn up.”
-
Due allowance for multiple testing is rarely made, and it is generally
not possible to discern reliably between primary and secondary
outcomes
-
Recent studies that compared protocols with trial reports have
shown selective publication of outcomes, depending on the obtained
P values, and that at least one primary outcome was changed, introduced,
or omitted in 62% of the trials.
-
The scope for bias is also large in observational studies. Many
studies are underpowered and do not give any power calculations.
-
Furthermore, a survey found that 92% of articles adjusted for
confounders and reported a median of seven confounders but most
did not specify whether they were pre-declared.
-
Fourteen per cent of these articles reported more than 100 effect
estimates, and subgroup analyses appeared in 57% of studies and
were generally believed.
-
The preponderance of significant results could be reduced if the
following actions were taken.
- First,
if we need a conventional significance level at all, which
is doubtful, it should be set at P < 0.001
-
Second, analysis of data and writing of manuscripts should
be done blind, hiding the nature of the interventions, exposures,
or disease status, as applicable, until all authors have approved
the two versions of the text
-
Third, journal editors should scrutinize abstracts more closely
and demand that research protocols and raw data—both for
randomized trials and for observational studies—be submitted
with the manuscript.
In
short, yet another reminder to read the methods section of papers
and not rely on results or conclusions presented in abstracts.
Gøtzsche’s Findings in Brief
-
The first result in the abstract was statistically significant
in 70% of the trials, 84% of cohort studies and 84% of case-control
studies. Although many of these results were derived from subgroup
or secondary analyses, or biased selection of results, they were
presented without reservations in 98% of the trials
| |