| Quality
of Evidence:
Primary Studies & General Concepts
newest
04/24/08:
Early
Discontinuation of Clinical Trials: Oncology Medication Studies—Recent
Developments and Concern »
Contents
- Quality
of Studies: Lower Quality = Greater Effect Size
»
- More:
Overestimation of Effect Size in Studies of Low Quality
»
- Concealment
of Allocation
»
- Blinding
and RCTs »
- Blinding
in Surgery Trials »
- The
Importance of Blinded Assessors in RCTs »
- Attrition
Bias: Intention-to-Treat Basics
»
- Intention-to-Treat
Analysis & Censoring: Rofecoxib Example
»
- Intention-to-Treat
Analysis: Misreporting and Migraine
»
- Missing
Data Points: Difference or No Difference »
- Quality
of Studies: VIGOR »
- Confidence-Intervals,
Power & Meaningful Clinical Benefit »
- Getting
“Had” by P-values: Confidence Intervals vs P-values
in Evaluating Safety Results:
Low-molecular-weight Heparin (LMWH) Example
»
- Understanding
Number Needed to Treat (NNT)
»
- newest
Early Discontinuation of Clinical Trials: Oncology Medication
Studies—Recent Developments and Concern »
|
| Quality
of Studies: Lower Quality = Greater Effect Size
The
quality of studies in systematic reviews and meta-analyses has repeatedly
been shown to affect the amount of benefit reported. This DelfiniClick
is a quick reminder that just because a study is a RCT does not
mean it will provide you with a reliable estimate of effect size.
A nice illustration of this point is provided in a classic article
by Moher D et al. (Does quality of reports of randomised trials
affect estimates of intervention efficacy reported in meta-analyses?
Lancet 1998; 352: 609–13).
In
this study, the authors randomly selected 11 meta-analyses that
involved 127 RCTs on the efficacy of interventions used for circulatory
and digestive diseases, mental health, pregnancy and childbirth.
The authors evaluated each RCT by examining the description of randomization,
allocation concealment, blinding, drop outs and withdrawals.
The
results are in line with other authors’ findings regarding
quality of methods and amount of benefit (effect size) reported
as relative measures below:
-
The quality of trials was low overall.
- Low-quality
trials compared with high quality trials (score >2) were associated
with an increased estimate of benefit of 34%.
-
Trials that used inadequate allocation concealment, compared with
those that used adequate methods, were also associated with an
increased estimate of benefit (37%).
-
The average treatment benefit was 39% for all trials, 52% for
low-quality trials, and 29% for high-quality trials.
The
authors conclude that studies of low methodological quality in which
the estimate of quality is incorporated into the metaanalyses can
alter the interpretation of the benefit of the intervention.
We
continue to see this problem in systematic reviews and clinical
guidelines and suggest that when evaluating secondary studies readers
pay close attention to the quality of included studies.
|
| Overestimation
of Effect Size in Studies of Low Quality
In
a previous DelfiniClick, we summarized an article by Moher and colleagues
(1) in which the authors randomly selected 11 meta-analyses involving
127 RCTs which evaluated the efficacy of interventions used for
circulatory and digestive diseases, mental health, pregnancy and
childbirth. Moher and colleagues concluded that -
-
Low-quality trials compared with high quality trials (score >2),
were associated with a relative increased estimate of benefit
(34%).
- Trials
that used inadequate allocation concealment, compared with those
that used adequate methods, were associated with a relative increased
estimate of benefit (37%).
Below
we summarize another study that confirms and expands Moher’s
findings. In a study similar to Moher’s, Kjaergard and colleagues
(2) evaluated the effects of methodologic quality on estimated intervention
effects in randomized trials.
The
study evaluated 23 large and 167 small randomized trials and a total
of 136,164 participants. Methodologic quality was defined as the
confidence that the trial’s design, conduct, analysis, and
presentation minimized or avoided biases in the trial’s intervention
comparisons (3). The reported methodologic quality was assessed
using four separate components and a composite quality scale.
The
quality score was ranked as low (</=2points) or high (>/=3
points), as suggested by Moher et al. (1). The four components were
1) generation of allocation sequence; 2) concealment of allocation;
3) double-blinding; and, 4) reporting of loss-to-follow-up:
RESULTS
OF KJAERGARD ET AL’S REVIEW (all reported exaggerations
are relative increases):
Generation
of Allocation Sequence
The odds ratios generated by all trials (large and small) with inadequate
generation of the allocation sequence were on average significantly
exaggerated by 51% compared with all trials reporting adequate generation
of allocation sequence (ratio of odds ratios (95% CI) = 0.49 (0.30–0.81),
P <0.001.
Concealment
of Allocation
All trials with inadequate allocation concealment exaggerated intervention
benefits by 40% compared with all trials reporting adequate allocation
concealment (ratio of odds ratios (95% CI) = 0.60 (0.31–1.15),
P =0.12. Odds ratios were significantly exaggerated by 52% in small
trials with inadequate versus adequate allocation concealment (ratio
of odds ratios (95% CI) 0.48 (0.25–0.92), P = 0.027).
Double
Blinding
The odds ratios generated by all trials without double blinding
were significantly exaggerated by 44% compared with all double-blind
trials (ratio of odds ratios (95% CI) = 0.56 (0.33–0.98),
P = 0.041).
Reporting
of Loss-to-Followup
The analyses showed no significant association between reported
follow-up and estimated intervention effects (ratio of odds ratios
(95% CI) = 1.50 (0.80–2.78), P = 0.2).
Kjaergard
and Colleagues’ Conclusions
- Adequate
generation of the allocation sequence and adequate allocation
concealment should be required for adequate randomization.
Unlike
previous investigators (1,3,4, 5), the authors found that trials
with inadequate generation of allocation sequence exaggerate intervention
effects significantly.
- Trials
with inadequate allocation concealment also generate exaggerated
results.
This
is in accordance with previous evidence (1,3,5). The authors found
that despite the considerable overlap between generation of allocation
sequence and allocation concealment, both factors may independently
affect the estimated intervention effect.
- Trials
without double blinding exaggerate results.
This
study supports Schulz and colleagues’ finding of a significant
association between intervention effects and double blinding and
extends the evidence by including trials from several therapeutic
areas.
- There
was no association between reported follow-up and intervention
effect.
Delfini
Comment
It is useful to know quantitatively how various threats
to validity affect results when doing critical appraisal of a study.
The study by Kjaergard and colleagues summarized above expands the
findings of Schulz, Moher, Juni and others.
Previous
studies have questioned the reliability of reported losses to follow-up
(5, 6). In accordance with Schulz and colleagues’ results
(5), the authors found no association between intervention effects
and reported follow-up.
Delfini Note: We have found that losses to follow-up may significantly
affect P values when sensitivity analysis is done. We consider loss
of =/>5% with differential loss or =/> 10% without differential
loss to be an important threat to validity.
In
agreement with the findings of Moher and associates (1,3) and Juni
and colleagues (7), the authors found that trials with a low quality
score on the scale developed by Jadad and colleagues (8) significantly
exaggerate intervention benefits.
Kjaergard
and colleagues conclude that assessment of methodologic quality
should focus on generation of allocation sequence, allocation concealment,
and double blinding. Delfini feels this is not sufficient –
but appreciates this study as one that further demonstrates the
importance of effective approaches to some of these methodologic
areas.
References
1. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, et al.
Does quality of reports of randomised trials affect estimates of
intervention efficacy reported in meta-analyses? Lancet. 1998;352:609-13.
[PMID: 9746022]
2.
Kjaergard LL, John Villumsen J, Gluud C. Reported Methodologic Quality
and Discrepancies between Large and Small Randomized Trials in Meta-Analyses.
Ann Intern Med. 2001;135:982-989.
3.
Moher D, Cook DJ, Jadad AR, Tugwell P, Moher M, Jones A, et al.
Assessing the quality of reports of randomised trials: implications
for the conduct of meta-analyses. Health Technol Assess. 1999;3:i-iv,
1-98. [PMID: 10374081]
4.
Emerson JD, Burdick E, Hoaglin DC, Mosteller F, Chalmers TC. An
empirical study of the possible relation of treatment differences
to quality scores in controlled randomized clinical trials. Control
Clin Trials. 1990;11:339-52. [PMID: 1963128]
5.
Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of
bias. Dimensions of methodological quality associated with estimates
of treatment effects in controlled trials. JAMA. 1995;273:408-12.
[PMID: 7823387]
6.
Gøtzsche PC. Methodology and overt and hidden bias in reports
of 196 double-blind trials of nonsteroidal antiinflammatory drugs
in rheumatoid arthritis. Control Clin Trials. 1989;10:31-56. [PMID:
2702836]
7.
Juni P, Witschi A, Bloch R, Egger M. The hazards of scoring the
quality of clinical trials for meta-analysis. JAMA. 1999;282:1054-60.
[PMID: 10493204]
8.
Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan
DJ, et al. Assessing the quality of reports of randomized clinical
trials: is blinding necessary? Control Clin Trials. 1996;17:1-12.
[PMID: 8721797] |
| Concealment
of Allocation In
1996, the CONSORT statement encouraged the reporting of concealment
of allocation. Concealment of allocation is the process for actually
assigning to the patient the group they will be in without breaking
blinding. Hewitt et al. in a recent issue of BMJ reviewed the prevalence
of adequate concealment of allocation in 4 journals—BMJ, Lancet,
JAMA and NEJM (Hewitt C et al. BMJ 2005;330:1057-1058. PMID: 15760970).
They scored the allocation as adequate (i.e., subject recruiter
was different person from the person executing the allocation sequence),
inadequate or unclear. Sealed envelopes were considered inadequate
unless performed by an independent third party.
Results
Studies included: 234
Adequate concealment: 132 (56%)
Inadequate concealment: 41 (18%)
Unclear concealment: 61 (26%)
Delfini
Commentary
The authors point out that previous studies have found an association
between inadequate concealment and the reporting of significant
results. Of interest is that studies included in this review with
inadequate concealment tended to show a significant result—OR
1.8, 95% CI (0.8 to 3.7).
This
is another study suggesting that the critical appraisal of RCTs
is “critical” and that lower quality studies are more
likely to report significant benefit than are higher quality studies.
|
| Blinding
and RCTs
A recent
article, Boutron I, Estellat C, Guittet L, Dechartres A, Sackett
DL, et al. (2006) Methods of blinding in reports of randomized controlled
trials assessing pharmacologic treatments: A systematic review.
PLoS Med 3(10): e425. DOI: 10.1371/ journal.pmed.0030425, provides
a great deal of useful information about and a way of classifying
blinding in research studies. The authors evaluated blinding in
RCTs of pharmacologic treatment published in 2004 in high impact-factor
journals. The following are some key points from the article:
•
The authors identified 819 reports with about 60% describing the
method of blinding. The classification identified three main methods
of blinding:
(1) methods to provide identical treatments in both arms,
(2) methods to avoid unblinding during the trial, and
(3) methods of blinded outcome assessment.
• ESTABLISHING BLINDING OF PATIENTS AND PROVIDERS: 472 [58%]
described the method of blinding, but 236 [29%] gave no detail and
111 [13%] some data on blinding (i.e., reporting that treatments
were similar or the use of double dummies with no description of
the method). The methods of blinding identified varied in complexity.
The authors reported use of a centralized preparation of similar
capsules, tablets, or embedded treatments in hard gelatin capsules
(193/336 [57%]), similar syringes (37/336 [11%]), or similar bottles
(38/336 [11%]). Use of a double dummy procedure was described in
79 articles (23%). Other methods consisted of a sham intervention
performed by an unblinded health care provider who was not actively
involved in the care of patients and had no other contact with patients
or other caregivers and outcome assessors (17/336 [5%]). To mask
the specific taste of the active treatments, in ten articles researchers
used a specific flavor such as peppermint or sugar to coat treatments.
For treatments administered by care providers, authors reported
use of a centralized preparation of opaque coverage to adequately
conceal intravenous treatments with different appearances (14/336
[4%]).
•
AVOIDING UNBLINDING OF PATIENTS AND PROVIDERS: Only 28/819 [3%])
reported methods to avoid unblinding. Methods to blind dosage adaptation
relied on use of a centralized adapted dosage or provision of sham
results of complementary investigations for treatments necessitating
dosage adaptation. Methods to avoid unblinding because of side effects
relied mainly on centralized assessment of side effects, partial
information to patients about side effects, use of active placebo
or systematic prevention of adverse effects in both arms.
•
BLINDING ASSESSORS: These methods depend on the main outcomes and
are particularly important when blinding cannot be established and
maintained by the methods described above. A total of 112 articles
[14%] described these methods, which relied mainly on a centralized
assessment of the main outcome. Blinding of outcome assessors is
presumably achieved if neither patients nor those involved in the
trial have any means to discover which arm a patient is in, for
example because the placebo and active drugs are indistinguishable
and allocation is via a central randomization service. 96 reports
(86%) of the 112 reports in which specific measures to blind the
outcome assessor were reported concern trials in which patients
were reported as blinded or in which double blinding or triple blinding
was reported. These results suppose that, although blinding was
performed at an earlier stage, the investigators nevertheless decided
to perform a specific method of blinding the outcome assessor.
•
AUTHORS COMMENTS AND CONCLUSIONS:
• Although blinding is essential to avoid bias, the reporting
of blinding is generally quite poor and reviews of trials that test
the success of blinding methods indicate that a high proportion
of trials are unblinded.
•
The study results might be explained in part by the insufficient
coverage of blinding in the Consolidated Standards for Reporting
Trials (CONSORT) statements. For example, three items of the CONSORT
statements are dedicated to the description of the randomization
procedure, whereas only one item is dedicated to the blinding issue.
The CONSORT statements mainly focus on reporting who is blinded
and less on the reporting of details on the method of blinding,
and this information is essential to appraise the success of blinding.
•
Some evidence suggests that although participants are reported as
blinded, the success of blinding might be questionable. For instance,
in a study assessing zinc treatment for the common cold, the blinding
procedure failed, because the taste and aftertaste of zinc was distinctive.
And yet, tools used to assess the quality of trials included in
meta-analyses and systematic reviews focus on the reporting of the
blinding status for each participant and rarely provide information
on the methods of blinding and the adequacy of the blinding method.
•
There is a need to strengthen the reporting guidelines related to
blinding issues, emphasizing adequate reporting of the method of
blinding.
Delfini
Commentary
Lack of blinding appears to be a major source of bias in RCTs. Just
as well-done randomization and concealment of allocation to the
study groups decreases the likelihood of selection bias, blinding
of subjects and everyone working with the subjects or study data
to the assigned intervention (double-blinding) decreases the likelihood
of performance bias. Performance bias occurs when patients in one
group experience care or exposures not experienced by patients in
the other group(s) and the differences in care affect the study
outcomes. Lack of blinding may affect outcomes in that:
- Unblinded
subjects may report outcomes differently from blinded subjects,
have different thresholds for leaving a study, seek (and possibly
receive) additional care in different ways.
-
Unblinded clinicians may behave differently towards patients than
blinded clinicians.
-
Using unblinded assessors may result in systematic differences
in outcomes assessment (assessment bias).
A
number of studies have shown that lack of blinding is associated
with inflated treatment effects.
In
some cases blinding may not be possible. For example, side effects
or taste may result in unblinding. The important point is that even
if blinding is not possible, the investigators do not get “extra”
validity points for doing the best they could (i.e., the study should
not be “upgraded”). |
|
Blinding
In Surgical Trials — It is Through Blinding We Become Able To
See Blinding
is an important consideration when evaluating a study. Without blinding,
the likelihood of bias increases. Bias occurs when patients in one
group experience care or exposures not experienced by patients in
the other group(s), and the differences in care affect the study
outcomes.Lack of blinding may be a major source of this type of
bias in that unblinded clinicians who are frequently “rooting
for the intervention” may behave differently than blinded
clinicians towards patients whom they know to be receiving the study
drug or intervention being studied. The result is likely to be that
in unblinded studies, patients may receive different or additional
care. Unblinded subjects may be more likely to drop out of a study
or seek care in ways that differ from blinded subjects. Unblinded
assessors may also be “rooting for the intervention”
and assess outcomes differently from blinded assessors.
How
much difference does blinding make? Jüni et al. reviewed four
studies that compared double blinded versus non-blinded RCTs and
attempted to quantify the amount of distortion (bias) caused by
lack of double blinding [1]. Overall, the overestimation of effect
was about 14%. The largest study reviewed by Juni assessed the methodological
quality of 229 controlled trials from 33 meta-analyses and then
analyzed, using multiple logistic regression models, the associations
between those assessments and estimated treatment effects [2]. Trials
that were not double-blind yielded on average 17% greater effect,
95% CI (4% to 29%), than blinded studies (P = .01).
Lack
of double blinding is frequently found in surgical trials and
results in uncertain evidence because of the problems stated above.
A case study helps to illustrate this. A recent multicenter RCT,
the Spine Patient Outcomes Research Trial (SPORT)[3] was a non-blinded
trial that serves as an interesting case study of the blinding
issues that arise when a surgical intervention is compared to
a non-surgical intervention, and blinding is not attempted. The
trial included patients with persistent (at least 6 weeks) disk-related
pain and neurologic symptoms (sciatica) who were randomized to
undergo diskectomy or receive usual care (not standardized but
frequently including patient education, anti-inflammatory medication,
and physical therapy, alone or in combination). There were a number
of problems with this study including lack of power, poor control
of non-study interventions, a high proportion of patients who
crossed over between treatment strategies (43% randomized to surgery
did not undergo surgery by 2 years and the 42% randomized to conservative
care did receive surgery) and lack of blinding. The degree of
missing data was 24%-27% without a true intention-to-treat analysis.
Of great interest was an editorial that dealt with the problem
of non-blinding in surgical studies. The editorialist, Flum, makes
the following points [4]:
- While
the technique of sham intervention is well accepted in studies
of medications using inactive pills (placebos), simulated acupuncture,
and nontherapeutic conversation in place of therapeutic psychiatric
interventions, it has only occasionally been applied to surgical
trials. This is unfortunate because the use of sham controls
has been critical in understanding just how much patient expectation
influences outcomes after an operation.
-
A sham-controlled trial would be particularly relevant for spine
surgery since the most commonly occurring and relevant outcomes
are subjective.
- Patients
chosing surgical options may have high expectations. They may
include a higher level of emotional “investment”
in surgical care compared with usual care based on the level
of commitment resulting from a decision to have an operation
and get through recovery. After the patient has accepted the
risks of surgical intervention, the desire for improvement may
drive perceptions about improvement.
-
Patients who opt for surgery may also differ from patients who
decline surgery in their beliefs regarding the benefits of invasive
interventions.
-
The surgeon’s expectations and direction are likely to
play an important role in patient improvement.
- Given
the proliferation of operative procedures for the treatment
of subjective complaints like back pain, the need for sham controlled
trials has never been greater.
Flum
goes on to present multiple examples of the power of suggestion
and the problem of doing non-blinded trials in the field of surgery.
Observational trials have often reported procedural success, but
sham-controlled trials for the same conditions demonstrate how much
of that success is due to the placebo effect.
-
Example 1 — Ligation of Internal Mammary:
After multiple observational studies suggesting that ligation
of the internal mammary artery was helpful in patients with coronary
disease, Cobb et al randomized patients to operative arterial
ligation or a sham procedure. Both groups improved after the intervention,
but there were similar, if not greater, improvements in subjective
measures such as exercise tolerance and nitroglycerin use in the
sham surgical group.
- Example
2 — Osteoarthritic Knee Surgery —
and 3 — Osteoarthritic Knee Joint Irrigation:
After multiple case series reported that patients with osteoarthritis
of the knee improve after arthroscopic surgery, Moseley et al
demonstrated just how much of that effect is related to the hopes,
expectations, and beliefs of the patient. The investigators randomized
180 patients to undergo arthroscopy with debridement, arthroscopy
with lavage, or sham arthroscopy. The power of expectation was
strong and patients were unable to determine if they had been
assigned to the treatment or sham groups— and all groups
improved. At 2 years after randomization, all patients reported
comparable pain scores and functional scores. Another sham-controlled
study in patients with knee osteoarthritis demonstrated that patients
benefit equally from irrigation of the joint and from sham irrigation.
- Example
4 — Parkinson’s Disease: Researchers
found similar improvements in quality of life after direct brain
injections of embryonic neurons or placebo in patients with advanced
Parkinson’s disease.
- Example
5 — Transmyocardial Laser Revascularization in HF:
Heart failure patients undergoing transmyocardial laser revascularization
or sham procedures had equal improvements in subjective outcomes.
- Example
6 — Hernia: After hernia repair, there
was equal improvement in pain control after cryoablation of nerves
or sham interventions.
- Examples
7-9 — Laparoscopic Interventions: Multiple
case series have reported benefit on subjective outcomes such
as pain control, function, and readiness for discharge with laparoscopic
cholecystectomy, colon resection, and appendectomy compared with
conventional approaches..Bias arises when the clinical care team
influences patient and discharge expectations though coaching,
communication, and management. Randomized trials of these three
procedures that included blinding of both the patients and the
discharging clinicians to the treatment that patients received
by placing large, side-to-side abdominal wall dressings demonstrate
little or no difference in patients reaching discharge criteria.
A reasonable conclusion is that when the clinician’s expectations
and “coaching” were removed by placing a large bandage
on the abdominal wall, the subjective benefits disappeared. Flum
concludes that studies not addressing both patient and clinician
expectation on subjective outcomes do not inform the clinical
community about the true role of the intervention.
Delfini
Commentary
Blinding of subjects and everyone working with the subjects
or study data to the assigned intervention (double-blinding) decreases
the likelihood of bias. Bias may be more likely to occur when evaluating
subjective outcomes such as pain, satisfaction, and function in
non-blinded studies, but it has also been reported with objective
outcomes such as mortality. When dealing with subjective outcomes,
as Flum points out, it is critical to distinguish the effect of
the intervention from the effect of the patient’s expectation
of the intervention. The only way to distinguish the effect of a
patient’s positive expectations of an operation from the intervention
itself is to blind patients to the treatment they receive and randomize
them to receive the intervention of interest or to receive a sham
intervention (placebo). Yet we frequently hear, “But blinding
is not possible in surgical studies.” Frequently the argument
is raised that subjecting people to anesthesia and sham surgery
is not ethical. However, conducting clinical trials employing methods
that result in avoidable fatal flaws is also problematic. Flum’s
position is that when the risk of a placebo does not exceed a threshold
of acceptable research risk and if the knowledge to be gained is
substantial, a sham-controlled trial is needed and is ethical. He
reasons that ethical justification of placebo-controlled trials
is based on the following considerations:
- Invasive
procedures are associated with risks.
-
There are great harms created by conducting studies that are of
uncertain validity.
- Establishing
community standards based on uncertain evidence is more likely
to result in more harm than good.
-
Sham-controlled trials are justified when uncertainty exists among
clinicians and patients about the merits of an intervention.
The
SPORT trial draws attention to the problem of non-blinding in surgical
trials. This was a very expensive, labor-intensive study that provides
no useful efficacy data. Research subjects were undoubtedly told
this study would provide answers regarding the relative efficacy
of surgery vs conservative care for lumbar spine disease. The authors
of the SPORT trial state that a sham-controlled trial was impractical
and unethical, possibly — according to Flum — because
the risk of the sham would include general anesthesia (to truly
blind the patients). He would argue that in this case blinding which
would require anesthesia is the only way that valid, useful evidence
could have been created. Even though we graded the study U (uncertain
validity and usefulness) and would not use the results to inform
decisions about efficacy or effectiveness because of the threats
to validity, the study does report information regarding risks of
surgery that may be of great value to patients.
-----------
1
Jüni P, Altman DG and Egger M. Systematic reviews in health
care: Assessing the quality of controlled clinical trials. BMJ.
2001;323;42-46. PMID: 11440947
2
Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of
bias. Dimensions of methodological quality associated with estimates
of of treatment effects in controlled trials. JAMA 1995;273:40812.
PMID: 7823387.
3 Weinstein
JN, Tosteson TD, Lurie JD, et al. Surgical vs nonoperative treatment
for lumbar disk herniation: the Spine Patient Outcomes Research
Trial (SPORT): a randomized trial. JAMA. 2006;296:2441-2450. PMID:
17119141
4 Flum
DR. Interpreting Surgical Trials With Subjective Outcomes Avoiding
UnSPORTsmanlike Conduct. JAMA, November 22/29, 2006—Vol 296,
No. 20: 2483-1484. PMID: 17119146
|
| The
Importance of Blinded Assessors in RCTs We
have previously summarized the problems associated with lack of
blinding in surgical (and other) studies — see Blinding
in Surgery Trials in a previous DelfiniClick™.
The major problem with unblinded studies is that the outcomes in
the intervention group are likely to be falsely inflated because
of the biases introduced by lack of blinding.
Recently
a group of orthopedists identified and reviewed thirty-two randomized,
controlled trials published in The Journal of Bone and Joint Surgery
between 2003 and 2004 to evaluate the effect of blinded assessment
vs non-blinded assessment on reported outcomes [1].
Results
-
Sixteen of the thirty-two randomized controlled trials did not
report blinding of outcome assessors when blinding would have
been possible.
- Among
the studies with continuous outcome measures, unblinded outcomes
assessment was associated with significantly larger treatment
effects than blinded outcomes assessment (standardized mean difference,
0.76 compared with 0.25; p = 0.01).
- In
the studies with dichotomous outcomes, unblinded outcomes assessments
were associated with significantly greater treatment effects than
blinded outcomes assessments (odds ratio, 0.13 compared with 0.42;
p < 0.001).
-
This translates into a relative risk reduction of 38% for blinded
outcome assessments compared with 71% for unblinded outcome assessments
(a difference of 33%).
Conclusion
Unblinded outcomes assessment dramatically inflates the reported
benefit of effectiveness of treatments.
Delfini
Commentary
This is yet another study pointing out the importance of blinding.
Based on this and other similar studies it is our conclusion that
studies or the results of studies without blinded assessors are
grade U or at best grade B-U (see evidence-grading scale here).
1.
Poolman RW, Struijs PA, Krips R, Sierevelt IN, Marti RK, Farrokhyar
F, Bhandari M. Reporting of outcomes in orthopaedic randomized trials:
does blinding of outcome assessors matter? J Bone Joint Surg Am.
2007 Mar;89(3):550-8. J Bone Joint Surg Am. 2007 Mar;89(3):550-8.
PMID: 17332104. »
|
| Attrition
Bias: Intention-to-Treat Basics
In
general, we approach critical appraisal of RCTs by evaluating the
four major components of a trial— study population (including
how established), the intervention, the follow-up and the assessment.
There is very little controversy about the process of randomizing
in order to distribute known and unknown confounders as equally
as possible between the groups. There also appears to be general
understanding that the only difference between the two groups should
be what is being studied. However, what seems to receive much less
attention is the considerable potential for bias that occurs when
data is missing from subjects because they do not complete a study
or are lost to follow-up, and investigators use models to deal with
that missing data. The only way to prevent this bias is to have
data on all randomized subjects. This is frequently not possible.
And bias creeps in.
Intent-to-treat
designs that provide primary outcome data on all randomized patients
are the ideal. All patients randomized are included in the analysis
— and patients are analyzed in the same groups to which they
were randomized. Unfortunately we are rarely provided with all of
this information, and we must struggle to impute the missing data—i.e.,
we must do our own sensitivity analysis and recalculate p-values
based on various assumptions (e.g., worst case scenario, all missing
subject fail, etc.) — when possible! All too often, papers
do not report sufficient data to perform these calculations, or
the variables do not lend themselves to this type of analysis because
they cannot be made binomial, and we are left with the authors’
frequently inadequate analysis. To which we have to assign a low
study grade as we remain uncertain enough about drawing cause and
effect conclusions based on the data.
We
see many studies where the analysis is accomplished using Kaplan-Meier
estimates and other models to deal with excluded patient data. As
John Lachin has pointed out, this type of “efficacy subset”
analysis has the potential for Type I errors (study findings=significant
difference between groups; truth=no significant difference) as large
as 50 percent or higher [1]. Lachin and others have shown that the
statistical methods used when data is censored (meaning not included
in analysis either through patient discontinuation or data being
removed), frequently assume that —
- Missing
data is missing at random to some degree;
-
It is reasonable to impute missing data using assumptions from
non-missing data; and,
-
The bias from efficacy subset analysis is not a major factor.
We
want to see data on all patients randomized. When patients are lost
to follow-up or do not complete a study, we want to see intent-to-treat
analyses with clear statements about how the missing data is imputed.
We agree with Lachin’s suggestion that the intent-to-treat
design is likely to be more powerful (than statistical modeling),
and especially powerful when an effective treatment slows progression
of a disease during its administration—i.e., when a patient
benefits long after the patient becomes noncompliant or the treatment
is terminated. Lachlin concludes that, “The bottom line is
that the only incontrovertibly unbiased study is one in which all
randomized patients are evaluated and included in the analysis,
assuming that other features of the study are also unbiased. This
is the essence of the intent-to-treat philosophy. Any analysis which
involves post hoc exclusions of information is potentially biased
and potentially misleading.”
We
also agree with an editorial comment made by Colin Begg who states
that, “The properly conducted randomized trial, where the
primary endpoint and the statistical method are specified in advance,
and all randomized patients contribute to the analysis in an intent-to-treat
fashion, provides a structure that severely limits our opportunity
to obscure the facts in favor of our theories.” Begg concludes
by supporting Lachin’s assessment: “He is absolutely
correct in his view that the recent heavy emphasis on the development
of missing data methodologies in statistical academic circles has
led to a culture in which poorly designed studies with lots of missing
data are perceived to be increasingly more acceptable, on the flimsy
notion that sophisticated statistical modeling can overcome poor
quality data. Mundane though it may sound, I strongly support his
[Lachin’s] assertion that `…the best way to deal with
the problem (of missing data) is to have as little missing data
as possible…’ Attention to the development of practical
strategies for obtaining outcome data from patients who withdraw
from trials, notably short-term trials with longitudinal repeated
measures outcomes, is more likely to lead to improvement in the
quality of clinical trials than the further development of statistical
techniques that impute the missing data. [2]”
It
would be difficult to express our concern more eloquently than what
is stated above. The two examples below amplify this.
Example
1: A group of rheumatologists were uncomfortable with Kaplan-Meier
statistical methods for analysis of outcomes in rheumatology studies.
Their concern was that, even though Kaplan-Meier methods are frequently
used to analyze cancer data, very little research has been done
to validate the use of Kaplan-Meir methods for drug studies (i.e.
endpoints such as stopping medication because of side-effects or
lack of efficacy. They tested three assumptions upon which Kaplan-Meier
survival analysis depends:
1.
Patients recruited early in the study should have the same drug
survival (i.e. time to determination of lack of efficacy or onset
of side-effects) as those recruited later;
2. Patients receiving their first drug later in the study should
have the same drug survival characteristics as those receiving it
earlier; and,
3. Drug survival characteristics should be independent of the time
that a patient has been in the study before receiving the disease
modifying drug.
To
examine the above assumptions, the authors plotted survival curves
for the different groups (i.e. subjects recruited early vs those
recruited later) and showed that, in each case, the drug survival
characteristics were statistically different between the two groups
(p<0.01). They conclude, as did Lachin, that it is not possible
to prove that survival analysis is always invalid (even though they
did show in this case the Kaplan-Meier analysis was invalid). However,
this group feels that the onus of proof is on those who advocate
for drug survival analysis—i.e., using statistical modeling
rather than presenting all the data so that the reader can do an
ITT analysis or sensitivity analysis[3].
Example
2: A similar situation occurred when a group of geriatricians became
concerned that many different, and sometimes inappropriate, statistical
techniques are used to analyze the results of randomized controlled
trials of falls prevention programs for elderly people. To evaluate
this, they used raw data from two randomized controlled trials of
a home exercise program to compare the number of falls in the exercise
and control groups using two different survival analysis models
(Andersen-Gill and marginal Cox regression) and a negative binomial
regression model for each trial.
In
one trial, the three different statistical techniques gave similar
results for the efficacy of the intervention but, in the second
trial, underlying assumptions were violated for the two Cox regression
models. Negative binomial regression models were easier to use and
more reliable.
Proportional
Hazards and Cox Regression Models: The authors point that although
the use of proportional hazards or Cox regression models can test
whether several factors (for example, intervention group, baseline
prognostic factors) are independently related to the rate of a specific
event (e.g., a fall) that using survival probabilities to analyze
time to fall events assumes that, at any time, participants who
are censored before the end of the trial have the same risk of falling
as those who complete the trial. An assumption of proportional hazards
models is that the ratio of the risks of the events in the two groups
is constant over time and that the ratio is the same for different
subgroups of the data, such as age and sex groups. This is known
as the proportionality of hazards assumption. No particular distribution
is assumed for the event times, that is, the time from the trial
start date for the individual to the outcome of interest (in this
case, a fall event) such as would be the case for death following
cardiac surgery, where one assume a greater frequency of deaths
to occur close to the surgical event.
Andersen-Gill
and marginal Cox proportional hazards regression: These models are
used in survival analyses when there are multiple events per person
in a trial. The Andersen-Gill extension of the proportional hazards
regression model and the marginal proportional hazards regression
model are both statistical techniques used for analyzing recurring
event data.
Negative
Binomial Regression: The negative binomial regression model can
also be used to compare recurrent event rates in different groups.
It allows investigation of the treatment effect and confounding
variables, and adjusts for variable follow-up times by using time
at risk.
In
the first study of falls in the elderly, all three statistical approaches
indicated that falls were significantly reduced by 40% (Andersen-Gill
Cox model), 44% (marginal Cox model) and 39% (negative binomial
regression model) in the exercise group compared with those in the
control group. The tests for the proportionality of hazards for
both types of survival regression models indicated that these models
“worked” for the recurring falls problem.
In
the second study, there was evidence that the proportional hazards
assumption was violated in the Andersen-Gill and marginal Cox regression
models (proportional hazards test). The authors point out that survival
analysis is not valid if participants who are censored do not have
the same rate of outcome (risk of falling) as those who continue
in the trial. The authors point out and cite a reference for concluding
that those not completing a falls prevention trial are at higher
risk of falling and, if fewer from one group than another group
withdraw, it may point to a study-related cause for the change in
discontinuation, and results may be biased.
Summary
Unfortunately, readers are in a very difficult position when evaluating
the quality of studies that use survival analyses and statistical
modeling because the assumptions used in the models are almost never
given and the missing data points are frequently quite large. Delfini
uses a conservative approach. We look for information about the
model, percent of subjects whose data are missing from analysis,
differential loss between the groups, censored information and reasons
for loss to follow-up. We have been unable to find any good evidence-based
criteria to help guide us in considering cut-offs for validity.
We use the following in evaluating how loss of subjects’ data
affects the validity of the study. While the suggestions below are
not evidence-based, they are conservative in comparison to some
EBM suggestions we have seen, and we have run some calculations
trying to help guide our choices. So caveat emptor!
Delfini
Non-evidence-based Advice on Reaction to Missing Data Points from
Non-completers and Those Lost to Follow-up:
| Minimal
threat |
<
5% and no differential loss* |
Possible
threat |
>=
5% but <10% and no differential loss* |
Acceptable
For efficacy |
>=
5%, but sensitivity analysis conducted, by authors or reviewers,
which applied worst-case scenario, or otherwise reasonable sensitivity
analysis, and analysis continued to agree with authors’
findings about statistical significance |
| Threat |
>=5%
with differential loss*, or >= 10% without differential loss,
and without worst-case sensitivity analysis, or otherwise reasonable
sensitivity analysis, conducted by authors or reviewers |
| *Differential
loss |
For
small to medium study (e.g., less than 300 total randomized),
differential loss must be low to non-existent (e.g., 2% or
less difference in missing data points between groups)
For large study (e.g., more than 300 total randomized), differential
loss must be minimal (e.g., 5% or less difference in missing
data points between groups) |
1.
Lachin JM. Statistical considerations in the intent-to-treat principle.
Control Clin Trials 2000;21:167–189. PMID: 11018568
2.
Utley M. et al. Potential bias in Kaplan-Meier survival analysis
applied to rheumatology drug studies. Rheumatology 2000;39:1-6.
3.
Robertson, MC et al. Statistical Analysis of Efficacy in Falls Prevention.
Journal of Gerontology 2005;60:530–534. |
| Intention-to-Treat
Analysis & Censoring: Rofecoxib Example
In
a recent DelfiniClick, we voiced concern about models used for analysis
of study outcomes, especially when information about assumptions
used is not reported. In the July 13, 2006 issue of the NEJM (published
early on-line), there is a very informative example of what can
happen when authors claim to analyze data using the intention-to-treat
(ITT) principle, but do not actually do an ITT analysis.
Case
Study
The NEJM published a correction to an original study of cardiovascular
events associated with rofecoxib versus placebo[1]. This correction
illustrates how Kaplan-Meier curves can be misleading to readers
and how they differ with various censoring assumptions. In this
case, by censoring data that occurred 14+ days after subjects discontinued
the study, the Kaplan-Meir curves for thrombotic events did not
separate until 18 months. The following is part of the correction
published by NEJM:
“…Statements regarding an increase in risk after 18
months should be removed from the Abstract (the sentence ‘The
increased relative risk became apparent after 18 months of treatment;
during the first 18 months, the event rates were similar in the
two groups’ should be deleted…”
The
reason for the correction appears to be an analysis of data released
by Merck to the FDA on May 11, 2006. These data provide information
about events in the subgroup of participants whose data were censored
if they had an event more than 14 days after early discontinuation
of the study medication.
Twelve
thrombotic events that occurred more than 14 days after the study
drug was stopped but within 36 months after randomization were noted.
Eight of the “new” events were in the rofecoxib group,
and these events had a definite effect on the published survival
curve for rofecoxib (Fig. 2 of the original article). When including
the new data, the separation of the rofecoxib and placebo curves
begins earlier than 18 months.
The
point of all this is that it is difficult to determine the validity
of a study when assumptions used in censoring of data are not reported.
With insufficient information about loss to follow-up, we cannot
do our own sensitivity analyses for imputing missing data with our
goal being to “test” the P-value reported by the authors.
To
reiterate from our previous DelfiniClick:
- Intent-to-treat
designs that provide primary outcome data on all randomized patients
are the ideal. All patients randomized are included in the analysis.
The same patients randomized at the beginning of the RCT are analyzed
in the same groups to which they were randomized.
-
Authors should use a CONSORT diagram to report what happened to
various patients during the course of the study – plus they
should provide detailed information about missing data points
including timing.
- Sensitivity
analyses are welcomed, especially those that subject the intervention
to the toughest trial. If p-values remain statistically significant
after such a test, we can be more confident about anticipated
outcomes in an otherwise valid study.
1.
Correction to: Cardiovascular events associated with rofecoxib in
a colorectal adenoma chemoprevention trial. N Engl J Med 2006;355:221.
2.
Bresalier RS, Sandler RS, Quan H, et al. Cardiovascular events associated
with rofecoxib in a colorectal adenoma chemoprevention trial. N
Engl J Med 2005;352:1092-102. |
| Intention-to-Treat
Analysis: Misreporting and Migraine
Intention-to-treat
analysis (ITT) is an important consideration in randomized, controlled
trials. And determining whether an analysis meets the definition
of ITT analysis or not is incredibly easy. Yet many authors mislabel
their analyses as ITT when they are not and report their results
in a biased way. An article in BMJ dealing with migraine illustrates
some important points about ITT analysis and reminds us that authors
continue to report outcomes in ways that are highly likely
to be biased.
Read
our case study here. |
| Missing
Data Points: Difference or No Difference — Does it Matter?
We
continue to study the "evidence on the evidence" —
meaning we are continually on the look out for information which
may shed light on the impact on reported outcomes of certain kinds
of bias, for example, or information that provides help in how to
handle different biases. Missing data points is an issue affecting
the majority of studies, but currently there is not clarity on how
big an issue this is, especially when there is not a differential
loss between groups.
We
spoke recently about this issue with John M. Lachin, Sc.D., Professor
of Biostatistics and Epidemiology, and of Statistics, The George
Washington University, and author. (And then we did some "hard
thinking" as David Eddy would say.) Even without differential
loss between the groups overall, a differential loss could occur
in prognostic variables — and readers are rarely going to
have access to data about changes in prognostic characteristics
post-baseline reporting. So we continue to offer our conservative
approach that loss of around five percent with differential loss
is a bias as well as loss of around ten percent or more without
differential loss.
For
those who are tough and hardy and really want to mull on this, here's
our updated white paper on "missingness" [Word]
or [PDF]. We welcome
further thoughts (or evidence) on this area. |
| Quality
of Studies: VIGOR
Why is it that Vioxx made the front page of the
NYTs in December of 2005 when it was withdrawn from the market in
2004? Reason: it was discovered that the authors “removed”
3 patients with CV events from the data in the days preceding final
hardcopy submission of the VIGOR study to the NEJM. Here are some
key points made by the NEJM in an editorial entitled, Expression
of Concern: Bombardier et al., “Comparison of Upper Gastrointestinal
Toxicity of Rofecoxib and Naproxen in Patients with Rheumatoid Arthritis,”
N Engl J Med 2000;343:1520-8, published on the web 12/8/04 and in
hard copy, N Engl J Med. 2005.353:25:
- The
VIGOR study was designed primarily to compare gastrointestinal
events in patients with rheumatoid arthritis randomly assigned
to treatment with rofecoxib (Vioxx) or naproxen (Naprosyn), but
data on cardiovascular events were also
monitored.
- Three
myocardial infarctions, all in the rofecoxib group, were not included
in the
data submitted to the Journal in hardcopy.
- Until
the end of November 2005, the NEJM believed that these were late
events that were not known to the authors in time to be included
in the article published in the Journal on November 23, 2000.
- It
now appears, however, from a memorandum dated July 5, 2000, that
was obtained by subpoena in the Vioxx litigation and made available
to the NEJM, that at least two of the authors knew about the three
additional myocardial infarctions at least two weeks before the
authors submitted the paper version of their manuscript.
- Lack
of inclusion of the three events resulted in an understatement
of the difference in risk of myocardial infarction between the
rofecoxib and naproxen groups.
- The
NEJM determined from a computer diskette that some of these data
were deleted from the VIGOR manuscript two days before it was
initially submitted to the Journal on May 18, 2000.
- Taken
together, these inaccuracies and deletions call into question
the integrity of the data on adverse cardiovascular events in
this article.
Merck's position is that the additional heart attacks
became known after the publication's "cutoff" date for
data to be analyzed and were therefore not reported in the Journal
article. To our knowledge, NEJM has not responded to Merck's point.
In any event, without the 3 missing subjects the
relative risk of myocardial infarction risk was 4.25 for refecoxib
versus naproxen, 95% CI (1.39 to 17.37). This is based on 17 MIs
out of 2315 person years of exposure for rofecoxib and 4 MIs out
of 2336 person years for naproxen.
Adding in the 3 missing subjects (new total of 20
MIs in the rofecoxib group) increases the relative risk to 5.00,
95% CI (1.68 to 20.13). This demonstrates how losing just a few
subjects even in a large study can change results dramatically.
For readers, the important point is to look carefully
to be sure that all randomized patients were accounted for. We believe
that if the loss of subjects is greater than 5% without an acceptable
ITT analysis there is uncertainty regarding the validity of the
results.
|
Confidence-Intervals,
Power & Meaningful Clinical Benefit:
Advice to Readers on How to Stop Worrying about Power and Start
Using Confidence Intervals &
Using Confidence Intervals to Evaluate Clinical Benefit of Statistically
Significant Findings
(Special thanks to Brian Alper, MD, MSPH and Ted Ganiats,
MD for their help in understanding this issue.)
Problems
with Non-Statistically Significant Findings
Research outcomes which are not statistically significant (also
referred to as “non-significant findings”) raise the
question, "Is there TRULY no difference, or were there not
enough people to show a difference if there is one?" (This
is known as beta- or Type II error.)
Power
calculations are performed prior to a study help investigators determine
the number of people they should enroll in the study to try and
detect a statistically significant difference if there is one. A
power of >= 80% is conventional and provides some leeway for
chance. Power calculations are generally performed only for the
primary outcome. They entail a lot of assumptions.
Good
News About Power!
The good news for readers is that you don’t need to worry
about power since you can evaluate inconclusiveness of findings
through using confidence intervals.
Here’s
what they are, and here’s how it’s done:
About
Confidence Intervals (CIs)
The results of a valid study represent an approximation of truth.
There might be other possible values that could equally approximate
truth. (What if the study had been done on Friday instead of on
Tuesday, for example? Maybe the difference in outcomes would be
an absolute 4 percent and not 5 percent.) In recognition of this,
confidence intervals are calculations of equally statistically plausible
results generating a range within which there is a 95% chance that
the true answer lies for a valid study. (As with all allowances
for chance findings, 95 percent is conventional.) You can apply
confidence intervals to any measure of outcomes such as an odds
ratio or absolute risk reduction (ARR).
This
is how confidence intervals are reported:
Example: ARR = 5%; 95% CI (3% to 7%)
How
to Use Confidence Intervals to Determine Statistical Significance
Absolute
Risk Reduction and Relative Risk Reduction
For measures reported as percentages, if the range includes zero,
the outcomes are not statistically significant.
Relative
Risk (aka Risk Ratio) and Odds Ratio
For measures reported as ratios, if the range includes 1, the
outcomes are not statistically significant.
How
to Use Confidence Intervals to Determine Conclusiveness of Non-significant
Findings
And if something is not statistically significant (also referred
to as non-significant or NS findings), you don’t know if there
truly is no difference, or whether there were not enough people
to show a difference if there is one.
You
can look to the CIs to help you with this situation. But first you
want to decide what you would consider to be your minimum requirement
for a clinically significant outcome (difference between outcomes
in the intervention and comparison groups). This is a judgment call.
Let’s
assume we are looking at a study, the primary outcome for which
is absolute reduction in mortality. One might reasonably conclude
that an outcome of 1 percent or more is, indeed, a clinically meaningful
benefit.
[Below
is a text explanation. Pictures tell this best, however. Click
here
to view a PDF of what this looks like graphically. Note
that the PDF starts out first with how to determine clinical significance
of statistically significant outcomes and then demonstrates how
to determine conclusiveness of non-significant findings.]
Example:
Clinical Significance Goal
>=1% absolute reduction in mortality
For
Non-Significant Findings:
Example
1
- ARR
= 2%; 95% CI (-1% to 5%)
- The
upper boundary tells you it is possible that the true result
WOULD meet your requirements for clinical significance –
thus, from that perspective this trial is inconclusive about
NO DIFFERENCE BETWEEN GROUPS - you do not know if the trial
was insufficiently powered (false negative due to insufficient
number of people to show a statistically significant difference
if there is one)
Example
2
- ARR
= 0%; 95% CI (-.5 to .5%)
-
The upper boundary does not reach your goal – therefore,
this can be considered sufficient evidence that there is no
difference between the groups that you would consider clinically
significant
How
to Use Confidence Intervals to Determine Conclusiveness of Non-significant
Findings
Again, you can also use confidence intervals to determine whether
a result from a valid study is of meaningful clinical benefit.
Requirements
for Meaningful Clinical Benefit
Remember that outcomes of clinical significance are those which
benefit patients in some way in the areas of morbidity, mortality,
symptom relief, physical or emotional functioning or health-related
quality of life. Intermediate markers are assumed to benefit patients
in these areas, but they may not - thus, a direct causal chain of
benefit must be proved to avoid waste and potential patient harms
occurring as unintended consequences. Meaningful clinical benefit
is a combination of benefits in a clinically significant area along
with the size of the results.
As
with evaluating the conclusiveness of a non-significant finding,
you apply judgment to set your minimum requirement for meaningful
clinical significance. Using the same example of your choosing 1
percent absolute reduction in mortality as meaningful clinical benefit:
Example:
Clinical Significance Goal
>=1% absolute reduction in mortality
For
Statistically Significant Findings:
Example
1
- ARR
= 2%; 95% CI (.5% to 3.5%)
- The
lower boundary tells you it is possible that the true result
will NOT meet your requirements for clinical significance –
thus, from that perspective this trial is inconclusive
Example
2
- ARR
= 2%; 95% CI (1 to 3%)
-
The lower boundary reaches your goals for clinical significance
– therefore, this can be considered sufficient evidence
of benefit
Again,
pictures probably tell this best. Click here
to view the PDF.
The
Authors Did Not Report CIs?
If you can create a 2 x 2 table from the study data, you can compute
them yourself using the confidence interval calculator of the University
of British Columbia, Department of Health Care and Epidemiology
»
which can also be found in the Delfini
WebLinks »
under "confidence interval calculations."
Evaluate
Definitions for Outcomes
And remember, ensure you agree with the authors’ definitions
of the outcomes, especially if they are using a term like “improved,”
“success,” or “failure” – is a three-point
change on a 200 point scale really a meaningful clinical difference
that should define success? You get to be the judge. |
| Getting
“Had” by P-values: Confidence Intervals vs P-values in
Evaluating Safety Results: Low-molecular-weight Heparin (LMWH) Example
In one of our DelfiniClicks
we have pointed out that confidence intervals (CIs) can be very
useful when examining results of randomized controlled trials (Confidence-Intervals,
Power & Meaningful Clinical Benefit
»). The first
step in examining safety results is to decide what you consider
to be a range for clinically significant outcomes (i.e., the difference
between outcomes in the intervention and comparison group). This
is a judgment call. Then examine the 95% CI to see if a clinically
significant difference is included in the confidence interval. If
it is, the study has not excluded the possibility of a clinically
significant harm even if the authors state there is no difference
(usually stated as “no difference” based on a non-significant
p-value.) It is important to remember that a non-significant p-value
can be very misleading in this situation.
This can be illustrated by an interesting conversation
we recently had with an orthopedic surgeon who felt he couldn’t
trust the medical literature to guide him because it gave him “misleading
information.” He based his conclusion on a study he read (he
wasn’t sure which study it was) regarding bleeding in orthopedic
surgery. After talking with him, we searched for studies that may
have led to his conclusion and found the following study which illustrates
why CIs are preferable to p-values in evaluating safety results
and possibly why he was misled.
Case
Study: An orthopedic surgeon reads an article comparing
outcomes, including bleeding rates, between fondaparinux and enoxaparin
in orthopedic surgery and sees the following statement by the
authors in the Abstract section of the paper:
“The two groups did not differ in frequency of death or
clinically relevant bleeding.” [1]
He
looks at the Results section of the paper and
reads the following: “The number of patients who had major
bleeding did not differ between groups (p=0.11).” He knows
that if the p-value is greater than 0.05, the differences are
not considered statistically significant, and he concludes that
there is no difference in bleeding between the groups. His confidence
is shaken when he switches to fondaparinux and his patients experience
increased postoperative bleeding.
Let’s evaluate this study’s bleeding
rates using confidence intervals. One might reasonably conclude
that an outcome of 1 percent or more difference between the groups
is, indeed, a clinically meaningful difference in bleeding:
- The
actual rates for major bleeding were 47/ 1140 (4.1%) in the fondaparinux
group vs 32/ 1133 (2.8%) in the enoxaparin group, up to day 11,
a difference of 1.3%, p=0.11.
- But
CIs provide more information: The absolute risk increase with
fondaparinux (ARI) was 1.3%, but the 95% CI was (0.3, 2.9) and
since the true difference could be as great as 2.9% (i.e., clinically
relevant) the authors’ conclusions are misleading.
The
Cochrane Handbook summarizes this problem nicely:
"A
common mistake when there is inconclusive evidence is to confuse
‘no evidence of an effect’ with ‘evidence of
no effect.’ When there is inconclusive evidence, it is wrong
to claim that it shows that an intervention has ‘no effect’
or is ‘no different’ from the control intervention.
It is safer to report the data, with a confidence interval, as
being compatible with either a reduction or an increase in the
outcome. When there is a ‘positive’ but statistically
non-significant trend, authors commonly describe this as ‘promising,’
whereas a ‘negative’ effect of the same magnitude
is not commonly described as a ‘warning sign.’ Authors
should be careful not to do this." [2]
Comments:
Following the Lassen study referenced above, others confirmed the
increased bleeding rate leading to re-operation and other significant
bleeding with fondaparinux vs enoxaprin. [3]
When investigators provide p-values but not confidence
intervals, readers can quickly calculate the 95% CIs if the outcomes
are dichotomous and the investigators report the actual rates of
events, as in the example above, by using the calculator available
at:
http://www.graphpad.com/quickcalcs/NNT1.cfm
Also, see our web links for other sources (search
“confidence intervals”):
http://www.delfini.org/delfiniWebSources.htm
References:
- Lassen
MR, Bauer KA, Eriksson BI, Turpie AG. Postoperative fondaparinux
versus preoperative enoxaparin for prevention of venous thromboembolism
in elective hip replacement surgery: a randomised double-blind
comparison. Lancet. 2002;359:1715- 20. [PMID: 12049858]
- Higgins
JPT, Green S, editors. 9.7 Common errors in reaching conclusions.
Cochrane Handbook for Systematic Reviews of Interventions 4.2.6
[updated September 2006]. http://www.cochrane.org/resources/handbook/hbook.htm
(accessed 22nd January 2008).
- Vormfelde
SV. Comment on: Lancet. 2002 May 18;359(9319):1710-1. Lancet.
2002 Nov 23;360(9346):1701. PMID 12457831.
|
Understanding
Number Needed to Treat (NNT)
We have found that it is very common for health care professionals
to not understand the steps in calculating NNT. Bandolier has available
on its website a classic article on NNT. We heartily recommend reviewing
this article if you have any questions or uncertainties about what
NNT means or how to calculate and use NNT information.
http://www.jr2.ox.ac.uk/bandolier/booth/painpag/NNTstuff/numeric.htm |
| Early
Discontinuation of Clinical Trials: Oncology Medication Studies—Recent
Developments and Concern: 04/28/08
With
the trend for more rapid approval of oncology drugs has come concern
regarding the validity of reported results because of methodological
problems. Validity and usefulness of reported results from oncology
(and other) studies are clearly threatened by lack of randomization,
blinding, the use of surrogate outcomes and other methodological
problems. Trotta et al. have extended this concern in a recent study
that highlights an additional problem with oncology studies—stopping
ocncology trials early [1. Trotta F, Apolone G, Garattini S, Tafuri
G. Stopping a trial early in oncology: for patients or for industry?
Ann Oncol. 2008 Apr 9 [Epub ahead of print] PMID: 18304961].The
aim of the study was to assess the use of interim analyses in randomized
controlled trials (RCTs) testing new anticancer drugs, focusing
on oncological clinical trials stopped early for benefit. A second
aim was to estimate how often trials prematurely stopped as a result
of an interim analysis are used for registration i.e., approval
by European Medicines Agency (EMEA), the European equivalent of
FDA approval. The authors searched Medline along with hand-searches
of The Lancet, The New England Journal of Medicine, and The Journal
of Clinical Oncology and evaluated all published clinical trials
stopped early for benefit and published in the last 11 years. The
focus was on anticancer drugs that contained an interim analysis.
Results
and Authors’ Conclusions
Twenty-five RCTs were analyzed. In 95% of studies, at the interim
analysis, efficacy was evaluated using the same end point as planned
for the final analysis. The authors’ found a consistent increase
(>50%) in prematurely stopped trials in oncology during the last
3 years. As a consequence of early stopping after the interim analysis,
approximately 3,300 patients/events across all studies were spared
potential harms of continued therapy. This may appear to be clearly
beneficial, but as the authors point out, stopping a trial early
does not guarantee that other patients will receive the apparent
benefit of stopping, assuming one exists, unless study findings
are immediately publicly disseminated. The authors found long delays
between study termination and published reports (approximately 2
years). If the trials had continued for these further 2 years, more
efficacy and safety data could have been gathered. Delays in reporting
results further lengthen the time needed for translating trial findings
into practice.
Surprisingly,
there was a very small percentage of trials (approximately 4%) stopped
early because of harms, i.e. serious adverse events. Therefore,
toxicity does not represent the main factor leading to early termination
of trials. Of the 25 trials, six had no data and safety monitoring
board (DSMB) and five had enrolled less than 40% of the planned
sample size. Even so, 11 were used to support licensing applications
on the basis of what could have been exaggerated chance events.
Thus, more than 78% of the oncology RCTs published in the last 3
years were used for registration purposes. The authors argue that
only untruncated trials can provide a full level of evidence which
might be useful for informing clinical practice decisions without
further confirmative trials. They concluded that early termination
may be done for ethical reasons such as minimizing the number of
people given an unsafe, ineffective, or clearly inferior treatment.
However, interim analyses may also have drawbacks, since stopping
trials early for apparent benefit will systematically overestimate
treatment effects [2. Pocock SJ. When (not) to stop a clinical trial
for benefit. JAMA 2005; 294: 2228–2230. PMID: 16264167] and
raises new concerns about what they describe as “market-driven
intent.” Some additional key points made by the authors:
- Repeated
interim analyses at short intervals raise concern about data reliability:
this strategy risks the appearance of seeking the statistical
significance necessary to stop a trial;
- Repeated
analyses on the same data pool often lead to statistically significant
results only by chance;
- If
a trial is evaluating the long-term efficacy of a treatment for
conditions such as cancer, short-term benefits — no matter
how significant statistically — may not justify early stopping.
Data on disease recurrence and progression, drug resistance, metastasis,
or adverse events could easily be missed. Early stopping may reduce
the likelihood of detecting a difference in overall survival (the
only relevant endpoint in this setting).
The
authors conclude that:
…a
decision whether to stop a clinical trial before its completion
requires a complex of ethical, statistical, and practical considerations,
indicating that results of RCTs stopped early for benefit should
be viewed with criticism and need to be further confirmed. The
main effect of such decisions is mainly to move forward to an
earlier-than-ideal point along the drug approval path; this could
jeopardise consumers’ health, leading
to unsafe and ineffective drugs being marketed and prescribed.
Even if well designed, truncated studies should not become routine.
We believe that only untruncated trials can provide a full level
of evidence which can be translated into clinical practice without
further confirmative trials.
Lancet
Comment
In a Lancet editorial on April 19, 2008 the editorialist states
that early stopping of RCTs should require proof beyond reasonable
doubt that equipoise no longer exists. Data safety and monitoring
boards must balance the decision to stop, which favors immediate
stakeholders (participants, investigators, sponsors, manufacturers,
patients’ advocates, and editors), with continuing the study
to obtain more accurate estimates of not only effectiveness, but
also of longer-term safety and that in judging whether or not to
stop a trial early for benefit, the plausibility of the findings
and their clinical significance are as important as statistical
boundaries.
Delfini
Comments
Overall we are concerned about the FDA’s loosening of standards
for accepting oncology study data as valid when it comes from studies
that many would judge to be fatally f | |