|
| Volume
— Quality
of Evidence:
Primary Studies & General Concepts
newest
12/05/2011:
Attrition Bias & A Biostatistician Weighs In: Dr. Steve Simon on "Why is a 20% dropout rate bad?"
11/28/2011:
Confidence Intervals: Overlapping Confidence Intervals—A Clarification
Contents
Go to
DelfiniClick™
for all volumes. |
| Quality
of Studies: Lower Quality = Greater Effect Size
The quality of
studies in systematic reviews and meta-analyses has repeatedly
been shown to affect the amount of benefit reported. This DelfiniClick
is a quick reminder that just because a study is a RCT does
not mean it will provide you with a reliable estimate of effect
size. A nice illustration of this point is provided in a classic
article by Moher D et al. (Does quality of reports of randomised
trials affect estimates of intervention efficacy reported in
meta-analyses? Lancet 1998; 352: 609–13)[1].
In this study,
the authors randomly selected 11 meta-analyses that involved
127 RCTs on the efficacy of interventions used for circulatory
and digestive diseases, mental health, pregnancy and childbirth.
The authors evaluated each RCT by examining the description
of randomization, allocation concealment, blinding, drop outs
and withdrawals.
The results are
in line with other authors’ findings regarding quality
of methods and amount of benefit (effect size) reported as relative
measures below:
- The quality
of trials was low overall.
- Low-quality
trials compared with high quality trials (score >2) were
associated with an increased estimate of benefit of 34%.
- Trials that
used inadequate allocation concealment, compared with those
that used adequate methods, were also associated with an increased
estimate of benefit (37%).
- The average
treatment benefit was 39% for all trials, 52% for low-quality
trials, and 29% for high-quality trials.
The authors conclude
that studies of low methodological quality in which the estimate
of quality is incorporated into the metaanalyses can alter the
interpretation of the benefit of the intervention.
We continue to
see this problem in systematic reviews and clinical guidelines
and suggest that when evaluating secondary studies readers pay
close attention to the quality of included studies.
[1] Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, Tugwell P, Klassen TP. Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet. 1998 Aug 22;352(9128):609-13. PubMed PMID: 9746022. |
| Overestimation
of Effect Size in Studies of Low Quality
Updated 10/11/2011 for Loss Levels
In a previous DelfiniClick,
we summarized an article by Moher and colleagues (1) in which
the authors randomly selected 11 meta-analyses involving 127
RCTs which evaluated the efficacy of interventions used for
circulatory and digestive diseases, mental health, pregnancy
and childbirth. Moher and colleagues concluded that -
- Low-quality
trials compared with high quality trials (score >2), were
associated with a relative increased estimate of benefit (34%).
- Trials that
used inadequate allocation concealment, compared with those
that used adequate methods, were associated with a relative
increased estimate of benefit (37%).
Below we summarize
another study that confirms and expands Moher’s findings.
In a study similar to Moher’s, Kjaergard and colleagues
(2) evaluated the effects of methodologic quality on estimated
intervention effects in randomized trials.
The study evaluated
23 large and 167 small randomized trials and a total of 136,164
participants. Methodologic quality was defined as the confidence
that the trial’s design, conduct, analysis, and presentation
minimized or avoided biases in the trial’s intervention
comparisons (3). The reported methodologic quality was assessed
using four separate components and a composite quality scale.
The quality score
was ranked as low (</=2points) or high (>/=3 points),
as suggested by Moher et al. (1). The four components were 1)
generation of allocation sequence; 2) concealment of allocation;
3) double-blinding; and, 4) reporting of loss-to-follow-up:
RESULTS
OF KJAERGARD ET AL’S REVIEW (all reported exaggerations
are relative increases):
Generation
of Allocation Sequence
The odds ratios generated by all trials (large and small) with
inadequate generation of the allocation sequence were on average
significantly exaggerated by 51% compared with all trials reporting
adequate generation of allocation sequence (ratio of odds ratios
(95% CI) = 0.49 (0.30–0.81), P <0.001.
Concealment
of Allocation
All trials with inadequate allocation concealment exaggerated
intervention benefits by 40% compared with all trials reporting
adequate allocation concealment (ratio of odds ratios (95% CI)
= 0.60 (0.31–1.15), P =0.12. Odds ratios were significantly
exaggerated by 52% in small trials with inadequate versus adequate
allocation concealment (ratio of odds ratios (95% CI) 0.48 (0.25–0.92),
P = 0.027).
Double
Blinding
The odds ratios generated by all trials without double blinding
were significantly exaggerated by 44% compared with all double-blind
trials (ratio of odds ratios (95% CI) = 0.56 (0.33–0.98),
P = 0.041).
Reporting
of Loss-to-Followup
The analyses showed no significant association between reported
follow-up and estimated intervention effects (ratio of odds
ratios (95% CI) = 1.50 (0.80–2.78), P = 0.2).
Kjaergard
and Colleagues’ Conclusions
- Adequate generation
of the allocation sequence and adequate allocation concealment
should be required for adequate randomization.
Unlike previous investigators (1,3,4, 5), the authors found
that trials with inadequate generation of allocation sequence
exaggerate intervention effects significantly.
- Trials with
inadequate allocation concealment also generate exaggerated
results.
This is in accordance with previous evidence (1,3,5). The
authors found that despite the considerable overlap between
generation of allocation sequence and allocation concealment,
both factors may independently affect the estimated intervention
effect.
- Trials without
double blinding exaggerate results.
This study supports Schulz and colleagues’ finding of
a significant association between intervention effects and
double blinding and extends the evidence by including trials
from several therapeutic areas.
- There was no
association between reported follow-up and intervention effect.
Delfini
Comment
It is useful to know quantitatively how various threats
to validity affect results when doing critical appraisal of
a study. The study by Kjaergard and colleagues summarized above
expands the findings of Schulz, Moher, Juni and others.
Previous studies
have questioned the reliability of reported losses to follow-up
(5, 6). In accordance with Schulz and colleagues’ results
(5), the authors found no association between intervention effects
and reported follow-up.
Delfini Note: We have found that losses to follow-up may significantly
affect P values. See our personal judgment on loss levels.
In agreement with
the findings of Moher and associates (1,3) and Juni and colleagues
(7), the authors found that trials with a low quality score
on the scale developed by Jadad and colleagues (8) significantly
exaggerate intervention benefits.
Kjaergard and colleagues
conclude that assessment of methodologic quality should focus
on generation of allocation sequence, allocation concealment,
and double blinding. Delfini feels this is not sufficient –
but appreciates this study as one that further demonstrates
the importance of effective approaches to some of these methodologic
areas.
References
1. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, et
al. Does quality of reports of randomised trials affect estimates
of intervention efficacy reported in meta-analyses? Lancet.
1998;352:609-13. PMID: 9746022
2. Kjaergard LL,
John Villumsen J, Gluud C. Reported Methodologic Quality and
Discrepancies between Large and Small Randomized Trials in Meta-Analyses.
Ann Intern Med. 2001;135:982-989. PMID 11730399
3. Moher D, Cook
DJ, Jadad AR, Tugwell P, Moher M, Jones A, et al. Assessing
the quality of reports of randomised trials: implications for
the conduct of meta-analyses. Health Technol Assess. 1999;3:i-iv,
1-98. PMID: 10374081
4. Emerson JD,
Burdick E, Hoaglin DC, Mosteller F, Chalmers TC. An empirical
study of the possible relation of treatment differences to quality
scores in controlled randomized clinical trials. Control Clin
Trials. 1990;11:339-52. PMID: 1963128
5. Schulz KF, Chalmers
I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions
of methodological quality associated with estimates of treatment
effects in controlled trials. JAMA. 1995;273:408-12. PMID: 7823387
6. Gøtzsche
PC. Methodology and overt and hidden bias in reports of 196
double-blind trials of nonsteroidal antiinflammatory drugs in
rheumatoid arthritis. Control Clin Trials. 1989;10:31-56. PMID:
2702836
7. Juni P, Witschi
A, Bloch R, Egger M. The hazards of scoring the quality of clinical
trials for meta-analysis. JAMA. 1999;282:1054-60. PMID: 10493204
8. Jadad AR, Moore
RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan DJ, et al.
Assessing the quality of reports of randomized clinical trials:
is blinding necessary? Control Clin Trials. 1996;17:1-12. PMID:
8721797 |
Key Clinical Questions and PICO
04/08/2011
Optimal clinical decision-making is dependent on accurately predicting outcomes from various choices. When considering therapeutic interventions (or no intervention), clinicians and patients need to know the reliability of the evidence upon which the predictions are made. For those who obtain and summarize the evidence used to inform decisions, a useful approach is to consider the PICO checklist. PICO (population/ intervention/ comparison/ outcome) is a useful reminder of some important considerations which can inform searching, can help with determining whether or not to include evidence, synthesizing the evidence and presenting it to various end-users.[1] PICO can also help ensure attention to transparency and the synthesis of useful information.
Key Clinical Questions
Key clinical questions create the focus for the work and, once created, drive the project. The key question assists me in all aspects of the project (ASK, ACQUIRE, APPRAISE, APPLY). PICO is a useful framework for constructing key questions, but should be applied thoughtfully. For example, if I am interested in the evidence regarding prevention of venous thrombosis in hip replacement surgery, I would want to include the population and study design and perhaps key outcomes in my searches, but I would not want to limit the question to any specific interventions in case there are some useful interventions of which I am not aware. So the question might be, “What is the evidence that thromboembolism or deep vein thrombosis (DVT) prophylaxis with various agents reduces mortality and clinically significant morbidity in hip replacement surgery?” In this case, I was somewhat specific about P, less specific about O and not specific about I and C. I could be even more specific about P if I specified patients at average risk for VTE or only patients at increased risk. If I were interested in the evidence about the effect of glycemic control on important outcomes in type II diabetes, I might pose the question as, “What is the effect of tight glycemic control on various outcomes,” and type in the terms “type 2 diabetes” AND “tight glycemic control” which would not limit the search to studies reporting outcomes of which I was unaware.
IMPORTANT POINT: When actually conducting a search, we use "condition" and not "population" generally as frequently the population will not produce the most robust search because it frequently does not activate the MeSH headings in PubMed which produces a search with key synonyms.
PICO can also assist in determining whether to include studies or exclude them from a review. For example, if I am limiting my review to drugs and not devices, I might want exclude mechanical compression devices. On the other hand, I might want to broaden the question to include both types of intervention and not limit to any intervention if I found important evidence of a synergistic effect of adding compression to drug prophylaxis in preventing VTE in hip replacement surgery and realized there may be other interventions of value also.
Finally, key questions are also helpful in reminding us of some important considerations when summarizing evidence and creating decision support for patients or clinicians. For example, when I have finished an evidence review, I would want to clearly describe the populations (and settings), interventions, the quality of the evidence, benefits and safety issues, surrogate outcomes and patient-important outcomes when summarizing the evidence. In summary, PICO can be very helpful in designing key questions (ASK, ACQUIRE), critically appraising the evidence (APPRAISE) and synthesizing the evidence (APPLY).
-
Guyatt GH, Oxman AD, Kunz R, Atkins D, Brozek J, Vist G, Alderson P, Glasziou P, Falck-Ytter Y, Schünemann HJ. GRADE guidelines: 2. Framing the question an deciding on important outcomes. J Clin Epidemiol. 2011 Apr;64(4):395-400. Epub 2010 Dec 30. PubMed PMID: 21194891.
|
Must
Clinical Trials be Randomized? A Look at Minimization Methods
07/20/2010
In clinical
trials, any difference between groups, except for what is being
studied, could explain or distort the study results. In randomized
clinical trials (RCTs), the purpose of randomization is to attempt
to distribute people for study into study groups in such a way
that prognostic variables are evenly distributed. Thus, the
goal of the randomization process in RCTs is to generate study
groups with similar known and unknown prognostic variables so
that the groups being compared have similar baseline characteristics.
Randomization is very likely to achieve balanced groups, especially
in large trials. Adequate simple or unrestricted randomization
is achieved by generating random number sequences and concealing
the randomization process from everyone involved in the study.
Minimization is a non-random method of allocating
patients to study groups. Since it is not random, is it necessarily
bad? Possibly not.
With minimization the goal is to ensure that
several pre-specified patient factors and the number of subjects
are balanced in the study groups. The allocation of each subject
is identified, and that information is used to increase the
likelihood that subjects are allocated to the group which it
is thought will result in balanced prespecified patient factors.
This can be accomplished by models that identify the the number
of patients in each group with the pre-specified factors and
increase the likelihood or ensure that the next subject will
be allocated to the group with fewer patients with the pre-specified
factor. Numerous methods for accomplishing minimization have
been described. Minimization may effectively distribute known
prognostic variables, and many authors consider it methodologically
equivalent to randomization without minimization. One potential
threat to validity is whether or not the knowledge of impending
allocation assignment by individuals involved in the study could
affect the allocation process. Benefits, drawbacks and extensive
methodological detail are available in a review by Scott et
al. who conclude that minimization is a highly effective allocation
method [1].
1. Scott NW, McPherson GC, Ramsay CR, Campbell
MK. The method of minimization for allocation to clinical trials.
a review. Control Clin Trials. 2002 Dec;23(6):662-74. Review.
PubMed PMID: 12505244
|
| Concealment
of Allocation In
1996, the CONSORT statement encouraged the reporting of concealment
of allocation. Concealment of allocation is the process for
actually assigning to the patient the group they will be in
without breaking blinding. Hewitt et al. in a recent issue of
BMJ reviewed the prevalence of adequate concealment of allocation
in 4 journals—BMJ, Lancet, JAMA and NEJM (Hewitt C et
al. BMJ 2005;330:1057-1058. PMID: 15760970). They scored the
allocation as adequate (i.e., subject recruiter was different
person from the person executing the allocation sequence), inadequate
or unclear. Sealed envelopes were considered inadequate unless
performed by an independent third party.
Results
Studies included: 234
Adequate concealment: 132 (56%)
Inadequate concealment: 41 (18%)
Unclear concealment: 61 (26%)
Delfini Commentary
The authors point out that previous studies have found an association
between inadequate concealment and the reporting of significant
results. Of interest is that studies included in this review
with inadequate concealment tended to show a significant result—OR
1.8, 95% CI (0.8 to 3.7).
This is another
study suggesting that the critical appraisal of RCTs is “critical”
and that lower quality studies are more likely to report significant
benefit than are higher quality studies.
|
| Blinding
and RCTs
A recent article,
Boutron I, Estellat C, Guittet L, Dechartres A, Sackett DL,
et al. (2006) Methods of blinding in reports of randomized controlled
trials assessing pharmacologic treatments: A systematic review.
PLoS Med 3(10): e425. DOI: 10.1371/ journal.pmed.0030425, provides
a great deal of useful information about and a way of classifying
blinding in research studies. The authors evaluated blinding
in RCTs of pharmacologic treatment published in 2004 in high
impact-factor journals. The following are some key points from
the article:
• The authors
identified 819 reports with about 60% describing the method
of blinding. The classification identified three main methods
of blinding:
(1) methods to provide identical treatments in both arms,
(2) methods to avoid unblinding during the trial, and
(3) methods of blinded outcome assessment.
• ESTABLISHING BLINDING OF PATIENTS AND PROVIDERS: 472
[58%] described the method of blinding, but 236 [29%] gave no
detail and 111 [13%] some data on blinding (i.e., reporting
that treatments were similar or the use of double dummies with
no description of the method). The methods of blinding identified
varied in complexity. The authors reported use of a centralized
preparation of similar capsules, tablets, or embedded treatments
in hard gelatin capsules (193/336 [57%]), similar syringes (37/336
[11%]), or similar bottles (38/336 [11%]). Use of a double dummy
procedure was described in 79 articles (23%). Other methods
consisted of a sham intervention performed by an unblinded health
care provider who was not actively involved in the care of patients
and had no other contact with patients or other caregivers and
outcome assessors (17/336 [5%]). To mask the specific taste
of the active treatments, in ten articles researchers used a
specific flavor such as peppermint or sugar to coat treatments.
For treatments administered by care providers, authors reported
use of a centralized preparation of opaque coverage to adequately
conceal intravenous treatments with different appearances (14/336
[4%]).
• AVOIDING
UNBLINDING OF PATIENTS AND PROVIDERS: Only 28/819 [3%]) reported
methods to avoid unblinding. Methods to blind dosage adaptation
relied on use of a centralized adapted dosage or provision of
sham results of complementary investigations for treatments
necessitating dosage adaptation. Methods to avoid unblinding
because of side effects relied mainly on centralized assessment
of side effects, partial information to patients about side
effects, use of active placebo or systematic prevention of adverse
effects in both arms.
• BLINDING
ASSESSORS: These methods depend on the main outcomes and are
particularly important when blinding cannot be established and
maintained by the methods described above. A total of 112 articles
[14%] described these methods, which relied mainly on a centralized
assessment of the main outcome. Blinding of outcome assessors
is presumably achieved if neither patients nor those involved
in the trial have any means to discover which arm a patient
is in, for example because the placebo and active drugs are
indistinguishable and allocation is via a central randomization
service. 96 reports (86%) of the 112 reports in which specific
measures to blind the outcome assessor were reported concern
trials in which patients were reported as blinded or in which
double blinding or triple blinding was reported. These results
suppose that, although blinding was performed at an earlier
stage, the investigators nevertheless decided to perform a specific
method of blinding the outcome assessor.
• AUTHORS
COMMENTS AND CONCLUSIONS:
• Although blinding is essential to avoid bias, the reporting
of blinding is generally quite poor and reviews of trials that
test the success of blinding methods indicate that a high proportion
of trials are unblinded.
• The study
results might be explained in part by the insufficient coverage
of blinding in the Consolidated Standards for Reporting Trials
(CONSORT) statements. For example, three items of the CONSORT
statements are dedicated to the description of the randomization
procedure, whereas only one item is dedicated to the blinding
issue. The CONSORT statements mainly focus on reporting who
is blinded and less on the reporting of details on the method
of blinding, and this information is essential to appraise the
success of blinding.
• Some evidence
suggests that although participants are reported as blinded,
the success of blinding might be questionable. For instance,
in a study assessing zinc treatment for the common cold, the
blinding procedure failed, because the taste and aftertaste
of zinc was distinctive. And yet, tools used to assess the quality
of trials included in meta-analyses and systematic reviews focus
on the reporting of the blinding status for each participant
and rarely provide information on the methods of blinding and
the adequacy of the blinding method.
• There is
a need to strengthen the reporting guidelines related to blinding
issues, emphasizing adequate reporting of the method of blinding.
Delfini
Commentary
Lack of blinding appears to be a major source of bias in RCTs.
Just as well-done randomization and concealment of allocation
to the study groups decreases the likelihood of selection bias,
blinding of subjects and everyone working with the subjects
or study data to the assigned intervention (double-blinding)
decreases the likelihood of performance bias. Performance bias
occurs when patients in one group experience care or exposures
not experienced by patients in the other group(s) and the differences
in care affect the study outcomes. Lack of blinding may affect
outcomes in that:
- Unblinded subjects
may report outcomes differently from blinded subjects, have
different thresholds for leaving a study, seek (and possibly
receive) additional care in different ways.
- Unblinded clinicians
may behave differently towards patients than blinded clinicians.
- Using unblinded
assessors may result in systematic differences in outcomes
assessment (assessment bias).
A number of studies
have shown that lack of blinding is associated with inflated
treatment effects.
In some cases blinding
may not be possible. For example, side effects or taste may
result in unblinding. The important point is that even if blinding
is not possible, the investigators do not get “extra”
validity points for doing the best they could (i.e., the study
should not be “upgraded”). |
Blinding and Objective Outcomes
03/28/2011
We provide some general references on blinding at Recommended Reading. A frequent question (or assumption) that we hear concerns lack of blinding and objective outcomes such as mortality. There appears to be a consensus that lack of blinding can distort subjective outcomes. However, there also appears to be a belief that lack of blinding is not likely to distort hard outcomes. We are not so sure.
In reviewing the literature on blinding, we find only one reference that actually attempts to address this question. Wood et al. found little evidence of bias in trials with objective outcomes.[1] Yet, as we know, absence of evidence is not evidence of absence. Therefore, anything that contradicts these findings raises the specter that we are not “distortion-free” when it comes to lack of blinding and hard outcomes.
The RECORD trial is an interesting case in point. Caregivers were not blinded, but adjudication was. However, Psaty and Prentice point out that it appears that it is possible that lack of blinding might have affected which cases were submitted to adjudication, potentially causing a meaningful change in outcomes.[2] We wrote a letter in response that pressed even further for the importance of blinding.[3] You can read more about this particular case in the DelfiniClick that immediately follows below.
A classic study is Chalmers’ review of the effect of randomization and concealment of allocation on the objective outcome, mortality, in 145 trials of interventions for acute myocardial infarction.[4] Although this study did not focus on blinding beyond the concealment phase of studies, it may help shine some light on this area. Chalmers showed (and others confirmed later) that lack of effective allocation concealment is associated with changes in study results. It is also possible that lack of blinding of patients and investigators in studies with objective outcome measures can affect patient management and patient experiences, thereby distorting results.
In Salpeter et al. a meta-analysis of hormone replacement therapy, mortality was an outcome of interest.[5] The trials were analyzed by mean age of women in the trials (creating one of several serious threats to validity), to create a “younger women” and an “older women” analysis set. No benefit was shown in the “older women” trials, but benefit was shown in the “younger women” set. Interestingly, many of the studies in the younger women group were open-label, but none were open-label in the older women group. Although clearly not proof, this is intriguing and potentially suggestive of a distorting effect of non-blinding in studies with objective outcome measures.
To us, irrespective of any hard evidence of the impact of lack of blinding on hard outcomes, the fact that a distortion is possible, is of concern. If it is true that clinicians’ interventions can have an impact on mortality, then it is entirely possible that knowing which treatment a patient is receiving could have an impact on mortality outcomes. We know that the placebo effect is real. A patient’s knowledge of his or her treatment could be impacted by that effect and/or by a change in behaviors on the part of clinicians, investigators, patients or others involved in clinical trials, and that could affect a hard outcome such as mortality.
As critical appraisers we want to know—
Who was blinded (including an express statement about blinded assessment)?
How was blinding managed?
Was the blinding likely to have been successful?
1. Wood L, et al. Empirical evidence of bias in treatment effect estimates in controlled trials with different interventions and outcomes: meta-epidemiological study. BMJ. 2008 Mar 15;336(7644):601-5. Epub 2008 Mar 3. PubMed PMID: 18316340.
2. Psaty BM, Prentice RL. Minimizing bias in randomized trials: the importance of blinding. JAMA. 2010 Aug 18;304(7):793-4. PubMed PMID: 20716744. [See below for DelfiniClick on this study.]
3. Strite SA, Stuart ME. Importance of blinding in randomized trials. JAMA. 2010 Nov 17;304(19):2127-8; author reply 2128. PubMed PMID: 21081725.
4. Chalmers TC et al. Bias in Treatment Assignment in Controlled Clinical Trials. N Engl J Med 1983;309:1358-61. PMID: 6633598.
5. Salpeter SR, et al. Mortality associated with hormone replacement therapy in younger and older women. J Gen Intern Med July 2004;19:791-804. PMID: 15209595 |
Open-Label
Trials and Importance of Blinding (Even with Hard Outcomes)
One of our
heroes is Dr. Bruce Psaty, a brilliant and dedicated University
of Washington researcher Sheri worked with years ago during
her stint at the Group Health Cooperative Center for Health
Studies (now retitled, Group Health Research Institute). Bruce
does some really interesting and important work, and frequently
his efforts add to our collection of cases for critical appraisal
training.
In a recent issue of JAMA, he and Dr. Ross Prentice,
a statistician and leader at Fred Hutchinson Cancer Research
Center, address, “Minimizing Bias in Randomized Trials:
The Importance of Blinding.”[1] They explore the “prospective
randomized open trial with blinded endpoints,” and examine
other evidence supporting the importance of investigator-blinding
in clinical trials. In their commentary, they examine the RECORD
trial (Rosiglitazone Evaluated for Cardiac Outcomes and Regulation
of Glycemia in Diabetes) which was an open-label trial with
blinded assessment. They report that it was determined that
event rates for myocardial infarction in the control group were
unexpectedly low, and they summarize some findings from an independent
review by the FDA which identified myriad problems with case
report forms created prior to any blind assessment. The FDA
review resulted in a re-analysis, using the available readjudicated
case information, with the end result that the outcome of non-significance
for risk of MI in the original study report changed to a statistically
significant difference, the results of which were reported to
be “remarkably close to results” reported in the
original meta-analysis that raised concerns about rosiglitazone
and cardiovascular risk.[2]
In our letter to JAMA,[3] we express that Drs.
Psaty and Prentice add to evidence on the importance of blinding,
and we raise some points to carry this further, including an
example specific to the commentary, that addresses potential
for unbalancing study groups.
We want to expand upon this to make two basic
key points:
1. As a general principle, nondifferential errors
between treatment groups can, in fact, systematically bias summary
measures. Example: Inaccurate measuring instruments equally
applied. What if a question on a survey instrument fails to
capture an outcome of interest? It might show no difference
between groups, when a true difference actually exists.
2. Nondifferential errors may be nondifferential
in appearance only. Missing data are a case in point. Missing
data points are frequent problems in clinical trials. Some reviewers
are unconcerned by missing data provided that the percent of
missing data is balanced between groups. We disagree. Just because
data may be missing in equal measure doesn’t mean that
the people for whom the data are missing are similar between
groups. If they are different, the groups may become unbalanced
in key prognostic variables that could distort the results.
In our letter, we also point out that unblinded
investigators may treat patients differently, which is a performance
bias. Patients with differing care experiences could have dramatically
different outcomes, including myocardial infarction, in keeping
with the RECORD study example.
We are grateful to Drs. Psaty and Prentice for
their work and agree that they have put a greater spotlight
on “likely important departures from the crucial equal
outcome ascertainment requirement under open-label trial designs.”[1]
We hope from their work and our letter that people will increasingly
see the important role blinding plays in clinical trial design
and execution.
References
1. Psaty BM, Prentice RL. Minimizing bias in randomized trials:
the importance of blinding. JAMA. 2010 Aug 18;304(7):793-4.
PubMed PMID: 20716744.
2. Nissen SE, Wolski K. Effect of rosiglitazone
on the risk of myocardial infarction and death from cardiovascular
causes. N Engl J Med. 2007 Jun 14;356(24):2457-71. Epub 2007
May 21. Erratum in: N Engl J Med. 2007 Jul 5;357(1):100.. PubMed
PMID: 17517853.
3. Strite SA, Stuart ME. Importance of Blinding
in Randomized Trials: To the Editor. JAMA. 2010 Nov 17;304(19):2127-8.
|
Blinding In Surgical Trials — It is Through Blinding
We Become Able To See
11/17/2010
Blinding
is an important consideration when evaluating a study. Without
blinding, the likelihood of bias increases. Bias occurs when
patients in one group experience care or exposures not experienced
by patients in the other group(s), and the differences in care
affect the study outcomes.Lack of blinding may be a major source
of this type of bias in that unblinded clinicians who are frequently
“rooting for the intervention” may behave differently
than blinded clinicians towards patients whom they know to be
receiving the study drug or intervention being studied. The
result is likely to be that in unblinded studies, patients may
receive different or additional care. Unblinded subjects may
be more likely to drop out of a study or seek care in ways that
differ from blinded subjects. Unblinded assessors may also be
“rooting for the intervention” and assess outcomes
differently from blinded assessors.
How much difference
does blinding make? Jüni et al. reviewed four studies that
compared double blinded versus non-blinded RCTs and attempted
to quantify the amount of distortion (bias) caused by lack of
double blinding [1]. Overall, the overestimation of effect was
about 14%. The largest study reviewed by Juni assessed the methodological
quality of 229 controlled trials from 33 meta-analyses and then
analyzed, using multiple logistic regression models, the associations
between those assessments and estimated treatment effects [2].
Trials that were not double-blind yielded on average 17% greater
effect, 95% CI (4% to 29%), than blinded studies (P = .01).
Lack of double
blinding is frequently found in surgical trials and results
in uncertain evidence because of the problems stated above.
A case study helps to illustrate this. A recent multicenter
RCT, the Spine Patient Outcomes Research Trial (SPORT)[3]
was a non-blinded trial that serves as an interesting case
study of the blinding issues that arise when a surgical intervention
is compared to a non-surgical intervention, and blinding is
not attempted. The trial included patients with persistent
(at least 6 weeks) disk-related pain and neurologic symptoms
(sciatica) who were randomized to undergo diskectomy or receive
usual care (not standardized but frequently including patient
education, anti-inflammatory medication, and physical therapy,
alone or in combination). There were a number of problems
with this study including lack of power, poor control of non-study
interventions, a high proportion of patients who crossed over
between treatment strategies (43% randomized to surgery did
not undergo surgery by 2 years and the 42% randomized to conservative
care did receive surgery) and lack of blinding. The degree
of missing data was 24%-27% without a true intention-to-treat
analysis. Of great interest was an editorial that dealt with
the problem of non-blinding in surgical studies. The editorialist,
Flum, makes the following points [4]:
- While the
technique of sham intervention is well accepted in studies
of medications using inactive pills (placebos), simulated
acupuncture, and nontherapeutic conversation in place of
therapeutic psychiatric interventions, it has only occasionally
been applied to surgical trials. This is unfortunate because
the use of sham controls has been critical in understanding
just how much patient expectation influences outcomes after
an operation.
- A sham-controlled
trial would be particularly relevant for spine surgery since
the most commonly occurring and relevant outcomes are subjective.
- Patients chosing
surgical options may have high expectations. They may include
a higher level of emotional “investment” in
surgical care compared with usual care based on the level
of commitment resulting from a decision to have an operation
and get through recovery. After the patient has accepted
the risks of surgical intervention, the desire for improvement
may drive perceptions about improvement.
- Patients
who opt for surgery may also differ from patients who decline
surgery in their beliefs regarding the benefits of invasive
interventions.
- The surgeon’s
expectations and direction are likely to play an important
role in patient improvement.
- Given the
proliferation of operative procedures for the treatment
of subjective complaints like back pain, the need for sham
controlled trials has never been greater.
Flum goes on to
present multiple examples of the power of suggestion and the
problem of doing non-blinded trials in the field of surgery.
Observational trials have often reported procedural success,
but sham-controlled trials for the same conditions demonstrate
how much of that success is due to the placebo effect.
- Example 1 —
Ligation of Internal Mammary: After multiple
observational studies suggesting that ligation of the internal
mammary artery was helpful in patients with coronary disease,
Cobb et al randomized patients to operative arterial ligation
or a sham procedure. Both groups improved after the intervention,
but there were similar, if not greater, improvements in subjective
measures such as exercise tolerance and nitroglycerin use
in the sham surgical group.
- Example 2 —
Osteoarthritic Knee Surgery — and 3
— Osteoarthritic Knee Joint Irrigation:
After multiple case series reported that patients with osteoarthritis
of the knee improve after arthroscopic surgery, Moseley et
al demonstrated just how much of that effect is related to
the hopes, expectations, and beliefs of the patient. The investigators
randomized 180 patients to undergo arthroscopy with debridement,
arthroscopy with lavage, or sham arthroscopy. The power of
expectation was strong and patients were unable to determine
if they had been assigned to the treatment or sham groups—
and all groups improved. At 2 years after randomization, all
patients reported comparable pain scores and functional scores.
Another sham-controlled study in patients with knee osteoarthritis
demonstrated that patients benefit equally from irrigation
of the joint and from sham irrigation.
- Example 4 —
Parkinson’s Disease: Researchers found similar
improvements in quality of life after direct brain injections
of embryonic neurons or placebo in patients with advanced
Parkinson’s disease.
- Example 5 —
Transmyocardial Laser Revascularization in HF:
Heart failure patients undergoing transmyocardial laser revascularization
or sham procedures had equal improvements in subjective outcomes.
- Example 6 —
Hernia: After hernia repair, there was equal
improvement in pain control after cryoablation of nerves or
sham interventions.
- Examples 7-9
— Laparoscopic Interventions: Multiple
case series have reported benefit on subjective outcomes such
as pain control, function, and readiness for discharge with
laparoscopic cholecystectomy, colon resection, and appendectomy
compared with conventional approaches..Bias arises when the
clinical care team influences patient and discharge expectations
though coaching, communication, and management. Randomized
trials of these three procedures that included blinding of
both the patients and the discharging clinicians to the treatment
that patients received by placing large, side-to-side abdominal
wall dressings demonstrate little or no difference in patients
reaching discharge criteria. A reasonable conclusion is that
when the clinician’s expectations and “coaching”
were removed by placing a large bandage on the abdominal wall,
the subjective benefits disappeared. Flum concludes that studies
not addressing both patient and clinician expectation on subjective
outcomes do not inform the clinical community about the true
role of the intervention.
Delfini
Commentary
Blinding of subjects and everyone working with the
subjects or study data to the assigned intervention (double-blinding)
decreases the likelihood of bias. Bias may be more likely to
occur when evaluating subjective outcomes such as pain, satisfaction,
and function in non-blinded studies, but it has also been reported
with objective outcomes such as mortality. When dealing with
subjective outcomes, as Flum points out, it is critical to distinguish
the effect of the intervention from the effect of the patient’s
expectation of the intervention. The only way to distinguish
the effect of a patient’s positive expectations of an
operation from the intervention itself is to blind patients
to the treatment they receive and randomize them to receive
the intervention of interest or to receive a sham intervention
(placebo). Yet we frequently hear, “But blinding is not
possible in surgical studies.” Frequently the argument
is raised that subjecting people to anesthesia and sham surgery
is not ethical. However, conducting clinical trials employing
methods that result in avoidable fatal flaws is also problematic.
Flum’s position is that when the risk of a placebo does
not exceed a threshold of acceptable research risk and if the
knowledge to be gained is substantial, a sham-controlled trial
is needed and is ethical. He reasons that ethical justification
of placebo-controlled trials is based on the following considerations:
- Invasive procedures
are associated with risks.
- There are great
harms created by conducting studies that are of uncertain
validity.
- Establishing
community standards based on uncertain evidence is more likely
to result in more harm than good.
- Sham-controlled
trials are justified when uncertainty exists among clinicians
and patients about the merits of an intervention.
The SPORT trial
draws attention to the problem of non-blinding in surgical trials.
This was a very expensive, labor-intensive study that provides
no useful efficacy data. Research subjects were undoubtedly
told this study would provide answers regarding the relative
efficacy of surgery vs conservative care for lumbar spine disease.
The authors of the SPORT trial state that a sham-controlled
trial was impractical and unethical, possibly — according
to Flum — because the risk of the sham would include general
anesthesia (to truly blind the patients). He would argue that
in this case blinding which would require anesthesia is the
only way that valid, useful evidence could have been created.
Even though we graded the study U (uncertain validity and usefulness)
and would not use the results to inform decisions about efficacy
or effectiveness because of the threats to validity, the study
does report information regarding risks of surgery that may
be of great value to patients.
-----------
1 Jüni P,
Altman DG and Egger M. Systematic reviews in health care: Assessing
the quality of controlled clinical trials. BMJ. 2001;323;42-46.
PMID: 11440947
2 Schulz KF, Chalmers
I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions
of methodological quality associated with estimates of of treatment
effects in controlled trials. JAMA 1995;273:40812. PMID:
7823387.
3 Weinstein JN,
Tosteson TD, Lurie JD, et al. Surgical vs nonoperative treatment
for lumbar disk herniation: the Spine Patient Outcomes Research
Trial (SPORT): a randomized trial. JAMA. 2006;296:2441-2450.
PMID: 17119141
4 Flum DR. Interpreting
Surgical Trials With Subjective Outcomes Avoiding UnSPORTsmanlike
Conduct. JAMA, November 22/29, 2006—Vol 296, No. 20: 2483-1484.
PMID: 17119146
|
| The
Importance of Blinded Assessors in RCTs We
have previously summarized the problems associated with lack
of blinding in surgical (and other) studies — see Blinding
in Surgery Trials in a previous DelfiniClick™.
The major problem with unblinded studies is that the outcomes
in the intervention group are likely to be falsely inflated
because of the biases introduced by lack of blinding.
Recently a group
of orthopedists identified and reviewed thirty-two randomized,
controlled trials published in The Journal of Bone and Joint
Surgery between 2003 and 2004 to evaluate the effect of blinded
assessment vs non-blinded assessment on reported outcomes [1].
Results
- Sixteen of
the thirty-two randomized controlled trials did not report
blinding of outcome assessors when blinding would have been
possible.
- Among the studies
with continuous outcome measures, unblinded outcomes assessment
was associated with significantly larger treatment effects
than blinded outcomes assessment (standardized mean difference,
0.76 compared with 0.25; p = 0.01).
- In the studies
with dichotomous outcomes, unblinded outcomes assessments
were associated with significantly greater treatment effects
than blinded outcomes assessments (odds ratio, 0.13 compared
with 0.42; p < 0.001).
- This translates
into a relative risk reduction of 38% for blinded outcome
assessments compared with 71% for unblinded outcome assessments
(a difference of 33%).
Conclusion
Unblinded outcomes assessment dramatically inflates the reported
benefit of effectiveness of treatments.
Delfini
Commentary
This is yet another study pointing out the importance of blinding.
Based on this and other similar studies it is our conclusion
that studies or the results of studies without blinded assessors
are grade U or at best grade B-U (see evidence-grading scale
here).
1. Poolman RW,
Struijs PA, Krips R, Sierevelt IN, Marti RK, Farrokhyar F, Bhandari
M. Reporting of outcomes in orthopaedic randomized trials: does
blinding of outcome assessors matter? J Bone Joint Surg Am.
2007 Mar;89(3):550-8. J Bone Joint Surg Am. 2007 Mar;89(3):550-8.
PMID: 17332104. »
|
| Testing
the Success of Blinding
03/23/09
Blinding in clinical
trials of medical interventions is important. Researchers have
reported that lack of blinding is likely to overestimate benefit
by up to a relative 72%. [1-4] Optimal reporting of blinding
entails who was blinded, how the blinding was performed and
whether the blind was likely to have been successfully maintained.
To assess the latter,
investigators, at times, attempt to test the success of blinding
following a clinical trial by asking clinicians and/or patients
to identify which arm they believed they were assigned to. However,
the results of this attempt may be misleading due to chance
and there is a strong possibility of confounding due to pre-trial
hunches about efficacy as described by Sackett in a letter to
the BMJ, "Why not test success of blinding?" PMID:
15130997.[5]
To illustrate Sackett's
point with a brief scenario, let us say that a new agent is
approved and interest about the agent is running high. A clinician
participating in a new clinical trial of that agent who is already
predisposed to believe the drug works is likely to guess all
treatment successes were a result of patients being assigned
to this arm. If an agent actually is effective, then it will
be likely to appear that blinding was not successful even if
it was.
Sackett describes
the reverse scenario here: http://www.bmj.com/cgi/content/full/328/7448/1136-a
- Kjaergard LL,
John Villumsen J, Gluud C. Reported Methodologic Quality and
Discrepancies between Large and Small Randomized Trials in
Meta-Analyses. Ann Intern Med. 2001;135:982-989. PMID 11730399
- Poolman RW,
Struijs PA, Krips R, Sierevelt IN, Marti RK, Farrokhyar F,
Bhandari M. Reporting of outcomes in orthopaedic randomized
trials: does blinding of outcome assessors matter? J Bone
Joint Surg Am. 2007 Mar;89(3):550-8. J Bone Joint Surg Am.
2007 Mar;89(3):550-8. PMID: 173321045.
- Schulz KF, Chalmers
I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions
of methodological quality associated with estimates of treatment
effects in controlled trials. JAMA. 1995;273:408-12. PMID:
7823387
- Jüni P, Altman DG, Egger M. Systematic reviews in health care: Assessing the quality of controlled clinical trials. BMJ. 2001 Jul 7;323(7303):42-6. Review. PubMed PMID: 11440947; PubMed Central PMCID: PMC1120670
- Sackett in a
letter to the BMJ, "Why not test success of blinding?"
PMID: 15130997
|
Attrition
Bias: Intention-to-Treat Basics
Updated 10/11/2011 Loss Table
In
general, we approach critical appraisal of RCTs by evaluating
the four major components of a trial— study population
(including how established), the intervention, the follow-up
and the assessment. There is very little controversy about the
process of randomizing in order to distribute known and unknown
confounders as equally as possible between the groups. There
also appears to be general understanding that the only difference
between the two groups should be what is being studied. However,
what seems to receive much less attention is the considerable
potential for bias that occurs when data is missing from subjects
because they do not complete a study or are lost to follow-up,
and investigators use models to deal with that missing data.
The only way to prevent this bias is to have data on all randomized
subjects. This is frequently not possible. And bias creeps in.
Intent-to-treat
designs that provide primary outcome data on all randomized
patients are the ideal. All patients randomized are included
in the analysis — and patients are analyzed in the same
groups to which they were randomized. Unfortunately we are rarely
provided with all of this information, and we must struggle
to impute the missing data—i.e., we must do our own sensitivity
analysis and recalculate p-values based on various assumptions
(e.g., worst case scenario, all missing subject fail, etc.)
— when possible! All too often, papers do not report sufficient
data to perform these calculations, or the variables do not
lend themselves to this type of analysis because they cannot
be made binomial, and we are left with the authors’ frequently
inadequate analysis. To which we have to assign a low study
grade as we remain uncertain enough about drawing cause and
effect conclusions based on the data.
We see many studies
where the analysis is accomplished using Kaplan-Meier estimates
and other models to deal with excluded patient data. As John
Lachin has pointed out, this type of “efficacy subset”
analysis has the potential for Type I errors (study findings=significant
difference between groups; truth=no significant difference)
as large as 50 percent or higher [1]. Lachin and others have
shown that the statistical methods used when data is censored
(meaning not included in analysis either through patient discontinuation
or data being removed), frequently assume that —
- Missing data
is missing at random to some degree;
- It is reasonable
to impute missing data using assumptions from non-missing
data; and,
- The bias from
efficacy subset analysis is not a major factor.
We want to see
data on all patients randomized. When patients are lost to follow-up
or do not complete a study, we want to see intent-to-treat analyses
with clear statements about how the missing data is imputed.
We agree with Lachin’s suggestion that the intent-to-treat
design is likely to be more powerful (than statistical modeling),
and especially powerful when an effective treatment slows progression
of a disease during its administration—i.e., when a patient
benefits long after the patient becomes noncompliant or the
treatment is terminated. Lachlin concludes that, “The
bottom line is that the only incontrovertibly unbiased study
is one in which all randomized patients are evaluated and included
in the analysis, assuming that other features of the study are
also unbiased. This is the essence of the intent-to-treat philosophy.
Any analysis which involves post hoc exclusions of information
is potentially biased and potentially misleading.”
We also agree with
an editorial comment made by Colin Begg who states that, “The
properly conducted randomized trial, where the primary endpoint
and the statistical method are specified in advance, and all
randomized patients contribute to the analysis in an intent-to-treat
fashion, provides a structure that severely limits our opportunity
to obscure the facts in favor of our theories.” Begg concludes
by supporting Lachin’s assessment: “He is absolutely
correct in his view that the recent heavy emphasis on the development
of missing data methodologies in statistical academic circles
has led to a culture in which poorly designed studies with lots
of missing data are perceived to be increasingly more acceptable,
on the flimsy notion that sophisticated statistical modeling
can overcome poor quality data. Mundane though it may sound,
I strongly support his [Lachin’s] assertion that `…the
best way to deal with the problem (of missing data) is to have
as little missing data as possible…’ Attention to
the development of practical strategies for obtaining outcome
data from patients who withdraw from trials, notably short-term
trials with longitudinal repeated measures outcomes, is more
likely to lead to improvement in the quality of clinical trials
than the further development of statistical techniques that
impute the missing data. [2]”
It would be difficult
to express our concern more eloquently than what is stated above.
The two examples below amplify this.
Example 1: A group
of rheumatologists were uncomfortable with Kaplan-Meier statistical
methods for analysis of outcomes in rheumatology studies. Their
concern was that, even though Kaplan-Meier methods are frequently
used to analyze cancer data, very little research has been done
to validate the use of Kaplan-Meir methods for drug studies
(i.e. endpoints such as stopping medication because of side-effects
or lack of efficacy. They tested three assumptions upon which
Kaplan-Meier survival analysis depends:
1. Patients recruited
early in the study should have the same drug survival (i.e.
time to determination of lack of efficacy or onset of side-effects)
as those recruited later;
2. Patients receiving their first drug later in the study should
have the same drug survival characteristics as those receiving
it earlier; and,
3. Drug survival characteristics should be independent of the
time that a patient has been in the study before receiving the
disease modifying drug.
To examine the
above assumptions, the authors plotted survival curves for the
different groups (i.e. subjects recruited early vs those recruited
later) and showed that, in each case, the drug survival characteristics
were statistically different between the two groups (p<0.01).
They conclude, as did Lachin, that it is not possible to prove
that survival analysis is always invalid (even though they did
show in this case the Kaplan-Meier analysis was invalid). However,
this group feels that the onus of proof is on those who advocate
for drug survival analysis—i.e., using statistical modeling
rather than presenting all the data so that the reader can do
an ITT analysis or sensitivity analysis[3].
Example 2: A similar
situation occurred when a group of geriatricians became concerned
that many different, and sometimes inappropriate, statistical
techniques are used to analyze the results of randomized controlled
trials of falls prevention programs for elderly people. To evaluate
this, they used raw data from two randomized controlled trials
of a home exercise program to compare the number of falls in
the exercise and control groups using two different survival
analysis models (Andersen-Gill and marginal Cox regression)
and a negative binomial regression model for each trial.
In one trial, the
three different statistical techniques gave similar results
for the efficacy of the intervention but, in the second trial,
underlying assumptions were violated for the two Cox regression
models. Negative binomial regression models were easier to use
and more reliable.
Proportional Hazards
and Cox Regression Models: The authors point that although the
use of proportional hazards or Cox regression models can test
whether several factors (for example, intervention group, baseline
prognostic factors) are independently related to the rate of
a specific event (e.g., a fall) that using survival probabilities
to analyze time to fall events assumes that, at any time, participants
who are censored before the end of the trial have the same risk
of falling as those who complete the trial. An assumption of
proportional hazards models is that the ratio of the risks of
the events in the two groups is constant over time and that
the ratio is the same for different subgroups of the data, such
as age and sex groups. This is known as the proportionality
of hazards assumption. No particular distribution is assumed
for the event times, that is, the time from the trial start
date for the individual to the outcome of interest (in this
case, a fall event) such as would be the case for death following
cardiac surgery, where one assume a greater frequency of deaths
to occur close to the surgical event.
Andersen-Gill and
marginal Cox proportional hazards regression: These models are
used in survival analyses when there are multiple events per
person in a trial. The Andersen-Gill extension of the proportional
hazards regression model and the marginal proportional hazards
regression model are both statistical techniques used for analyzing
recurring event data.
Negative Binomial
Regression: The negative binomial regression model can also
be used to compare recurrent event rates in different groups.
It allows investigation of the treatment effect and confounding
variables, and adjusts for variable follow-up times by using
time at risk.
In the first study
of falls in the elderly, all three statistical approaches indicated
that falls were significantly reduced by 40% (Andersen-Gill
Cox model), 44% (marginal Cox model) and 39% (negative binomial
regression model) in the exercise group compared with those
in the control group. The tests for the proportionality of hazards
for both types of survival regression models indicated that
these models “worked” for the recurring falls problem.
In the second study,
there was evidence that the proportional hazards assumption
was violated in the Andersen-Gill and marginal Cox regression
models (proportional hazards test). The authors point out that
survival analysis is not valid if participants who are censored
do not have the same rate of outcome (risk of falling) as those
who continue in the trial. The authors point out and cite a
reference for concluding that those not completing a falls prevention
trial are at higher risk of falling and, if fewer from one group
than another group withdraw, it may point to a study-related
cause for the change in discontinuation, and results may be
biased.
Summary
Unfortunately, readers are in a very difficult position when
evaluating the quality of studies that use survival analyses
and statistical modeling because the assumptions used in the
models are almost never given and the missing data points are
frequently quite large. Delfini uses a conservative approach.
We look for information about the model, percent of subjects
whose data are missing from analysis, differential loss between
the groups, censored information and reasons for loss to follow-up.
We have been unable to find any good evidence-based criteria
to help guide us in considering cut-offs for validity. We use
the following in evaluating how loss of subjects’ data
affects the validity of the study. While the suggestions below
are not evidence-based, they are conservative in comparison
to some EBM suggestions we have seen, and we have run some calculations
trying to help guide our choices. So caveat emptor!
Delfini
Non-evidence-based Advice on Reaction to Missing Data Points
from Non-completers and Those Lost to Follow-up:
Is there sufficient patient drop-out or missing data points that the study’s validity is threatened? Missing data is one issue. Another key issue is whether an imbalance between groups has resulted or whether there is an imbalance within groups as randomized as compared to completers. If there is no imbalance between and within various study subgroups (as randomized compared to completers, patients lost to follow-up, etc), then this may present no threat, except in instances in which statistical significance is not reached because it may be that not enough people remained to show a statistically significant difference if one exists. Baseline comparisons of various subgroups is advised where possible.
Consider the following (consider percents to be approximations) ―
-
Likely minimal threat: <5% and no differential loss* (e.g., 1% vs 4%) (based on known instances where loss greater than 5% results in large P-value changes)
-
Possible threat: >=5% but <10% and no differential loss*
-
Threat in most cases: >=5% with differential loss*, or >= 10% without differential loss, and without worst-case sensitivity analysis, or otherwise reasonable sensitivity analysis, conducted by authors or reviewers, in which statistical significance is confirmed
|
1. Lachin JM. Statistical
considerations in the intent-to-treat principle. Control Clin
Trials 2000;21:167–189. PMID: 11018568
2. Utley M. et
al. Potential bias in Kaplan-Meier survival analysis applied
to rheumatology drug studies. Rheumatology 2000;39:1-6.
3. Robertson,
MC et al. Statistical Analysis of Efficacy in Falls Prevention.
Journal of Gerontology 2005;60:530–534. |
Intention-to-Treat
Analysis & the Effects of Various Methods of Handling Missing
Subjects: The Case of the Compelling Rationale
08/04/08
The goals of Intention-to-Treat
Analysis (ITT) are to preserve the benefits of randomization
and mitigate bias from missing data. Not doing so is equivalent
to changing a study design from a randomized controlled trial
(RCT), which is an experiment, into a study with many features
of a cohort design, and thus resulting in many of problems inherent
in observational studies. For example, removal or attrition
of patients after randomization (eg, through disqualification,
a decision to not include in the analysis, discontinuations,
missingness, etc.) may systematically introduce bias, or bias
may be introduced through various aspects related to the interventions
used.
In ITT analysis,
all patients are included in the analysis through an assignment
of a value for those missing final data points. For background
on this, get basic information above
and in our EBM
tips, plus the table of contents on this page for further
reading.
The purpose of
this Click is to provide some resistance to the concept of a
“compelling rationale” for excluding patients from
analysis. Sometimes researchers come up with seemingly compelling
rationale for removing patients from analysis; but, as several
EBM experts suggest, “sample size slippages” put
the study on a slippery slope.
Examples
Patients
Excluded Pre-Treatment
Some researchers consider it reasonable to exclude patients
who die before a treatment or before the treatment could take
effect since clearly the treatment was not responsible. If
groups are balanced, such a move should be considered to be
unnecessary because differences unrelated to treatment should
occur equally in each group, excepting due to chance. One
wouldn’t think to do so in a placebo group, and yet,
to keep from introducing a bias by treating groups differently,
except for the intervention or exposure under study, this
would need to be done in the placebo group. The rationale
is the same.
Case in point:
imagine a study comparing surgery to medical treatment. As
pointed out by Hollis and Campbell, if patients assigned to
surgery but not medical therapy were removed because of dying
prior to the intervention, this would create a falsely low
mortality rate in the surgical group.[1] Schultz and Grimes
clarify that this is unnecessary if the study is successfully
randomized, as randomization balances non-attributable deaths.
[2]
Patients
Determined Ineligible Post-randomization
Some investigators remove patients from analysis who are found
post-randomization to be in fact, ineligible for study. Why
would this be a problem if uniformly applied to both groups?
Schultz and Grimes argue that discovery of ineligibility is
“probably not random.” They point out that there
is the potential for a) greater attention paid to those not
responsive to treatment or having side effects; b) systematic
removal of subjects’ data; and, c) physicians to withdraw
patients if they “think” they were randomized
to wrong group. They state that there is a possible reduction
of bias if this is done fully blinded and equally between
groups, but stress that it is best not done at all, pointing
out that such problems should even out if the groups are truly
balanced in the first place due to effective randomization.
Excluding
Patients Post-randomization Who Don’t Pick Up Medication
Frequently, we see that investigators have defined their intention-to-treat
population as being all patients who filled a study prescription
— and then claim to have performed ITT analysis. Firstly,
this should not be called an ITT-analysis — it is more
correctly a modified ITT. Secondly, a problem with excluding
patients after randomization who have not picked up their
prescription is that it allows choice to enter into the experiment,
and choice may be related to differences in the characteristics
(prognostic factors) of individuals who choose to pick-up
their medications as compared to those who do not.
Also, there is
always a possibility that some patients are systematically
discouraged from picking up their medication. If there is
a differential loss in those not picking up their medication,
a systematic bias is possible and is worrisome. If there is
no differential loss, including those who did not pick up
a study medication in the analysis should not be an issue
if groups were created through true randomization.
Excluding
Protocol Deviations
Schultz and Grimes present a case study of a trial of placebo
versus prophylactic antibiotics for IUD insertion in which
25% of the patients in the group were found not to be compliant.
Why not exclude them from the analysis? In response, they
raise the question what if those 25% were in better health
or would tolerate an IUD insertion more easily – the
treatment group would be systematically biased toward those
more susceptible to infection.
A Final
Example
One of our favorite musings on ITT analysis is presented by
Gerard E. Dallal, PhD on his website at http://www.jerrydallal.com/LHSP/itt.htm
Dallal reports
that Paul Meier (of Kaplan-Meier fame), then of the University
of Chicago, offered an example involving a subject in a heart
disease study where there is a question of whether his death
should be counted against the intervention or set aside. The
subject disappeared after falling off his boat. He had been
observed carrying two six-packs of beer on board before setting
off alone. Meier argues that most researchers would set this
event aside as unrelated to the treatment, while intention-to-treat
would require the death be counted against the treatment.
But suppose, Meier continues, that the beer is eventually
recovered and every can is unopened.
“Intention-to-treat
does the right thing in any case. By treating all events the
same way, deaths unrelated to treatment should be equally
likely to occur in all groups and the worst that can happen
is that the treatment effects will be watered down by the
occasional, randomly occurring outcome unrelated to treatment.
If we pick and choose which events should count, we risk introducing
bias into our estimates of treatment effects.” [3]
Key Points
- If groups are
balanced, most adjustments should be considered to be unnecessary.
- Randomization
is the best means of creating balanced groups.
- The effect of
removing patients from an analysis is a potential derandomization,
potentially leaving groups with differing prognostic variables.
- Investigators
should more appropriately deal with these issues in a sensitivity
analysis which can be reported as a secondary analysis.
References
1. Hollis S, Campbell F. What is meant by intention to treat
analysis? Survey of published randomised controlled trials.
BMJ. Vol 319. Sept 1999: 670-674. http://bmj.com/cgi/content/full/319/7211/670?maxtoshow=?eaf
NOTE: Delfini agrees that differential loss is important to
note, but even equivalent loss of greater than five percent
could be a threat to validity.
2. Schulz KF, Grimes
DA. Sample size slippages in randomised trials: exclusions and
the lost and wayward. The Lancet. Vol 359. March 2, 2002: 781-785.
PMID: 11888606
NOTE: Delfini stresses that the approach taken for missing values
should not give an advantage to the intervention.
3. Gerard E. Dallal,
PhD: http://www.jerrydallal.com/LHSP/itt.htm
accessed on 08/01/2008 |
Intention-to-Treat & Imputing Missing Variables: Last-Observation-Carried-Forward (LOCF)—When We Might Draw Reasonable Conclusions
09/23/2011
Principles of Intention-to-Treat (ITT) analysis require analyzing all patients in the groups to which they were assigned. This is regardless of whether they received their assigned intervention or not and is regardless of whether they completed the trial or not. For those who do not complete the study or for whom data on endpoints is missing, a value is to be assigned—which is referred to as “data imputation.” As anything that systematically leads away from truth is a bias, imputing data is, necessarily a bias. However, it is generally considered the preferred analysis method because it is thought to help preserve the benefits of randomization and deal with problems of missing data.
Imputing outcomes for missing data points is either done to try and approximate what might have been true or is used as a method to test the strength of the results—meaning if I put the intervention through a tough challenge, such as assigning failure to those missing in the intervention group and success to those missing in the comparison group, is any difference favoring the intervention still statistically significant?
This DelfiniClick™ is focused on "last-observation-carried-forward" (LOCF) which is frequently used to assign missing variables. LOCF simply means, for example, that if I lost a patient at month 6 in a 12-month trial, I assign the 12-month value for my data point from what I observed in month 6. A number of authors consider this a method prone to bias for various reasons [1-6] not the least of which is that it is not robust and may not be a reasonable predictor of outcomes.
However, as many researchers use LOCF for data imputation, it is worth exploring whether there are circumstances that allow us to draw reasonable conclusions from otherwise valid studies when LOCF is employed. Although using LOCF in progressive conditions clearly distorts results, we might be able to get at some reasonable answers despite its use because we know the direction or trend line without effective treatment.
Scenario 1: Ideal Study Circumstances & Drug Does Not Work
Assumptions
- Ineffective agent versus placebo
- Study is of a progressive condition in which overall improvement could not be expected to happen without some kind of effective intervention
- Randomization is successful
- Concealment of allocation was performed successfully
- Blinding is successful and was maintained
- Missing data between groups is equal and timing of missing data is similar
- Study is otherwise valid
Imagine a graph that plots results between the groups over various time points—see below. We would expect the lines to be roughly the same. The resulting bias would be that the rate and lower boundary of the reported outcome would be higher than what would actually be true. However, in considering the difference in outcomes between groups, we would have a truthful answer: no difference between the groups.

Scenario 2: Ideal Study Circumstances & Drug Does Work
Assumptions
- Effective agent versus placebo
- Study is of a progressive condition in which overall improvement could not be expected to happen without some kind of effective intervention
- Randomization is successful
- Concealment of allocation was performed successfully
- Blinding is successful and was maintained
- Missing data between groups is equal and timing of missing data is similar
- Study is otherwise valid
Imagine a graph that plots results between the groups over various time points—see below. We would expect the lines to diverge. The resulting bias would be that the rate and lower boundary of the reported outcome would be higher than what would actually be true in the placebo group. Conversely, the rate and the upper boundary of the reported outcome would be lower than what would actually be true in the active agent group. So the bias would favor placebo and be conservative against the intervention. However, in considering the difference in outcomes between groups, we would have a truthful answer: a difference between the groups.

Scenario 3: Uncertain Study Circumstances & Unknown if Drug Works
Assumptions
- Agent of unknown efficacy versus placebo
- Study is of a progressive condition in which overall improvement could not be expected to happen without some kind of effective intervention
- Randomization appears successful: random method used to assign people to their groups plus a review of the table of baseline characteristics is suggestive that the groups are balanced
- Concealment of allocation appears to have been performed successfully: call-in-center was used
- Blinding appears to have been well attended to and drug side effects or other circumstances would not seem to break an effective blind
- Missing data between groups is roughly similar, but timing of missing data is unknown
- Study is otherwise valid insofar as we can tell
If the lines do diverge it seems reasonable to conclude one of three things: 1) we have a chance effect, 2) a systematic bias explains the reported improvement in the active agent group; or, 3) the agent actually works.

Chance is a possibility, though not so likely with a prespecified outcome. If the reporting were actually graphed out over time rather than just reported as a summary measure, and we saw consistency in the data points, we would conclude it would be unlikely to be a chance effect.
Another possibility could be differences in care or co-interventions. Effective concealment of allocation and effective blinding would be likely to enable us to rule out such differences being due to bias from knowing the group to which a person was assigned. Therefore, any such resulting differences would be reasonably likely to be a result of some action of the agent.
Actions of the agent would generally be either benefit or harm. If the agent caused a harm that resulted in a greater number of people in the active agent group receiving a co-intervention, that intervention would have to be effective or synergistic with the active agent, in order to see a reported benefit—which is probably not very likely. (And it is possible that this kind of situation would result in failure of successful blinding—in that instance, we would look for anything that may have resulted in improvement to patients other than the agent.)
If the agent is truly working, it is unlikely that subjects would be receiving a co-intervention. That scenario would be more likely to result if the patient were on placebo or the drug did not work. In the latter instance, probably an equal number of subjects in both groups would be getting a co-intervention and the likelihood would be no or little difference between the groups.
Conclusion Using LOCF in Progressive Illness
We strongly prefer that LOCF not be utilized for data imputation for reasons studied by various authors [1-6], but, in the case of a progressive illness, for example, with unlikely spontaneous improvement, it may be reasonable to trust claims of efficacy under the right study conditions, with a recognition that the estimates of effect will likely be distorted.
-
Using LOCF in progressive illnesses has the disadvantage of likely upgrading of an estimate of effect where there is actually no effect and downgrading estimates for true effectiveness.
-
However, our ability to discern potentially efficacious treatment is aided by expected trending. For example in a study with a placebo group with progressive disease and an intervention group with improving disease, LOCF would be conservative because it would imput better-than-actual observations in the placebo group and worse-than-actual observations in the intervention group.
-
Reporting by various time points strengthens confidence that outcomes are not due to chance.
Conclusion Using LOCF in Non-progressive Illness
Using LOCF in non-progressive illness is possibly more problematic as we do not have the assistance of an expected trend for either group. Consequently, we have fewer clues to aid us in drawing any conclusion.
References [Delfini LOCF Summary Notes]
- Carpenter J, Kenward K. Guidelines for handling missing data in Social Science Research. www.missingdata.org.uk [Strongly recommends avoiding LOCF.]
- Gadbury GL, Coffey CS, Allison DB. Modern statistical methods for handling missing repeated measurements in obesity trial data: beyond LOCF. Obes Rev. 2003 Aug;4(3):165-84. PubMed PMID: 12916818. [Reports on some simulations of LOCF producing bias for all three general categories of missing data. “Both multiple imputation and mixed effects models appear to produce unbiased estimates of a treatment effect for all types of missing data.”]
- O'Brien PC, Zhang D, Bailey KR. Semi-parametric and non-parametric methods for clinical trials with incomplete data. Stat Med. 2005 Feb 15;24(3):341-58. Erratum in: Stat Med. 2005 Nov 15;24(21):3385. PubMed PMID: 15546952. [LOCF should not be used.]
- Shih W. Problems in dealing with missing data and informative censoring in clinical trials. Curr Control Trials Cardiovasc Med. 2002 Jan 8;3(1):4. PubMed PMID: 11985668; PubMed Central PMCID: PMC134466. [Discusses various biases with use of LOCF.]
- Wood AM, White IR, Thompson SG. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin Trials. 2004;1(4):368-66. Review. PubMed PMID: 16269265. [LOCF is crude and rarely appropriate.]
- Woolley SB, Cardoni AA, Goethe JW. Last-observation-carried-forward imputation method in clinical efficacy trials: review of 352 antidepressant studies. Pharmacotherapy. 2009 Dec;29(12):1408-16. Review. PubMed PMID: 19946800. [Cautions depending on the pattern of missing data and emphasizes need for explicitly describing this in published reports along with the likely effect of dropouts and how they reached their conclusions. Recommends mixed-effects modeling as it is “less likely to introduce substantial bias.”]
|
| Intention-to-Treat
Analysis & Censoring: Rofecoxib Example In
a recent DelfiniClick, we voiced concern about models used for
analysis of study outcomes, especially when information about
assumptions used is not reported. In the July 13, 2006 issue
of the NEJM (published early on-line), there is a very informative
example of what can happen when authors claim to analyze data
using the intention-to-treat (ITT) principle, but do not actually
do an ITT analysis.
Case Study
The NEJM published a correction to an original study of cardiovascular
events associated with rofecoxib versus placebo[1]. This correction
illustrates how Kaplan-Meier curves can be misleading to readers
and how they differ with various censoring assumptions. In this
case, by censoring data that occurred 14+ days after subjects
discontinued the study, the Kaplan-Meir curves for thrombotic
events did not separate until 18 months. The following is part
of the correction published by NEJM:
“…Statements
regarding an increase in risk after 18 months should be removed
from the Abstract (the sentence ‘The increased relative
risk became apparent after 18 months of treatment; during
the first 18 months, the event rates were similar in the two
groups’ should be deleted…”
The reason for
the correction appears to be an analysis of data released by
Merck to the FDA on May 11, 2006. These data provide information
about events in the subgroup of participants whose data were
censored if they had an event more than 14 days after early
discontinuation of the study medication.
Twelve thrombotic
events that occurred more than 14 days after the study drug
was stopped but within 36 months after randomization were noted.
Eight of the “new” events were in the rofecoxib
group, and these events had a definite effect on the published
survival curve for rofecoxib (Fig. 2 of the original article).
When including the new data, the separation of the rofecoxib
and placebo curves begins earlier than 18 months.
The point of all
this is that it is difficult to determine the validity of a
study when assumptions used in censoring of data are not reported.
With insufficient information about loss to follow-up, we cannot
do our own sensitivity analyses for imputing missing data with
our goal being to “test” the P-value reported by
the authors.
To reiterate
from our previous DelfiniClick:
- Intent-to-treat
designs that provide primary outcome data on all randomized
patients are the ideal. All patients randomized are included
in the analysis. The same patients randomized at the beginning
of the RCT are analyzed in the same groups to which they were
randomized.
- Authors should
use a CONSORT diagram to report what happened to various patients
during the course of the study – plus they should provide
detailed information about missing data points including timing.
- Sensitivity
analyses are welcomed, especially those that subject the intervention
to the toughest trial. If p-values remain statistically significant
after such a test, we can be more confident about anticipated
outcomes in an otherwise valid study.
1. Correction to:
Cardiovascular events associated with rofecoxib in a colorectal
adenoma chemoprevention trial. N Engl J Med 2006;355:221.
2. Bresalier RS,
Sandler RS, Quan H, et al. Cardiovascular events associated
with rofecoxib in a colorectal adenoma chemoprevention trial.
N Engl J Med 2005;352:1092-102. |
| Intention-to-Treat
Analysis: Misreporting and Migraine
Intention-to-treat
analysis (ITT) is an important consideration in randomized,
controlled trials. And determining whether an analysis meets
the definition of ITT analysis or not is incredibly easy. Yet
many authors mislabel their analyses as ITT when they are not
and report their results in a biased way. An article in BMJ
dealing with migraine illustrates some important points about
ITT analysis and reminds us that authors continue to
report outcomes in ways that are highly likely to be biased.
Case Study
As described in
the CONSORT STATEMENT (http://www.consort-statement.org/),
among other things, ITT analysis “prevents bias caused
by the loss of participants, which may disrupt the baseline
equivalence established by random assignment and which may reflect
non-adherence to the protocol.”
ITT analysis is
defined as follows in the CONSORT STATEMENT:
“A strategy for analyzing data in which all participants
are included in the group to which they were assigned, whether
or not they completed the intervention given to the group.”
An easy way to
tell if an ITT analysis has been done is to look at the number
randomized in each group and see if that number is the same
number that is analyzed. Number in should be the same number
out — in each group as originally randomized.
And, as you can
see, determining whether an analysis meets the definition of
ITT analysis or not is incredibly easy. Yet many authors mislabel
their analyses as ITT when they are not. In one study, in articles
reviewed authors were found to say they had performed an ITT
analysis when 47% of the time they had not. (Kruse, R. B Alper
et al. Intention-to-treat analysis: Who is in? Who is out? JFamPrac
2002 Nov: (Vol 51) #11)
An article in BMJ
dealing with migraine illustrates some important points about
ITT analysis and reminds us that authors continue to
report outcomes in ways that are highly likely to be biased.
In the Schrader
study, 30 patients with migraine were randomized to receive
lisinopril and 30 were randomized to placebo. The authors, however,
only reported on 55 patients in their so-labeled
“intention-to-treat analysis” because of poor compliance.
This is not an intention-to-treat analysis.
The following is
reported by the authors:
| Schrader
H, Stovner, LJ, Helde G, Sand T, Bovim G. Prophylactic
treatment of migraine with angiotensin converting inhibitor
(lisinopril): randomised, placebo controlled, crossover
study. BMJ 2001;322:1-5 — article
available at — http://bmj.bmjjournals.com/cgi/content/full/322/7277/19. |
Results
In the 47 participants with complete data, hours with headache,
days with headache, days with migraine, and headache severity
index were significantly reduced by 20% (95% confidence
interval 5% to 36%), 17% (5% to 30%), 21% (9% to 34%), and
20% (3% to 37%), respectively, with lisinopril compared
with placebo. Days with migraine were reduced by at least
50% in 14 participants for active treatment versus placebo
and 17 patients for active treatment versus run-in period.
Days with migraine were fewer by at least 50% in 14 participants
for active treatment versus placebo. Intention to treat
analysis of data from 55 patients supported the differences
in favour of lisinopril for the primary end points. In the
intention to treat analysis in 55 patients, significant
differences were retained for the primary efficacy end points:
|
Intention
to Treat Analysis—55 Participants with Means (SD) |
| |
Lisinopril |
Placebo |
Mean
% reduction (95% CI) |
| Headache hours |
138 (130)
|
162 (134)
|
15 (0 to 30) |
| Headache days |
20.7 (14) |
24.7 (11) |
16 (5 to 27) |
| Migraine days |
14.6 (10) |
18.7 (9) |
22 |
| Conclusion:
The angiotensin converting enzyme inhibitor, lisinopril,
has a clinically important prophylactic effect in migraine.
|
The authors have
done as their primary analysis an “optimal compliance
analysis.” They also state they have done an ITT analysis
but they have not.
It is fine to do
non-ITT analyses – “as treated,” and “completer”
analysis are two common ones you will frequently see. But the
ITT analysis must be the primary analysis. Others are considered
secondary (and should be labeled and treated as such).
And so how does
one handle loss to follow-up? There are various methods, but
there is an important principle which should guide us —
the method should put the burden of proof on the intervention.
This is the opposite of our court system – “guilty
until proven innocent,” in effect. So what you do is assign
an outcome to those lost to follow-up that puts the intervention
through the toughest test. “Worse-case-basis” is
one method; “last-observed result” is another.
If you put the
intervention through the hardest test, and you still have positive
results (assuming the study is otherwise valid), you can feel
much more confident about the reported outcomes truly being
valid. If the missing subjects in the above-mentioned migraine
article are handled this way, there is no statistically
significant difference between lisinopril and placebo.
We are frequently
asked what is an acceptable percent loss to follow-up. It depends
on whether the loss to follow-up will affect the results or
not. We have seen what we consider to be important changes even
with small numbers lost to follow-up. We recommend that you
do sensitivity analyses (“what if”s) to see what
the effect might be if you had the data. Without doing an ITT
analysis, we are very uncomfortable about the results if five
percent or more of subjects have missing data for analyzing
endpoints -- and even less than five percent might have impact.
For those who would
like more information, the following article is an excellent
one on the subject and is very helpful for understanding issues
pertaining to ITT analysis and randomization as well:
Schulz
KF, Grimes DA
Sample size slippages in randomised trials: exclusions and the
lost and wayward.
The Lancet. Vol 359. March 2, 2000: 781-785
PMID: 11888606
See other reading
on ITT analysis is available here.
Very special thanks
to Murat Akalin, MD, MPH, UCSD, for selecting
a great article for case study, participating in this review,
doing the ITT analysis and encouraging us to write this. |
| Missing
Data Points: Difference or No Difference — Does it Matter?
We continue to
study the "evidence on the evidence" — meaning
we are continually on the look out for information which may
shed light on the impact on reported outcomes of certain kinds
of bias, for example, or information that provides help in how
to handle different biases. Missing data points is an issue
affecting the majority of studies, but currently there is not
clarity on how big an issue this is, especially when there is
not a differential loss between groups.
We spoke recently
about this issue with John M. Lachin, Sc.D., Professor of Biostatistics
and Epidemiology, and of Statistics, The George Washington University,
and author. (And then we did some "hard thinking"
as David Eddy would say.) Even without differential loss between
the groups overall, a differential loss could occur in prognostic
variables — and readers are rarely going to have access
to data about changes in prognostic characteristics post-baseline
reporting. So we continue to offer our conservative approach
that loss of around five percent with differential loss is a
bias as well as loss of around ten percent or more without differential
loss.
For those who are
tough and hardy and really want to mull on this, here's our
updated white paper on "missingness" [Word]
or [PDF]. We
welcome further thoughts (or evidence) on this area.
Update:
Attrition
Bias Caution: Non-differential Loss Between Groups Can Threaten
Validity
01/16/2011
Read our BMJ
Rapid Response Letter to a critical appraisal and
quiz that we thought missed an important point about non-differential
drop outs, our rationale and our recommedations for future reporting.
|
Attrition Bias and Baseline Characteristic Testing (Esp for Non-Dichotomous Variables)
05/19/2011
Not having complete information on all study subjects is a common problem in research. The key issue is whether those subjects for whom data is missing are similar or not to those for whom data is available. In other words, the question is might reported outcomes be distorted due to an imbalance in the groups for which we have information? As Schulz and Grimes state, “Any erosion…over the course of the trial from those initially unbiased groups produces bias, unless, of course, that erosion is random…”. [1] As of this date, we are not aware of a preferred way to handle this problematic area and the effect of various levels of attrition remains unclear.[2], [3].
We have previously summarized our position on performing sensitivity analyses when variables are dichotomous. Non-dichotomous data pose unique challenges. We think it is reasonable to perform a sensitivity analysis on subjects for whom data is available and for whom it is not. Others have recommended this approach. Dumville et al states, “Attrition can introduce bias if the characteristics of people lost to follow-up differ between the randomised groups. In terms of bias, this loss is important only if the differing characteristic is correlated with the trial’s outcome measures.…we suggest it is informative to present baseline characteristics for the participants for whom data have been analysed and those who are lost to follow-up separately. This would provide a clearer picture of the subsample not included in an analysis and may help indicate potential attrition bias.”
Other suggestions regarding missing data through censoring have been provided to us by John M. Lachin, Sc.D., Professor of Biostatistics and Epidemiology, and of Statistics, The George Washington University (personal communication):
- Evaluate censoring by examining both administrative censoring and censoring due to loss-to-follow-up. Administrative censoring (censoring of subjects who enter a study late) may not result in significant bias. Censoring because of loss-to-follow-up or discontinuing is more likely to pose a threat to validity
- Compare characteristics of losses (e.g., withdrawing consent, adverse events, loss to follow-up, protocol violations) versus completers (including administratively censored) within groups.
- Compare characteristics of losses (not administratively censored) between groups.
- Adjust group effect for factors in which groups differ.
There are some caveats that should be raised regarding this kind of sensitivity analysis. There may be other resulting imbalances between groups that are not measurable. Also no differences in characteristcs of the groups could be due to insufficient power to reveal true differences. And importantly, differences found could be due to chance.
However, if the groups appear to be similar, we think it is reasonable to conclude that such sensitivity analyses may be suggestive that the groups remained balanced despite the number of discontinuations. If the groups remained balanced, then—depending on details of the study— the discontinuations may not have created any meaningful distortion of results.
References
1. Schulz KF, Grimes DA. Sample size slippages in randomised trials: exclusions and the lost and wayward. Lancet. 2002 Mar 2;359(9308):781-5. PubMed PMID: 11888606.
2. Dumville JC, Torgerson DJ, Hewitt CE. Reporting attrition in randomized controlled trials. BMJ. 2006 Apr 22;332(7547):969-71. Review. PubMed PMID: 16627519; PubMed Central PMCID: PMC1444839.
3. Hewitt CE, Kumaravel B, Dumville JC, Torgerson DJ; Trial attrition study group. Assessing the impact of attrition in randomized controlled trials. J Clin Epidemiol. 2010 Nov;63(11):1264-70. Epub 2010 Jun 22. PubMed PMID: 20573482. |
Attrition Bias & A Biostatistician Weighs In: Dr. Steve Simon on "Why is a 20% dropout rate bad?"
12/05/2011
We have written numerous times about attrition bias. Large numbers of patients dropping out of studies or unable to complete participation in studies tends to be one of the biggest barriers in passing critical appraisal screenings. This area is also one of the least understood in evaluating impact on outcomes, with a paucity of helpful evidence.
Biostatistician, Steve Simon, addresses dropout rates in this month’s newsletter in his helpful entry titled, “Why is a 20% dropout rate bad?” Steve provides us with some math to tell us that, “If both the proportion of dropouts is small and the difference in prognosis between dropouts and completers is small, you are truly worry free.”
He also gives us help with differential loss: “The tricky case is when only one [proportion of dropouts] is small. You should be okay as long as the other one isn't horribly bad. So a small dropout rate is okay even with unequal prognosis between completers and dropouts as long as the inequality is not too extreme. Similarly, if the difference in prognosis is small, then any dropout rate that is not terribly bad (less than 30% is what I'd say), should leave you in good shape.”
He gives us a rule of thumb to go by: “Now it is possible to construct settings where a 10% dropout rate leads to disaster or where you'd be safe even with a 90% dropout rate, but these scenarios are unrealistic. My rule is don't worry about a dropout rate less than 10% except in extraordinary settings. A dropout rate of 30% or higher though, is troublesome unless you have pretty good inside information that the difference in prognosis between dropouts and completers is trivially small.”
We are happy that, in the face of no truly helpful evidence on this topic, we are pretty much on the same page. Bottom line for us is that authors could help us all out tremendously by assessing comparability between baseline characteristics at randomization and for those analyzed. We want to see that there are no major changes in prognostic variables between these comparisons.
You can read Steve’s full entry here and even sign-up to be on his mailing list:
http://www.pmean.com/news/201111.html#1 |
| Quality
of Studies: VIGOR
Why is it that Vioxx made the front page of
the NYTs in December of 2005 when it was withdrawn from the
market in 2004? Reason: it was discovered that the authors “removed”
3 patients with CV events from the data in the days preceding
final hardcopy submission of the VIGOR study to the NEJM. Here
are some key points made by the NEJM in an editorial entitled,
Expression of Concern: Bombardier et al., “Comparison
of Upper Gastrointestinal Toxicity of Rofecoxib and Naproxen
in Patients with Rheumatoid Arthritis,” N Engl J Med 2000;343:1520-8,
published on the web 12/8/04 and in hard copy, N Engl J Med.
2005.353:25:
- The VIGOR study
was designed primarily to compare gastrointestinal events
in patients with rheumatoid arthritis randomly assigned to
treatment with rofecoxib (Vioxx) or naproxen (Naprosyn), but
data on cardiovascular events were also
monitored.
- Three myocardial
infarctions, all in the rofecoxib group, were not included
in the
data submitted to the Journal in hardcopy.
- Until the end
of November 2005, the NEJM believed that these were late events
that were not known to the authors in time to be included
in the article published in the Journal on November 23, 2000.
- It now appears,
however, from a memorandum dated July 5, 2000, that was obtained
by subpoena in the Vioxx litigation and made available to
the NEJM, that at least two of the authors knew about the
three additional myocardial infarctions at least two weeks
before the authors submitted the paper version of their manuscript.
- Lack of inclusion
of the three events resulted in an understatement of the difference
in risk of myocardial infarction between the rofecoxib and
naproxen groups.
- The NEJM determined
from a computer diskette that some of these data were deleted
from the VIGOR manuscript two days before it was initially
submitted to the Journal on May 18, 2000.
- Taken together,
these inaccuracies and deletions call into question the integrity
of the data on adverse cardiovascular events in this article.
Merck's position
is that the additional heart attacks became known after the
publication's "cutoff" date for data to be analyzed
and were therefore not reported in the Journal article. To our
knowledge, NEJM has not responded to Merck's point.
In any event, without
the 3 missing subjects the relative risk of myocardial infarction
risk was 4.25 for refecoxib versus naproxen, 95% CI (1.39 to
17.37). This is based on 17 MIs out of 2315 person years of
exposure for rofecoxib and 4 MIs out of 2336 person years for
naproxen.
Adding in the 3
missing subjects (new total of 20 MIs in the rofecoxib group)
increases the relative risk to 5.00, 95% CI (1.68 to 20.13).
This demonstrates how losing just a few subjects even in a large
study can change results dramatically.
For readers, the
important point is to look carefully to be sure that all randomized
patients were accounted for. We believe that if the loss of
subjects is greater than 5% without an acceptable ITT analysis
there is uncertainty regarding the validity of the results.
|
Avoiding
Overestimates of Benefit: Composite Endpoints in Cardiovascular
Trials
04/20/09
Composite endpoints
represent the grouping together of individual endpoints to serve
as a single outcome measure. They are frequently used in clinical
trials to reduce requirements for sample size. In other words,
composite endpoints — by adding together individual outcomes
— increase the overall event rates and, thus, the statistical
power of a study to demonstrate a statistical and clinically
meaningful difference between groups if one exists. Composite
endpoints also enable researchers to conduct studies of smaller
size and still reach what may be clinically meaningful outcomes.
It has been pointed out, however, that the trade-offs for this
increased power may include difficulties for readers in correctly
interpreting results.
Several investigators
[1,2] have pointed out that composite endpoints may be misleading
if the investigators —
- Include individual
outcomes that have differing importance to patients;
- Include individual
outcomes that have differing rates of occurrence; or,
- Do not include
rates for individual outcomes.
For example, in
cardiovascular trials the composite endpoint of cardiovascular
mortality, myocardial infarction and revascularization procedures
is frequently encountered. The reader is very likely to conclude
that the effect for meaningful outcomes is much greater than
the reported results based on the composite endpoint. If one
misunderstands that the apparent effect is driven largely by
revascularization — which is frequently driven by subjective
symptoms and subjective decision-making to perform the procedure
— rather than objective outcomes such as myocardial infarction
and death, then the reported composite endpoint is likely to
result in erroneous (falsely inflated) conclusions by the reader.
Lim and colleagues
[3] found in a review of 304 cardiovascular trials published
in 14 leading journals between January 2000 and January 2007
that 73% trials reported composite primary outcomes. The total
number of individual events and the total number of events represented
by the composite outcome differed in 79% of trials. P values
for composite outcomes less than 0.05 were more frequently reported
than P values of 0.05 or greater. Additionally, death as an
individual endpoint made a relatively small contribution to
estimates of effect summarized by the trials’ composite
endpoints, whereas revascularization made a greater contribution.
Lim et al. recommend that authors report results for each individual
endpoint in addition to the composite endpoint so that readers
can ascertain the contribution of each individual endpoint.
Readers should
bear in mind that safety outcomes when reported as single events
can be made to appear “insignificant” since P values
are frequently greater that 0.05. If investigators report efficacy
results as composite outcomes it may be reasonable to expect
safety results to also be reported as composites.
Bottom Lines for
Recent Cardiovascular Studies (That Also Apply to Trials in
Other Areas):
1. Composite outcomes increase event rates and statistical power.
2. Composite outcomes in cardiovascular trials are frequent
and often comprise 3 to 4 individual end points.
3. Individual events frequently vary in clinical significance.
4. Meaningful differences between the total number of individual
events in a trial and those reported for the composite outcomes
are very common.
5. When studies include composite outcomes comprised of individual
outcomes of varying importance and frequency, interpreting results
becomes difficult for readers.
6. Interpretation becomes easier if authors include individual
outcomes along with the composite measures.
References
1. Freemantle N,
Calvert M, Wood J, Eastaugh J, Griffin C. Composite outcomes
in randomized trials: greater precision but with greater uncertainty?
JAMA. 2003;289:2554-9. [PMID: 12759327].
2. Ferreira-González I, Busse JW, Heels-Ansdell D, Montori
VM, Akl EA, Bryant DM, Alonso-Coello P, Alonso J, Worster A,
Upadhye S, Jaeschke R, Schünemann HJ, Permanyer-Miralda
G, Pacheco-Huergo V, Domingo-Salvany A, Wu P, Mills EJ, Guyatt
GH. Problems with use of composite end points in cardiovascular
trials: systematic review of randomised controlled trials. BMJ.
2007 Apr 14;334(7597):786. Epub 2007 Apr 2. [PMID: 17403713].
3. Lim E, Brown A, Helmy A, Mussa S, Altman DG. Composite outcomes
in cardiovascular research: a survey of randomized trials. Ann
Intern Med. 2008 Nov 4;149(9):612-7. [PMID: 18981486]
|
| Confidence-Intervals,
Power & Meaningful Clinical Benefit:
Advice to Readers on How to Stop Worrying about Power and Start
Using Confidence Intervals &
Using Confidence Intervals to Evaluate Clinical Benefit of Statistically
Significant Findings
(Special thanks to Brian Alper, MD, MSPH and Ted Ganiats,
MD for their help in understanding this issue.)
Problems
with Non-Statistically Significant Findings
Research outcomes which are not statistically significant (also
referred to as “non-significant findings”) raise
the question, "Is there TRULY no difference, or were there
not enough people to show a difference if there is one?"
(This is known as beta- or Type II error.)
Power calculations
are performed prior to a study help investigators determine
the number of people they should enroll in the study to try
and detect a statistically significant difference if there is
one. A power of >= 80% is conventional and provides some
leeway for chance. Power calculations are generally performed
only for the primary outcome. They entail a lot of assumptions.
Good News
About Power!
The good news for readers is that you don’t need to worry
about power since you can evaluate inconclusiveness of findings
through using confidence intervals.
Here’s what
they are, and here’s how it’s done:
About Confidence
Intervals (CIs)
The results of a valid study represent an approximation of truth.
There might be other possible values that could equally approximate
truth. (What if the study had been done on Friday instead of
on Tuesday, for example? Maybe the difference in outcomes would
be an absolute 4 percent and not 5 percent.) In recognition
of this, confidence intervals are calculations of equally statistically
plausible results generating a range within which there is a
95% chance that the true answer lies for a valid study. (As
with all allowances for chance findings, 95 percent is conventional.)
You can apply confidence intervals to any measure of outcomes
such as an odds ratio or absolute risk reduction (ARR).
This is how confidence
intervals are reported:
Example: ARR
= 5%; 95% CI (3% to 7%)
How to
Use Confidence Intervals to Determine Statistical Significance
Absolute
Risk Reduction and Relative Risk Reduction
For measures reported as percentages, if the range includes
zero, the outcomes are not statistically significant.
Relative
Risk (aka Risk Ratio) and Odds Ratio
For measures reported as ratios, if the range includes 1,
the outcomes are not statistically significant.
How to
Use Confidence Intervals to Determine Conclusiveness of Non-significant
Findings
And if something is not statistically significant (also referred
to as non-significant or NS findings), you don’t know
if there truly is no difference, or whether there were not enough
people to show a difference if there is one.
You can look to
the CIs to help you with this situation. But first you want
to decide what you would consider to be your minimum requirement
for a clinically significant outcome (difference between outcomes
in the intervention and comparison groups). This is a judgment
call.
Let’s assume
we are looking at a study, the primary outcome for which is
absolute reduction in mortality. One might reasonably conclude
that an outcome of 1 percent or more is, indeed, a clinically
meaningful benefit.
[Below is a text
explanation. Pictures tell this best, however. Click
here to
view a PDF of what this looks like graphically. Note
that the PDF starts out first with how to determine clinical
significance of statistically significant outcomes and then
demonstrates how to determine conclusiveness of non-significant
findings.]
Example:
Clinical Significance Goal
>=1% absolute reduction in mortality
For Non-Significant
Findings:
Example
1
- ARR = 2%;
95% CI (-1% to 5%)
- The upper
boundary tells you it is possible that the true result WOULD
meet your requirements for clinical significance –
thus, from that perspective this trial is inconclusive about
NO DIFFERENCE BETWEEN GROUPS - you do not know if the trial
was insufficiently powered (false negative due to insufficient
number of people to show a statistically significant difference
if there is one)
Example
2
- ARR = 0%;
95% CI (-.5 to .5%)
- The upper
boundary does not reach your goal – therefore, this
can be considered sufficient evidence that there is no difference
between the groups that you would consider clinically significant
How to
Use Confidence Intervals to Determine Conclusiveness of Non-significant
Findings
Again, you can also use confidence intervals to determine whether
a result from a valid study is of meaningful clinical benefit.
Requirements
for Meaningful Clinical Benefit
Remember that outcomes of clinical significance are those which
benefit patients in some way in the areas of morbidity, mortality,
symptom relief, physical or emotional functioning or health-related
quality of life. Intermediate markers are assumed to benefit
patients in these areas, but they may not - thus, a direct causal
chain of benefit must be proved to avoid waste and potential
patient harms occurring as unintended consequences. Meaningful
clinical benefit is a combination of benefits in a clinically
significant area along with the size of the results.
As with evaluating
the conclusiveness of a non-significant finding, you apply judgment
to set your minimum requirement for meaningful clinical significance.
Using the same example of your choosing 1 percent absolute reduction
in mortality as meaningful clinical benefit:
Example:
Clinical Significance Goal
>=1% absolute reduction in mortality
For Statistically
Significant Findings:
Example
1
- ARR = 2%;
95% CI (.5% to 3.5%)
- The lower
boundary tells you it is possible that the true result will
NOT meet your requirements for clinical significance –
thus, from that perspective this trial is inconclusive
Example
2
- ARR = 2%;
95% CI (1 to 3%)
- The lower
boundary reaches your goals for clinical significance –
therefore, this can be considered sufficient evidence of
benefit
Again, pictures
probably tell this best. Click here
to view the PDF.
The Authors
Did Not Report CIs?
If you can create a 2 x 2 table from the study data, you can
compute them yourself using the confidence interval calculator
of the University of British Columbia, Department of Health
Care and Epidemiology »
which can also be found in the Delfini
WebLinks »
under "confidence interval calculations."
Evaluate
Definitions for Outcomes
And remember, ensure you agree with the authors’ definitions
of the outcomes, especially if they are using a term like “improved,”
“success,” or “failure” – is a
three-point change on a 200 point scale really a meaningful
clinical difference that should define success? You get to be
the judge. |
Confidence Intervals: Overlapping Confidence Intervals—A Clarification
11/28/2011
Confidence intervals are useful in studies that compare the difference in outcomes between two interventions, because they provide a range of values (representing the estimate of effect) within which the true difference between the two interventions is likely to be found—assuming that the study is valid.
However, a common error is to draw conclusions based on overlapping 95% confidence intervals when the results in the two groups are compared. The error is to conclude that the means of two different groups are not statistically significantly different from each other. The error frequently occurs when the investigators in such cases do not calculate the confidence intervals for the difference between the groups. For example, two groups of patients with diabetes received two different drug regimens and hemoglobin A1c measurements were assessed. Results are presented in the table below.
Table 1. Example of Overlapping 95% CIs With Statistical Differences
Group |
Hemoglobin A1c with 95% CIs |
P-Value for Difference in Meansa |
#1 receiving drug A |
7.4, 95% CI (7 to 7.8) |
P=0.0376 |
#2 receiving drug B |
8.0, 95% CI (7.6 to 8.4) |
a: For a detailed mathematical explanation about the problems of variability that occur when comparing two means and details about calculating the P-value see Austin et al. [2]
As pointed out by Altman, “In comparative studies, confidence intervals should be reported for the differences between groups, not for the results of each group separately.”[1]
In theory, two treatment groups can have a statistically significant difference in mean effects at the 5% level of significance, with an overlap of as much as 29% between the corresponding 95% CIs. [2,3,4] Calculations illustrating 6 cases of statistically significant differences in groups with overlapping 95% CIs are shown in Table 2.
Table 2. Percent of Overlapping of 95% CIs and P-Values For Differences Between Groups [2]
Percent Overlap |
0% |
5% |
10% |
15% |
20% |
25% |
P-Value |
.0056 |
.0085 |
.0126 |
.0185 |
.0266 |
.0376 |
References
1. Altman DG. Statistics and ethics in medical research. In: Statistics in practice. London: British Medical Association; 1982. Chapter VI.
2. Austin P, Hux J. A brief note on overlapping confidence intervals. Journal of Vascular Surgery. 2002; 36, 1, 194-195.
3. Payton ME, Greenstone MH, Schenker N. Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? Journal of Insect Science. 2003; 3, 34.
4. Odueyungbo A, Thabane L, Markle-Reid M. Tips on overlapping confidence intervals and univariate linear models. Nurse Res. 2009;16(4):73-83. Review. PubMed PMID: 19653548. |
Primary and Secondary Outcomes: Significance Issues
08/08/2011
I am a fan of statistician, Steve Simon. You can sign up for his newsletter here: http://www.pmean.com/. I recently wrote him to ask his opinion about significant secondary outcomes when the primary is not statistically significant. Here's the essence of my letter to him and his response follows. At the end of the day, I think it ends up being like many critical appraisal conundrums, "It depends."
From Sheri Strite to Steve Simon: Excerpts
Assume that my examples represent studies done otherwise “perfectly” to be at low risk of bias with biological plausibility and alpha spending applied for analyzing the secondary outcomes. Let us set aside the fact that the authors should make a more logical set of choices of primary outcomes to hedge their bets and have a greater likelihood of a positive outcome. (In other words, for the sake of my question, I am letting themselves shoot themselves—and their agent— in the foot.) Let us say we have biological plausibility, etc., etc. Let us say for reasons having to do with science fiction (and trying to keep my question completely statistical and logical) that these will be the only studies ever on their topic using these class of agents, and I need an option for patient care. So the answer for me can’t be, “Wait for confirmatory studies.”
I have heard off and on that, if a primary outcome is not statistically significant, you should just discount any statistically significant secondary outcomes. I have never been able to find or to conceptualize why this should be so. I found the following written by you.
“Designating primary outcome variables
When you need to examine many different outcome measures in a single research study, you still may be able to keep a narrow focus by specifying a small number of your outcome measures as primary variables. Typically, a researcher might specify 3-5 variables as primary. The fewer primary outcome variables, the better. You would then label as secondary those variables not identified as primary outcome variables.
“When you designate a small number of primary variables, you are making an implicit decision. The success or failure of your intervention will be judged almost entirely by the primary variables. If you find that none of the primary variables are statistically significant, then you will conclude that the intervention was not successful. You would still discuss any significant findings among your secondary outcome variables, but these findings would be considered tentative and would require replication.”
But I am not getting the why. And is this necessarily so? Read on. I’d be grateful if I could give you a couple of short scenarios.
Please keep in mind that my goals are as a reviewer of evidence (generally on efficacy or safety of therapies) and not as a researcher, so a helpful answer to me would be more in the nature of what I can use, if clinically meaningful, and not how I might redesign my study to make more sense. I am working with what’s out there and not creating new knowledge.
Background
Probably 99 percent of the time, the studies I review have a single primary outcome. The other 1 percent has 2. So I never see 3 to 5. But then I always see a multiplicity of outcomes defined as secondary, all of which seems somewhat arbitrary to me.
Scenario
I read a study comparing Drug X to placebo for prevention of cardiovascular events in type 1 diabetics.
The primary outcome is overall mortality.
Let us say that the researchers chose 4 secondary outcomes:
— death from stroke
—stroke
—death from MI
—MI
Let us assume that Drug X really works. Let us say that we have non-significant findings for overall mortality, which I realize could be a simple case of lack of power.
Let’s say that stroke and MI were statistically significant, favoring Drug X over placebo. Is it really true that you believe I should consider these findings tentative? I find it hard to think why that should be. They are related and the lack of significant mortality outcome could again be a power issue.
If I am correct that I can consider these clinically useful findings provided the effect size meets my minimum, then what about a scenario in which a researcher chose a really unlikely primary outcome (even a goofy one), but reasonable secondary outcomes? Setting aside the fact that such a choice would give me pause about the rigor of the study—setting this aside just to focus on statistical logic—what if in an otherwise valid study—
Drug Y versus Placebo
Clinical Question: Is Drug Y effective in weight reduction over placebo in women between the ages of 20 through 30?
Primary outcome:
—Death, not statistically significant
Secondary outcomes:
—Weight loss of > 10 pounds, statistically significant and clinically meaningful
—Clinically meaningful change in BMI, statistically significant and clinically meaningful
It seems to me that secondary outcomes should be able to be used with as much confidence as primary outcomes given certain factors such as attention to chance effects, relatedness of several outcomes, etc.
If I am wrong about this can you enlighten me or steer me to some helpful resources.
Most gratefully yours, Sheri
And here is Steve's response:
http://www.pmean.com/news/201105.html#2 |
| Getting
“Had” by P-values: Confidence Intervals vs P-values
in Evaluating Safety Results: Low-molecular-weight Heparin (LMWH)
Example
In one of our DelfiniClicks
we have pointed out that confidence intervals (CIs) can be very
useful when examining results of randomized controlled trials
(Confidence-Intervals, Power
& Meaningful Clinical Benefit). The first step
in examining safety results is to decide what you consider to
be a range for clinically significant outcomes (i.e., the difference
between outcomes in the intervention and comparison group).
This is a judgment call. Then examine the 95% CI to see if a
clinically significant difference is included in the confidence
interval. If it is, the study has not excluded the possibility
of a clinically significant harm even if the authors state there
is no difference (usually stated as “no difference”
based on a non-significant p-value.) It is important to remember
that a non-significant p-value can be very misleading in this
situation.
This can be illustrated
by an interesting conversation we recently had with an orthopedic
surgeon who felt he couldn’t trust the medical literature
to guide him because it gave him “misleading information.”
He based his conclusion on a study he read (he wasn’t
sure which study it was) regarding bleeding in orthopedic surgery.
After talking with him, we searched for studies that may have
led to his conclusion and found the following study which illustrates
why CIs are preferable to p-values in evaluating safety results
and possibly why he was misled.
Case
Study: An orthopedic surgeon reads an article comparing
outcomes, including bleeding rates, between fondaparinux and
enoxaparin in orthopedic surgery and sees the following statement
by the authors in the Abstract section of
the paper: “The two groups did not differ in frequency
of death or clinically relevant bleeding.” [1]
He looks at the
Results section of the paper and reads the
following: “The number of patients who had major bleeding
did not differ between groups (p=0.11).” He knows that
if the p-value is greater than 0.05, the differences are not
considered statistically significant, and he concludes that
there is no difference in bleeding between the groups. His
confidence is shaken when he switches to fondaparinux and
his patients experience increased postoperative bleeding.
Let’s evaluate
this study’s bleeding rates using confidence intervals.
One might reasonably conclude that an outcome of 1 percent or
more difference between the groups is, indeed, a clinically
meaningful difference in bleeding:
- The actual rates
for major bleeding were 47/ 1140 (4.1%) in the fondaparinux
group vs 32/ 1133 (2.8%) in the enoxaparin group, up to day
11, a difference of 1.3%, p=0.11.
- But CIs provide
more information: The absolute risk increase with fondaparinux
(ARI) was 1.3%, but the 95% CI was (0.3, 2.9) and since the
true difference could be as great as 2.9% (i.e., clinically
relevant) the authors’ conclusions are misleading.
The Cochrane Handbook
summarizes this problem nicely:
"A common
mistake when there is inconclusive evidence is to confuse
‘no evidence of an effect’ with ‘evidence
of no effect.’ When there is inconclusive evidence,
it is wrong to claim that it shows that an intervention has
‘no effect’ or is ‘no different’ from
the control intervention. It is safer to report the data,
with a confidence interval, as being compatible with either
a reduction or an increase in the outcome. When there is a
‘positive’ but statistically non-significant trend,
authors commonly describe this as ‘promising,’
whereas a ‘negative’ effect of the same magnitude
is not commonly described as a ‘warning sign.’
Authors should be careful not to do this." [2]
Comments:
Following the Lassen study referenced above, others confirmed
the increased bleeding rate leading to re-operation and other
significant bleeding with fondaparinux vs enoxaprin. [3]
Click here
for our
primer on confidence intervals.
When investigators
provide p-values but not confidence intervals, readers can quickly
calculate the 95% CIs if the outcomes are dichotomous and the
investigators report the actual rates of events, as in the example
above, by using the calculator available at:
http://www.graphpad.com/quickcalcs/NNT1.cfm
Also, see our web
links for other sources (search “confidence intervals”):
http://www.delfini.org/delfiniWebSources.htm
References:
- Lassen MR, Bauer
KA, Eriksson BI, Turpie AG. Postoperative fondaparinux versus
preoperative enoxaparin for prevention of venous thromboembolism
in elective hip replacement surgery: a randomised double-blind
comparison. Lancet. 2002;359:1715- 20. [PMID: 12049858]
- Higgins JPT,
Green S, editors. 9.7 Common errors in reaching conclusions.
Cochrane Handbook for Systematic Reviews of Interventions
4.2.6 [updated September 2006]. http://www.cochrane.org/resources/handbook/hbook.htm
(accessed 22nd January 2008).
- Vormfelde SV.
Comment on: Lancet. 2002 May 18;359(9319):1710-1. Lancet.
2002 Nov 23;360(9346):1701. PMID 12457831.
|
|
A Cautionary Tale of Harms versus
Benefits: Misleading Findings Due to Potentially Inadequate
Data Capture — Approtinin Example
05/22/08
Assessing safety
of interventions is frequently challenging for many reasons,
and it is made even more so when data is missing. It is easy
to draw conclusions about the clinical usefulness of new interventions
from studies that have only limited outcome measures without
noticing what is missing.
Aprotinin is a
recent example of a drug which was approved by the FDA on the
basis of reduced bleeding in coronary artery bypass graft (CABG)
surgery and which was quickly adopted by surgeons, but with
what now appears to be outcomes of greater harms than benefits.
There now appears to be increased mortality in patients receiving
aprotinin even though there is a decreased need for blood transfusions.
Aprotonin received
FDA approval in 1993 for use in CABG surgery to decrease blood
loss. However, observational studies in 2006 and 2007 reported
increased mortality with aprotinin [1,2], A 2007 Cochrane Review
[3] of 211 RCTs reported that patients receiving aprotinin were
less likely to have red blood cell transfusions than were those
receiving lysine analogues, tranexamic acid (TXA), and epsilon
aminocaproic acid (EACA). When the pooled estimates from the
head-to-head trials of the two lysine analogues were combined
and compared to aprotinin alone, aprotinin appeared superior
in reducing the need for red blood cell transfusions: RR 0.83
(95% CI 0.69 to 0.99). The Cochrane review concluded that aprotinin
may be superior to the lysine analogues TXA and EACA in reducing
blood loss and the need for transfusion of red cells in patients
undergoing cardiac surgery. The Cochrane review, however, was
limited by inclusion of what appear to be studies with limited
or no mortality reporting.
In contrast, in
May 2008, the Blood Conservation Using Antifibrinolytics in
a Randomized Trial (BART) study [4] which compared massive postoperative
bleeding rates in patients treated with aprotinin versus those
treated with the lysine analogues tranexamic acid and aminocaproic
acid in patients undergoing high-risk cardiac surgery, reported
decreased massive bleeding, but increased mortality in patients
receiving aprotinin. The trial was terminated early because
of a higher rate of death at 30 days in patients receiving aprotinin.
- 74 patients
(9.5%) in the aprotinin group had massive bleeding, as compared
with 93 (12.1%) in the tranexamic acid group and 94 (12.1%)
in the aminocaproic acid group (relative risk in the aprotinin
group for both comparisons, 0.79; 95% confidence interval
[CI], 0.59 to 1.05).
- At 30 days,
the rate of death from any cause was 6.0% in the aprotinin
group, as compared with 3.9% in the tranexamic acid group
(relative risk, 1.55; 95% CI, 0.99 to 2.42) and 4.0% in the
aminocaproic acid group (relative risk, 1.52; 95% CI, 0.98
to 2.36).
- The relative
risk of death in the aprotinin group, as compared with that
in both groups receiving lysine analogues, was 1.53 (95% CI,
1.06 to 2.22).
The authors concluded
that —
In summary, despite
the possibility of a modest reduction in the risk of massive
bleeding, the strong and consistent negative mortality trend
associated with aprotinin as compared with lysine analogues
precludes its use in patients undergoing high-risk cardiac
surgery.
Delfini
Comments
Given a potential
relative risk of roughly as high as 2 (meaning that those receiving
aprotinin may have as high as a roughly 2 times greater likelihood
of death than those receiving lysine analogues), it is likely
that aprotinin will no longer be used in high-risk and perhaps
all cardiac surgery based on the BART study because of what
appears to be increased mortality with aprotinin not seen with
the lysine analogues.
And so what possibly
explains this conflict in findings? While it is possible that
the results in the BART study are due to chance, that seems
unlikely given a) the previously observed findings in the 2006
and 2007 observational studies, and b) the consistency of results
in comparing aprotinin against each agent.
- The Cochrane
review of 113 studies, many of low quality, failed to detect
the increased mortality with aprotinin. It is not clear why
the systematic review did not detect the increased mortality
trend, but it may be explained by the Cochrane group’s
inclusion of studies not evaluating or incompletely reporting
mortality data.
- A lesson
from this is that pooling of data in secondary studies
may fail to identify important safety issues if the studies
are small or if outcomes are infrequent or insufficiently
reported.
- The aprotinin
story appears to be an example of how a large, well-designed
and conducted RCT paying close attention to adverse events,
identified a meaningful increase in mortality that a meta-analysis
of many small RCTs of variable quality did not detect. Small,
low-quality RCTs and meta-analyses of small, low-quality RCTs
may distort results because of various deficiencies and biases,
including absence of safety findings due to small sample size
or incomplete reporting of outcomes.
And so what can
a diligent reader do? Our advice is carefully consider whether
primary and secondary outcomes in clinical trials are sufficient
in terms of providing evidence regarding benefits and risks.
If outcome measures are few and are all from small studies or
meta-analyses of small studies, it is possible that clinically
important harms will not be detected. Uncertainty is reduced
when large RCTs confirming results of earlier, smaller studies
become available--or as in the case of aprotinin—when
a large RCT identified meaningful adverse events.
References
1. Mangano DT,
Tudor IC, Dietzel C. The risk associated with aprotinin in cardiac
surgery. N Engl J Med 2006;354:353-65.
2. Mangano DT,
Miao Y, Vuylsteke A, et al. Mortality associated with aprotinin
during 5 years following
coronary artery bypass graft surgery. JAMA 2007;297:471-9.
3. Henry DA, Carless
PA, Moxey AJ, et al. Anti-fibrinolytic use for minimising perioperative
allogeneic blood transfusion. Cochrane Database Syst Rev 2007;4:CD001886.
4. Fergusson DA,
Hébert PC, Mazer CD, et al. A comparison of aprotinin
and lysine analogues in high-risk cardiac surgery. N Engl J
Med 2008;358:2319-31. |
| Understanding
Number Needed to Treat (NNT)
We have found that it is very common for health care professionals
to not understand the steps in calculating NNT. Bandolier has
available on its website a classic article on NNT. We heartily
recommend reviewing this article if you have any questions or
uncertainties about what NNT means or how to calculate and use
NNT information.
http://www.jr2.ox.ac.uk/bandolier/booth/painpag/NNTstuff/numeric.htm |
Early
Discontinuation of Clinical Trials: Oncology Medication Studies—Recent
Developments and Concern
04/28/08
With the trend
for more rapid approval of oncology drugs has come concern regarding
the validity of reported results because of methodological problems.
Validity and usefulness of reported results from oncology (and
other) studies are clearly threatened by lack of randomization,
blinding, the use of surrogate outcomes and other methodological
problems. Trotta et al. have extended this concern in a recent
study that highlights an additional problem with oncology studies—stopping
ocncology trials early [1. Trotta F, Apolone G, Garattini S,
Tafuri G. Stopping a trial early in oncology: for patients or
for industry? Ann Oncol. 2008 Apr 9 [Epub ahead of print] PMID:
18304961].The aim of the study was to assess the use of interim
analyses in randomized controlled trials (RCTs) testing new
anticancer drugs, focusing on oncological clinical trials stopped
early for benefit. A second aim was to estimate how often trials
prematurely stopped as a result of an interim analysis are used
for registration i.e., approval by European Medicines Agency
(EMEA), the European equivalent of FDA approval. The authors
searched Medline along with hand-searches of The Lancet, The
New England Journal of Medicine, and The Journal of Clinical
Oncology and evaluated all published clinical trials stopped
early for benefit and published in the last 11 years. The focus
was on anticancer drugs that contained an interim analysis.
Results
and Authors’ Conclusions
Twenty-five RCTs were analyzed. In 95% of studies, at the interim
analysis, efficacy was evaluated using the same end point as
planned for the final analysis. The authors’ found a consistent
increase (>50%) in prematurely stopped trials in oncology
during the last 3 years. As a consequence of early stopping
after the interim analysis, approximately 3,300 patients/events
across all studies were spared potential harms of continued
therapy. This may appear to be clearly beneficial, but as the
authors point out, stopping a trial early does not guarantee
that other patients will receive the apparent benefit of stopping,
assuming one exists, unless study findings are immediately publicly
disseminated. The authors found long delays between study termination
and published reports (approximately 2 years). If the trials
had continued for these further 2 years, more efficacy and safety
data could have been gathered. Delays in reporting results further
lengthen the time needed for translating trial findings into
practice.
Surprisingly, there
was a very small percentage of trials (approximately 4%) stopped
early because of harms, i.e. serious adverse events. Therefore,
toxicity does not represent the main factor leading to early
termination of trials. Of the 25 trials, six had no data and
safety monitoring board (DSMB) and five had enrolled less than
40% of the planned sample size. Even so, 11 were used to support
licensing applications on the basis of what could have been
exaggerated chance events. Thus, more than 78% of the oncology
RCTs published in the last 3 years were used for registration
purposes. The authors argue that only untruncated trials can
provide a full level of evidence which might be useful for informing
clinical practice decisions without further confirmative trials.
They concluded that early termination may be done for ethical
reasons such as minimizing the number of people given an unsafe,
ineffective, or clearly inferior treatment. However, interim
analyses may also have drawbacks, since stopping trials early
for apparent benefit will systematically overestimate treatment
effects [2. Pocock SJ. When (not) to stop a clinical trial for
benefit. JAMA 2005; 294: 2228–2230. PMID: 16264167] and
raises new concerns about what they describe as “market-driven
intent.” Some additional key points made by the authors:
- Repeated interim
analyses at short intervals raise concern about data reliability:
this strategy risks the appearance of seeking the statistical
significance necessary to stop a trial;
- Repeated analyses
on the same data pool often lead to statistically significant
results only by chance;
- If a trial is
evaluating the long-term efficacy of a treatment for conditions
such as cancer, short-term benefits — no matter how
significant statistically — may not justify early stopping.
Data on disease recurrence and progression, drug resistance,
metastasis, or adverse events could easily be missed. Early
stopping may reduce the likelihood of detecting a difference
in overall survival (the only relevant endpoint in this setting).
The authors conclude
that:
…a decision
whether to stop a clinical trial before its completion requires
a complex of ethical, statistical, and practical considerations,
indicating that results of RCTs stopped early for benefit
should be viewed with criticism and need to be further confirmed.
The main effect of such decisions is mainly to move forward
to an earlier-than-ideal point along the drug approval path;
this could jeopardise consumers’ health, leading
to unsafe and ineffective drugs being marketed and prescribed.
Even if well designed, truncated studies should not become
routine. We believe that only untruncated trials can provide
a full level of evidence which can be translated into clinical
practice without further confirmative trials.
Lancet
Comment
In a Lancet editorial on April 19, 2008 the editorialist states
that early stopping of RCTs should require proof beyond reasonable
doubt that equipoise no longer exists. Data safety and monitoring
boards must balance the decision to stop, which favors immediate
stakeholders (participants, investigators, sponsors, manufacturers,
patients’ advocates, and editors), with continuing the
study to obtain more accurate estimates of not only effectiveness,
but also of longer-term safety and that in judging whether or
not to stop a trial early for benefit, the plausibility of the
findings and their clinical significance are as important as
statistical boundaries.
Delfini
Comments
Overall we are concerned about the FDA’s loosening of
standards for accepting oncology study data as valid when it
comes from studies that many would judge to be fatally flawed
and that there is a likelihood these studies will accentuate
clinical advantages because of falsely inflated results.
We are seeing more oncology medications with FDA approval based
on observational studies.
The trend towards early stopping of studies in many instances
represents yet another step towards acceptance of low quality
oncology studies.
- We believe that
—
- Oncologists
may not be aware of the threats to validity in many of
the newest oncology medication studies and develop unwarranted
enthusiasm for unproven, possibly harmful new agents.
- Patients should
receive complete information about the risks of distorted
study results when low quality studies are used to inform
decisions that entail unproven benefits and significant
potential risks.
- We agree that
in most studies, the benefits of longer follow-up with more
accurate assessment of outcomes including more complete assessments
of adverse events will provide a greater likelihood of deriving
valid, useful information for informing clinical decisions.
References
- Trotta F, Apolone
G, Garattini S, Tafuri G. Stopping a trial early in oncology:
for patients or for industry? Ann Oncol. 2008 Apr 9 [Epub
ahead of print] PMID: 18304961.
- Pocock SJ. When
(not) to stop a clinical trial for benefit. JAMA 2005; 294:
2228–2230. PMID: 16264167.
|
Advanced
Concepts: Can Useful Information Be Obtained From Studies With
Significant Threats To Validity? A Case Study of Missing Data
Points in Venous Thromboembolism (VTE) Prevention Studies &
A Case Study of How Evidence from One Study Might Support Conclusions
from a Flawed Study
09/02/09
We approach
critical appraisal of the medical literature by applying critical
appraisal concepts coupled with critical thinking. This requires
a movement from the general to the very particular circumstances
before us. Paraphrasing from — is it Voltaire?; along
with one of our favorite medical leaders, Dr. Tim Young —
“It is important to keep perfection from being the enemy
of the good.” Ultimately, the goal of doing critical appraisal
work is not to “pass” or “fail” studies,
but to help predict, for example, the effect of an intervention
on outcomes of interest, based on what we can glean as possible
explanations for the observed outcomes.
Understanding
critical appraisal concepts is necessary to conceive of possible
explanations. Critical appraisal is not and should not be a
mere recording of tick marks on a checklist or earning points
on a quality scale. Despite attempts of various groups to do
so, we maintain that the reliability and usefulness of a study
cannot be “scored” through such a system. Moher
has pointed out a number of shortcomings of “scales”
designed to quantitate the likelihood of freedom from bias in
clinical trials.[1]
It requires
reflective thought to determine why we might see a particular
outcome — and this is wholly dependent upon a variety
of factors including the population, the circumstances of care,
threats to validity in applying critical appraisal concepts
and more. It is also important to keep in mind that failing
a critical appraisal screening does NOT mean something doesn’t
work. Furthermore, it is important to understand that sometimes
— despite “failing” a critical appraisal for
research design, execution or reporting — a study will,
in fact, give us evidence that is reasonable to rely upon. Our
Venous Thromboembolism (VTE) Prevention story is a case in point.
Recently
we assisted Kaiser Permanente Hawaii in developing a standardized,
evidence based VTE prophylaxis guideline for the known high
risk total knee and hip replacement population.
Our key
questions were as follows:
- What
is the evidence that thromboembolism or DVT prophylaxis with
various agents reduces mortality and clinically significant
morbidity in hip and knee replacement surgery?
- What
is the evidence regarding timing (starting and duration) of
anticoagulant prophylaxis for appropriate agents when used
for prevention of thromboembolism in hip and knee replacement
surgery?
- What
is the evidence regarding bleeding from thromboembolism prophylaxis
with the various appropriate agents?
There are
several interesting lessons this project taught us about applying
general critical appraisal concepts to individual trials and
keeping one’s eye focused on the true goal behind the
concepts. Firstly, in much of the literature on VTE and DVT
prophylaxis, the rates for missing venogram data are very high
— frequently this is as high as 40 percent. Delfini’s
stance on missing data is that even a small drop-out rate or
percent of missing data can threaten validity.[2,3] But it is
the reason for the missing data rates that is truly what matters.
A fundamental issue in critical appraisal of clinical trials
is that there can be no difference between the groups being
studied since it is that difference that may account for the
difference in outcomes.
As stated
above, in examining multiple studies of VTE prophylaxis in THR
and TKR surgery, we found a high percentage of studies had missing
venogram information. It appears that patients and their clinicians
frequently chose to omit the final venogram despite a study
protocol requiring a venogram for assessing DVT rates. From
a clinical standpoint and a patient perspective, this makes
perfect sense. For example, most patients in the study will
be asymptomatic and there are risks associated with the procedure.
In addition undergoing a venogram is inconvenient (e.g., creating
a delay in hospital discharge).
So the key
question becomes — do the groups differ with respect to
the missing data? Success of concealed allocation, blinding
and comparable rates of missing data are all validity detection
clues to help ensure it is unlikely that the groups were different
or treated differently. In our review of the data, we think
that it may be reasonable to conclude that a decision to have
a final venogram was independent of anything about the interventions
and prognostic variables in the two groups and unlikely to be
the factor responsible for different DVT rates in the groups.
A different,
yet an interesting challenge with the Westrich study revolved
around the scientific evidence on compression devices.[4] This
study reported the overall incidence of deep vein thrombosis
(DVT) rates in total knee replacement (TKR) surgery rates in
mechanical compression plus enoxaparin versus mechanical compression
plus aspirin (ASA). Our original grading of this study (partly
due to problems in study reporting) was Grade U: Uncertain Validity.
Delfini almost never utilizes a Grade U study for questions
of efficacy. [NB: Following discussions with the author, clarifying
certain questions, the study was upgraded to Grade BU: Possible
to uncertain usefulness using the Delfini
Evidence Grading System.] However, upon careful review and
reasoning, and armed with evidence from another Grade BU study,
Haas, which studied aspirin alone for VTE prophylaxis, our team
was able to deduce that the results of Westrich were likely
to be valid and useful.[5]
Here is
our summation of the Westrich results:
Westrich 06 (grade B-U) reported overall DVT rates in TKR surgery
rates in the mechanical compression and enoxaparin group of
14.1% versus 17.8% in the mechanical compression and ASA group;
ARR 1.36%; 95% CI (-6.83% to 9.55%);
p=0.27. Rates in both groups are significantly lower than the
41% to 85% DVT incidence rates reported in the literature for
no VTE prophylaxis and the reported distal DVT rate of 47% (Haas
90) for aspirin alone.
- Mechanical
compression was initiated in the recovery room; 325 mg of
enteric-coated aspirin twice daily was started the night prior
to surgery; enoxaparin was started ~48 hours after removal
of the epidural catheter).
Here is
our reasoning as to why the Westrich results are likely to be
reliable:
- The Haas
study provided information about the rates of DVT likely to
be expected through use of aspirin (reported DVT rate of 47%).
DVT rates in the Westrich study groups (14.1% and 17.8%) were
dramatically better than what one would expect from aspirin
alone. After taking into account differences in the subjects
and other care in the two studies, the DVT rates in the two
studies remain extremely large.
- In Westrich,
mechanical compression was used on both lower extremeties.
Therefore, the difference between the two groups was likely
to be enoxaparin versus ASA.
- In Westrich,
the incidence rate of DVT in both groups was less than would
be expected based on the DVT rates reported in the Haas study
in which the intervention was aspirin versus mechanical compression.
Therefore, we considered that it was reasonable to conclude
that mechanical devices provide significant benefit in preventing
DVT since that would appear to explain the much lower incidence
rates of DVT in both Westrich study groups.
At times
it makes sense to grade individual study conclusions. Documentation
of reasons is always important and required as good evidence-based
practice.
Bottom
Line: It is important to understand critical appraisal
concepts, and it is important to critically appraise studies.
The goal, however, is getting close to truth. Doing so requires
critical thinking as well about the unique circumstances of
each study and each study topic.
References
1. Moher D, Jadad AR, Nichol G, Penman M, Tugwell P, Walsh S.
Assessing the quality of randomized controlled trials: an annotated
bibliography of scales and checklists. Control Clin Trials.
1995 Feb;16(1):62-73. PMID: 7743790
2. Strite
SA, Stuart ME, Urban S. Process steps and suggestions for creating
drug monographs and drug class reviews in an evidence-based
formulary system. Formulary. April 2008;43:135–145.
3. Delfini
Group White Paper — Missing Data: Considerations
4. Westrich
GH, Bottner F, Windsor RE, Laskin RS, Haas SB, Sculco TP. VenaFlow
plus Lovenox vs VenaFlow plus aspirin for thromboembolic disease
prophylaxis in total knee arthroplasty. J Arthroplasty. 2006
Sep;21(6 Suppl 2):139-43. PMID: 16950076
5. Haas
SB, Insall JN, Scuderi GR, et al. Pneumatic sequential compression
boots compared with aspirin prophylaxis of deep-vein thrombosis
after total knee arthroplasty. J Bone Joint Surg Am 1990; 72:27–31.
PMID: 2404020
|
|
|