DelfiniClick

Quality of Evidence:
Primary Studies & General Concepts

newest
04/24/08:
Early Discontinuation of Clinical Trials: Oncology Medication Studies—Recent Developments and Concern »

Contents      

  • Quality of Studies: Lower Quality = Greater Effect Size »   
    • More: Overestimation of Effect Size in Studies of Low Quality »
  • Concealment of Allocation »
  • Blinding and RCTs »
  • Blinding in Surgery Trials »
  • The Importance of Blinded Assessors in RCTs »
  • Attrition Bias: Intention-to-Treat Basics »
  • Intention-to-Treat Analysis & Censoring: Rofecoxib Example »
  • Intention-to-Treat Analysis: Misreporting and Migraine »
  • Missing Data Points: Difference or No Difference »
  • Quality of Studies: VIGOR »
  • Confidence-Intervals, Power & Meaningful Clinical Benefit »
  • Getting “Had” by P-values: Confidence Intervals vs P-values in Evaluating Safety Results: Low-molecular-weight Heparin (LMWH) Example »
  • Understanding Number Needed to Treat (NNT) »
  • newest Early Discontinuation of Clinical Trials: Oncology Medication Studies—Recent Developments and Concern »
Quality of Studies: Lower Quality = Greater Effect Size

The quality of studies in systematic reviews and meta-analyses has repeatedly been shown to affect the amount of benefit reported. This DelfiniClick is a quick reminder that just because a study is a RCT does not mean it will provide you with a reliable estimate of effect size. A nice illustration of this point is provided in a classic article by Moher D et al. (Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet 1998; 352: 609–13).

In this study, the authors randomly selected 11 meta-analyses that involved 127 RCTs on the efficacy of interventions used for circulatory and digestive diseases, mental health, pregnancy and childbirth. The authors evaluated each RCT by examining the description of randomization, allocation concealment, blinding, drop outs and withdrawals.

The results are in line with other authors’ findings regarding quality of methods and amount of benefit (effect size) reported as relative measures below:

  • The quality of trials was low overall.
  • Low-quality trials compared with high quality trials (score >2) were associated with an increased estimate of benefit of 34%.
  • Trials that used inadequate allocation concealment, compared with those that used adequate methods, were also associated with an increased estimate of benefit (37%).
  • The average treatment benefit was 39% for all trials, 52% for low-quality trials, and 29% for high-quality trials.

The authors conclude that studies of low methodological quality in which the estimate of quality is incorporated into the metaanalyses can alter the interpretation of the benefit of the intervention.

We continue to see this problem in systematic reviews and clinical guidelines and suggest that when evaluating secondary studies readers pay close attention to the quality of included studies.

Overestimation of Effect Size in Studies of Low Quality

In a previous DelfiniClick, we summarized an article by Moher and colleagues (1) in which the authors randomly selected 11 meta-analyses involving 127 RCTs which evaluated the efficacy of interventions used for circulatory and digestive diseases, mental health, pregnancy and childbirth. Moher and colleagues concluded that -

  • Low-quality trials compared with high quality trials (score >2), were associated with a relative increased estimate of benefit (34%).
  • Trials that used inadequate allocation concealment, compared with those that used adequate methods, were associated with a relative increased estimate of benefit (37%).

Below we summarize another study that confirms and expands Moher’s findings. In a study similar to Moher’s, Kjaergard and colleagues (2) evaluated the effects of methodologic quality on estimated intervention effects in randomized trials.

The study evaluated 23 large and 167 small randomized trials and a total of 136,164 participants. Methodologic quality was defined as the confidence that the trial’s design, conduct, analysis, and presentation minimized or avoided biases in the trial’s intervention comparisons (3). The reported methodologic quality was assessed using four separate components and a composite quality scale.

The quality score was ranked as low (</=2points) or high (>/=3 points), as suggested by Moher et al. (1). The four components were 1) generation of allocation sequence; 2) concealment of allocation; 3) double-blinding; and, 4) reporting of loss-to-follow-up:

RESULTS OF KJAERGARD ET AL’S REVIEW (all reported exaggerations are relative increases):

Generation of Allocation Sequence
The odds ratios generated by all trials (large and small) with inadequate generation of the allocation sequence were on average significantly exaggerated by 51% compared with all trials reporting adequate generation of allocation sequence (ratio of odds ratios (95% CI) = 0.49 (0.30–0.81), P <0.001.

Concealment of Allocation
All trials with inadequate allocation concealment exaggerated intervention benefits by 40% compared with all trials reporting adequate allocation concealment (ratio of odds ratios (95% CI) = 0.60 (0.31–1.15), P =0.12. Odds ratios were significantly exaggerated by 52% in small trials with inadequate versus adequate allocation concealment (ratio of odds ratios (95% CI) 0.48 (0.25–0.92), P = 0.027).

Double Blinding
The odds ratios generated by all trials without double blinding were significantly exaggerated by 44% compared with all double-blind trials (ratio of odds ratios (95% CI) = 0.56 (0.33–0.98), P = 0.041).

Reporting of Loss-to-Followup
The analyses showed no significant association between reported follow-up and estimated intervention effects (ratio of odds ratios (95% CI) = 1.50 (0.80–2.78), P = 0.2).

Kjaergard and Colleagues’ Conclusions

  1. Adequate generation of the allocation sequence and adequate allocation concealment should be required for adequate randomization.
    Unlike previous investigators (1,3,4, 5), the authors found that trials with inadequate generation of allocation sequence exaggerate intervention effects significantly.
  2. Trials with inadequate allocation concealment also generate exaggerated results.
    This is in accordance with previous evidence (1,3,5). The authors found that despite the considerable overlap between generation of allocation sequence and allocation concealment, both factors may independently affect the estimated intervention effect.
  3. Trials without double blinding exaggerate results.
    This study supports Schulz and colleagues’ finding of a significant association between intervention effects and double blinding and extends the evidence by including trials from several therapeutic areas.
  4. There was no association between reported follow-up and intervention effect.

Delfini Comment
It is useful to know quantitatively how various threats to validity affect results when doing critical appraisal of a study. The study by Kjaergard and colleagues summarized above expands the findings of Schulz, Moher, Juni and others.

Previous studies have questioned the reliability of reported losses to follow-up (5, 6). In accordance with Schulz and colleagues’ results (5), the authors found no association between intervention effects and reported follow-up.
Delfini Note: We have found that losses to follow-up may significantly affect P values when sensitivity analysis is done. We consider loss of =/>5% with differential loss or =/> 10% without differential loss to be an important threat to validity.

In agreement with the findings of Moher and associates (1,3) and Juni and colleagues (7), the authors found that trials with a low quality score on the scale developed by Jadad and colleagues (8) significantly exaggerate intervention benefits.

Kjaergard and colleagues conclude that assessment of methodologic quality should focus on generation of allocation sequence, allocation concealment, and double blinding. Delfini feels this is not sufficient – but appreciates this study as one that further demonstrates the importance of effective approaches to some of these methodologic areas.

References
1. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, et al. Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet. 1998;352:609-13. [PMID: 9746022]

2. Kjaergard LL, John Villumsen J, Gluud C. Reported Methodologic Quality and Discrepancies between Large and Small Randomized Trials in Meta-Analyses. Ann Intern Med. 2001;135:982-989.

3. Moher D, Cook DJ, Jadad AR, Tugwell P, Moher M, Jones A, et al. Assessing the quality of reports of randomised trials: implications for the conduct of meta-analyses. Health Technol Assess. 1999;3:i-iv, 1-98. [PMID: 10374081]

4. Emerson JD, Burdick E, Hoaglin DC, Mosteller F, Chalmers TC. An empirical study of the possible relation of treatment differences to quality scores in controlled randomized clinical trials. Control Clin Trials. 1990;11:339-52. [PMID: 1963128]

5. Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA. 1995;273:408-12. [PMID: 7823387]

6. Gøtzsche PC. Methodology and overt and hidden bias in reports of 196 double-blind trials of nonsteroidal antiinflammatory drugs in rheumatoid arthritis. Control Clin Trials. 1989;10:31-56. [PMID: 2702836]

7. Juni P, Witschi A, Bloch R, Egger M. The hazards of scoring the quality of clinical trials for meta-analysis. JAMA. 1999;282:1054-60. [PMID: 10493204]

8. Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan DJ, et al. Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials. 1996;17:1-12. [PMID: 8721797]

Concealment of Allocation

In 1996, the CONSORT statement encouraged the reporting of concealment of allocation. Concealment of allocation is the process for actually assigning to the patient the group they will be in without breaking blinding. Hewitt et al. in a recent issue of BMJ reviewed the prevalence of adequate concealment of allocation in 4 journals—BMJ, Lancet, JAMA and NEJM (Hewitt C et al. BMJ 2005;330:1057-1058. PMID: 15760970). They scored the allocation as adequate (i.e., subject recruiter was different person from the person executing the allocation sequence), inadequate or unclear. Sealed envelopes were considered inadequate unless performed by an independent third party.

Results
Studies included: 234
Adequate concealment: 132 (56%)
Inadequate concealment: 41 (18%)
Unclear concealment: 61 (26%)


Delfini Commentary
The authors point out that previous studies have found an association between inadequate concealment and the reporting of significant results. Of interest is that studies included in this review with inadequate concealment tended to show a significant result—OR 1.8, 95% CI (0.8 to 3.7).

This is another study suggesting that the critical appraisal of RCTs is “critical” and that lower quality studies are more likely to report significant benefit than are higher quality studies.

Blinding and RCTs

A recent article, Boutron I, Estellat C, Guittet L, Dechartres A, Sackett DL, et al. (2006) Methods of blinding in reports of randomized controlled trials assessing pharmacologic treatments: A systematic review. PLoS Med 3(10): e425. DOI: 10.1371/ journal.pmed.0030425, provides a great deal of useful information about and a way of classifying blinding in research studies. The authors evaluated blinding in RCTs of pharmacologic treatment published in 2004 in high impact-factor journals. The following are some key points from the article:

• The authors identified 819 reports with about 60% describing the method of blinding. The classification identified three main methods of blinding:
(1) methods to provide identical treatments in both arms,
(2) methods to avoid unblinding during the trial, and
(3) methods of blinded outcome assessment.


• ESTABLISHING BLINDING OF PATIENTS AND PROVIDERS: 472 [58%] described the method of blinding, but 236 [29%] gave no detail and 111 [13%] some data on blinding (i.e., reporting that treatments were similar or the use of double dummies with no description of the method). The methods of blinding identified varied in complexity. The authors reported use of a centralized preparation of similar capsules, tablets, or embedded treatments in hard gelatin capsules (193/336 [57%]), similar syringes (37/336 [11%]), or similar bottles (38/336 [11%]). Use of a double dummy procedure was described in 79 articles (23%). Other methods consisted of a sham intervention performed by an unblinded health care provider who was not actively involved in the care of patients and had no other contact with patients or other caregivers and outcome assessors (17/336 [5%]). To mask the specific taste of the active treatments, in ten articles researchers used a specific flavor such as peppermint or sugar to coat treatments. For treatments administered by care providers, authors reported use of a centralized preparation of opaque coverage to adequately conceal intravenous treatments with different appearances (14/336 [4%]).

• AVOIDING UNBLINDING OF PATIENTS AND PROVIDERS: Only 28/819 [3%]) reported methods to avoid unblinding. Methods to blind dosage adaptation relied on use of a centralized adapted dosage or provision of sham results of complementary investigations for treatments necessitating dosage adaptation. Methods to avoid unblinding because of side effects relied mainly on centralized assessment of side effects, partial information to patients about side effects, use of active placebo or systematic prevention of adverse effects in both arms.

• BLINDING ASSESSORS: These methods depend on the main outcomes and are particularly important when blinding cannot be established and maintained by the methods described above. A total of 112 articles [14%] described these methods, which relied mainly on a centralized assessment of the main outcome. Blinding of outcome assessors is presumably achieved if neither patients nor those involved in the trial have any means to discover which arm a patient is in, for example because the placebo and active drugs are indistinguishable and allocation is via a central randomization service. 96 reports (86%) of the 112 reports in which specific measures to blind the outcome assessor were reported concern trials in which patients were reported as blinded or in which double blinding or triple blinding was reported. These results suppose that, although blinding was performed at an earlier stage, the investigators nevertheless decided to perform a specific method of blinding the outcome assessor.

• AUTHORS COMMENTS AND CONCLUSIONS:
• Although blinding is essential to avoid bias, the reporting of blinding is generally quite poor and reviews of trials that test the success of blinding methods indicate that a high proportion of trials are unblinded.

• The study results might be explained in part by the insufficient coverage of blinding in the Consolidated Standards for Reporting Trials (CONSORT) statements. For example, three items of the CONSORT statements are dedicated to the description of the randomization procedure, whereas only one item is dedicated to the blinding issue. The CONSORT statements mainly focus on reporting who is blinded and less on the reporting of details on the method of blinding, and this information is essential to appraise the success of blinding.

• Some evidence suggests that although participants are reported as blinded, the success of blinding might be questionable. For instance, in a study assessing zinc treatment for the common cold, the blinding procedure failed, because the taste and aftertaste of zinc was distinctive. And yet, tools used to assess the quality of trials included in meta-analyses and systematic reviews focus on the reporting of the blinding status for each participant and rarely provide information on the methods of blinding and the adequacy of the blinding method.

• There is a need to strengthen the reporting guidelines related to blinding issues, emphasizing adequate reporting of the method of blinding.

Delfini Commentary
Lack of blinding appears to be a major source of bias in RCTs. Just as well-done randomization and concealment of allocation to the study groups decreases the likelihood of selection bias, blinding of subjects and everyone working with the subjects or study data to the assigned intervention (double-blinding) decreases the likelihood of performance bias. Performance bias occurs when patients in one group experience care or exposures not experienced by patients in the other group(s) and the differences in care affect the study outcomes. Lack of blinding may affect outcomes in that:

  • Unblinded subjects may report outcomes differently from blinded subjects, have different thresholds for leaving a study, seek (and possibly receive) additional care in different ways.
  • Unblinded clinicians may behave differently towards patients than blinded clinicians.
  • Using unblinded assessors may result in systematic differences in outcomes assessment (assessment bias).

A number of studies have shown that lack of blinding is associated with inflated treatment effects.

In some cases blinding may not be possible. For example, side effects or taste may result in unblinding. The important point is that even if blinding is not possible, the investigators do not get “extra” validity points for doing the best they could (i.e., the study should not be “upgraded”).

Blinding In Surgical Trials — It is Through Blinding We Become Able To See

Blinding is an important consideration when evaluating a study. Without blinding, the likelihood of bias increases. Bias occurs when patients in one group experience care or exposures not experienced by patients in the other group(s), and the differences in care affect the study outcomes.Lack of blinding may be a major source of this type of bias in that unblinded clinicians who are frequently “rooting for the intervention” may behave differently than blinded clinicians towards patients whom they know to be receiving the study drug or intervention being studied. The result is likely to be that in unblinded studies, patients may receive different or additional care. Unblinded subjects may be more likely to drop out of a study or seek care in ways that differ from blinded subjects. Unblinded assessors may also be “rooting for the intervention” and assess outcomes differently from blinded assessors.

How much difference does blinding make? Jüni et al. reviewed four studies that compared double blinded versus non-blinded RCTs and attempted to quantify the amount of distortion (bias) caused by lack of double blinding [1]. Overall, the overestimation of effect was about 14%. The largest study reviewed by Juni assessed the methodological quality of 229 controlled trials from 33 meta-analyses and then analyzed, using multiple logistic regression models, the associations between those assessments and estimated treatment effects [2]. Trials that were not double-blind yielded on average 17% greater effect, 95% CI (4% to 29%), than blinded studies (P = .01).

Lack of double blinding is frequently found in surgical trials and results in uncertain evidence because of the problems stated above. A case study helps to illustrate this. A recent multicenter RCT, the Spine Patient Outcomes Research Trial (SPORT)[3] was a non-blinded trial that serves as an interesting case study of the blinding issues that arise when a surgical intervention is compared to a non-surgical intervention, and blinding is not attempted. The trial included patients with persistent (at least 6 weeks) disk-related pain and neurologic symptoms (sciatica) who were randomized to undergo diskectomy or receive usual care (not standardized but frequently including patient education, anti-inflammatory medication, and physical therapy, alone or in combination). There were a number of problems with this study including lack of power, poor control of non-study interventions, a high proportion of patients who crossed over between treatment strategies (43% randomized to surgery did not undergo surgery by 2 years and the 42% randomized to conservative care did receive surgery) and lack of blinding. The degree of missing data was 24%-27% without a true intention-to-treat analysis. Of great interest was an editorial that dealt with the problem of non-blinding in surgical studies. The editorialist, Flum, makes the following points [4]:

    • While the technique of sham intervention is well accepted in studies of medications using inactive pills (placebos), simulated acupuncture, and nontherapeutic conversation in place of therapeutic psychiatric interventions, it has only occasionally been applied to surgical trials. This is unfortunate because the use of sham controls has been critical in understanding just how much patient expectation influences outcomes after an operation.
    • A sham-controlled trial would be particularly relevant for spine surgery since the most commonly occurring and relevant outcomes are subjective.
    • Patients chosing surgical options may have high expectations. They may include a higher level of emotional “investment” in surgical care compared with usual care based on the level of commitment resulting from a decision to have an operation and get through recovery. After the patient has accepted the risks of surgical intervention, the desire for improvement may drive perceptions about improvement.
    • Patients who opt for surgery may also differ from patients who decline surgery in their beliefs regarding the benefits of invasive interventions.
    • The surgeon’s expectations and direction are likely to play an important role in patient improvement.
    • Given the proliferation of operative procedures for the treatment of subjective complaints like back pain, the need for sham controlled trials has never been greater.

Flum goes on to present multiple examples of the power of suggestion and the problem of doing non-blinded trials in the field of surgery. Observational trials have often reported procedural success, but sham-controlled trials for the same conditions demonstrate how much of that success is due to the placebo effect.

  • Example 1 — Ligation of Internal Mammary: After multiple observational studies suggesting that ligation of the internal mammary artery was helpful in patients with coronary disease, Cobb et al randomized patients to operative arterial ligation or a sham procedure. Both groups improved after the intervention, but there were similar, if not greater, improvements in subjective measures such as exercise tolerance and nitroglycerin use in the sham surgical group.
  • Example 2 — Osteoarthritic Knee Surgery — and 3 — Osteoarthritic Knee Joint Irrigation: After multiple case series reported that patients with osteoarthritis of the knee improve after arthroscopic surgery, Moseley et al demonstrated just how much of that effect is related to the hopes, expectations, and beliefs of the patient. The investigators randomized 180 patients to undergo arthroscopy with debridement, arthroscopy with lavage, or sham arthroscopy. The power of expectation was strong and patients were unable to determine if they had been assigned to the treatment or sham groups— and all groups improved. At 2 years after randomization, all patients reported comparable pain scores and functional scores. Another sham-controlled study in patients with knee osteoarthritis demonstrated that patients benefit equally from irrigation of the joint and from sham irrigation.
  • Example 4 — Parkinson’s Disease: Researchers found similar improvements in quality of life after direct brain injections of embryonic neurons or placebo in patients with advanced Parkinson’s disease.
  • Example 5 — Transmyocardial Laser Revascularization in HF: Heart failure patients undergoing transmyocardial laser revascularization or sham procedures had equal improvements in subjective outcomes.
  • Example 6 — Hernia: After hernia repair, there was equal improvement in pain control after cryoablation of nerves or sham interventions.
  • Examples 7-9 — Laparoscopic Interventions: Multiple case series have reported benefit on subjective outcomes such as pain control, function, and readiness for discharge with laparoscopic cholecystectomy, colon resection, and appendectomy compared with conventional approaches..Bias arises when the clinical care team influences patient and discharge expectations though coaching, communication, and management. Randomized trials of these three procedures that included blinding of both the patients and the discharging clinicians to the treatment that patients received by placing large, side-to-side abdominal wall dressings demonstrate little or no difference in patients reaching discharge criteria. A reasonable conclusion is that when the clinician’s expectations and “coaching” were removed by placing a large bandage on the abdominal wall, the subjective benefits disappeared. Flum concludes that studies not addressing both patient and clinician expectation on subjective outcomes do not inform the clinical community about the true role of the intervention.

Delfini Commentary
Blinding of subjects and everyone working with the subjects or study data to the assigned intervention (double-blinding) decreases the likelihood of bias. Bias may be more likely to occur when evaluating subjective outcomes such as pain, satisfaction, and function in non-blinded studies, but it has also been reported with objective outcomes such as mortality. When dealing with subjective outcomes, as Flum points out, it is critical to distinguish the effect of the intervention from the effect of the patient’s expectation of the intervention. The only way to distinguish the effect of a patient’s positive expectations of an operation from the intervention itself is to blind patients to the treatment they receive and randomize them to receive the intervention of interest or to receive a sham intervention (placebo). Yet we frequently hear, “But blinding is not possible in surgical studies.” Frequently the argument is raised that subjecting people to anesthesia and sham surgery is not ethical. However, conducting clinical trials employing methods that result in avoidable fatal flaws is also problematic. Flum’s position is that when the risk of a placebo does not exceed a threshold of acceptable research risk and if the knowledge to be gained is substantial, a sham-controlled trial is needed and is ethical. He reasons that ethical justification of placebo-controlled trials is based on the following considerations:

  • Invasive procedures are associated with risks.
  • There are great harms created by conducting studies that are of uncertain validity.
  • Establishing community standards based on uncertain evidence is more likely to result in more harm than good.
  • Sham-controlled trials are justified when uncertainty exists among clinicians and patients about the merits of an intervention.

The SPORT trial draws attention to the problem of non-blinding in surgical trials. This was a very expensive, labor-intensive study that provides no useful efficacy data. Research subjects were undoubtedly told this study would provide answers regarding the relative efficacy of surgery vs conservative care for lumbar spine disease. The authors of the SPORT trial state that a sham-controlled trial was impractical and unethical, possibly — according to Flum — because the risk of the sham would include general anesthesia (to truly blind the patients). He would argue that in this case blinding which would require anesthesia is the only way that valid, useful evidence could have been created. Even though we graded the study U (uncertain validity and usefulness) and would not use the results to inform decisions about efficacy or effectiveness because of the threats to validity, the study does report information regarding risks of surgery that may be of great value to patients.

-----------

1 Jüni P, Altman DG and Egger M. Systematic reviews in health care: Assessing the quality of controlled clinical trials. BMJ. 2001;323;42-46. PMID: 11440947

2 Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions of methodological quality associated with estimates of of treatment effects in controlled trials. JAMA 1995;273:408­12. PMID: 7823387.

3 Weinstein JN, Tosteson TD, Lurie JD, et al. Surgical vs nonoperative treatment for lumbar disk herniation: the Spine Patient Outcomes Research Trial (SPORT): a randomized trial. JAMA. 2006;296:2441-2450. PMID: 17119141

4 Flum DR. Interpreting Surgical Trials With Subjective Outcomes Avoiding UnSPORTsmanlike Conduct. JAMA, November 22/29, 2006—Vol 296, No. 20: 2483-1484. PMID: 17119146

The Importance of Blinded Assessors in RCTs

We have previously summarized the problems associated with lack of blinding in surgical (and other) studies — see Blinding in Surgery Trials in a previous DelfiniClick™. The major problem with unblinded studies is that the outcomes in the intervention group are likely to be falsely inflated because of the biases introduced by lack of blinding.

Recently a group of orthopedists identified and reviewed thirty-two randomized, controlled trials published in The Journal of Bone and Joint Surgery between 2003 and 2004 to evaluate the effect of blinded assessment vs non-blinded assessment on reported outcomes [1].

Results

  1. Sixteen of the thirty-two randomized controlled trials did not report blinding of outcome assessors when blinding would have been possible.
  2. Among the studies with continuous outcome measures, unblinded outcomes assessment was associated with significantly larger treatment effects than blinded outcomes assessment (standardized mean difference, 0.76 compared with 0.25; p = 0.01).
  3. In the studies with dichotomous outcomes, unblinded outcomes assessments were associated with significantly greater treatment effects than blinded outcomes assessments (odds ratio, 0.13 compared with 0.42; p < 0.001).
  4. This translates into a relative risk reduction of 38% for blinded outcome assessments compared with 71% for unblinded outcome assessments (a difference of 33%).

Conclusion
Unblinded outcomes assessment dramatically inflates the reported benefit of effectiveness of treatments.

Delfini Commentary
This is yet another study pointing out the importance of blinding. Based on this and other similar studies it is our conclusion that studies or the results of studies without blinded assessors are grade U or at best grade B-U (see evidence-grading scale here).

1. Poolman RW, Struijs PA, Krips R, Sierevelt IN, Marti RK, Farrokhyar F, Bhandari M. Reporting of outcomes in orthopaedic randomized trials: does blinding of outcome assessors matter? J Bone Joint Surg Am. 2007 Mar;89(3):550-8. J Bone Joint Surg Am. 2007 Mar;89(3):550-8. PMID: 17332104. »

Attrition Bias: Intention-to-Treat Basics

In general, we approach critical appraisal of RCTs by evaluating the four major components of a trial— study population (including how established), the intervention, the follow-up and the assessment. There is very little controversy about the process of randomizing in order to distribute known and unknown confounders as equally as possible between the groups. There also appears to be general understanding that the only difference between the two groups should be what is being studied. However, what seems to receive much less attention is the considerable potential for bias that occurs when data is missing from subjects because they do not complete a study or are lost to follow-up, and investigators use models to deal with that missing data. The only way to prevent this bias is to have data on all randomized subjects. This is frequently not possible. And bias creeps in.

Intent-to-treat designs that provide primary outcome data on all randomized patients are the ideal. All patients randomized are included in the analysis — and patients are analyzed in the same groups to which they were randomized. Unfortunately we are rarely provided with all of this information, and we must struggle to impute the missing data—i.e., we must do our own sensitivity analysis and recalculate p-values based on various assumptions (e.g., worst case scenario, all missing subject fail, etc.) — when possible! All too often, papers do not report sufficient data to perform these calculations, or the variables do not lend themselves to this type of analysis because they cannot be made binomial, and we are left with the authors’ frequently inadequate analysis. To which we have to assign a low study grade as we remain uncertain enough about drawing cause and effect conclusions based on the data.

We see many studies where the analysis is accomplished using Kaplan-Meier estimates and other models to deal with excluded patient data. As John Lachin has pointed out, this type of “efficacy subset” analysis has the potential for Type I errors (study findings=significant difference between groups; truth=no significant difference) as large as 50 percent or higher [1]. Lachin and others have shown that the statistical methods used when data is censored (meaning not included in analysis either through patient discontinuation or data being removed), frequently assume that —

  • Missing data is missing at random to some degree;
  • It is reasonable to impute missing data using assumptions from non-missing data; and,
  • The bias from efficacy subset analysis is not a major factor.

We want to see data on all patients randomized. When patients are lost to follow-up or do not complete a study, we want to see intent-to-treat analyses with clear statements about how the missing data is imputed. We agree with Lachin’s suggestion that the intent-to-treat design is likely to be more powerful (than statistical modeling), and especially powerful when an effective treatment slows progression of a disease during its administration—i.e., when a patient benefits long after the patient becomes noncompliant or the treatment is terminated. Lachlin concludes that, “The bottom line is that the only incontrovertibly unbiased study is one in which all randomized patients are evaluated and included in the analysis, assuming that other features of the study are also unbiased. This is the essence of the intent-to-treat philosophy. Any analysis which involves post hoc exclusions of information is potentially biased and potentially misleading.”

We also agree with an editorial comment made by Colin Begg who states that, “The properly conducted randomized trial, where the primary endpoint and the statistical method are specified in advance, and all randomized patients contribute to the analysis in an intent-to-treat fashion, provides a structure that severely limits our opportunity to obscure the facts in favor of our theories.” Begg concludes by supporting Lachin’s assessment: “He is absolutely correct in his view that the recent heavy emphasis on the development of missing data methodologies in statistical academic circles has led to a culture in which poorly designed studies with lots of missing data are perceived to be increasingly more acceptable, on the flimsy notion that sophisticated statistical modeling can overcome poor quality data. Mundane though it may sound, I strongly support his [Lachin’s] assertion that `…the best way to deal with the problem (of missing data) is to have as little missing data as possible…’ Attention to the development of practical strategies for obtaining outcome data from patients who withdraw from trials, notably short-term trials with longitudinal repeated measures outcomes, is more likely to lead to improvement in the quality of clinical trials than the further development of statistical techniques that impute the missing data. [2]”

It would be difficult to express our concern more eloquently than what is stated above. The two examples below amplify this.

Example 1: A group of rheumatologists were uncomfortable with Kaplan-Meier statistical methods for analysis of outcomes in rheumatology studies. Their concern was that, even though Kaplan-Meier methods are frequently used to analyze cancer data, very little research has been done to validate the use of Kaplan-Meir methods for drug studies (i.e. endpoints such as stopping medication because of side-effects or lack of efficacy. They tested three assumptions upon which Kaplan-Meier survival analysis depends:

1. Patients recruited early in the study should have the same drug survival (i.e. time to determination of lack of efficacy or onset of side-effects) as those recruited later;
2. Patients receiving their first drug later in the study should have the same drug survival characteristics as those receiving it earlier; and,
3. Drug survival characteristics should be independent of the time that a patient has been in the study before receiving the disease modifying drug.

To examine the above assumptions, the authors plotted survival curves for the different groups (i.e. subjects recruited early vs those recruited later) and showed that, in each case, the drug survival characteristics were statistically different between the two groups (p<0.01). They conclude, as did Lachin, that it is not possible to prove that survival analysis is always invalid (even though they did show in this case the Kaplan-Meier analysis was invalid). However, this group feels that the onus of proof is on those who advocate for drug survival analysis—i.e., using statistical modeling rather than presenting all the data so that the reader can do an ITT analysis or sensitivity analysis[3].

Example 2: A similar situation occurred when a group of geriatricians became concerned that many different, and sometimes inappropriate, statistical techniques are used to analyze the results of randomized controlled trials of falls prevention programs for elderly people. To evaluate this, they used raw data from two randomized controlled trials of a home exercise program to compare the number of falls in the exercise and control groups using two different survival analysis models (Andersen-Gill and marginal Cox regression) and a negative binomial regression model for each trial.

In one trial, the three different statistical techniques gave similar results for the efficacy of the intervention but, in the second trial, underlying assumptions were violated for the two Cox regression models. Negative binomial regression models were easier to use and more reliable.

Proportional Hazards and Cox Regression Models: The authors point that although the use of proportional hazards or Cox regression models can test whether several factors (for example, intervention group, baseline prognostic factors) are independently related to the rate of a specific event (e.g., a fall) that using survival probabilities to analyze time to fall events assumes that, at any time, participants who are censored before the end of the trial have the same risk of falling as those who complete the trial. An assumption of proportional hazards models is that the ratio of the risks of the events in the two groups is constant over time and that the ratio is the same for different subgroups of the data, such as age and sex groups. This is known as the proportionality of hazards assumption. No particular distribution is assumed for the event times, that is, the time from the trial start date for the individual to the outcome of interest (in this case, a fall event) such as would be the case for death following cardiac surgery, where one assume a greater frequency of deaths to occur close to the surgical event.

Andersen-Gill and marginal Cox proportional hazards regression: These models are used in survival analyses when there are multiple events per person in a trial. The Andersen-Gill extension of the proportional hazards regression model and the marginal proportional hazards regression model are both statistical techniques used for analyzing recurring event data.

Negative Binomial Regression: The negative binomial regression model can also be used to compare recurrent event rates in different groups. It allows investigation of the treatment effect and confounding variables, and adjusts for variable follow-up times by using time at risk.

In the first study of falls in the elderly, all three statistical approaches indicated that falls were significantly reduced by 40% (Andersen-Gill Cox model), 44% (marginal Cox model) and 39% (negative binomial regression model) in the exercise group compared with those in the control group. The tests for the proportionality of hazards for both types of survival regression models indicated that these models “worked” for the recurring falls problem.

In the second study, there was evidence that the proportional hazards assumption was violated in the Andersen-Gill and marginal Cox regression models (proportional hazards test). The authors point out that survival analysis is not valid if participants who are censored do not have the same rate of outcome (risk of falling) as those who continue in the trial. The authors point out and cite a reference for concluding that those not completing a falls prevention trial are at higher risk of falling and, if fewer from one group than another group withdraw, it may point to a study-related cause for the change in discontinuation, and results may be biased.

Summary
Unfortunately, readers are in a very difficult position when evaluating the quality of studies that use survival analyses and statistical modeling because the assumptions used in the models are almost never given and the missing data points are frequently quite large. Delfini uses a conservative approach. We look for information about the model, percent of subjects whose data are missing from analysis, differential loss between the groups, censored information and reasons for loss to follow-up. We have been unable to find any good evidence-based criteria to help guide us in considering cut-offs for validity. We use the following in evaluating how loss of subjects’ data affects the validity of the study. While the suggestions below are not evidence-based, they are conservative in comparison to some EBM suggestions we have seen, and we have run some calculations trying to help guide our choices. So caveat emptor!

Delfini Non-evidence-based Advice on Reaction to Missing Data Points from Non-completers and Those Lost to Follow-up:

Minimal threat < 5% and no differential loss*
Possible
threat
>= 5% but <10% and no differential loss*
Acceptable
For efficacy
>= 5%, but sensitivity analysis conducted, by authors or reviewers, which applied worst-case scenario, or otherwise reasonable sensitivity analysis, and analysis continued to agree with authors’ findings about statistical significance
Threat >=5% with differential loss*, or >= 10% without differential loss, and without worst-case sensitivity analysis, or otherwise reasonable sensitivity analysis, conducted by authors or reviewers
*Differential loss

For small to medium study (e.g., less than 300 total randomized), differential loss must be low to non-existent (e.g., 2% or less difference in missing data points between groups)

For large study (e.g., more than 300 total randomized), differential loss must be minimal (e.g., 5% or less difference in missing data points between groups)

1. Lachin JM. Statistical considerations in the intent-to-treat principle. Control Clin Trials 2000;21:167–189. PMID: 11018568

2. Utley M. et al. Potential bias in Kaplan-Meier survival analysis applied to rheumatology drug studies. Rheumatology 2000;39:1-6.

3. Robertson, MC et al. Statistical Analysis of Efficacy in Falls Prevention. Journal of Gerontology 2005;60:530–534.

Intention-to-Treat Analysis & Censoring: Rofecoxib Example

In a recent DelfiniClick, we voiced concern about models used for analysis of study outcomes, especially when information about assumptions used is not reported. In the July 13, 2006 issue of the NEJM (published early on-line), there is a very informative example of what can happen when authors claim to analyze data using the intention-to-treat (ITT) principle, but do not actually do an ITT analysis.

Case Study
The NEJM published a correction to an original study of cardiovascular events associated with rofecoxib versus placebo[1]. This correction illustrates how Kaplan-Meier curves can be misleading to readers and how they differ with various censoring assumptions. In this case, by censoring data that occurred 14+ days after subjects discontinued the study, the Kaplan-Meir curves for thrombotic events did not separate until 18 months. The following is part of the correction published by NEJM:

“…Statements regarding an increase in risk after 18 months should be removed from the Abstract (the sentence ‘The increased relative risk became apparent after 18 months of treatment; during the first 18 months, the event rates were similar in the two groups’ should be deleted…”

The reason for the correction appears to be an analysis of data released by Merck to the FDA on May 11, 2006. These data provide information about events in the subgroup of participants whose data were censored if they had an event more than 14 days after early discontinuation of the study medication.

Twelve thrombotic events that occurred more than 14 days after the study drug was stopped but within 36 months after randomization were noted. Eight of the “new” events were in the rofecoxib group, and these events had a definite effect on the published survival curve for rofecoxib (Fig. 2 of the original article). When including the new data, the separation of the rofecoxib and placebo curves begins earlier than 18 months.

The point of all this is that it is difficult to determine the validity of a study when assumptions used in censoring of data are not reported. With insufficient information about loss to follow-up, we cannot do our own sensitivity analyses for imputing missing data with our goal being to “test” the P-value reported by the authors.

To reiterate from our previous DelfiniClick:

  • Intent-to-treat designs that provide primary outcome data on all randomized patients are the ideal. All patients randomized are included in the analysis. The same patients randomized at the beginning of the RCT are analyzed in the same groups to which they were randomized.
  • Authors should use a CONSORT diagram to report what happened to various patients during the course of the study – plus they should provide detailed information about missing data points including timing.
  • Sensitivity analyses are welcomed, especially those that subject the intervention to the toughest trial. If p-values remain statistically significant after such a test, we can be more confident about anticipated outcomes in an otherwise valid study.

1. Correction to: Cardiovascular events associated with rofecoxib in a colorectal adenoma chemoprevention trial. N Engl J Med 2006;355:221.

2. Bresalier RS, Sandler RS, Quan H, et al. Cardiovascular events associated with rofecoxib in a colorectal adenoma chemoprevention trial. N Engl J Med 2005;352:1092-102.

Intention-to-Treat Analysis: Misreporting and Migraine

Intention-to-treat analysis (ITT) is an important consideration in randomized, controlled trials. And determining whether an analysis meets the definition of ITT analysis or not is incredibly easy. Yet many authors mislabel their analyses as ITT when they are not and report their results in a biased way. An article in BMJ dealing with migraine illustrates some important points about ITT analysis and reminds us that authors continue to report outcomes in ways that are highly likely to be biased.

Read our case study here.

Missing Data Points: Difference or No Difference — Does it Matter?

We continue to study the "evidence on the evidence" — meaning we are continually on the look out for information which may shed light on the impact on reported outcomes of certain kinds of bias, for example, or information that provides help in how to handle different biases. Missing data points is an issue affecting the majority of studies, but currently there is not clarity on how big an issue this is, especially when there is not a differential loss between groups.

We spoke recently about this issue with John M. Lachin, Sc.D., Professor of Biostatistics and Epidemiology, and of Statistics, The George Washington University, and author. (And then we did some "hard thinking" as David Eddy would say.) Even without differential loss between the groups overall, a differential loss could occur in prognostic variables — and readers are rarely going to have access to data about changes in prognostic characteristics post-baseline reporting. So we continue to offer our conservative approach that loss of around five percent with differential loss is a bias as well as loss of around ten percent or more without differential loss.

For those who are tough and hardy and really want to mull on this, here's our updated white paper on "missingness" [Word] or [PDF]. We welcome further thoughts (or evidence) on this area.

Quality of Studies: VIGOR

Why is it that Vioxx made the front page of the NYTs in December of 2005 when it was withdrawn from the market in 2004? Reason: it was discovered that the authors “removed” 3 patients with CV events from the data in the days preceding final hardcopy submission of the VIGOR study to the NEJM. Here are some key points made by the NEJM in an editorial entitled, Expression of Concern: Bombardier et al., “Comparison of Upper Gastrointestinal Toxicity of Rofecoxib and Naproxen in Patients with Rheumatoid Arthritis,” N Engl J Med 2000;343:1520-8, published on the web 12/8/04 and in hard copy, N Engl J Med. 2005.353:25:

  • The VIGOR study was designed primarily to compare gastrointestinal events in patients with rheumatoid arthritis randomly assigned to treatment with rofecoxib (Vioxx) or naproxen (Naprosyn), but data on cardiovascular events were also
    monitored.
  • Three myocardial infarctions, all in the rofecoxib group, were not included in the
    data submitted to the Journal in hardcopy.
  • Until the end of November 2005, the NEJM believed that these were late events that were not known to the authors in time to be included in the article published in the Journal on November 23, 2000.
  • It now appears, however, from a memorandum dated July 5, 2000, that was obtained by subpoena in the Vioxx litigation and made available to the NEJM, that at least two of the authors knew about the three additional myocardial infarctions at least two weeks before the authors submitted the paper version of their manuscript.
  • Lack of inclusion of the three events resulted in an understatement of the difference in risk of myocardial infarction between the rofecoxib and naproxen groups.
  • The NEJM determined from a computer diskette that some of these data were deleted from the VIGOR manuscript two days before it was initially submitted to the Journal on May 18, 2000.
  • Taken together, these inaccuracies and deletions call into question the integrity of the data on adverse cardiovascular events in this article.

Merck's position is that the additional heart attacks became known after the publication's "cutoff" date for data to be analyzed and were therefore not reported in the Journal article. To our knowledge, NEJM has not responded to Merck's point.

In any event, without the 3 missing subjects the relative risk of myocardial infarction risk was 4.25 for refecoxib versus naproxen, 95% CI (1.39 to 17.37). This is based on 17 MIs out of 2315 person years of exposure for rofecoxib and 4 MIs out of 2336 person years for naproxen.

Adding in the 3 missing subjects (new total of 20 MIs in the rofecoxib group) increases the relative risk to 5.00, 95% CI (1.68 to 20.13). This demonstrates how losing just a few subjects even in a large study can change results dramatically.

For readers, the important point is to look carefully to be sure that all randomized patients were accounted for. We believe that if the loss of subjects is greater than 5% without an acceptable ITT analysis there is uncertainty regarding the validity of the results.

Confidence-Intervals, Power & Meaningful Clinical Benefit:
Advice to Readers on How to Stop Worrying about Power and Start Using Confidence Intervals &
Using Confidence Intervals to Evaluate Clinical Benefit of Statistically Significant Findings
(Special thanks to Brian Alper, MD, MSPH and Ted Ganiats, MD for their help in understanding this issue.)

Problems with Non-Statistically Significant Findings
Research outcomes which are not statistically significant (also referred to as “non-significant findings”) raise the question, "Is there TRULY no difference, or were there not enough people to show a difference if there is one?" (This is known as beta- or Type II error.)

Power calculations are performed prior to a study help investigators determine the number of people they should enroll in the study to try and detect a statistically significant difference if there is one. A power of >= 80% is conventional and provides some leeway for chance. Power calculations are generally performed only for the primary outcome. They entail a lot of assumptions.

Good News About Power!
The good news for readers is that you don’t need to worry about power since you can evaluate inconclusiveness of findings through using confidence intervals.

Here’s what they are, and here’s how it’s done:

About Confidence Intervals (CIs)
The results of a valid study represent an approximation of truth. There might be other possible values that could equally approximate truth. (What if the study had been done on Friday instead of on Tuesday, for example? Maybe the difference in outcomes would be an absolute 4 percent and not 5 percent.) In recognition of this, confidence intervals are calculations of equally statistically plausible results generating a range within which there is a 95% chance that the true answer lies for a valid study. (As with all allowances for chance findings, 95 percent is conventional.) You can apply confidence intervals to any measure of outcomes such as an odds ratio or absolute risk reduction (ARR).

This is how confidence intervals are reported:

Example: ARR = 5%; 95% CI (3% to 7%)

How to Use Confidence Intervals to Determine Statistical Significance

Absolute Risk Reduction and Relative Risk Reduction
For measures reported as percentages, if the range includes zero, the outcomes are not statistically significant.

Relative Risk (aka Risk Ratio) and Odds Ratio
For measures reported as ratios, if the range includes 1, the outcomes are not statistically significant.

How to Use Confidence Intervals to Determine Conclusiveness of Non-significant Findings
And if something is not statistically significant (also referred to as non-significant or NS findings), you don’t know if there truly is no difference, or whether there were not enough people to show a difference if there is one.

You can look to the CIs to help you with this situation. But first you want to decide what you would consider to be your minimum requirement for a clinically significant outcome (difference between outcomes in the intervention and comparison groups). This is a judgment call.

Let’s assume we are looking at a study, the primary outcome for which is absolute reduction in mortality. One might reasonably conclude that an outcome of 1 percent or more is, indeed, a clinically meaningful benefit.

[Below is a text explanation. Pictures tell this best, however. Click here to view a PDF of what this looks like graphically. Note that the PDF starts out first with how to determine clinical significance of statistically significant outcomes and then demonstrates how to determine conclusiveness of non-significant findings.]

Example: Clinical Significance Goal
>=1% absolute reduction in mortality

For Non-Significant Findings:

Example 1

  • ARR = 2%; 95% CI (-1% to 5%)
  • The upper boundary tells you it is possible that the true result WOULD meet your requirements for clinical significance – thus, from that perspective this trial is inconclusive about NO DIFFERENCE BETWEEN GROUPS - you do not know if the trial was insufficiently powered (false negative due to insufficient number of people to show a statistically significant difference if there is one)

Example 2

  • ARR = 0%; 95% CI (-.5 to .5%)
  • The upper boundary does not reach your goal – therefore, this can be considered sufficient evidence that there is no difference between the groups that you would consider clinically significant

How to Use Confidence Intervals to Determine Conclusiveness of Non-significant Findings
Again, you can also use confidence intervals to determine whether a result from a valid study is of meaningful clinical benefit.

Requirements for Meaningful Clinical Benefit
Remember that outcomes of clinical significance are those which benefit patients in some way in the areas of morbidity, mortality, symptom relief, physical or emotional functioning or health-related quality of life. Intermediate markers are assumed to benefit patients in these areas, but they may not - thus, a direct causal chain of benefit must be proved to avoid waste and potential patient harms occurring as unintended consequences. Meaningful clinical benefit is a combination of benefits in a clinically significant area along with the size of the results.

As with evaluating the conclusiveness of a non-significant finding, you apply judgment to set your minimum requirement for meaningful clinical significance. Using the same example of your choosing 1 percent absolute reduction in mortality as meaningful clinical benefit:

Example: Clinical Significance Goal
>=1% absolute reduction in mortality

For Statistically Significant Findings:

Example 1

  • ARR = 2%; 95% CI (.5% to 3.5%)
  • The lower boundary tells you it is possible that the true result will NOT meet your requirements for clinical significance – thus, from that perspective this trial is inconclusive

Example 2

  • ARR = 2%; 95% CI (1 to 3%)
  • The lower boundary reaches your goals for clinical significance – therefore, this can be considered sufficient evidence of benefit

Again, pictures probably tell this best. Click here to view the PDF.

The Authors Did Not Report CIs?
If you can create a 2 x 2 table from the study data, you can compute them yourself using the confidence interval calculator of the University of British Columbia, Department of Health Care and Epidemiology » which can also be found in the Delfini WebLinks » under "confidence interval calculations."

Evaluate Definitions for Outcomes
And remember, ensure you agree with the authors’ definitions of the outcomes, especially if they are using a term like “improved,” “success,” or “failure” – is a three-point change on a 200 point scale really a meaningful clinical difference that should define success? You get to be the judge.

Getting “Had” by P-values: Confidence Intervals vs P-values in Evaluating Safety Results: Low-molecular-weight Heparin (LMWH) Example

In one of our DelfiniClicks we have pointed out that confidence intervals (CIs) can be very useful when examining results of randomized controlled trials (Confidence-Intervals, Power & Meaningful Clinical Benefit »). The first step in examining safety results is to decide what you consider to be a range for clinically significant outcomes (i.e., the difference between outcomes in the intervention and comparison group). This is a judgment call. Then examine the 95% CI to see if a clinically significant difference is included in the confidence interval. If it is, the study has not excluded the possibility of a clinically significant harm even if the authors state there is no difference (usually stated as “no difference” based on a non-significant p-value.) It is important to remember that a non-significant p-value can be very misleading in this situation.

This can be illustrated by an interesting conversation we recently had with an orthopedic surgeon who felt he couldn’t trust the medical literature to guide him because it gave him “misleading information.” He based his conclusion on a study he read (he wasn’t sure which study it was) regarding bleeding in orthopedic surgery. After talking with him, we searched for studies that may have led to his conclusion and found the following study which illustrates why CIs are preferable to p-values in evaluating safety results and possibly why he was misled.

Case Study: An orthopedic surgeon reads an article comparing outcomes, including bleeding rates, between fondaparinux and enoxaparin in orthopedic surgery and sees the following statement by the authors in the Abstract section of the paper: “The two groups did not differ in frequency of death or clinically relevant bleeding.” [1]

He looks at the Results section of the paper and reads the following: “The number of patients who had major bleeding did not differ between groups (p=0.11).” He knows that if the p-value is greater than 0.05, the differences are not considered statistically significant, and he concludes that there is no difference in bleeding between the groups. His confidence is shaken when he switches to fondaparinux and his patients experience increased postoperative bleeding.

Let’s evaluate this study’s bleeding rates using confidence intervals. One might reasonably conclude that an outcome of 1 percent or more difference between the groups is, indeed, a clinically meaningful difference in bleeding:

  • The actual rates for major bleeding were 47/ 1140 (4.1%) in the fondaparinux group vs 32/ 1133 (2.8%) in the enoxaparin group, up to day 11, a difference of 1.3%, p=0.11.
  • But CIs provide more information: The absolute risk increase with fondaparinux (ARI) was 1.3%, but the 95% CI was (0.3, 2.9) and since the true difference could be as great as 2.9% (i.e., clinically relevant) the authors’ conclusions are misleading.

The Cochrane Handbook summarizes this problem nicely:

"A common mistake when there is inconclusive evidence is to confuse ‘no evidence of an effect’ with ‘evidence of no effect.’ When there is inconclusive evidence, it is wrong to claim that it shows that an intervention has ‘no effect’ or is ‘no different’ from the control intervention. It is safer to report the data, with a confidence interval, as being compatible with either a reduction or an increase in the outcome. When there is a ‘positive’ but statistically non-significant trend, authors commonly describe this as ‘promising,’ whereas a ‘negative’ effect of the same magnitude is not commonly described as a ‘warning sign.’ Authors should be careful not to do this." [2]

Comments:
Following the Lassen study referenced above, others confirmed the increased bleeding rate leading to re-operation and other significant bleeding with fondaparinux vs enoxaprin. [3]

When investigators provide p-values but not confidence intervals, readers can quickly calculate the 95% CIs if the outcomes are dichotomous and the investigators report the actual rates of events, as in the example above, by using the calculator available at:
http://www.graphpad.com/quickcalcs/NNT1.cfm

Also, see our web links for other sources (search “confidence intervals”):
http://www.delfini.org/delfiniWebSources.htm

References:

  1. Lassen MR, Bauer KA, Eriksson BI, Turpie AG. Postoperative fondaparinux versus preoperative enoxaparin for prevention of venous thromboembolism in elective hip replacement surgery: a randomised double-blind comparison. Lancet. 2002;359:1715- 20. [PMID: 12049858]
  2. Higgins JPT, Green S, editors. 9.7 Common errors in reaching conclusions. Cochrane Handbook for Systematic Reviews of Interventions 4.2.6 [updated September 2006]. http://www.cochrane.org/resources/handbook/hbook.htm (accessed 22nd January 2008).
  3. Vormfelde SV. Comment on: Lancet. 2002 May 18;359(9319):1710-1. Lancet. 2002 Nov 23;360(9346):1701. PMID 12457831.

Understanding Number Needed to Treat (NNT)

We have found that it is very common for health care professionals to not understand the steps in calculating NNT. Bandolier has available on its website a classic article on NNT. We heartily recommend reviewing this article if you have any questions or uncertainties about what NNT means or how to calculate and use NNT information.

http://www.jr2.ox.ac.uk/bandolier/booth/painpag/NNTstuff/numeric.htm

Early Discontinuation of Clinical Trials: Oncology Medication Studies—Recent Developments and Concern: 04/28/08

With the trend for more rapid approval of oncology drugs has come concern regarding the validity of reported results because of methodological problems. Validity and usefulness of reported results from oncology (and other) studies are clearly threatened by lack of randomization, blinding, the use of surrogate outcomes and other methodological problems. Trotta et al. have extended this concern in a recent study that highlights an additional problem with oncology studies—stopping ocncology trials early [1. Trotta F, Apolone G, Garattini S, Tafuri G. Stopping a trial early in oncology: for patients or for industry? Ann Oncol. 2008 Apr 9 [Epub ahead of print] PMID: 18304961].The aim of the study was to assess the use of interim analyses in randomized controlled trials (RCTs) testing new anticancer drugs, focusing on oncological clinical trials stopped early for benefit. A second aim was to estimate how often trials prematurely stopped as a result of an interim analysis are used for registration i.e., approval by European Medicines Agency (EMEA), the European equivalent of FDA approval. The authors searched Medline along with hand-searches of The Lancet, The New England Journal of Medicine, and The Journal of Clinical Oncology and evaluated all published clinical trials stopped early for benefit and published in the last 11 years. The focus was on anticancer drugs that contained an interim analysis.

Results and Authors’ Conclusions
Twenty-five RCTs were analyzed. In 95% of studies, at the interim analysis, efficacy was evaluated using the same end point as planned for the final analysis. The authors’ found a consistent increase (>50%) in prematurely stopped trials in oncology during the last 3 years. As a consequence of early stopping after the interim analysis, approximately 3,300 patients/events across all studies were spared potential harms of continued therapy. This may appear to be clearly beneficial, but as the authors point out, stopping a trial early does not guarantee that other patients will receive the apparent benefit of stopping, assuming one exists, unless study findings are immediately publicly disseminated. The authors found long delays between study termination and published reports (approximately 2 years). If the trials had continued for these further 2 years, more efficacy and safety data could have been gathered. Delays in reporting results further lengthen the time needed for translating trial findings into practice.

Surprisingly, there was a very small percentage of trials (approximately 4%) stopped early because of harms, i.e. serious adverse events. Therefore, toxicity does not represent the main factor leading to early termination of trials. Of the 25 trials, six had no data and safety monitoring board (DSMB) and five had enrolled less than 40% of the planned sample size. Even so, 11 were used to support licensing applications on the basis of what could have been exaggerated chance events. Thus, more than 78% of the oncology RCTs published in the last 3 years were used for registration purposes. The authors argue that only untruncated trials can provide a full level of evidence which might be useful for informing clinical practice decisions without further confirmative trials. They concluded that early termination may be done for ethical reasons such as minimizing the number of people given an unsafe, ineffective, or clearly inferior treatment. However, interim analyses may also have drawbacks, since stopping trials early for apparent benefit will systematically overestimate treatment effects [2. Pocock SJ. When (not) to stop a clinical trial for benefit. JAMA 2005; 294: 2228–2230. PMID: 16264167] and raises new concerns about what they describe as “market-driven intent.” Some additional key points made by the authors:

  • Repeated interim analyses at short intervals raise concern about data reliability: this strategy risks the appearance of seeking the statistical significance necessary to stop a trial;
  • Repeated analyses on the same data pool often lead to statistically significant results only by chance;
  • If a trial is evaluating the long-term efficacy of a treatment for conditions such as cancer, short-term benefits — no matter how significant statistically — may not justify early stopping. Data on disease recurrence and progression, drug resistance, metastasis, or adverse events could easily be missed. Early stopping may reduce the likelihood of detecting a difference in overall survival (the only relevant endpoint in this setting).

The authors conclude that:

…a decision whether to stop a clinical trial before its completion requires a complex of ethical, statistical, and practical considerations, indicating that results of RCTs stopped early for benefit should be viewed with criticism and need to be further confirmed. The main effect of such decisions is mainly to move forward to an earlier-than-ideal point along the drug approval path; this could jeopardise consumers’ health, leading
to unsafe and ineffective drugs being marketed and prescribed. Even if well designed, truncated studies should not become routine. We believe that only untruncated trials can provide a full level of evidence which can be translated into clinical practice without further confirmative trials.

Lancet Comment
In a Lancet editorial on April 19, 2008 the editorialist states that early stopping of RCTs should require proof beyond reasonable doubt that equipoise no longer exists. Data safety and monitoring boards must balance the decision to stop, which favors immediate stakeholders (participants, investigators, sponsors, manufacturers, patients’ advocates, and editors), with continuing the study to obtain more accurate estimates of not only effectiveness, but also of longer-term safety and that in judging whether or not to stop a trial early for benefit, the plausibility of the findings and their clinical significance are as important as statistical boundaries.

Delfini Comments
Overall we are concerned about the FDA’s loosening of standards for accepting oncology study data as valid when it comes from studies that many would judge to be fatally f