DelfiniClick

A cool click for information nibblets. Check here periodically for information and clicks worth making...
For evidence-based medicine basics, see Evidence Essentials. For other news and items specific to practitioner and patient information aids, see our newsletter,
On the Same Page
™.

Contents      

About PMID Numbers: We frequently utilize a PMID number in place of a citation. Where PMID numbers are available, enter that number into the PubMed search box to retrieve that citation and listing.

new
11/10/07: Confidence-Intervals, Power & Meaningful Clinical Benefit:
Advice to Readers on How to Stop Worrying about Power and Start Using Confidence Intervals &
Using Confidence Intervals to Evaluate Clinical Benefit of Statistically Significant Findings
»

Other Key Links:
Evidence Essentials »
On the Same Page »

Variation/Quality of Care
The Volume of Inappropriate Care in the US »
Variations in Volume & Intensity of Hospital Care
»
Variations in Experts' Recommendations
»
Variations in Clinicians' Estimates of Pretest Probabilities
»
Underuse of Proven Interventions
»
Class Effect: Caution Urged »
Guidelines & Effectiveness of Implementation »
Oregon Preferred Drug List
»
Successful Evidence-based QI Project: Diabetes Management at Dreyer Medical Clinical »

Reporting
Untrustable Abstracts & P-Values »
CONSORT Statement on Harms »
When There is No Evidence »
TREND: Reporting Standards for Non-randomized Studies »

Poorly Written Papers »

Media Heyday: Aspirin and (Potentially) Reduced Risk of Breast Cancer »

Evidence & Patients
Better Evidence Choices: Carotid Endarterectomy
@
On the Same Page
(use your BACK button to return to DelfiniClick) »
Screening & Decision Aids »
01/05/07: Bandolier on Patients and Risk Information
@
On the Same Page
(use your BACK button to return to DelfiniClick)
»
Strategies for Increasing Adherence »
Discussing Information and Decisions with Patients »

EBM Humor
EBM Secret Truths Uncovered!!! »
A Field Guide to Experts!
»

Special Feature: The Validity Detective

Quality of Evidence

Observational Studies
More on the Problem with Drawing Cause-Effect Conclusions from Observational Studies »
Cause & Effect Conclusions from Observations »
Bias in Observational Studies — More on HRT in Menopause
»

Primary Studies & General Concepts
Understanding Number Needed to Treat (NNT)
»
Concealment of Allocation
»
Blinding and RCTs »
Blinding in Surgery Trials
»
The Importance of Blinded Assessors in RCTs »
Attrition Bias: Intention-to-Treat Basics »
Intention-to-Treat Analysis: Censoring »
Intention-to-Treat Analysis: Misreporting and Migraine
»
Missing Data Points: Difference or No Difference »
Quality of Studies: Lower Quality = Greater Effect Size
»
   More: Overestimation of Effect Size in Studies of Low Quality »
External Validity: Case of the Carotid Stent »
Quality of Studies: VIGOR »
Confidence-Intervals, Power & Meaningful Clinical Benefit »

Diagnostic Studies
Bias in Diagnostic Studies
»

Secondary Studies
Systematic Reviews: Quality & Searching Tips »
Systematic Reviews: Untrustable "Trustable" Sources?
»
Delfini Letter to BMJ: Corticosteriods are Not Proven for Treatment of Bell's Palsy
»
Delfini Letter to NEJM: Bell's Palsy - REDUX! »
Substandard Evidence: POEMS and Diabetes — Readers, Beware! »
Quality of Systematic Reviews: Misleading “POEM” on Hormone Therapy »

Secondary Sources
Quality of Clinical Guidelines »
Poor Quality of Guidelines: Case Study — The Evidence on Well-Child Care Recommendations »

Quality Improvement
Successful Evidence-based QI Project: Diabetes Management at Dreyer Medical Clinical
»

For Clicks that do not work, please email us to let us know. For many of them, you may be able to go to PubMed and do a quick search using the citation: www.PubMed.gov.

DelfiniClick

Confidence-Intervals, Power & Meaningful Clinical Benefit:
Advice to Readers on How to Stop Worrying about Power and Start Using Confidence Intervals &
Using Confidence Intervals to Evaluate Clinical Benefit of Statistically Significant Findings
(Special thanks to Brian Alper, MD, MSPH and Ted Ganiats, MD for their help in understanding this issue.)

Problems with Non-Statistically Significant Findings
Research outcomes which are not statistically significant (also referred to as “non-significant findings”) raise the question, "Is there TRULY no difference, or were there not enough people to show a difference if there is one?" (This is known as beta- or Type II error.)

Power calculations are performed prior to a study help investigators determine the number of people they should enroll in the study to try and detect a statistically significant difference if there is one. A power of >= 80% is conventional and provides some leeway for chance. Power calculations are generally performed only for the primary outcome. They entail a lot of assumptions.

Good News About Power!
The good news for readers is that you don’t need to worry about power since you can evaluate inconclusiveness of findings through using confidence intervals.

Here’s what they are, and here’s how it’s done:

About Confidence Intervals (CIs)
The results of a valid study represent an approximation of truth. There might be other possible values that could equally approximate truth. (What if the study had been done on Friday instead of on Tuesday, for example? Maybe the difference in outcomes would be an absolute 4 percent and not 5 percent.) In recognition of this, confidence intervals are calculations of equally statistically plausible results generating a range within which there is a 95% chance that the true answer lies for a valid study. (As with all allowances for chance findings, 95 percent is conventional.) You can apply confidence intervals to any measure of outcomes such as an odds ratio or absolute risk reduction (ARR).

This is how confidence intervals are reported:

Example: ARR = 5%; 95% CI (3% to 7%)

How to Use Confidence Intervals to Determine Statistical Significance

Absolute Risk Reduction and Relative Risk Reduction
For measures reported as percentages, if the range includes zero, the outcomes are not statistically significant.

Relative Risk (aka Risk Ratio) and Odds Ratio
For measures reported as ratios, if the range includes 1, the outcomes are not statistically significant.

How to Use Confidence Intervals to Determine Conclusiveness of Non-significant Findings
And if something is not statistically significant (also referred to as non-significant or NS findings), you don’t know if there truly is no difference, or whether there were not enough people to show a difference if there is one.

You can look to the CIs to help you with this situation. But first you want to decide what you would consider to be your minimum requirement for a clinically significant outcome (difference between outcomes in the intervention and comparison groups). This is a judgment call.

Let’s assume we are looking at a study, the primary outcome for which is absolute reduction in mortality. One might reasonably conclude that an outcome of 1 percent or more is, indeed, a clinically meaningful benefit.

[Below is a text explanation. Pictures tell this best, however. Click here to view a PDF of what this looks like graphically. Note that the PDF starts out first with how to determine clinical significance of statistically significant outcomes and then demonstrates how to determine conclusiveness of non-significant findings.]

Example: Clinical Significance Goal
>=1% absolute reduction in mortality

For Non-Significant Findings:

Example 1

  • ARR = 2%; 95% CI (-1% to 5%)
  • The upper boundary tells you it is possible that the true result WOULD meet your requirements for clinical significance – thus, from that perspective this trial is inconclusive about NO DIFFERENCE BETWEEN GROUPS - you do not know if the trial was insufficiently powered (false negative due to insufficient number of people to show a statistically significant difference if there is one)

Example 2

  • ARR = 0%; 95% CI (-.5 to .5%)
  • The upper boundary does not reach your goal – therefore, this can be considered sufficient evidence that there is no difference between the groups that you would consider clinically significant

How to Use Confidence Intervals to Determine Conclusiveness of Non-significant Findings
Again, you can also use confidence intervals to determine whether a result from a valid study is of meaningful clinical benefit.

Requirements for Meaningful Clinical Benefit
Remember that outcomes of clinical significance are those which benefit patients in some way in the areas of morbidity, mortality, symptom relief, physical or emotional functioning or health-related quality of life. Intermediate markers are assumed to benefit patients in these areas, but they may not - thus, a direct causal chain of benefit must be proved to avoid waste and potential patient harms occurring as unintended consequences. Meaningful clinical benefit is a combination of benefits in a clinically significant area along with the size of the results.

As with evaluating the conclusiveness of a non-significant finding, you apply judgment to set your minimum requirement for meaningful clinical significance. Using the same example of your choosing 1 percent absolute reduction in mortality as meaningful clinical benefit:

Example: Clinical Significance Goal
>=1% absolute reduction in mortality

For Statistically Significant Findings:

Example 1

  • ARR = 2%; 95% CI (.5% to 3.5%)
  • The lower boundary tells you it is possible that the true result will NOT meet your requirements for clinical significance – thus, from that perspective this trial is inconclusive

Example 2

  • ARR = 2%; 95% CI (1 to 3%)
  • The lower boundary reaches your goals for clinical significance – therefore, this can be considered sufficient evidence of benefit

Again, pictures probably tell this best. Click here to view the PDF.

The Authors Did Not Report CIs?
If you can create a 2 x 2 table from the study data, you can compute them yourself using the confidence interval calculator of the University of British Columbia, Department of Health Care and Epidemiology » which can also be found in the Delfini WebLinks » under "confidence interval calculations."

Evaluate Definitions for Outcomes
And remember, ensure you agree with the authors’ definitions of the outcomes, especially if they are using a term like “improved,” “success,” or “failure” – is a three-point change on a 200 point scale really a meaningful clinical difference that should define success? You get to be the judge.

Overestimation of Effect Size in Studies of Low Quality

In a previous DelfiniClick, we summarized an article by Moher and colleagues (1) in which the authors randomly selected 11 meta-analyses involving 127 RCTs which evaluated the efficacy of interventions used for circulatory and digestive diseases, mental health, pregnancy and childbirth. Moher and colleagues concluded that -

  • Low-quality trials compared with high quality trials (score >2), were associated with a relative increased estimate of benefit (34%).
  • Trials that used inadequate allocation concealment, compared with those that used adequate methods, were associated with a relative increased estimate of benefit (37%).

Below we summarize another study that confirms and expands Moher’s findings. In a study similar to Moher’s, Kjaergard and colleagues (2) evaluated the effects of methodologic quality on estimated intervention effects in randomized trials.

The study evaluated 23 large and 167 small randomized trials and a total of 136,164 participants. Methodologic quality was defined as the confidence that the trial’s design, conduct, analysis, and presentation minimized or avoided biases in the trial’s intervention comparisons (3). The reported methodologic quality was assessed using four separate components and a composite quality scale.

The quality score was ranked as low (</=2points) or high (>/=3 points), as suggested by Moher et al. (1). The four components were 1) generation of allocation sequence; 2) concealment of allocation; 3) double-blinding; and, 4) reporting of loss-to-follow-up:

RESULTS OF KJAERGARD ET AL’S REVIEW (all reported exaggerations are relative increases):

Generation of Allocation Sequence
The odds ratios generated by all trials (large and small) with inadequate generation of the allocation sequence were on average significantly exaggerated by 51% compared with all trials reporting adequate generation of allocation sequence (ratio of odds ratios (95% CI) = 0.49 (0.30–0.81), P <0.001.

Concealment of Allocation
All trials with inadequate allocation concealment exaggerated intervention benefits by 40% compared with all trials reporting adequate allocation concealment (ratio of odds ratios (95% CI) = 0.60 (0.31–1.15), P =0.12. Odds ratios were significantly exaggerated by 52% in small trials with inadequate versus adequate allocation concealment (ratio of odds ratios (95% CI) 0.48 (0.25–0.92), P = 0.027).

Double Blinding
The odds ratios generated by all trials without double blinding were significantly exaggerated by 44% compared with all double-blind trials (ratio of odds ratios (95% CI) = 0.56 (0.33–0.98), P = 0.041).

Reporting of Loss-to-Followup
The analyses showed no significant association between reported follow-up and estimated intervention effects (ratio of odds ratios (95% CI) = 1.50 (0.80–2.78), P = 0.2).

Kjaergard and Colleagues’ Conclusions

  1. Adequate generation of the allocation sequence and adequate allocation concealment should be required for adequate randomization.

    Unlike previous investigators (1,3,4, 5), the authors found that trials with inadequate generation of allocation sequence exaggerate intervention effects significantly.

  2. Trials with inadequate allocation concealment also generate exaggerated results.

    This is in accordance with previous evidence (1,3,5). The authors found that despite the considerable overlap between generation of allocation sequence and allocation concealment, both factors may independently affect the estimated intervention effect.

  3. Trials without double blinding exaggerate results.

    This study supports Schulz and colleagues’ finding of a significant association between intervention effects and double blinding and extends the evidence by including trials from several therapeutic areas.

  4. There was no association between reported follow-up and intervention effect.

Delfini Comment
It is useful to know quantitatively how various threats to validity affect results when doing critical appraisal of a study. The study by Kjaergard and colleagues summarized above expands the findings of Schulz, Moher, Juni and others.

Previous studies have questioned the reliability of reported losses to follow-up (5, 6). In accordance with Schulz and colleagues’ results (5), the authors found no association between intervention effects and reported follow-up.
Delfini Note: We have found that losses to follow-up may significantly affect P values when sensitivity analysis is done. We consider loss of =/>5% with differential loss or =/> 10% without differential loss to be an important threat to validity.

In agreement with the findings of Moher and associates (1,3) and Juni and colleagues (7), the authors found that trials with a low quality score on the scale developed by Jadad and colleagues (8) significantly exaggerate intervention benefits.

Kjaergard and colleagues conclude that assessment of methodologic quality should focus on generation of allocation sequence, allocation concealment, and double blinding. Delfini feels this is not sufficient – but appreciates this study as one that further demonstrates the importance of effective approaches to some of these methodologic areas.

References
1. Moher D, Pham B, Jones A, Cook DJ, Jadad AR, Moher M, et al. Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet. 1998;352:609-13. [PMID: 9746022]

2. Kjaergard LL, John Villumsen J, Gluud C. Reported Methodologic Quality and Discrepancies between Large and Small Randomized Trials in Meta-Analyses. Ann Intern Med. 2001;135:982-989.

3. Moher D, Cook DJ, Jadad AR, Tugwell P, Moher M, Jones A, et al. Assessing the quality of reports of randomised trials: implications for the conduct of meta-analyses. Health Technol Assess. 1999;3:i-iv, 1-98. [PMID: 10374081]

4. Emerson JD, Burdick E, Hoaglin DC, Mosteller F, Chalmers TC. An empirical study of the possible relation of treatment differences to quality scores in controlled randomized clinical trials. Control Clin Trials. 1990;11:339-52. [PMID: 1963128]

5. Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA. 1995;273:408-12. [PMID: 7823387]

6. Gøtzsche PC. Methodology and overt and hidden bias in reports of 196 double-blind trials of nonsteroidal antiinflammatory drugs in rheumatoid arthritis. Control Clin Trials. 1989;10:31-56. [PMID: 2702836]

7. Juni P, Witschi A, Bloch R, Egger M. The hazards of scoring the quality of clinical trials for meta-analysis. JAMA. 1999;282:1054-60. [PMID: 10493204]

8. Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan DJ, et al. Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials. 1996;17:1-12. [PMID: 8721797]

Missing Data Points: Difference or No Difference — Does it Matter?

We continue to study the "evidence on the evidence" — meaning we are continually on the look out for information which may shed light on the impact on reported outcomes of certain kinds of bias, for example, or information that provides help in how to handle different biases. Missing data points is an issue affecting the majority of studies, but currently there is not clarity on how big an issue this is, especially when there is not a differential loss between groups.

We spoke recently about this issue with John M. Lachin, Sc.D., Professor of Biostatistics and Epidemiology, and of Statistics, The George Washington University, and author. (And then we did some "hard thinking" as David Eddy would say.) Even without differential loss between the groups overall, a differential loss could occur in prognostic variables — and readers are rarely going to have access to data about changes in prognostic characteristics post-baseline reporting. So we continue to offer our conservative approach that loss of around five percent with differential loss is a bias as well as loss of around ten percent or more without differential loss.

For those who are tough and hardy and really want to mull on this, here's our updated white paper on "missingness" [Word] or [PDF]. We welcome further thoughts (or evidence) on this area.

The Importance of Blinded Assessors in RCTs

We have previously summarized the problems associated with lack of blinding in surgical (and other) studies — see Blinding in Surgery Trials in a previous DelfiniClick. The major problem with unblinded studies is that the outcomes in the intervention group are likely to be falsely inflated because of the biases introduced by lack of blinding.

Recently a group of orthopedists identified and reviewed thirty-two randomized, controlled trials published in The Journal of Bone and Joint Surgery between 2003 and 2004 to evaluate the effect of blinded assessment vs non-blinded assessment on reported outcomes [1].

Results

  1. Sixteen of the thirty-two randomized controlled trials did not report blinding of outcome assessors when blinding would have been possible.
  2. Among the studies with continuous outcome measures, unblinded outcomes assessment was associated with significantly larger treatment effects than blinded outcomes assessment (standardized mean difference, 0.76 compared with 0.25; p = 0.01).
  3. In the studies with dichotomous outcomes, unblinded outcomes assessments were associated with significantly greater treatment effects than blinded outcomes assessments (odds ratio, 0.13 compared with 0.42; p < 0.001).
  4. This translates into a relative risk reduction of 38% for blinded outcome assessments compared with 71% for unblinded outcome assessments (a difference of 33%).

Conclusion
Unblinded outcomes assessment dramatically inflates the reported benefit of effectiveness of treatments.

Delfini Commentary
This is yet another study pointing out the importance of blinding. Based on this and other similar studies it is our conclusion that studies or the results of studies without blinded assessors are grade U or at best grade B-U (see evidence-grading scale here).

1. Poolman RW, Struijs PA, Krips R, Sierevelt IN, Marti RK, Farrokhyar F, Bhandari M. Reporting of outcomes in orthopaedic randomized trials: does blinding of outcome assessors matter? J Bone Joint Surg Am. 2007 Mar;89(3):550-8. J Bone Joint Surg Am. 2007 Mar;89(3):550-8. PMID: 17332104. »

Return to Top.


Successful Evidence-based QI Project: Diabetes Management at Dreyer Medical Clinical
Example provided by Rami Rihani, PharmD, Director of Pharmacy

Delfini Introduction
Measuring clinical improvements is complex. One of the most important, frequently misunderstood issues is that cause and effect relationships can only be drawn with reasonable certainty from valid experiments (RCTs). However, if we have valid evidence from RCTs that an intervention leads to improved clinical outcomes, it is then reasonable to use process measures to evaluate the success of our evidence-based clinical improvement project.

Generally, we advise people to measure — not health status outcomes — but to perform a process measurement to evaluate the success of application of the intervention. In other words, we advise people to measure the success of implementation of the clinical improvement. For example, if we are trying to ensure patients get a beta-blocker post-MI, we would recommend looking to see if prescriptions increased for hospitalized MI patients — not to measure whether patient survival was improved. This is because observational data, such as information extracted from databases, can be highly prone to confounding. If health status outcomes are measured, then we advise people to ensure that there is a sufficient understanding of all those utilizing the data that conclusions drawn from observational data can be misleading. In the above example, if patient survival decreased, there could be many explanations.

However, if a health status outcome is measured, and if the before/after change is dramatic, it is reasonable to hypothesize that our project has been successful. For example…

Problem
Many diabetics have difficulty achieving a HbA1c <7.0. Frequently diabetics are told their HbA1cs are too high but active medication change is not aggressively pursued.

Evidence-based QI Project: A quality improvement group at Dreyer Medical Clinic developed a disease management initiative using PharmDs to actively titrate dosages of insulin and other drugs based on the Intermountain Health Care (IHC) diabetes management protocol. The process is as follows:

  • Primary care physician (PCP) refers patient to the diabetes management program;
  • PharmD aggressively titrates medication based on IHC protocol;
  • PharmD monitors for safety and efficacy of medication interventions in collaboration with the PCP

Outcomes

Outcome (n=1049)
Prior to Enrollment
Most Recent Follow-up
% at HbA1c < 7%
18%
48.5%
% at LDL < 100
30%
58%

Delfini Commentary
There was a significant improvement in the percent of patients achieving goal HbA1c and LDL associated with this project.

It is reasonable to believe that the clinical improvement project was successful. Using outcomes data from the UK Prospective Diabetes Study 35 (1), the QI team estimates that since inception, the disease management initiative resulted in the prevention of —

  • four diabetes related deaths and
  • nine microvascular events (defined as renal failure, death from renal failure, retinal photocoagulation, or vitreous hemorrhage)

1. Stratton, I,M., Adler, A.I., et al, Association of glycaemia with macrovascular and microvascular complications of type 2 diabetes (UKPDS 35): prospective observation study. BMJ 2000; 321; 405-12.

Return to Top.

Blinding In Surgical Trials — It is Through Blinding We Become Able To See

Blinding is an important consideration when evaluating a study. Without blinding, the likelihood of bias increases. Bias occurs when patients in one group experience care or exposures not experienced by patients in the other group(s), and the differences in care affect the study outcomes.Lack of blinding may be a major source of this type of bias in that unblinded clinicians who are frequently “rooting for the intervention” may behave differently than blinded clinicians towards patients whom they know to be receiving the study drug or intervention being studied. The result is likely to be that in unblinded studies, patients may receive different or additional care. Unblinded subjects may be more likely to drop out of a study or seek care in ways that differ from blinded subjects. Unblinded assessors may also be “rooting for the intervention” and assess outcomes differently from blinded assessors.

How much difference does blinding make? Jüni et al. reviewed four studies that compared double blinded versus non-blinded RCTs and attempted to quantify the amount of distortion (bias) caused by lack of double blinding [1]. Overall, the overestimation of effect was about 14%. The largest study reviewed by Juni assessed the methodological quality of 229 controlled trials from 33 meta-analyses and then analyzed, using multiple logistic regression models, the associations between those assessments and estimated treatment effects [2]. Trials that were not double-blind yielded on average 17% greater effect, 95% CI (4% to 29%), than blinded studies (P = .01).

Lack of double blinding is frequently found in surgical trials and results in uncertain evidence because of the problems stated above. A case study helps to illustrate this. A recent multicenter RCT, the Spine Patient Outcomes Research Trial (SPORT)[3] was a non-blinded trial that serves as an interesting case study of the blinding issues that arise when a surgical intervention is compared to a non-surgical intervention, and blinding is not attempted. The trial included patients with persistent (at least 6 weeks) disk-related pain and neurologic symptoms (sciatica) who were randomized to undergo diskectomy or receive usual care (not standardized but frequently including patient education, anti-inflammatory medication, and physical therapy, alone or in combination). There were a number of problems with this study including lack of power, poor control of non-study interventions, a high proportion of patients who crossed over between treatment strategies (43% randomized to surgery did not undergo surgery by 2 years and the 42% randomized to conservative care did receive surgery) and lack of blinding. The degree of missing data was 24%-27% without a true intention-to-treat analysis. Of great interest was an editorial that dealt with the problem of non-blinding in surgical studies. The editorialist, Flum, makes the following points [4]:

    • While the technique of sham intervention is well accepted in studies of medications using inactive pills (placebos), simulated acupuncture, and nontherapeutic conversation in place of therapeutic psychiatric interventions, it has only occasionally been applied to surgical trials. This is unfortunate because the use of sham controls has been critical in understanding just how much patient expectation influences outcomes after an operation.
    • A sham-controlled trial would be particularly relevant for spine surgery since the most commonly occurring and relevant outcomes are subjective.
    • Patients chosing surgical options may have high expectations. They may include a higher level of emotional “investment” in surgical care compared with usual care based on the level of commitment resulting from a decision to have an operation and get through recovery. After the patient has accepted the risks of surgical intervention, the desire for improvement may drive perceptions about improvement.
    • Patients who opt for surgery may also differ from patients who decline surgery in their beliefs regarding the benefits of invasive interventions.
    • The surgeon’s expectations and direction are likely to play an important role in patient improvement.
    • Given the proliferation of operative procedures for the treatment of subjective complaints like back pain, the need for sham controlled trials has never been greater.

Flum goes on to present multiple examples of the power of suggestion and the problem of doing non-blinded trials in the field of surgery. Observational trials have often reported procedural success, but sham-controlled trials for the same conditions demonstrate how much of that success is due to the placebo effect.

  • Example 1 — Ligation of Internal Mammary: After multiple observational studies suggesting that ligation of the internal mammary artery was helpful in patients with coronary disease, Cobb et al randomized patients to operative arterial ligation or a sham procedure. Both groups improved after the intervention, but there were similar, if not greater, improvements in subjective measures such as exercise tolerance and nitroglycerin use in the sham surgical group.
  • Example 2Osteoarthritic Knee Surgery and 3 Osteoarthritic Knee Joint Irrigation: After multiple case series reported that patients with osteoarthritis of the knee improve after arthroscopic surgery, Moseley et al demonstrated just how much of that effect is related to the hopes, expectations, and beliefs of the patient. The investigators randomized 180 patients to undergo arthroscopy with debridement, arthroscopy with lavage, or sham arthroscopy. The power of expectation was strong and patients were unable to determine if they had been assigned to the treatment or sham groups— and all groups improved. At 2 years after randomization, all patients reported comparable pain scores and functional scores. Another sham-controlled study in patients with knee osteoarthritis demonstrated that patients benefit equally from irrigation of the joint and from sham irrigation.
  • Example 4 Parkinson’s Disease: Researchers found similar improvements in quality of life after direct brain injections of embryonic neurons or placebo in patients with advanced Parkinson’s disease.
  • Example 5 Transmyocardial Laser Revascularization in HF: Heart failure patients undergoing transmyocardial laser revascularization or sham procedures had equal improvements in subjective outcomes.
  • Example 6 Hernia: After hernia repair, there was equal improvement in pain control after cryoablation of nerves or sham interventions.
  • Examples 7-9 Laparoscopic Interventions: Multiple case series have reported benefit on subjective outcomes such as pain control, function, and readiness for discharge with laparoscopic cholecystectomy, colon resection, and appendectomy compared with conventional approaches..Bias arises when the clinical care team influences patient and discharge expectations though coaching, communication, and management. Randomized trials of these three procedures that included blinding of both the patients and the discharging clinicians to the treatment that patients received by placing large, side-to-side abdominal wall dressings demonstrate little or no difference in patients reaching discharge criteria. A reasonable conclusion is that when the clinician’s expectations and “coaching” were removed by placing a large bandage on the abdominal wall, the subjective benefits disappeared. Flum concludes that studies not addressing both patient and clinician expectation on subjective outcomes do not inform the clinical community about the true role of the intervention.

Delfini Commentary
Blinding of subjects and everyone working with the subjects or study data to the assigned intervention (double-blinding) decreases the likelihood of bias. Bias may be more likely to occur when evaluating subjective outcomes such as pain, satisfaction, and function in non-blinded studies, but it has also been reported with objective outcomes such as mortality. When dealing with subjective outcomes, as Flum points out, it is critical to distinguish the effect of the intervention from the effect of the patient’s expectation of the intervention. The only way to distinguish the effect of a patient’s positive expectations of an operation from the intervention itself is to blind patients to the treatment they receive and randomize them to receive the intervention of interest or to receive a sham intervention (placebo). Yet we frequently hear, “But blinding is not possible in surgical studies.” Frequently the argument is raised that subjecting people to anesthesia and sham surgery is not ethical. However, conducting clinical trials employing methods that result in avoidable fatal flaws is also problematic. Flum’s position is that when the risk of a placebo does not exceed a threshold of acceptable research risk and if the knowledge to be gained is substantial, a sham-controlled trial is needed and is ethical. He reasons that ethical justification of placebo-controlled trials is based on the following considerations:

  • Invasive procedures are associated with risks.
  • There are great harms created by conducting studies that are of uncertain validity.
  • Establishing community standards based on uncertain evidence is more likely to result in more harm than good.
  • Sham-controlled trials are justified when uncertainty exists among clinicians and patients about the merits of an intervention.

The SPORT trial draws attention to the problem of non-blinding in surgical trials. This was a very expensive, labor-intensive study that provides no useful efficacy data. Research subjects were undoubtedly told this study would provide answers regarding the relative efficacy of surgery vs conservative care for lumbar spine disease. The authors of the SPORT trial state that a sham-controlled trial was impractical and unethical, possibly — according to Flum because the risk of the sham would include general anesthesia (to truly blind the patients). He would argue that in this case blinding which would require anesthesia is the only way that valid, useful evidence could have been created. Even though we graded the study U (uncertain validity and usefulness) and would not use the results to inform decisions about efficacy or effectiveness because of the threats to validity, the study does report information regarding risks of surgery that may be of great value to patients.

-----------

1 Jüni P, Altman DG and Egger M. Systematic reviews in health care: Assessing the quality of controlled clinical trials. BMJ. 2001;323;42-46. PMID: 11440947

2 Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions of methodological quality associated with estimates of of treatment effects in controlled trials. JAMA 1995;273:408­12. PMID: 7823387.

3 Weinstein JN, Tosteson TD, Lurie JD, et al. Surgical vs nonoperative treatment for lumbar disk herniation: the Spine Patient Outcomes Research Trial (SPORT): a randomized trial. JAMA. 2006;296:2441-2450. PMID: 17119141

4 Flum DR. Interpreting Surgical Trials With Subjective Outcomes Avoiding UnSPORTsmanlike Conduct. JAMA, November 22/29, 2006—Vol 296, No. 20: 2483-1484. PMID: 17119146

Return to Top.

Blinding and RCTs

A recent article, Boutron I, Estellat C, Guittet L, Dechartres A, Sackett DL, et al. (2006) Methods of blinding in reports of randomized controlled trials assessing pharmacologic treatments: A systematic review. PLoS Med 3(10): e425. DOI: 10.1371/ journal.pmed.0030425, provides a great deal of useful information about and a way of classifying blinding in research studies. The authors evaluated blinding in RCTs of pharmacologic treatment published in 2004 in high impact-factor journals. The following are some key points from the article:

• The authors identified 819 reports with about 60% describing the method of blinding. The classification identified three main methods of blinding:
(1) methods to provide identical treatments in both arms,
(2) methods to avoid unblinding during the trial, and
(3) methods of blinded outcome assessment.


• ESTABLISHING BLINDING OF PATIENTS AND PROVIDERS: 472 [58%] described the method of blinding, but 236 [29%] gave no detail and 111 [13%] some data on blinding (i.e., reporting that treatments were similar or the use of double dummies with no description of the method). The methods of blinding identified varied in complexity. The authors reported use of a centralized preparation of similar capsules, tablets, or embedded treatments in hard gelatin capsules (193/336 [57%]), similar syringes (37/336 [11%]), or similar bottles (38/336 [11%]). Use of a double dummy procedure was described in 79 articles (23%). Other methods consisted of a sham intervention performed by an unblinded health care provider who was not actively involved in the care of patients and had no other contact with patients or other caregivers and outcome assessors (17/336 [5%]). To mask the specific taste of the active treatments, in ten articles researchers used a specific flavor such as peppermint or sugar to coat treatments. For treatments administered by care providers, authors reported use of a centralized preparation of opaque coverage to adequately conceal intravenous treatments with different appearances (14/336 [4%]).

• AVOIDING UNBLINDING OF PATIENTS AND PROVIDERS: Only 28/819 [3%]) reported methods to avoid unblinding. Methods to blind dosage adaptation relied on use of a centralized adapted dosage or provision of sham results of complementary investigations for treatments necessitating dosage adaptation. Methods to avoid unblinding because of side effects relied mainly on centralized assessment of side effects, partial information to patients about side effects, use of active placebo or systematic prevention of adverse effects in both arms.

• BLINDING ASSESSORS: These methods depend on the main outcomes and are particularly important when blinding cannot be established and maintained by the methods described above. A total of 112 articles [14%] described these methods, which relied mainly on a centralized assessment of the main outcome. Blinding of outcome assessors is presumably achieved if neither patients nor those involved in the trial have any means to discover which arm a patient is in, for example because the placebo and active drugs are indistinguishable and allocation is via a central randomization service. 96 reports (86%) of the 112 reports in which specific measures to blind the outcome assessor were reported concern trials in which patients were reported as blinded or in which double blinding or triple blinding was reported. These results suppose that, although blinding was performed at an earlier stage, the investigators nevertheless decided to perform a specific method of blinding the outcome assessor.

• AUTHORS COMMENTS AND CONCLUSIONS:
• Although blinding is essential to avoid bias, the reporting of blinding is generally quite poor and reviews of trials that test the success of blinding methods indicate that a high proportion of trials are unblinded.

• The study results might be explained in part by the insufficient coverage of blinding in the Consolidated Standards for Reporting Trials (CONSORT) statements. For example, three items of the CONSORT statements are dedicated to the description of the randomization procedure, whereas only one item is dedicated to the blinding issue. The CONSORT statements mainly focus on reporting who is blinded and less on the reporting of details on the method of blinding, and this information is essential to appraise the success of blinding.

• Some evidence suggests that although participants are reported as blinded, the success of blinding might be questionable. For instance, in a study assessing zinc treatment for the common cold, the blinding procedure failed, because the taste and aftertaste of zinc was distinctive. And yet, tools used to assess the quality of trials included in meta-analyses and systematic reviews focus on the reporting of the blinding status for each participant and rarely provide information on the methods of blinding and the adequacy of the blinding method.

• There is a need to strengthen the reporting guidelines related to blinding issues, emphasizing adequate reporting of the method of blinding.

Delfini Commentary
Lack of blinding appears to be a major source of bias in RCTs. Just as well-done randomization and concealment of allocation to the study groups decreases the likelihood of selection bias, blinding of subjects and everyone working with the subjects or study data to the assigned intervention (double-blinding) decreases the likelihood of performance bias. Performance bias occurs when patients in one group experience care or exposures not experienced by patients in the other group(s) and the differences in care affect the study outcomes. Lack of blinding may affect outcomes in that:

  • Unblinded subjects may report outcomes differently from blinded subjects, have different thresholds for leaving a study, seek (and possibly receive) additional care in different ways.
  • Unblinded clinicians may behave differently towards patients than blinded clinicians.
  • Using unblinded assessors may result in systematic differences in outcomes assessment (assessment bias).

A number of studies have shown that lack of blinding is associated with inflated treatment effects.

In some cases blinding may not be possible. For example, side effects or taste may result in unblinding. The important point is that even if blinding is not possible, the investigators do not get “extra” validity points for doing the best they could (i.e., the study should not be “upgraded”).

Return to Top.
More on The Problem with Drawing Cause-Effect Conclusions from Observational Studies

Our last teaching engagement was in Framingham, Massachusetts and reminds us of the value of observational studies to assist us in developing risk stratification models. The Framingham Study began in 1948 as the first prospective study of cardiovascular disease and is important because through observations it has identified cardiovascular disease (CVD) risk factors which can be associated with morbidity and mortality.

But there is good evidence that basing cause and effect conclusions from observational studies is unreliable. Cause and effect conclusions should be based on randomized controlled trials (RCTs) where bias, confounding and chance have been ruled out as possible explanations for the observed association between the intervention and the outcome. Because there are so many observational studies published each week and because we keep seeing health professionals inappropriately basing treatment decisions on them, it is worthwhile summarizing an excellent review of the literature on this topic.

The study and literature review can be found in the reference:

Deeks JJ, Dinnes J, D'Amico R, Sowden AJ, Sakarovitch C, Song F, Petticrew M, Altman DG. Evaluating non-randomised intervention studies. Health Technology Assessment 2003; Vol. 7: No. 27.

Some key points from this article:

  • Comparison of results of randomized and non-randomized studies across multiple interventions in multiple studies demonstrate that, in the majority of cases, observational studies are not consistent with the results of RCTs
  • This study, using meta-epidemiological techniques, demonstrates that —
    • None of the study results can be adequately adjusted for bias in observational studies using historic and concurrent controls
    • Logistic regression on average increases bias when applied to observational studies

Conclusions

  • Non-randomized studies may still give seriously misleading results even when those treated and control groups appear similar in key prognostic factors
  • Standard methods of case-mix adjustment do not guarantee removal of bias
  • Omission of important confounding factors can explain failure of adjustment as a substitute for randomization
  • There is no known method for reliably adjusting for confounding factors in observational studies

Delfini Commentary
Extreme caution is urged when considering results of observational studies in interventions for screening, prevention and therapy. Cause and effect conclusions should only be drawn from RCTs.

One reason for this is that there may be major differences in the characteristics (prognostic factors) of individuals who choose a therapy compared to people who do not choose that therapy. A classic example is hormone replacement therapy after myocardial infarction (MI) in women. Most observational studies reported that roughly twice as many women who did not choose to take hormone replacement therapy (HRT) had a recurrent MI compared to women who chose to take HRT. This led people to believe — incorrectly — that HRT caused this benefit. Later, well-done RCTs were conducted and no such benefit was found. Why? The most likely reason is that the observational studies were highly prone to biases resulting from differences between the groups which could not be eliminated even with statistical adjustments in which researchers try to balance confounders between the groups, such as adjusting for smoking.

Another reason is that in observations, investigators do not “control” all elements of the study as they do in RCTs. The end result is that in observational studies other aspects affecting the groups are almost certain to be different in important ways which are likely to explain or affect the study results.

Key Point
Any difference between groups — except for what is being studied (e.g., HRT use)
is a bias.

In the case of HRT after MI, selection bias was present in that women who chose to take HRT were probably more likely to be “health-conscious,” exercise, watch their diets, etc., making them different from the women who did not take HRT. It is also likely that there were other differences in how the two groups experienced their health care because in observational studies there is no formal protocol and so there will be differences in many ways that could affected observed outcomes such as other therapies used, how outcomes are assessed, frequency for follow-up, and so on.

Even with statistical adjustments for differences between potential and known prognostic characteristics of the groups, bias cannot be reliably eliminated because whatever is actually responsible for the outcome (i.e., the confounder) is what would have to be adjusted. This would entail having advance knowledge of cause and effect (but that is why the study is being conducted). Plus statistical adjustment has limitations. How could every single factor that made the HRT users different be adjusted? Humans embody an infinite number of variables such as characteristics and exposures.

Comparisons of RCTs and observational studies of the same interventions have repeatedly demonstrated that even with the most meticulous statistical adjustments, bias cannot be reliably eliminated from observational studies. The key message is that without randomization and assurance that interventions and assessments are the same for both study and comparison groups, one cannot reliably draw conclusions about cause and effect relationships. Associations between interventions and outcomes in observational studies are very likely to be due to bias or confounding. Therefore, observational studies are only useful for hypothesis-generating when considering questions of preventive, screening or therapeutic interventions.

Database Studies
Some groups have tried to demonstrate improved health outcomes (e.g., death, stroke, etc.) through studies of their databases. It should be remembered that this type of study is an observational study and prone to bias and confounding for the reasons explained above, plus it is highly prone to chance findings of statistical significance. Therefore, database studies may be useful for suggesting areas for further study, but they should not be thought of as valid studies from which cause and effect relationships can be concluded.

Return to Top.

Untrustable P-values & Abstracts

One of the first things we teach our EBM learners is that although abstracts can be useful to get a sense of what an article is about and can be at times be used to exclude studies from further review, abstracts cannot reliably be used to determine if a study is valid.

Validity must be determined by examining the methods of the study (assuming it is the right study type). A little-known problem with abstracts is that the information provided in the abstract cannot be documented in the body of the paper up to 68% of the time in some of the top-tier medical journals [Pitkin, R et al. Accuracy of Data in Abstracts of Published Research Articles. JAMA. 1999; 281: 1110-1111 PMID: 10188662 — reviewing JAMA, NEJM, The Lancet, The Annuals of Internal Medicine, BMJ and the Canadian Medical Journal]. In this DelfiniClick we report another problem with abstracts—the problem of bias.

Peter C Gøtzsche in a BMJ article (Believability of relative risks and odds ratios in abstracts: cross sectional study. BMJ 2006;333;231-234; PMID: 16854948) reviews previous publications reporting biased results-reporting and biased reporting of conclusions, and he presents additional evidence of bias in reporting P values.

We do not have the expertise to evaluate all the points made in his paper; however, we present his comments and findings here for you to evaluate and draw your own conclusions. Although, we believe the assumptions upon which Gøtzsche bases his conclusions can be challenged, the following should be of interest to anyone interested in critical apppraisal of the medical literature.

Gøtzsche’s Comments

  • Significant results in abstracts should generally be disbelieved
  • Ongoing research has shown that more than 200 statistical tests are sometimes specified in trial protocols. If you compare a treatment with itself—that is, the null hypothesis of no difference is known to be true—the chance that one or more of 200 tests will be statistically significant at the 5% level is 99.996% if we assume the tests are independent
  • Thus, the investigators or sponsor can be fairly confident that “something interesting will turn up.”
  • Due allowance for multiple testing is rarely made, and it is generally not possible to discern reliably between primary and secondary outcomes
  • Recent studies that compared protocols with trial reports have shown selective publication of outcomes, depending on the obtained P values, and that at least one primary outcome was changed, introduced, or omitted in 62% of the trials.
  • The scope for bias is also large in observational studies. Many studies are underpowered and do not give any power calculations.
  • Furthermore, a survey found that 92% of articles adjusted for confounders and reported a median of seven confounders but most did not specify whether they were pre-declared.
  • Fourteen per cent of these articles reported more than 100 effect estimates, and subgroup analyses appeared in 57% of studies and were generally believed.
  • The preponderance of significant results could be reduced if the following actions were taken.
    • First, if we need a conventional significance level at all, which is doubtful, it should be set at P < 0.001
    • Second, analysis of data and writing of manuscripts should be done blind, hiding the nature of the interventions, exposures, or disease status, as applicable, until all authors have approved the two versions of the text
    • Third, journal editors should scrutinize abstracts more closely and demand that research protocols and raw data—both for randomized trials and for observational studies—be submitted with the manuscript.

In short, yet another reminder to read the methods section of papers and not rely on results or conclusions presented in abstracts.

Gøtzsche’s Findings in Brief

  • The first result in the abstract was statistically significant in 70% of the trials, 84% of cohort studies and 84% of case-control studies. Although many of these results were derived from subgroup or secondary analyses, or biased selection of results, they were presented without reservations in 98% of the trials