 |
East: Example 3
Beta-Blocker Heart Attack Study
In the late 1970's, the National Heart, Lung and Blood Institute sponsored the beta-blocker heart attack trial (BHAT). This was a double-blind, randomized multicenter trial to test the drug propanolol, a beta-blocker, against a placebo among patients who had previously experienced acute myocardial infarction and were receiving optimal therapy for their condition. Based on earlier studies, the investigators believed that propanolol would boost the median survival from 10.84 years with placebo to 14.06 years with propranolol.
- The Traditional Single-Look Design
- Group-Sequential Designs
- Interim Monitoring of BHAT
- Early Termination of BHAT
1. The Traditional-SingleLook Design
The investigators wanted their new study to have 90% power to detect this improvement in survival on the basis of a two-sided significance test conducted at the alpha = 0.05 level. Using the traditional single-look design for survival studies, they determined that in order to achieve 90% power they would have to follow all patients accrued to the study until 629 deaths were observed and then perform a logrank significance test on the data. If the logrank statistic exceeded 1.96 they would conclude that propranolol prolonged survival relative to placebo. How long would such a study take? Based on past experience they expected to recruit 1,700 patients per year onto the study. The question then was, how long should the patient accrual continue? The larger the number of patients enrolled, the sooner the 629 deaths would be observed and the sooner the study could be terminated. On the other hand the cost of the study would increase directly in proportion to the number of patients enrolled. The essence of the trade-off between patient accrual and total study duration is available from the following graph provided by East.
For example, if exactly 1,700 patients were enrolled, the accrual period would last for one year and there would be an additional 8.5 years of patient follow-up until 629 deaths were observed. This would imply a total study duration of 9.5 years. However, the investigators did not want to extend the study beyond 4.5 years. Eventually they settled on 4020 patients, thereby committing to a study whose expected duration was 4.21 years.
2. Group Sequential Design
In a traditional single-look clinical trial the patients would have been followed for survival until the 629 deaths were obtained and the logrank test would then be performed to determine if propranolol prolonged survival relative to placebo. The criterion for success would be whether or not the standardized logrank test statistic exceeded 1.96. We have just seen that such a study would be expected to last 4.21 years. The investigators of the BHAT study felt, however, that they could do better with a group-sequential design. In a group-sequential clinical trial the survival data are monitored at several interim time-points by an independent data and safety monitoring board (DSMB) and the trial is terminated early if the superiority of the new treatment can be established statistically at any one of the interim looks. Since the data will be tested repeatedly in a group-sequential study, the burden of proof must be more stringent at each of the interim looks than would be the case if there was no interim monitoring at all. Otherwise there is the danger that chance fluctuations in the data will be misinterpreted as demonstrating a real underlying effect. This increasing stringency is accomplished by establishing a stopping boundary at each interim look. The DSMB for the BHAT study decided to take up to seven looks at the data, six interim and one final. If the logrank statistic crossed the stopping boundary at any of the interim looks the study would be terminated early. Two types of stopping boundaries were contemplated; the O'Brien-Fleming boundaries and the Pocock boundaries.
2.1 The O'Brien-Fleming Stopping Boundaries
The chart below, produced by East, displays the O'Brien-Fleming stopping boundaries for the BHAT study.
These boundaries assess the necessary penalty at each interim look to compensate for the fact that the data might be tested for statistical significance up to seven times after equal increments of information (deaths). The y-axis denotes the standardized value of the test statistic used to make decisions, and the x-axis denotes number of deaths. Notice that the x-axis extends to 649 deaths rather than the 629 deaths required for the traditional one-look design. If the study continued to the end without crossing an intermediate boundary, it would actually last longer than the one-look study since we must commit up-front to observing 20 more deaths. Notice too that these boundaries are shaped like a funnel with the large end at the left and the small end at the right. At the left end of the funnel where the first look at the data takes place after 93 deaths, the standardized value of the test statistic required to stop the study and declare that propranolol prolongs survival significantly better than placebo is 5.45.
Conversely if the test statistic is less than -5.45, we would declare that the placebo is significantly better than the propranolol. If the test statistic lies between -5.45 and 5.45, the study continues. This reflects the severe penalty to be paid at the first of the seven potential looks at the data. The traditional single-look design requires the test statistic only to exceed 1.96 in order to declare statistical significance. As you proceed to the right with further looks, the funnel narrows and a lower absolute value of the test statistic is required to exit. Nevertheless there continues to be a penalty associated with the multiple looks right up to the end. At the very last look the upper and lower stopping boundaries are 2.06 and -2.06, respectively, as compared to 1.96 and -1.96 for the traditional one-look design.
2.2 The Pocock Stopping Boundaries
An alternative pair of upper and lower stopping boundaries, proposed by Pocock, are displayed below.
These boundaries are flat with the same value, 2.49, required to declare statistical significance at each of the seven looks. Since 2.49 is larger than 1.96, the corresponding single-look criterion, there is once again a penalty paid to compensate for the multiple looks at the data. The penalty is, however, constant for each look, and substantially smaller than the penalty exacted by the O'Brien-Fleming boundaries at the early looks. Thus it is easier to cross the Pocock boundaries at the early stages of the clinical trial than it is to cross the O'Brien-Fleming boundaries. Notice, however, that the Pocock boundaries extend out to 779 deaths; 50 more than are needed for the single-look design and 30 more than are needed for the O'Brien-Fleming Design. Thus one must make a much larger upfront commitment if one wishes to use these boundaries for early stopping.
2.3 Comparing the Three Designs in East
The interactive form-fill-in East user interface makes it very easy to compare the properties of the single look design and the two seven-look designs. One simply enters the design parameters into a worksheet and then examines the maximum and expected study durations for the three types of designs. The design worksheet is displayed at the top of the next column.
With the 1-look design the study will last 4.21 years under the alternative hypothesis that propranolol prolongs the median survival from 10.84 years to 14.06 years, and 3.86 years under the null hypothesis of no survival benefit. Since the 7-look O'Brien-Fleming design permits early stopping, the expected study duration is reduced to 3.36 years, a saving of more than 10 months if the alternative hypothesis is true. The up-front commitment for this expected benefit is that the maximum study duration might be 4.31 years if a stopping boundary is not crossed earlier. The 7-look Pocock design is expected to last only 3.17 years under the alternative hypothesis, a saving of more than one year relative to the 1-look design. But again, there is a large up-front commitment. This time the maximum study duration could be as high as 5.01 years. The actual probability of going all the way out to 7-looks and thus prolonging the study to the end is not negligible. The exit probabilities at each of the seven looks are displayed below for the Pocock stopping boundaries under the alternative hypothesis. Notice the the probability of crossing for the first time at look seven is 0.156.

By displaying various design options side by side on the design worksheet in this manner, one can quickly evaluate the strengths and weaknesses of each one and select the design which is best suited to the needs of the investigators. In the present case the investigators decided to adopt the 7-look O'Brien-Fleming design.
2.4 Simulating the Selected Design
Any study designed by East can be simulated under any choice of treatment differences. For clinical trials with survival endpoints it is convenient to express the treatment difference as the negative of the log hazard ratio of treatment to control. Thus, in the present case, the magnitude of the treatment difference under the alternative hypothesis is -log(10.84/14.06) = 0.26. If 0.26 is indeed the true treatment difference the study will have 90% power to reject the null hypothesis that treatment difference is zero. But what if we have over-estimated the true treatment difference and in fact the negative of the log hazard ratio is only 0.2? East can simulate the 7-look O'Brien-Fleming design under this assumption. The results of 1000 such simulations are displayed below.

These simulations show that the null hypothesis of no treatment difference is rejected 710 times in 1000 simulations. Thus it is seen that the power of the study would be only 71% rather than 90% if the negative of the log hazard ratio was 0.2 instead of 0.26.
3. Interim Monitoring of BHAT
The DSMB met on six occasions after activating the study. The summary results for these six interim monitoring time-points are tabulated below.
|
Look Number |
Monitoring Date |
Months Since Start |
Cumulative Deaths |
Logrank Statistic |
|
1
2
3
4
5
6 |
May, 1979
Oct, 1979
March, 1980
Oct, 1980
April, 1981
Oct, 1981 |
11
16
21
28
34
40 |
56
77
126
177
247
318 |
1.68
2.24
2.37
2.30
2.34
2.82 |
We may enter these statistics into the interim monitoring worksheet in East as shown below.
The value of the test statistic and the corresponding stopping boundary have been tabulated for each look. Notice that the interim monitoring occurred roughly once every six months. But the number of deaths observed at each of the six interim monitoring time points was much smaller than what had been planned at the design stage. Thus the stopping boundaries had to be re-computed using the flexible Lan and DeMets alpha-spending function methodology. The original boundaries, the re-computed boundaries and the path traced out by the test statistic are displayed below.

Although the stopping boundary has not yet been crossed, the test statistic is very close to the upper stopping boundary. Only 318 deaths have been observed so far, amounting to 49% of the total information. In fact only 0.003 of the allowable alpha = 0.05 has been spent. This is depicted below in the alpha spending function graph corresponding to the O'Brien-Fleming stopping boundaries used in this study.
 
It is thus very likely that the upper stopping boundary will be crossed by the time all 649 deaths have been observed. This can be seen from the conditional power chart provided by East and displayed above.
The conditional power chart reveals that, under the alternative hypothesis that the negative log hazard ratio is 0.26, the study has 99% power to eventually cross the upper stopping boundary.
4. Early Termination of BHAT
Suppose that the value of the test statistic at look 6 had been 3.0 instead of 2.82. In that case the upper stopping boundary would have been crossed and the study would have terminated. East would then compute the p-value, confidence interval and median unbiased estimate of the negative log hazard ratio, adjusted for the six interim looks. The results are shown below.

The adjusted p-value is 0.001547, implying the difference between propranolol and placebo, measured in terms of the negative log hazard ratio, is significantly greater than zero. One can state with 90% confidence that the negative log hazard ratio is between 0. 1 5 and 0.52, with a median unbiased estimate of 0.33.
Note: The actual BHAT trial did in fact close at the sixth interim look even though the stopping boundary with the x-axis measuring the number of events (deaths) was not crossed. This is because the investigators used the equally spaced numbers of deaths as their metric for computing the information fraction at each interim look. In 1981, tools for monitoring after unequally spaced numbers of deaths were not available to the investigators. Based on this approximation, the stopping boundary was crossed at the sixth look (see DeMets et. al., Controlled Clinical Trials, 5, 362-72, 1984 for details.)
|
 |
|