143
C LINICAL TRIALS FOR PERSONALIZED , MARKER - BASED TREATMENT STRATEGIES Dissertation zur Erlangung des Doktorgrades der Fakultät für Mathematik und Physik der Albert-Ludwigs-Universität Freiburg im Breisgau vorgelegt von HONG SUN February 2016

clinical trials for personalized, marker-based treatment strategies

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: clinical trials for personalized, marker-based treatment strategies

CLINICAL TRIALS FOR PERSONALIZED,MARKER-BASED TREATMENT STRATEGIES

Dissertation zur Erlangung des Doktorgrades

der Fakultät für Mathematik und Physik

der Albert-Ludwigs-Universität Freiburg im Breisgau

vorgelegt von

HONG SUN

February 2016

Page 2: clinical trials for personalized, marker-based treatment strategies

Dekan: Prof. Dr. Dietmar Kröner

1. Referent: Prof. Dr. Martin SchumacherInstitute of Medical Biometry and StatisticsMedical Center — University of FreiburgStefan-Meier-Straße 2679104 FreiburgGermany

2. Referent: Prof. Dr. Werner BrannathDepartment of Mathematics/Computer ScienceUniversity of BremenLinzerstraße 428359 BremenGermany

Datum der Promotion: 30 May 2016

Page 3: clinical trials for personalized, marker-based treatment strategies

Acknowledgements

About ten years ago, I took a flight and left China for the first time in my life to Europe for studyingstatistics. Now, I proudly succeed to celebrate my 10-year anniversary with Europe with this ‘booklet’! Iwish to express my sincere gratitudes to everyone who has contributed to this thesis directly or indirectly!

First and foremost, I would like to thank my PhD promoter prof. dr. Martin Schumacher for hisvaluable inputs and instructions for this thesis and all the efforts for my PhD, as well as the successfulinitiation and coordination of the Marie Curie Initial Training Network MEDIASRES. Great appreciationsgo to my supervisor of my MEDIASRES project Prof. dr. Werner Vach, who gave me the chance to joinMEDIASRES network and work with him on this interesting project, as well as acted as my researchfather, for his great supervision, huge help and support, large amount of time, efforts and discussionson my projects as well as all the invaluable knowledge and passion on the topic of clinical trials andpersonalized medicine. Many thanks to my co-supervisor dr. Frank Bretz for his priceless contributionand suggestions on my projects, and his good arrangement of my secondment in Novartis and the nicetime we worked together. I am also grateful to Prof. dr. Werner Brannath for acting as reference andproviding constructive comments, and Susanne Crowe for the English writing corrections on this thesis.

Further, I would like to extend my acknowledgements to who participated and contributed to ME-DIASRES network, to the organizing team for the good organization and management of this network,to all the brilliant supervisors for all the fruitful meetings and professional trainings, especially to everyfellow: Sung Won, Federico, Anna W, Markus, Corine, Alexia, Ketil, Soheila, Mia, Susanne, Matteo,Leyla and Anna B, it was a great honor and so much fun to get to know all of you and spend three yearsseparately but grow up together with you! Thank you very much to the staff in IMBI for various topics ofseminars and talks, tasty coffee and cakes, helpful advice and tips on living in Germany and free Germanlessons, to the colleagues in FDM for the swell working time and nice office atmosphere!

It would have been impossible for me to spend so many years away from home without the kind help,support encouragement and comfort from my great international friends I met in Belgium, Switzerlandand Germany: Amparo, Pryseley, Nana, Emanuela, Tanya, Carolina, Harison, Lulu, Shu-fang, Qiyu,Susanne, Xiao Huang, Songjie, Stefan, Ye Zhang, and many many others. I cherish every preciousmoment we spent, and you are always in my heart even though we live very far away!

Last but most importantly, I would like to express my special thanks to my family: my parents JianliSun and Ping Hong, and my sister Wan Sun, for their endless and unconditional support, encouragement,comfort, confidence, patience, tolerance and love!

February 2016, Freiburg

Page 4: clinical trials for personalized, marker-based treatment strategies
Page 5: clinical trials for personalized, marker-based treatment strategies

Summary

The increasing progress in developing biological and molecular targeted agents, especially in oncol-ogy, promises the development of personalized medicine, where the optimal treatment options are chosenbased on characteristics of the patient and his/her disease. The main aims of this thesis are to translatethe need of personalized medicine to well-defined statistical problems, to develop and compare statisticalapproaches to solve the problems and to develop recommendations for statistical analysis of such clinicaltrials in future. The randomized controlled trials are the gold standard for measuring an intervention’simpact in clinical trials. Any new intervention selected for a single marker or multiple markers shouldbe validated in the randomized controlled trials. Our motivations came from the current development onvarious innovative biomarker designs and statistical analysis strategies proposed recently in the literature.In this thesis, we investigate further on the topic of clinical trial design and statistical analysis strategies inrandomized controlled trials using personalized, marker-based treatment strategies. We focus on severalsituations when either a single marker or multiple markers being involved in the randomized controlledtrials. We aim to build new methodologies and compare them with current widely used methods for sub-group and interaction analyses and multiple testing approaches for clinical trials using information froma single and/or multiple markers.

We first briefly summarize five well-known clinical trials designs for predictive biomarkers consid-ered directly or indirectly in this thesis, e.g., randomized-all design, biomarker-by-treatment interactiondesign, targeted or selection design, biomarker-strategy design and individual-profile design. For thesimple situation where only one single pre-specified marker being involved in randomized controlledtrial, several statistical analysis strategies proposed in the literature categorized by the choices and se-quences of subgroup testing, i.e., in the marker-positive, the marker-negative subgroups and/or the over-all population. We discuss four different statistical approaches— the fixed-sequence, marker sequentialtest, fallback and treatment-by-biomarker interaction approaches — for the randomized-all design orbiomarker-by-treatment interaction design.

In particular, we consider treatment selection in confirmatory clinical trials, where we aim at es-tablishing the treatment effect in the overall population or the targeted subgroup, e.g., marker-positivesubgroup. Five existing multiple testing procedures from the family of feedback procedures and proce-dures with weighting strategies with the closed testing principle are compared through simulation studies,including two parametric procedures Song-Chi and the weighted parametric procedures, as well as threenon-parametric procedures such as the weighted Bonferroni test, the weighted-Holm and the fallbackprocedures. Rejection regions and powers are considered for all five procedures in different scenarios.

Page 6: clinical trials for personalized, marker-based treatment strategies

vi Summary

The results shown that the weighted parametric procedure obtains highest powers among all procedureswith weighting strategy under the same setting of weights, since it considers the correlation between thetwo test statistics. Due to the consistency constraint, Song-Chi procedure obtains lower powers than theweighted parametric procedure in general, sometimes even lower than non-parametric procedures. It alsoperforms poor when the treatment effect also exists in the marker-negative subgroup, so we should bemore cautious when applying Song-Chi procedure.

After the treatment effect being established in the overall population without considering the markerinformation at beginning of the planing phase, sometimes for certain reasons, e.g. regulatory requirementsor health technology assessment, a post hoc subgroup analysis is initiated or required with lack of powersin small subgroups. We propose a framework to assess and compute the long term effect of differentstrategies to perform subgroup analysis in this special situation. We consider two performance measuresto evaluate the average post-study treatment effect for patients in all studies (E) and the fraction of pa-tients with a negative treatment effect in the positive studies (P). Nine existing decision rules includingperforming the overall test without subgroup analysis, simply comparing the estimate with zero as well asthe significance testing with different choices of significance levels are applied to different assumptionsof subgroup specific and individual treatment effect. Optimistic, moderate and pessimistic scenarios areassumed for true treatment effect. We demonstrate that there are decision rules for subgroup analysiswhich decrease P and increase E simultaneously comparing to the situation of no subgroup analysis.These rules are much more liberal than the usual significance testing, since there is a high risk to decreaseE using the latter.

When multiple markers are involved in the RCTs, one new treatment may not be sufficient for treat-ing all patients with different biological or genetic characteristics, several treatments could be involvedin the same study. A situation with multiple markers and multiple treatments is considered in the lastpart of this thesis, where there already exists a highly stratified treatment strategy depending on a markerpattern, and dividing the whole population into small subgroups. We aim to demonstrate a treatmenteffect for a subset of subpopulations, instead of each single subpopulation. We present a framework tocompare the new approach —- testing all possible subsets formed by joining subgroups and selecting thesubset with minimal p-value — with simpler ones like subgroup analysis, performing an overall test only,and combinations of both. In this framework we consider conceptually similar measures in the previousframework we proposed, the impact, i.e. the expected average change in outcome when patients will betreated as recommended and the inferiority rate, i.e. the fraction of patients recommended to switch toa worse treatment, as well as an additional measure — success rate (i.e. the probability to identify onesignificant subset). In our simulation studies, we focus on a substantial variation of the subgroup specifictreatment effects, as we have expect no benefit from the new strategy in some subpopulations.

We believe our work provides at least some insights of statistical issues in personalized treatmentstrategies in randomized clinical trials. It may shed a light for further research on this topic. The outputsof this thesis are relevant for all clinicians and statisticians involved in the planning and analysis of studieson personalized treatment strategies, both in industrial and academic settings.

Page 7: clinical trials for personalized, marker-based treatment strategies

Contents

Summary v

Table of Contents vii

List of Tables xi

List of Figures xiii

List of Abbreviations xvii

1 Introduction 11.1 Personalized medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Randomized clinical trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Subgroup analysis in RCTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Aims and structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 State of the arts of clinical trial design and statistical analysis in personalized medicine 72.1 Biomarker definition and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Biomarker designs for randomized controlled trials . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Randomize-all or all-comer design . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Interaction or biomarker-stratified design . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 Targeted or selection design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.4 Biomarker-strategy design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.5 Individual profile or marker-based and stratified design . . . . . . . . . . . . . . 15

2.3 Multiplicity issues in RCTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Definition and general concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.2 Closure principle and closed testing procedures . . . . . . . . . . . . . . . . . . 20

vii

Page 8: clinical trials for personalized, marker-based treatment strategies

viii TABLE OF CONTENTS

2.3.3 Classifications of multiple testing procedures . . . . . . . . . . . . . . . . . . . 21

2.4 Statistical analysis strategies for randomize-all design . . . . . . . . . . . . . . . . . . . 22

2.4.1 Fixed-sequence approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.2 Marker sequential test approach . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.3 Fallback approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.4 Treatment-by-biomarker interaction approach . . . . . . . . . . . . . . . . . . . 26

3 Motivating case studies 293.1 CAPRIE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 STarT Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 FOCUS4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 A comparison of multiple testing procedures for testing both the overall and one subgroupspecific effect in confirmatory clinical trials 334.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.1 Notation and hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.2 Feedback procedures and extensions . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.3 Procedures with weighted FWER-controlling methods . . . . . . . . . . . . . . 37

4.2.4 Comparison of five procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 A framework to assess the value of subgroup analyses when the overall treatment effect issignificant 575.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.2 Performance measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.3 Subgroup decision rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.4 Assumptions on subgroup specific treatment effects and individual treatment effects 62

5.2.5 Assumptions on distribution of true study effects . . . . . . . . . . . . . . . . . 63

5.2.6 Scenarios for outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Page 9: clinical trials for personalized, marker-based treatment strategies

TABLE OF CONTENTS ix

6 Comparing a highly stratified treatment strategy with the standard treatment in random-ized clinical trials 756.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.2.2 Analytic framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.2.3 Five approaches for subset selection . . . . . . . . . . . . . . . . . . . . . . . . 776.2.4 Quality and performance measures . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.3 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.4.1 Design of simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

7 Concluding remarks and further research 91

References 97

A Proofs and computation details of Chapter 6 109A.1 Proof of FWER control for M4 and M5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 109A.2 Appendix II: Proof of positivity of subgroup effect estimates . . . . . . . . . . . . . . . 110A.3 Appendix III: Technical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 110

B R code for Chapter 6 111

C Tables of Chapter 4 121

Page 10: clinical trials for personalized, marker-based treatment strategies
Page 11: clinical trials for personalized, marker-based treatment strategies

List of Tables

2.1 Type I and Type II errors in multiple hypotheses testing. . . . . . . . . . . . . . . . . . 18

2.2 Classification of multiple testing procedures can be used in confirmatory clinical trials. . 21

4.1 The closed testing presentation of feedback procedures. . . . . . . . . . . . . . . . . . 35

4.2 The closed testing presentation of an example of feedback procedure. . . . . . . . . . . 36

4.3 The closed testing presentation of Song-Chi procedure. . . . . . . . . . . . . . . . . . . 37

4.4 The closed testing presentation of weighted Bonferroni test for two hypotheses. . . . . . 38

4.5 The closed testing presentation of fallback procedure for two hypotheses. . . . . . . . . 39

4.6 The closed testing presentation of weighted Holm procedure for two hypotheses. . . . . 40

4.7 The closed testing presentation of weighted parametric procedure for two hypotheses. . 40

4.8 Summary of the rejection rules of local test statistics of each stage for two hypothesesusing five methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.9 Significance levels α1(α2) of 5 procedures for equal group size with different choices ofweights and α∗1 = 0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.10 Significance levels of Song-Chi procedure for α∗1 = 0.1,1 or 10α1 with different k and wo. 45

5.1 Summary of the families of subgroup decision rules. . . . . . . . . . . . . . . . . . . . 60

5.2 Overview about the three scenarios for the distributions of true study effect. . . . . . . . 63

5.3 Selected results showed in Figure 5.3. % denotes the relative decrease compared to φ N . 67

5.4 Selected results showed in Figure 5.4 and 5.5. % denotes the relative decrease comparedto φ N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.1 Overview of the five subset selection approaches. . . . . . . . . . . . . . . . . . . . . . 78

6.2 The subsets S∗ selected be the five approaches in the example of a hypothetical study withresults as shown in Figure 6.1. The subgroups included in S∗ are marked by a cross. . . 81

xi

Page 12: clinical trials for personalized, marker-based treatment strategies

xii List of Tables

6.3 Success rate dependent on P and τ for equal group size and unequal group size. Theresults correspond to Figures 6.3 and 6.6. . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.4 Impact and inferiority rate depending on P and τ for equal group sizes. The resultscorrespond to Figures 6.4 and 6.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.5 Impact and inferiority rate depending on P and τ for unequal group sizes. The resultscorrespond to Figures 6.7 and 6.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

C.1 Results of three powers using five procedures with k = 0.25,0.5,0.75 for scenario 1. . . 122C.2 Results of three powers using five procedures with k = 0.25,0.5,0.75 for scenario 2. . . 123C.3 Results of three powers using five procedures with k = 0.25,0.5,0.75 for scenario 3. . . 124C.4 Results of three powers for Song-Chi procedure with k = 0.5 and α∗1 = 0.1,1,10α com-

paring to weighted parametric procedure. . . . . . . . . . . . . . . . . . . . . . . . . . 125

Page 13: clinical trials for personalized, marker-based treatment strategies

List of Figures

2.1 Distinction between quantitative interaction (left) and qualitative interaction (right). . . . 8

2.2 Examples of treatment effects for no biomarker (A), purely prognostic marker (B), purelypredictive marker (C), and biomarker that is both predictive and prognostic (D). . . . . . 9

2.3 Examples of Kaplan-Meier curves for purely prognostic marker (A), purely predictivemarker (B), and marker that is both predictive and prognostic (C). . . . . . . . . . . . . 10

2.4 Randomize-all design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Interaction or biomarker-stratified design . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.6 Targeted or selection design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Biomarker-strategy design with standard control . . . . . . . . . . . . . . . . . . . . . 14

2.8 Biomarker-strategy design with randomized control . . . . . . . . . . . . . . . . . . . . 15

2.9 Individual profile or marker-based and stratified design . . . . . . . . . . . . . . . . . . 16

2.10 Umbrella trial design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.11 Schematic diagram of the closure principle for H1 and H2 and their intersection H12 . . 21

2.12 Fixed-sequence approach with testing B− in the second stage (FS-1) . . . . . . . . . . . 23

2.13 Fixed-sequence approach with testing overall population in the second stage (FS-2) . . . 23

2.14 Marker sequential test approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.15 Fallback approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.16 Treatment-by-biomarker interaction approach . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 RRR and 95% CI by disease subgroups in CAPRIE trial. . . . . . . . . . . . . . . . . . 30

3.2 Trial schema for STarT Back. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Trial schema for FOCUS4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1 Rejection region of the intersection hypothesis H12 for k = 0.5 and wo = 0.5,0.8. . . . . 42

4.2 Rejection region of the intersection hypothesis H12 for k = 0.25,0.75 and wo = 0.5,0.8. 44

4.3 Rejection region of the intersection hypothesis H12 for k = 0.5 and wo = 0.5,0.8. . . . . 45

xiii

Page 14: clinical trials for personalized, marker-based treatment strategies

xiv List of Figures

4.4 Comparison of three powers using five procedures with k = 0.5 for scenario 1 . . . . . . 48

4.5 Scatter plots of power comparisons using five procedures with k = 0.5 for scenario 1 . . 48

4.6 Comparison of three powers using five procedures with k = 0.5 for scenario 2 . . . . . . 49

4.7 Comparison of three powers using five procedures with k = 0.5 for scenario 3 . . . . . . 50

4.8 Comparison of three powers for Song-Chi with different choices of α∗1 and weightedparametric procedure with k = 0.5 for scenario 1 . . . . . . . . . . . . . . . . . . . . . 51

4.9 Comparison of three powers for Song-Chi with different choices of α∗1 and weightedparametric procedure with k = 0.5 for scenario 2 . . . . . . . . . . . . . . . . . . . . . 51

4.10 Comparison of three powers for Song-Chi with different choices of α∗1 and weightedparametric procedure with k = 0.5 for scenario 3 . . . . . . . . . . . . . . . . . . . . . 52

4.11 Comparison of three powers using five procedures with k = 0.25 for scenario 2 . . . . . 52

4.12 Comparison of three powers using five procedures with k = 0.75 for scenario 2 . . . . . 53

5.1 Decision rules applied to the CAPRIE trial . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Histogram of all individuals TEs θsgi (white) and those from studies declaring superiorityin case of no subgroup analysis (blue) for three scenarios with τ = 0.5 and R2 = 0 . . . . 65

5.3 Plot of E vs. P using different decision rules for three scenarios with τ = 0.5/1 and fivedifferent values of R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4 Plot of E vs. P for different number of subgroups (K, 3 plots in the upper part) and usingdifferent overall sample sizes (3 plots in the lower part) for the moderate scenario andτ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Plot of E vs. P for different number of subgroups (K, 3 plots in the upper part) and usingdifferent overall sample sizes (3 plots in the lower part) for the moderate scenario andτ = 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.6 Additional scenarios in section 5.2.6 using moderate scenario and τ = 0.5 for risk differ-ence with πS

sgi = 0.1 (1), risk difference with πSsgi = 0.2 (2), odds ratio (3b) and effect size

(4b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.7 Plot of E vs. P using different decision rules for three scenarios with τ = 0.5/1 and fivedifferent values of R2 using odds ratios (scenario 3a in section 5.2.6) . . . . . . . . . . . 70

5.8 Plot of E vs. P using different decision rules for three scenarios with τ = 0.5/1 and fivedifferent values of R2 using effect sizes (scenario 4a in section 5.2.6) . . . . . . . . . . . 71

6.1 Subgroup specific treatment effect estimates with point-wise 95% confidence intervals ina hypothetical study with 6 subgroups. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 The values of θg (black dots) for g = 1, ...,8 for two choices of P and τ . The blue arrowsillustrate the half range τ of the non-zero effects. The red line indicates the average of themaximum and minimum values of the non-zero effects. . . . . . . . . . . . . . . . . . 82

6.3 Success rate depending on number of subgroups K for different choices of P and τ forequal subgroup sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Page 15: clinical trials for personalized, marker-based treatment strategies

List of Figures xv

6.4 Impact vs. inferiority rate for P = 0.25 and different choices of τ for equal subgroup sizes. 846.5 Impact vs. inferiority rate for P = 0 and different choices of τ for equal subgroup sizes. . 856.6 Success rate depending on number of subgroups K for different choices of P and τ with

unequal group sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.7 Impact vs. inferiority rate for P = 0.25 and different choices of τ for unequal subgroup

sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.8 Impact vs. inferiority rate for P = 0 and different choices of τ for unequal subgroup sizes. 88

Page 16: clinical trials for personalized, marker-based treatment strategies
Page 17: clinical trials for personalized, marker-based treatment strategies

List of Abbreviations and sympols

4A: adaptive alpha allocation approachAKT: v-akt murine thymoma viral oncogene homologB+: biomarker positiveB−: biomarker negativeBRAF: B-Raf proto-oncogene, serine/threonine kinaseCI: confidence intervalDNA: deoxyribonucleic acidEGFR: epidermal growth factor receptorEMA: European Medicines AgencyERCC: excision repair cross-complementingEXP/Exp: experimental treatmentFB: fallback procedureFDA: Food drug administrationFS: fixed-sequenceFWER: family-wise error rateHER2: human epidermal growth factor receptorHTA: health technology assessmentICH: Internation Conference on HarmonidstionIQWiG: Institute for Quality and Efficiency in Health CareITT: Intention-to-treatIR: inferiority rateIS: ischemic strokeKRAS: kirsten rat sarcoma viral oncogene homologMEK: mitogen-activated protein kinase/extracellular signal-regulated kinase kinaseMI: myocardial infarctionMaST: marker sequential testMRCT: multi-regional clinical trialsMTP: multiple testing procedureNCI: National Cancer InstituteNICE: National Institute for Health and Clinical ExcellenceNo.: Number

Page 18: clinical trials for personalized, marker-based treatment strategies

xviii List of Abbreviations

NRAS: neuroblastoma RAS viral (v-ras) oncogene homologOR: odds ratioOS: overall survivalPAD: peripheral arterial diseasePFS: progression-free survivalPH: proportional harzardsPIK3CA: phosphatidylinositol-45-bisphosphate 3-kinase catalytic subunit alphaR: randomizationRCT: randomized controlled trialRMDQ: Roland and Morris Disability QuestionnaireRRR: relative risk reductionSC: Song-Chi procedureSTD/Std: standard treatmentTE: treatment effectTRT: treatmentUS: United StatesWB: weighted Bonferroni testWH: weighted Holm procedureWHO: World Health Organization WP: weighted parametric procedure

Page 19: clinical trials for personalized, marker-based treatment strategies

Chapter 1Introduction

1.1 Personalized medicine

The term “personalized medicine” can be dated back to 19th century, one of earliest applications wasreported by Reuben Ottenberg regarding the first known blood compatibility test for transfusion usingblood typing techniques and cross-matching between donors and patients to prevent hemolytic transfu-sion reactions in 1907 [6]. Personalized medicine was refereed as many terms in the literature including“precision medicine”, “stratified medicine”, “targeted medicine”, and “individualized medicine”. Nowa-days, it is often described as providing “the right patient with the right drug at the right dose at the righttime” [114].

There is still no formal definition of personalized medicine, although many institutions have givendifferent definitions but in a similar perspective. For instance, Personalized Medicine Coalition defined itas ‘the use of new methods of molecular analysis to better manage a patient’s disease or predisposition todisease’, and ‘A form of medicine that uses information about a person’s genes, proteins, and environmentto prevent, diagnose, and treat disease’, defined by US National Cancer Institute. We prefer to adopt thedefinition from the Academy of Medical Sciences: ‘it is a medical model that proposes the customizationof health-care, with decisions and practices being tailored to the individual patient by use of genetic orother information, it is an evolving field of medicine in which treatments are tailored to the individualpatient. More broadly, personalized medicine may be considered as the tailoring of medical treatmentto the individual characteristics, needs, and preferences of a patient during all stages of care, includingprevention, diagnosis, treatment, and follow-up’ [1]. The ultimate goal of personalized medicine is tomake the optimal treatment decision based on all of the information from a patient, including not onlygenetic information, but also age, gender, other clinical factors, current disease status, patient preferenceand other information.

The development of personalized medicine is a long process, it demonstrates the potential to en-

1

Page 20: clinical trials for personalized, marker-based treatment strategies

2 CHAPTER 1. INTRODUCTION

hance the treatment efficacy of therapies which are already proved in stratified subgroups, to identify theheterogeneity of patient population, to avoid unnecessary toxicity caused by treatments and to minimizethe occurrence of adverse events. Methods for exploring very large quantities of data, integration withbiological knowledge and innovative study designs are critical [93]. The current applications of person-alized medicine are far more broad than the identification of optimal drug and the optimal dosage for asubgroup of patients, it might include situations of diagnostic testing, withholding treatment, preventiveinterventions, or targeted therapy options for individual patients. The development of effective personal-ized medicine shall include five key elements as follows: 1. obtaining genetic or genomic information ofindividual patients [98, 103]; 2. identifying one or more biomarkers [8, 18]; 3. developing new or select-ing available therapies[28, 65]; 4. measuring the relationship between biomarkers and clinical outcomeswith retrospective exploratory analysis [19, 78, 79, 100]; 5. verifying the relationship in a prospectiverandomized clinical trial [20, 21, 23, 59]. In this thesis, we assume that the biomarkers and treatmentsare already available and we focus on finding the most suitable way to match the treatments and specificsubpopulations of patients. Therefore, we mainly address the last two elements, i.e., investigating therelationship between biomarkers and clinical outcome, as well as validating our evaluation in clinicaltrials.

1.2 Randomized clinical trials

According to WHO’s definition: “For the purposes of registration, a clinical trial is any research study thatprospectively assigns human participants or groups of humans to one or more health-related interventionsto evaluate the effects on health outcomes” [120]. If a trial does not involve an intervention then it is notconsidered to be a ‘clinical trial’ by definition, thus clinical trials are also be referred to as interventionaltrials by many institutions, where interventions include but are not restricted to drugs, cells and otherbiological products, surgical procedures, radiologic procedures, devices, behavioral treatments, process-of-care changes, preventive care, etc [120].

In drug development, clinical trials are often described as consisting of four temporal phases, i.e.,Phase I-IV. Phase I initiates an investigation new drug into humans, with normally a small group of20–100 healthy volunteers in general or 5 – >= 15 patients in oncology trials. This phase is designedto assess the safety (pharmacovigilance), tolerability, pharmacokinetics, and pharmacodynamics of adrug. Once a dose or range of doses is determined in phase I, the next goal is to evaluate whether thenew drug has any biological activity or effect in further phases [26]. Phase II is usually consideredas the initiative studies for exploring therapeutic efficacy as the primary objective in patients. PhaseII trials are performed on larger groups (100-300) and are designed to investigate if the new drug iseffective and to further evaluate its safety. They are often designed as single-arm trials, or small scaledrandomized trials, thus it is not sufficient for an intervention being approved only with phase II trials,except for certain special situations, e.g., in rare diseases. With the primary objective to demonstrate orconfirm therapeutic benefit, phase III is a necessary and crucial stage in the drug development. Phase

Page 21: clinical trials for personalized, marker-based treatment strategies

1.3. SUBGROUP ANALYSIS IN RCTS 3

III studies, are designed to confirm the preliminary evidence accumulated in Phase II that a drug is safeand effective to use in the intended population, and intended to provide an adequate basis for marketingapproval. Known as confirmatory trials, phase III trials aim at assessing the definitive efficacy and eveneffectiveness of the experimental drug compared with current standard treatment. They are conducted asrandomized controlled trials (RCT), typically multi-center or multi-national trials on large patient groupswith 300–3,000 or more depending upon the disease/medical condition studied and the expected treatmentdifferences. As the last stage of the drug development, phase IV begins only after drug approval. PhaseIV studies are known as post-marketing surveillance trials, which aim at gather additional informationabout a drug’s safety, efficacy, or optimal use. They are not considered necessary for approval but maybe required by regulatory authorities due to the importance of the drug’s use. The safety surveillance isdesigned to detect any rare or long-term adverse effects over a much larger patient population with longertime period than the Phase I-III clinical trials. Harmful effects discovered by Phase IV trials may resultin a drug being withdrawn from the market, or restricted to certain uses [24, 62].

A RCT is defined as a study that measures an intervention’s effect by randomly assigning individu-als or groups of individuals to an intervention or a control group. RCTs are normally performed in phaseIII as well as some phase II trials. Well-designed RCTs are considered as the gold standard for measur-ing an intervention’s impact in clinical trials [86]. Normally RCTs compare the experimental treatmentwith current standard care for assessing the efficacy or effectiveness and safety. Sometimes placebo isalso involved with or without the standard control, this type of RCT is usually referred to as a ‘placebo-controlled trial’. RCTs can be widely conducted for testing the superiority, non-inferiority or equivalencewith respect to the primary or secondary endpoints, they are classified by design as randomized-parallel,crossover, cluster or factorial designs. The key property of RCT is randomization, which has the advan-tages of balancing the known and unknown prognostic factors, and eliminating any bias from purposefultreatment assignment. Apart from simple randomization methods such as tossing a coin, many com-plex randomization methods have been proposed in the past decades and used today, e.g, minimization,stratification and permutation or block randomization. In practice, RCTs typically incorporate differentrandomization methods with other techniques such as blinding or conducting in multi-centers [76].

1.3 Subgroup analysis in RCTs

Subgroup analysis is a great temptation as well as a big challenge in clinical trials, especially for per-sonalized treatment strategies. In every patient population, it is very likely that some subpopulations ofpatients will have a better response to a new treatment and others will have a worse response. If thepopulation is considered as a whole, significant improvement may not be detected due to poor or evennegative responses in certain subgroups of patients, although the new treatment may be very effectivefor other subgroups of patients. Therefore, it is always of use and importance to know whether and inwhich subpopulation the new treatment works better or do more harm to the patients than in the others,no matter if the treatment showed effectiveness in overall population or not.

Page 22: clinical trials for personalized, marker-based treatment strategies

4 CHAPTER 1. INTRODUCTION

Mainly two types of subgroup analyses are involved in clinical trials: exploratory subgroup analysisand confirmatory subgroup analysis [63]. Confirmatory subgroup analysis is performed in confirmatoryclinical trials with a small set of prospectively well-defined subpopulations of patients that are more likelyto benefit from the treatment. Analysis of the subgroups need to fulfill the standard requirements for con-firmatory trials and may result in a modified regulatory claim as well. Exploratory subgroup analysis iscommonly utilized in phase II or III trials and relies on a post-hoc subgroup search, which can be per-formed in trials with a positive efficacy in the overall patient population in order to identify subgroupswith limited treatment effect, or in a negative trial with retrospective analyses aiming to detect subgroupswith positive effect. A certain subgroup with an acceptable toxicity can be selected, if safety issues aredetected in the complementary subgroups or the overall population. [12, 31, 50]. In principle, confirma-tory subgroup analysis should be pre-planned and exploratory subgroup analyses should be interpretedcautiously [63].

In general, subgroup analysis exposes severe statistical challenges such as multiplicity problemsin testing multiple hypotheses and suffering from insufficient power in certain subgroups. Therefore,many concerns have been raised regarding the inappropriate reporting, misuses and over-interpretationof subgroup analyses in clinical trials. For instance [126] mentioned, “Trials are at least a practical wayof making some solid progress, and it would be unfortunate if desire for the perfect (i.e., knowledge ofexactly who will benefit from treatment) were to become the enemy of the possible (i.e., knowledge ofthe direction and approximate size of the effects of treatment of wide categories of patient).” Similarly,many statisticians tried to develop suitable approaches for both exploratory and confirmatory subgroupanalyses over last decades [7, 92, 116, 127]. Recently, health technology assessment institutions andregulatory agencies also increased their attentions on subgroup analyses in order to make appropriatedecisions or recommendations about a treatment benefit for relevant patient groups in both positive andnegative clinical trials [3, 17, 38, 41, 51, 52, 90].

1.4 Aims and structure

The main aims of this thesis are to translate the need of personalized medicine to well-defined statisticalproblems, to develop and compare clinical trial design and statistical approaches to solve the problemsand to develop recommendations for personalized, marker-based treatment strategies in future. In theRCTs with a single marker being involved in the treatment selection, we aim to compare and existingparametric and non-parametric multiple testing procedures (MTPs) for establishing the treatment effectin the overall population or one subgroup specified by marker status, as well as recommend suitableMTPs in different scenarios. After the treatment effect being established in the overall population, weaim to build new methods and frameworks for further subgroup and interaction analyses for studies withsubgroups categorized by a single marker or multiple markers in more general settings. When multiplemarkers are involved in the RCTs, one new treatment may not be sufficient for treating all patients withdifferent biological or genetic characteristics, several treatments could be involved in the same study. In

Page 23: clinical trials for personalized, marker-based treatment strategies

1.4. AIMS AND STRUCTURE 5

this situation, we aim to compare the highly stratified marker-based treatment strategy, where patientsfrom different subgroups are treated by different treatments, with the standard treatment strategy forselecting treatments in the RCTs with several subgroups defined by multiple markers.

The structure of this thesis is organized as follows: Chapter 2 first briefly introduces the definition ofbiomarker and reviews several existing clinical trials designs involving biomarkers, such as randomized-all design, biomarker-by-treatment interaction design, targeted or selection design, biomarker-strategydesign and individual-profile design. Then, several statistical analysis approaches for randomized-alldesign are also discussed in Chapter 2. We introduce three well-known studies as motivating cases inChapter 3, e.g. the CAPRIE, STarT Back and FOCUS4 trials, which have been also discussed in otherchapters.

Chapter 4 to 6 present the main findings of this thesis. In Chapter 4, we consider the simplest situ-ation, where only one single marker with two subgroups is considered relevant for treatment selection inconfirmatory clinical trials. We compare several subgroup analysis and MTPs from two families, i.e., thefeedback procedures and procedures with weighting strategy. Five MTPs are compared by rejection re-gions and powers, including two parametric and three non-parametric MTPs, for establishing a treatmenteffect in either the marker-positive subgroup or the overall population.

Similarly, we consider a situation with a single marker in Chapter 5, but the marker was not con-sidered and involved at beginning of the planing phase, and a post hoc subgroup analysis is initiated orrequired for certain reasons after a significant treatment effect showed in the overall population. Here wepropose a framework to assess the added value of subgroup analyses in this special situation by propos-ing new measures and comparing different subgroup analysis methods under different assumptions. Thesituation can also be extended to more general situations with multiple markers.

The situation with multiple markers and multiple subgroups is considered in Chaper 6, where wediscuss how to compare the marker-based treatment strategy with the standard treatment strategy usingdifferent approaches, including the two-arm approach, traditional subgroup analysis and subset analy-ses. We propose a framework using similar measures as in Chapter 5, in order to explicitly show thedifferences using the five methods considered.

Last but not the least, we summarize our findings and give a outlook of future research in Chapter7. Some computation details and tables of numerical results are listed in the Appendices.

Page 24: clinical trials for personalized, marker-based treatment strategies
Page 25: clinical trials for personalized, marker-based treatment strategies

Chapter 2State of the arts of clinical trial design

and statistical analysis in personalized

medicine

2.1 Biomarker definition and examples

The term “biomarker” has been defined by many institutions from different areas. The most accepteddefinition is the one defined by the National Institutes of Health Biomarkers Definitions Working Group in1998, as ‘A characteristic that is objectively measured and evaluated as an indicator of normal biologicalprocesses, pathogenic processes, or pharmacologic responses to a therapeutic intervention’ [11].

It is necessary to distinguish between disease-related and drug-related biomarkers. In oncology,clinical biomarkers are distinguished into two types: prognostic biomarkers and predictive biomarkers,according to whether the biomarker is disease-related or drug-related. Prognostic biomarkers are disease-related biomarkers, they predict the likely course of disease in a defined clinical population, irrespectiveof which treatment the patients receive, e.g. lymph node involvement is a prognostic biomarker of solidtumor, as it predicts a poor outcome even though treatment may prolong the survival of patients with orwithout evidence of nodal involvement. In contrast, predictive biomarkers are drug-related biomarkers,which predict the likely response of a patient to a special treatment in terms of efficacy and/or safety[19, 129], for example, HER2 application is a predictive biomarker for benefit from trazumumab therapyand perhaps also from doxorubicin and taxanes [104]. A predictive biomarker can also be used to iden-tify poor candidates for certain treatment, e.g., advanced colorectal cancer patients whose tumors haveKRAS mutations appear to be poor candidates for treatment with EGFR antibodies [70]. However, manybiomarkers are both prognostic and predictive, e.g., hormone-receptor status in breast cancer [19].

7

Page 26: clinical trials for personalized, marker-based treatment strategies

8 CHAPTER 2. STATE OF THE ARTS

Figure 2.1: Distinction between quantitative interaction (left) and qualitative interaction (right).The points indicate the treatment effects in group 1 or group 2 and the solid lines connect the treatment effects in the same

subgroups treated by treatment A or B.

Ballman [8] discussed the statistical considerations for determining whether a biomarker is poten-tially predictive or prognostic. A formal test for an interaction between the biomarker and treatmentgroup was suggested, e.g., a regression model that contains at least the treatment group, biomarker, andtreatment-by-biomarker interaction could be performed for continuous outcome, a binomial or simplelogistic regression model with same covariates as above can be applied to binary outcome. In case ofa time-to-event outcome, e.g., overall survival or progression-free survival (PFS), a simple Cox propor-tional hazards (PH) model could be used. When heterogeneous treatment effects exist in biomarker pos-itive and negative subgroups, the interaction between treatment and biomarker is statistically significant.Two types of interactions should be taken into account in the choices of design and analysis of RCTs,i.e., quantitative and qualitative interactions. Figure 2.1 depicts the distinctions between the quantitativeinteraction on the left hand side and qualitative interaction on the right hand side. Qualitative interactionscould occur when one treatment is superior for some subsets of patients and the alternative treatment issuperior for other subsets. A quantitative interaction rises when there is variation of treatment effectsamong subsets in the magnitude but not in the direction [46]. In the case of a predictive biomarker, thep-value of the treatment-by-biomarker interaction in the model — regardless of linear regression, logisticregression or cox PH model — should be less than the predetermined statistical significance level, sincea significant treatment-by-biomarker interaction indicates that the treatment effect differs by biomarkervalue. Both quantitative and qualitative interactions are possible to be derived for predictive biomarkers.In contrast, if the biomarker is prognostic, even if the study is sufficiently powered for an interaction,the test for interaction may be not significant, but the biomarker could be statistically associated with theoutcome with or without the treatment-by-group interaction. In case of heterogeneous treatment effects,only quantitative interactions can appear in prognostic biomarkers.

Figure 2.2 simply depicts four situations in case of no biomarker (Figure 2.2A), involving purelyprognostic biomarker (Figure 2.2B), purely predictive biomarker (Figure 2.2C) and a biomarker that isboth predictive and prognostic (Figure 2.2D). Accordingly, Figure 2.3 displays the different behaviors in

Page 27: clinical trials for personalized, marker-based treatment strategies

2.1. BIOMARKER DEFINITION AND EXAMPLES 9

Figure 2.2: Examples of treatment effects for no biomarker (A), purely prognostic marker (B), purely predictivemarker (C), and biomarker that is both predictive and prognostic (D).Figure reproduced from Buyse and Michiels [18]. The points indicate the treatment effects in biomarker-positive subgroup and the

diamonds indicate the treatment effects in the biomarker-negative subgroup. The solid and dashed lines connect the treatmenteffects in the same subgroups treated by the experimental (Exp) or standard (Std) treatments.

terms of Kaplan-Meier curves. In Figure 2.3A, the biomarker-positive patients have a better survival thanbiomarker-negative patients regardless of treatment groups, thus we could claim that the biomarker isprognostic. The fact that the treatment effect is the same for biomarker-negative and biomarker-positivepatients, e.g., the hazard ratio for the treatment effect is the same in both groups, shows that the biomarkeris not predictive. The biomarker in Figure 2.3B is a purely predictive marker, since there is only atreatment effect for biomarker-positive patients and no treatment effect for biomarker-negative patients,the treatment effect differs in quality between two groups. In addition, a qualitative interaction is detecteddue to the fact that patients who are not treated have the same survival in both the biomarker-positive andbiomarker-negative groups, so we can conclude that the biomarker is not prognostic. Figure 2.3C showsthe idealized examples of a biomarker that is both predictive and prognostic with the properties in bothFigure 2.3A and Figure 2.3B. It is predictive because there exists a significant difference of treatmenteffect between the two biomarker groups, i.e., larger effect in biomarker-positive patients. Moreover, wecan see that the biomarker-positive patients have improved survival compared with biomarker-negativepatients, independent of treatment group, so we can conclude that the biomarker is also prognostic. Sameas in Figure 2.2D, Figure 2.3C is also an example of a quantitative interaction.

Page 28: clinical trials for personalized, marker-based treatment strategies

10 CHAPTER 2. STATE OF THE ARTS

Figure 2.3: Examples of Kaplan-Meier curves for purely prognostic marker (A), purely predictive marker (B), andmarker that is both predictive and prognostic (C).

Figure reproduced from Ballman [8]

2.2 Biomarker designs for randomized controlled trials

Due to the fast development of personalized medicine and molecularly biologic markers, many clinicaltrial designs involving biomarkers have been proposed in recent years. Several narrative reviews areavailable [19, 44, 56, 82, 89, 129]. Most of the designs consider the general setting of testing a previ-ously identified single biomarker with discrete categories, as the current common studies mainly aim atidentifying subgroups on the basis of a single marker. However, the multiple markers situation might beof interest occasionally [41]. Most of the biomarker designs assume the cut-point has already been deter-mined for a continuous biomarker to classify patients as biomarker-positive (B+) and biomarker-negative(B−). Hoering et al. [56] had one of the first discussions regarding three different biomarker RCT designs,i.e., randomize-all design, targeted design and biomarker-strategy design and compared the powers for

Page 29: clinical trials for personalized, marker-based treatment strategies

2.2. BIOMARKER DESIGNS FOR RANDOMIZED CONTROLLED TRIALS 11

these three designs in different scenarios. Freidlin et al. [44] described three designs: enrichment designs,biomarker-strategy designs and biomarker-stratified designs, presented comparisons of pros and cons indepth of each design. Buyse et al. [19] extended the discussion on biomarker designs and summarizedten different designs which are commonly used in Phase II and III clinical trials integrating prognostic,predictive and surrogate markers for either prospective or retrospective identification and validation.

In this Section, we briefly summarize five designs widely discussed for RCTs involving biomarkers[19, 44, 56, 82, 129], including randomize-all design, biomarker-stratified design, targeted or selectiondesign, biomarker-strategy design and individual profile design.

2.2.1 Randomize-all or all-comer design

When there are no compelling biologic or early trial data for a candidate predictive biomarker to predictthe effect of a new treatment at the initiation of definitive phase III trials, or we are not sure if the newtreatment will be only effective in certain subgroups, it is generally reasonable to include all patientsas eligible for randomization and conduct the randomize-all design, and to retrospectively plan for aprospective subgroup analysis based on the biomarker [83]. In this design as shown in Figure 2.4, allpatients are randomized to either experimental treatment (EXP) or standard treatment (STD) regardlessof the biomarker status, then all patients or a subset of them are tested afterwards to determine theirbiomarker status. A potential problem with the delayed biomarker test is that the biomarker status maynot be available for all patients due to certain reasons, e.g., some patients may refuse consent or tissue mayno longer be available. In this case, it is important to verify that the subset of patients whose biomarkerstatus being assessed is reasonably representative of the total population [19].

Figure 2.4: Randomize-all design

Page 30: clinical trials for personalized, marker-based treatment strategies

12 CHAPTER 2. STATE OF THE ARTS

Most of phase III RCTs adopted randomize-all design or the biomarker-stratified design introducedbelow, which is a special version of the all-comer design [109]. Subgroup analyses methods for trialsusing all-comer designs are discussed in Chapter 5.

2.2.2 Interaction or biomarker-stratified design

In the prospective setting, if the biomarker has been investigated in previous observable studies or in earlyphase trials, it is desirable and essential to randomize the patients to different treatments with stratificationby the companion’s biomarker status. If we hypothesize that the treatment is mostly efficacious in marker-positive patients, but it is unclear whether the therapy is beneficial for biomarker-negative patients as well,it is wise to include all eligible patients and choose the interaction or biomarker-stratified design. In thisdesign, patients are first tested for the biomarker status, and then randomized to either EXP or STD (SeeFigure 2.5). The benefits of stratification include balancing treatment groups with respect to biomarkerstatus, in the mean time also making sure that the biomarker status is known for all patients [19].

Figure 2.5: Interaction or biomarker-stratified design

The biomarker-stratified design provides a sound basis for decision-making about the efficacy andbenefit/risk of the experimental treatment, as it is good for testing the overall benefit regardless of markerstatus, and exploring the biomarker positive and negative subpopulation. This design can assess whetherthe biomarker is useful in selecting the best treatment among two or more treatments for a given patientfor predictive biomarkers [44, 129]. However, a statistically significant interaction between a biomarkerand treatment effect does not automatically guarantee the predictiveness of the biomarker for treatmentselection, as the treatment effect may be positive in both subpopulations [3, 72]. Thus, this design may ormay not be useful in treatment selection for prognostic biomarkers, since the treatment could be better forall patients but it depends on the value of prognostic biomarkers. A statistical test would be more useful

Page 31: clinical trials for personalized, marker-based treatment strategies

2.2. BIOMARKER DESIGNS FOR RANDOMIZED CONTROLLED TRIALS 13

for selecting treatment in subsets defined by the biomarker. Alternatively, we can aim at proving an effectin the marker positive and in the negative subpopulation. This will then allow us to determine, withappropriate power, whether or not the treatment is effective overall and in the subgroup of biomarker-positive patients [56]. The biomarker-stratified design allows for several hierarchical statistical testingprocedures. Statistical analysis and testing strategies are discussed in Section 2.4.

2.2.3 Targeted or selection design

In some settings, sufficiently convincing evidence is available to suggest that the potential treatmentbenefit is limited to a certain biomarker-defined patient subgroup, or when it is not feasible to use abiomarker-stratified design, which requires random assignment of all the patients, then the best strategywill often be to target the subset of patients who are predicted to benefit most, and the targeted or se-lection design would be the most appropriate design. Similar to the interaction or biomarker-stratifieddesign, a diagnostic test is performed to assess biomarker status before randomization in the targeted orselection design, but only the patients with certain predictive characteristic, i.e., biomarker-positive, enterthe RCTs, the remaining patients will not be eligible, as shown in Figure 2.6 .

Figure 2.6: Targeted or selection design

A targeted design has been shown to be a good design if the underlying pathways and biology areunderstood well enough or a biomarker is proven to be truly predictive of the treatment efficacy, so thatit is clear that the therapy under investigation can only work for a specific subset of patients [100]. Thetargeted or selection design generally requires a smaller number of patients to be randomized than therandomize-all design to determine the effectiveness of a new treatment in the targeted subpopulation ofpatients. However, as shown in Hoering et al. [56], this design is not suitable for prognostic biomarker orthe biomarker with not well established cut-point (e.g. biomarker positive vs. biomarker negative), as noinsight is gained on the efficacy of the new treatment in the patients in the complementary subpopulations,and a large number of patients still need to be assessed for their marker status, especially if the biomarkerprevalence is low. This design can be aimed at establishing the worth of the new agent. For example, the

Page 32: clinical trials for personalized, marker-based treatment strategies

14 CHAPTER 2. STATE OF THE ARTS

clinical development of trastuzumab in breast cancer was restricted to patients with HER2/neu-amplifiedtumors, based both on biological considerations and the lack of tumor response in advanced tumorswithout HER2/neu amplification [81, 96].

2.2.4 Biomarker-strategy design

The biomarker-strategy design mainly addresses the clinical utility of predictive biomarkers, it aims tocompare the standard treatment with a biomarker-strategy which adapts the treatments according to thebiomarker status. In this design, patients are randomly assigned to the biomarker-based strategy armor non-biomarker-based strategy arm, patients in biomarker-based strategy arm will receive effectivetreatments according to their biomarker status, e.g., EXP in the biomarker-positive subgroup and STD inthe biomarker-negative group, while patients in the non-biomarker-based strategy arm will receive onlySTD (Figure 2.7) or randomized to either EXP or STD regardless of biomarker status (Figure 2.8). Theexcision repair cross-complementing 1 (ERCC1) trial is a good example of phase III trial using biomarker-strategy design. ERCC1 gene expression has been suggested as a predictive biomarker associated withcisplatin resistance in non-small cell lung cancer. In the ERCC1 trial, patients were randomly assignedwith allocation ratio 2:1 to the biomarker-strategy arm or the control arm that received cisplatin plusdocetaxel. While in biomarker-strategy arm, patients with high ERCC1 were treated by gemcitabine plusdocetaxel and low ERCC1 by cisplatin plus docetaxel [25].

Figure 2.7: Biomarker-strategy design with standard control

The biomarker-strategy design seems to address the relevant question by comparing the newbiomarker-based treatment strategy to the standard-approach which does not consider the biomarker,however, it was shown to be inefficient and statistically problematic. Two main concerns are discussedwith the biomarker-strategy design: first, the statistical power for comparing the strategy arm with thestandard arm is much lower compared to other designs, e.g. enrichment or biomarker-stratified design,due to the fact that a certain proportion of patients would receive the same treatment on either arm [44, 56].Second, the difference between the two randomized arms is expected to be small, especially if the preva-lence of a positive biomarker is low, even if a bigger difference was detected between the randomized

Page 33: clinical trials for personalized, marker-based treatment strategies

2.2. BIOMARKER DESIGNS FOR RANDOMIZED CONTROLLED TRIALS 15

Figure 2.8: Biomarker-strategy design with randomized control

arms, it could be due to a better efficacy of the experimental arm, regardless of the biomarker status [37].Thus a positive study cannot distinguish between a successful treatment selection strategy and certaineffective experimental treatments compared to the standard treatment.

2.2.5 Individual profile or marker-based and stratified design

When one or more predictive biomarkers are known or assumed to exist, the purpose of the trial is not toformally validate these biomarkers, but rather to use them to optimize treatment selection. In this situa-tion, the individual profile or marker-based and stratified design summarized in Ziegler et al. [129] may bethe best choice, which can be considered as an extension of the biomarker-strategy design with standardcontrol (Figure 2.7) involving a single biomarker with more than two categories or multiple biomark-ers [44]. This design includes a large number of different profiles possibly from multiple biomarkersor generic characteristics of each patient, and leads to the selection and decisions of one out of a largenumber of different treatments. It can be easily planned and understood as a strategic trial, which com-paring the conventional treatment selection to an individualized decision rule (Figure 2.9). For example,an individualized therapy might combine several mono-therapies, each selected based on the presence orabsence of a specific DNA variant.

The recent development of ‘umbrella’ trials in oncology research can be considered as an extensionof the individual profile design, which in fact also address the similar problem that not only single markerbut multiple molecular markers are involved and considered in the studies [71, 84, 85, 107, 111]. How-ever, similar to the main difference between randomized-all design and biomarker-stratified design thatin umbrella trials, the biomarker status are first assessed before assigning treatments to each profile, thusthe biomarker status is known for all patients (Figure 2.10).

The individual profile or marker-based and stratified design reflects the paradigm of individualizedtreatment and personalized medicine. However, the complexities of this design as well as the practicalissues involved pose new challenges to the regulatory system. Several statistical approaches for thisdesign are discussed in Chapter 6.

Page 34: clinical trials for personalized, marker-based treatment strategies

16 CHAPTER 2. STATE OF THE ARTS

Figure 2.9: Individual profile or marker-based and stratified design

Figure 2.10: Umbrella trial design

Page 35: clinical trials for personalized, marker-based treatment strategies

2.3. MULTIPLICITY ISSUES IN RCTS 17

2.3 Multiplicity issues in RCTs

Multiplicity issues have drawn considerable attention in conducting clinical trials for broad classes oftreatments for several decades, including drugs, therapies and medical devices. In personalized medicine,the complexities of different trial designs aimed at biomarker identification or subgroup selection causedstatistical as well as regulatory concerns, especially when multiple biomarkers, objectives or endpointsare involved in the studies [38, 63].

We first introduce some definitions and principles in this section, then describe some multiple test-ing procedures (MTPs) which are commonly used in clinical trials, and discuss them in details in thefollowing chapters.

2.3.1 Definition and general concept

Multiplicity issues arise when there are multiple hypotheses that must be tested simultaneously, if thesame significance level, e.g. α , is applied to each hypothesis, the overall Type I error rate is greater thanα [124]. The simple way for reducing multiplicity concerns is try to reduce the number of hypothesesbeing tested. However, it is almost impossible to only test a single hypothesis and to avoid multiplicityproblems in large scale RCTs, thus a multiplicity adjustment is required to preserve the overall error rateat the nominal level, and multiple testing techniques addressing the aim of controlling the overall type Ierror rate of all hypotheses must be implemented [34, 36].

Union-intersection and intersection-union testing approaches

Multiple testing problems can be formulated by two principles: union-intersection and intersection-uniontesting approaches. The union-intersection testing approach derived from Roy [97] can be informally re-ferred as an at-least-one testing approach. Assume we have multiple hypotheses to be tested, let m≥ 1 de-note the number of hypotheses corresponding to the multiple objectives in a clinical trial, and H1, . . . ,Hm

denote the null hypotheses. In the union-intersection framework, the global hypothesis is defined as theintersection of all elementary hypotheses, and given by HI ,

HI = H1∩·· ·∩Hm,

it is rejected if at least one of all null hypotheses is rejected. In a clinical trial, if the main objectiveis formulated by several primary analyses or several hypotheses in different populations, this objectiveis met when at least one analysis or hypothesis provides a significant result. In this situation, we havethe classical multiplicity problem, i.e., the probability to reject Hi is typically larger than α , if all singlehypotheses are tested at level α .

Another class of multiple testing problems are formulated by the intersection-union test [10], whichcan be considered as an all-or-none approach. In contrast to the union-intersection test, the global null

Page 36: clinical trials for personalized, marker-based treatment strategies

18 CHAPTER 2. STATE OF THE ARTS

hypothesis of the intersection-union test is defined as the union of the individual hypotheses HU is tested

HU = H1∪·· ·∪Hm,

which is rejected only if all null hypotheses are rejected. No multiplicity adjustment is needed in this ap-proach, because each elementary null hypothesis is tested at the full α level, and the global null hypothesisis rejected if all elementary hypotheses are rejected.

Family-wise error rate and adjusted p-values

For any testing problem in a study, we should control two types of errors, i.e., Type I and Type II errors.Type I error is defined as a false positive decision occurs if an effect was declared when none exists. TypeII error is defined as a false negative decision occurs if we fail to declare a truly existing effect. Theoverall type I error rate is the probability of rejecting at least one true hypothesis. As shown in Table 2.1with m hypotheses, R is the total number of rejected hypotheses, V is the number of Type I errors and T

is the number of Type II errors.

Table 2.1: Type I and Type II errors in multiple hypotheses testing.

Hypotheses Not rejected Rejected TotalTrue U V m0False T S m−m0Total W R m

In clinical trials, carrying out the individual tests at the unadjusted significance level and ignoringmultiplicity in the union-intersection testing approach will lead to a higher inflated probability of makingincorrect conclusions for new treatments. As the multiple hypotheses of interest in a clinical trial areconsidered as a family, the overall type I error rate computed for this family of hypotheses, is calledfamily-wise error rate (FWER), and is given by

FWER = P(V > 0).

It is a most common error rate that required to be controlled in clinical trials [15, 55]. There is a distinctionbetween weak and strong FWER control. The weak control of the FWER means to compute the FWERunder the assumption that all hypotheses are simultaneously true, it is required that

P(V > 0 | HI)≤ α,

The strong control of the FWER means that the maximum error rate (e.g. α) needs to be protected under

Page 37: clinical trials for personalized, marker-based treatment strategies

2.3. MULTIPLICITY ISSUES IN RCTS 19

all configurations of true and false null hypotheses, which requires that

maxI⊆{1,...,m}

P

(V > 0 |

⋂i∈I

Hi

)≤ α,

[34–36]. In order to protect the probability of making a false claim, it is mandated to control the FWERin the strong sense in the confirmatory clinical trials [38]. All MTPs introduced in this thesis providestrong control of FWER.

P-value is an useful tool for comparing the testing results and making decisions in the hypothesistesting approach. The computation of p-values is a common exercise in univariate hypothesis test, thusit is also desirable to compute the adjusted p-values in the MTPs for directly comparing the significancelevel [15]. An adjusted p-value — denoted as qi — is defined as the smallest significance level at whichone still rejects the elementary hypothesis using a given MTP, it is directly comparable with the signifi-cance level, e.g. α [117]. For the FWER, the mathematical definition of an adjusted p-value is similar tothe definition of an ordinary p-value, which is given by

qi = in f{α ∈ (0,1)|Hi is rejected at FWER = α},

if α exists, otherwise qi = 1. When qi ≤ α , the corresponding elementary null hypothesis Hi can berejected, if the MTP controls the FWER at level α . Adjusted p-values capture the multiplicity adjustmentby construction and incorporate the structure of the underlying decisions [36].

Single step and stepwise procedures

Multiple comparison procedures can be classified into two types: single-step and stepwise procedures.Single-step procedures are categorized by the fact that the decision to reject any null hypothesis does notdepend on other hypotheses, thus, the order of the hypotheses is not important, and it can be consideredas multiple tests being performed simultaneously. Bonferroni test is a well-known single-step procedure.

In contrast, stepwise procedures are usually performed in a sequential manner, where the rejectionor non-rejection of a null hypothesis may take the decision of other hypotheses into account. Stepwiseprocedures are further divided into step-down and step-up procedures. Step-down procedures start testingthe most significant p-value and continue through the sequence until a certain hypothesis is retainedor all hypotheses are rejected. If a hypothesis is retained, testing stops and the remaining hypothesesare retained by implication. The Holm procedure is a stepwise extension of the Bonferroni test usingthe closure principle in combination with step-down testing approach [57]. Step-up procedures test thehypotheses from the opposite direction and carry out individual tests from the least significant one to themost significant one. Once a step-up procedure rejects a hypothesis, it rejects the rest of the hypothesesby implication. The Hochberg procedure is another example of an extension of the Bonferroni test usingstep-up testing approach [54].

Single-step procedures are generally less powerful than their stepwise extensions. In stepwise pro-

Page 38: clinical trials for personalized, marker-based treatment strategies

20 CHAPTER 2. STATE OF THE ARTS

cedures, some hypotheses can be rejected or retained by implications, any hypothesis rejected by theformer will also be rejected by the latter, but not vice versa, so more hypotheses can be rejected with thesame FWER [15, 34, 36].

2.3.2 Closure principle and closed testing procedures

The closure principle is a key principle of constructing multiple tests proposed by Marcus et al. [80],which has been used to construct a variety of stepwise procedures. Multiple testing procedures basedon the closure principle are called closed testing procedures. Marcus et al. [80] showed that the closedtesting procedures controls the FWER strongly at the α level. In the general case of testing m hypotheses,the closed testing procedures are performed as follows:

1. Define a set of elementary hypotheses H = {H1, . . . ,Hm}.

2. Construct the closure setHI =

⋂i∈I

Hi

for each non-empty intersection hypothesis HI and I ⊆ {1, . . . ,m}.

3. Define a local α-level for each elementary hypothesis in each intersection hypothesis, yielding ap-value pI

4. Reject Hi if all intersection hypotheses HI with i ∈ I are rejected at their local significance level α .

The adjusted p-value for Hi from a closed testing procedure is computed as:

qi = maxI:i∈I

pI , i = 1, . . . ,m

Take a simple example with only two hypotheses, i.e., H = {H1,H2}. Figure 2.11 shows a schematicdiagram of closed testing procedures as proposed by [15]. The intersection hypothesis H12 is shown atthe top, while the two elementary hypotheses H1 and H2 are shown at the bottom of the diagram. Testingfollows a ‘top-down’ fashion, where H12 is tested first at level α . If H12 is not rejected, no further testing isperformed. Otherwise, H1 and H2 are each tested at level α . Finally, the global hypothesis H1 is rejectedif H12 and H1 are both locally rejected. A similar decision rule holds also for the global hypothesis H2.

The closure principle is a flexible construction method to capture the difference in the relationshipbetween the various study objectives applying to MTPs, i.e., different elementary hypotheses. Manycommon MTPs are in fact closed testing procedures [15, 36], for instance, the Holm procedure, fixed-sequence procedure, the fallback procedures, etc.

Page 39: clinical trials for personalized, marker-based treatment strategies

2.3. MULTIPLICITY ISSUES IN RCTS 21

Figure 2.11: Schematic diagram of the closure principle for H1 and H2 and their intersection H12

2.3.3 Classifications of multiple testing procedures

A large number of multiple testing procedures have been used in clinical trials. Besides the classificationof single-step vs. stepwise procedures, MTPs can also be classified by logical relationship or distribu-tional information. Dmitrienko et al. [35] discussed two types of logical relationships, i.e., pre-specifiedhypothesis ordering and data-driven hypothesis ordering. In pre-specified hypothesis ordering, the orderof null hypotheses being testing is pre-specified according to their clinical importance or other criteria.Fixed-sequence and the fallback procedures are typical pre-specified hypothesis ordering procedures. Onthe contrary, no pre-determined hypothesis ordering exists in the data-driven hypothesis ordering, the nullhypotheses are ordered and tested by the significance of test statistics. Holm and Hochberg proceduresare good examples of data-driven hypothesis ordering procedures.

Single-step and stepwise procedures can also be classified as parametric or non-parametric proce-dures. Nonparametric MTPs control the FWER without any distributional assumptions, i.e., the p-valuesonly rely on the local tests. While parametric procedures take the correlation of the test statistics intoaccount and assume the test statistics follow a certain distribution. Table 2.2 presents some well-knownmultiple testing procedures according to these two classifications. Some of them are discussed further inChapter 4.

Table 2.2: Classification of multiple testing procedures can be used in confirmatory clinical trials.

Distributional Single-step Data-driven Pre-specifiedinformation hypotheses ordering hypothesis orderingNonparametric Bonferroni Holm Fixed-sequence

Weighted Bonferroni Weighted Holm FallbackParametric Dunnett Weighted parametric Parametric fallback

Feedback

Page 40: clinical trials for personalized, marker-based treatment strategies

22 CHAPTER 2. STATE OF THE ARTS

2.4 Statistical analysis strategies for randomize-all design

Although many phase III RCTs with randomize-all or biomarker stratified designs (Section 2.2.1 andSection 2.2.2) involve one or more baseline genomic or clinical classifiers, most investigations and guide-lines only consider the simplest but the most common situation where subgroups are identified based on asingle classifier [41, 113]. The aim is typically to establish efficacy claims of test treatment or drug in theoverall population and/or pre-specified subgroup(s). By definition of the trial design, a separate statisticalanalysis plan is usually carried out to evaluate the biomarker effect apart from the treatment efficacy.However, since most clinical trials are designed to be powered by the primary outcome in the overallpopulation only, often such subgroup-testing tends to be underpowered compared to the overall popula-tion because of a smaller number of patients included. Hence the multiplicity issues need to be takeninto account in the statistical analysis of this special setting in integrating the treatment and biomarkerevaluation.

Various statistical analysis plans have been proposed in the literature incorporating different multipletesting procedure [43, 103]. Freidlin et al. [45], Matsui et al. [83], Simon [100] reviewed and comparedseveral types of analysis approaches categorized by the choices and sequences of subgroup testing, i.e.,in biomarker-positive subgroup (B+), biomarker-negative subgroup (B−) and overall population. If wedenote θ+,θ− and θo as the treatment effect being tested in B+,B− and the overall population, the nullhypotheses considered in the statistical analysis plans are no treatment effect in the targeted population,the complementary population and the overall population, denoted by H+,H− and Ho respectively.

In this section, we summarize and discuss the advantages and drawbacks of four approaches interms of the probability of asserting treatment efficacy for either the overall patient population or abiomarker-positive subpopulation of patients, including fixed-sequence, fallback, marker sequential testand treatment-by-biomarker interaction approaches, as well as the potential suitable statistical methodsand testing procedures that could be applied for each approach.

2.4.1 Fixed-sequence approaches

In a randomize-all trial, if a well-defined predictive biomarker is involved, and we knew from previousstudies that the new treatment differentiates from biomarker-positive and biomarker-negative subgroups,it is reasonable to address the treatment effect differences in one or both subgroups in the statisticalanalysis plan [45]. In fixed-sequence (FS) approaches, treatment efficacy is first tested for the biomarker-positive subgroup, since we expect the treatment to be effective in the biomarker-positive subgroup. Aftershowing its effectiveness in the biomarker-positive subgroup, treatment efficacy could be tested in eitherthe biomarker-negative patients or overall population. Since the hypothesis testing order is pre-specified,the fixed-sequence procedures are widely considered in this situation. In the first stage, treatment efficacyis tested for a biomarker-positive group using the pre-specified significance level α . If the first test issignificant, treatment efficacy could be tested in either the subset of the remaining patients (FS-1, Figure2.12) or the overall population (FS-2, Figure 2.13) using the same significance level α [82, 83]. Since

Page 41: clinical trials for personalized, marker-based treatment strategies

2.4. STATISTICAL ANALYSIS STRATEGIES FOR RANDOMIZE-ALL DESIGN 23

both hypotheses are tested in full α level using fixed-sequence procedure, the FWER is well controlled.

Figure 2.12: Fixed-sequence approach with testing B− in the second stage (FS-1)

Figure 2.13: Fixed-sequence approach with testing overall population in the second stage (FS-2)

The hypothesis testing of FS-1 approach are performed in following steps:

• Step 1. Perform test of H+ at α level. If H+ is rejected, proceed to Step 2; otherwise stop testingand the H− is automatically accepted.

• Step 2. Perform test of H− at α level. If H− is rejected, conclude θ+ > 0,θ− > 0; otherwiseconclude θ+ > 0 only.

The hypothesis testing of FS-2 approach are performed in following steps:

• Step 1. Perform test of H+ at α level. If H+ is rejected, proceed to Step 2; otherwise stop testingand Ho is automatically accepted.

Page 42: clinical trials for personalized, marker-based treatment strategies

24 CHAPTER 2. STATE OF THE ARTS

• Step 2. Perform test of Ho at α level. If Ho is rejected, conclude θo > 0; otherwise conclude θ+ > 0only.

2.4.2 Marker sequential test approach

Freidlin et al. [43] implemented the hybrid design discussed in Freidlin et al. [45] and presented themarker sequential test approach (MaST). As a useful alternative of the fixed-sequence approaches, MaSTis appropriate for the settings where we assume that the treatment will not be effective in the biomarker-negative patients unless it is effective in the biomarker-positive patients.

Figure 2.14: Marker sequential test approach

As shown in Figure 2.14, the treatment efficacy is at first tested in biomarker-positive subgroup ata reduced level α1,α1 < α in this approach. If the treatment efficacy is significant, then the biomarker-negative subgroup is tested at the full significance level α , otherwise, the overall population is tested atthe remaining significance level of α2 = α −α1. It controls the FWER at level α under the global nullhypothesis of no treatment effects in either biomarker-positive or biomarker-negative subgroup.

The hypothesis testings of MaST approach are performed in following steps:

• Step 1. Perform test of H+ at pre-specified α1 level, α1 ≤ α . If H+ is rejected, proceed to Step 2;otherwise proceed to Step 3.

• Step 2. Perform test of H− at α level. If H− is rejected, conclude θo > 0; otherwise concludeθ+ > 0 only.

Page 43: clinical trials for personalized, marker-based treatment strategies

2.4. STATISTICAL ANALYSIS STRATEGIES FOR RANDOMIZE-ALL DESIGN 25

• Step 3. Perform test of H− at α2 level, α2 = α−α1. If H− is rejected, conclude θo > 0; otherwisestop testing and accept all null hypotheses.

Compared to the fixed-sequence approaches, MaST has higher power when the treatment effectis homogeneous across biomarker subgroups, and preserves higher power for settings where treatmentbenefit is limited to biomarker-positive subgroup [43, 83].

2.4.3 Fallback approach

When limited evidence exists that the new treatment is only effective in the targeted subgroup or the treat-ment efficacy may be homogeneous across the subgroups, it is generally reasonable to assess treatmentefficacy for the overall patients and prepare the subset analysis as a fallback option [83]. The fallbackapproaches tests treatment efficacy in the overall population at a reduced significance level α1 in thefirst stage, then followed by a test of treatment efficacy in a biomarker-positive subgroup in the full sig-nificance level in the second stage if it shows significance. Although this approach also pre-specifiesthe testing order, the biomarker-positive subgroup can still be tested at the unused significance levelα2 = α −α1 in the second stage to preserve the FWER α , if the treatment efficacy is not significant inthe overall population.

Figure 2.15: Fallback approach

The hypothesis testings of the fallback approach are performed in following steps:

• Step 1. Perform test of Ho at pre-specified α1 level, α1 ≤ α . If Ho is rejected, conclude θo > 0;otherwise proceed to Step 2;

• Step 2. Perform test of H+ at α level, α2 = α −α1. If Ho is rejected, conclude θ+ > 0 only;otherwise stop testing and accept all null hypotheses.

Page 44: clinical trials for personalized, marker-based treatment strategies

26 CHAPTER 2. STATE OF THE ARTS

This approach could suffer a serious lack of power for a qualitative interaction between treatmentand biomarker, and under qualitative interaction, the probability of asserting treatment effect in the overallpopulation could be high if the treatment effect highly differs between the two subgroups. However, thepower can be improved by taking into account the correlation between the tests in the overall populationand biomarker-positive subgroup [116], thus parametric testing procedures such as parametric fallback,Song-Chi and feedback procedures can also be considered in the fallback analysis approach.

2.4.4 Treatment-by-biomarker interaction approach

If there is substantial uncertainty about a difference in treatment effects between the biomarker-positiveand biomarker-negative subgroups, the treatment-by-biomarker interaction approach could be a reason-able choice, as this approach provides a higher probability of asserting treatment efficacy for the rightpatient population under homogeneous treatment effects and a qualitative interaction over biomarker-based subgroups by performing a preliminary test of interaction between treatment and biomarker status[83]. To control for the family-wise type I error rate, this approach performs a one-sided interaction test atthe first stage using a significance level of αINT to detect larger treatment effects in the biomarker-positivesubgroup [99]. In the second stage, the treatment efficacy is tested in the overall population using a re-duced significance level α3 if the interaction test is not significant or in the biomarker-positive subgroupwith a significance level of α4 (< α), where αINT ,α3, and α4 are chosen to control the FWER at the levelα in testing no treatment effects for the overall population and biomarker-positive subgroup, based on theasymptotic distribution of the test statistics.

Figure 2.16: Treatment-by-biomarker interaction approach

Page 45: clinical trials for personalized, marker-based treatment strategies

2.4. STATISTICAL ANALYSIS STRATEGIES FOR RANDOMIZE-ALL DESIGN 27

The hypothesis testings of the treatment-by-biomarker interaction approach are performed in fol-lowing steps:

• Step 1. Perform treatment-by-biomarker interaction test of HINT at pre-specified αINT level, αINT ≤α . If H+ is rejected, proceed to Step 2; otherwise proceed to Step 3.

• Step 2. Perform test of H+ at α3 level, α3 ≤ α . If H+ is rejected, conclude θ+ > 0; otherwise stoptesting and accept all null hypotheses.

• Step 3. Perform test of Ho at α4 level, α4 = α−α3. If Ho is rejected, conclude θo > 0; otherwisestop testing and accept all null hypotheses.

The treatment-by-biomarker interaction approach has been widely discussed as a clinical validationof the predictive biomarker in the literature. Although this approach performs well even under the homo-geneous treatment effects and qualitative interactions, it suffers serious lack of power and requires largersample sizes than the other approaches [83]. A comparison of a method similar to this approach withother methods of subgroup analysis can be found in Chapter 5.

Page 46: clinical trials for personalized, marker-based treatment strategies
Page 47: clinical trials for personalized, marker-based treatment strategies

Chapter 3Motivating case studies

3.1 CAPRIE

The CAPRIE trial is a randomized triple-blind clinical trial for 19,185 patients with atherothrombosis(diagnosed by the disease manifestations of myocardial infarction [MI], stroke, or symptomatic periph-eral arterial disease [PAD]) randomly assigned to be treated by either clopidogrel or aspirin using blockrandomization . The main objective was to test the superiority of clopidogrel vs. aspirin in the secondaryprevention of vascular events. The primary endpoint was the first occurrence of an event in the outcomecluster of MI, ischemic stroke (IS), or vascular death in the total cohort of patients with atherothrombosis.The primary analysis defined in the protocol is the intention-to-treat (ITT) analysis for the unadjusted pri-mary endpoint based on all randomized patients with a two-sided 5% significance level and 90% powerto detect an overall relative risk reduction (RRR) of 11.6% [9, 22, 48].

The results from the ITT analysis showed a statistically significant (P= 0.043) RRR of 8.7% in favorof clopidogrel (95% confidence interval [CI] 0.3 to 16.5), which was consistent with the pre-specifiedthreshold (11.6% RRR) for the study design. While in an additional analysis, the CAPRIE investigatorsseparately examined the effect of the treatments on the primary outcome in each of the three strata. Thisanalysis suggested a nominally difference of RRR among different disease groups. A borderline statisticalsignificance (P= 0.042) of alleged heterogeneity was detected by the interaction between treatment effectand subgroups (Figure 3.1).

29

Page 48: clinical trials for personalized, marker-based treatment strategies

30 CHAPTER 3. MOTIVATING CASE STUDIES

Figure 3.1: RRR and 95% CI by disease subgroups in CAPRIE trial.Figure reproduced from CAPRIE Steering Committee [22] with permission of The Lancet

3.2 STarT Back

The STarT Back trial is a RCT to compare a stratified primary care management according to the patient’sprognosis with current best care for patients with low back pain. Patients are randomly assigned to twotreatment arms — targeted treatment arm or non-targeted treatment arm — using block randomizationwith an allocation ratio 2:1. In the non-targeted treatment arm, all patients are treated only by the currentbest care, whereas patients in the targeted treatment arm are classified by the STarT Back Screening Toolas three risk groups, i.e., low risk, medium risk, and high risk subgroups, and treated by best currentcare plus a clinical session and/or further physiotherapy-led treatment session, as shown in Figure 3.2.The primary objective is to establish whether stratifying patients using a novel tool in combination withtargeted treatments, is better than best current care at reducing long-term disability for low back pain.Other objectives are also considered such as testing the differences within each of the three risk groups,e.g., whether there exists non-inferior clinical outcomes in the low risk patients, as well as superiority inthe medium and high risk groups compared to the current best care.

In total 851 patients were randomly assigned to the targeted arm (568) and the non-targeted arm(283) with an allocation ratio 2:1. The targeted arm includes three risk groups: low risk (26%), mediumrisk (46%) and high risk (28%). The primary outcome — the effect of treatment measured by the Rolandand Morris Disability Questionnaire 22 (RMDQ) score showed significant improvement in the targetedtreatment arm both at 4 months follow-up with an effect size of 0.32 (95% CI: 0.19-0.45) and at 12months follow-up with an effect size of 0.19 (95% CI: 0.04-0.33). A range of secondary outcome mea-

Page 49: clinical trials for personalized, marker-based treatment strategies

3.3. FOCUS4 31

sures, including physical and emotional functioning, pain intensity, quality of life, days off work, globalimprovement ratings, and treatment satisfaction also showed in favor of the stratified management ap-proach [49, 53].

Figure 3.2: Trial schema for STarT Back.Figure reproduced from Hay et al. [49] with permission of BioMed Central

3.3 FOCUS4

Due to the pressing need for more-efficient trial designs for biomarker-stratified clinical trials, Kaplanet al. [69] suggested a new approach of trial design that links novel treatment evaluation with the evalua-tion of biomarkers within a confirmatory phase II/III trial, which can be considered as a type of umbrella

Page 50: clinical trials for personalized, marker-based treatment strategies

32 CHAPTER 3. MOTIVATING CASE STUDIES

trial introduced in Section 2.2.5 (Figure 2.10). The FOCUS4 trial is a multi-arm multi-stage trial usingthis approach in patients suffering from inoperable advanced or metastatic colorectal cancer stratified bymultiple biomarkers and treated with multiple targeted treatments respectively. The primary objective isto test the progression disease survival (PFS) and overall survival (OS) in either Phase II or III stages. El-igible patients who have underwent the standard chemotherapy for 16 weeks, are first offered a biomarkerscreening test, and allocated into five cohorts according to the order of biomarkers: BRAF, PIK3CA, KRAS

or NRAS, EGFR and unclassified. Patients in each cohort was randomly assigned to either placebo or tar-geted treatment arms using minimization with an allocation ratio 1:2. In the targeted treatment arm, fivedifferent pre-specified interventions are given to different cohorts according the biomarkers, i.e., specificBRAF-mutated kinase inhibitor in combination with panitumab and/or MEK inhibitor, dual PI3K/mTOR

inhibitor mono-therapy, dual-pathway inhibition using AKT and MEK inhibitor, HER1, HER2, and HER3

inhibitors and Capecitabine respectively (3.3). Each cohort includes four stages: two phase II trials withprimary outcome PFS, two phase III trials with primary outcome PFS and OS [111].

The FOCUS4 design was developed to provide a more efficient framework for trials with biomarkersas predictors to the new agents (or combinations) and to include promising but not yet validated biomark-ers and their corresponding targeted agents. To the knowledge of the investigators of this study, it is thefirst test of a protocol that assigns all patients with metastatic colorectal cancer to one of a number ofparallel population enriched, biomarker-stratified randomized trials [68, 69].

Figure 3.3: Trial schema for FOCUS4.Figure reproduced from Kaplan et al. [69]

Page 51: clinical trials for personalized, marker-based treatment strategies

Chapter 4A comparison of multiple testing

procedures for testing both the overall

and one subgroup specific effect in

confirmatory clinical trials

4.1 Background

Both exploratory and confirmatory clinical trials are crucial in drug development. Unlike exploratorytrials, a confirmatory trial is an adequately controlled trial in which the hypotheses are stated in advanceand evaluated [63]. As a rule, confirmatory trials are necessary to provide firm evidence of efficacy orsafety. Thus, the key hypothesis of interest in such trials follows directly from the primary objective, andis always pre-defined, and then subsequently tested when the trial is complete.

Subgroup analyses in confirmatory clinical trial designs have drawn considerable attention recentlyin situations where they aim to establish efficacy claims of the tested drug in the overall population or inone pre-specified subgroup defined by a baseline genomic or clinical classifier. Often such a subgrouptends to be underpowered compared to the overall population because of a smaller number of patientsincluded [3, 32, 52]. To deal with this problem many multiple testing procedures have been introduced sofar in the last decade in the literature including the fallback procedures [4, 121] and other multiple testingprocedures [74, 105]. In essence, these procedures allocate more significance level for overall populationand less significance level for sub-populations addressing their ranked clinical importance that is overallpopulation is of primary interest and sub-populations are of secondary interest.

Zhao et al. [128] showed that some procedures can be systematically expressed in terms of closed

33

Page 52: clinical trials for personalized, marker-based treatment strategies

34 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

testing procedures [80] using suitable functions that determine how to propagate local significance levelsto a pre-specified subgroup based on the result from the overall population. They termed this familyof procedures as ‘feedback procedures’. As the feedback procedures follow the closure principle, itmeets the minimum requirement for confirmatory trials addressing multiple study populations of stronglycontrolling the family-wise error rate, that is the probability of erroneously rejecting at least one true nullhypothesis controlled at a pre-specified significance level α under any configuration of true and false nullhypotheses.

In this chapter, we aim to investigate the feedback procedures in more detail and compare them withthe procedures with weighting strategies, e.g., weighted Bonferroni test, weighted Holm, and weightedparametric procedure, as well as the extensions of these procedures.

4.2 Methods

4.2.1 Notation and hypotheses

Let N denote the total sample size in a study. The patient population is formed by the targeted subgroupwith proportion k and the complementary subgroup with proportion 1− k. The sample size in the twogroups are n+ = Nk and n− = N(1− k) respectively.

In our setting of testing both overall and subgroup specific treatment effects in confirmatory clinicaltrials, two hypotheses are tested either sequentially or simultaneously, i.e., testing the treatment effectin both overall population and a pre-specified targeted subgroup. We consider two null hypotheses Ho

and H+ defined in Chapter 2 Section 2.4, i.e., testing treatment efficacy in the overall population and thesensitive subgroup respectively. They given by

H0o : θo ≤ 0 versus H1

o : θo > 0

andH0+ : θ+ ≤ 0 versus H1

+ : θ+ > 0

4.2.2 Feedback procedures and extensions

To compare an experimental treatment to a standard treatment in a confirmatory clinical trial with twohypotheses, Zhao et al. [128] proposed the term ‘feedback procedures’ for a family of multiple testing pro-cedures serving as an extension of the fallback procedures [60, 121, 122]. All multiple testing proceduresbelonging to the family of feedback procedures have a special relationship between the two test statistics,i.e., one test statistic provides feedback for another test statistic by selecting a function in calculating thecritical value for the latter one, which is called ‘α–spending function’. Being parametric multiple testingprocedures, the decision rules of feedback procedures take into account the joint distribution of the teststatistics associated with the two hypotheses [2, 35].

Page 53: clinical trials for personalized, marker-based treatment strategies

4.2. METHODS 35

4.2.2.1 General rejection rules

In the setting shown above in Section 4.2.1, we could test the treatment effects in both the overall pop-ulation and the targeted subgroup with the two hypotheses Ho and H+ using feedback procedures. Theoverall test is tested at a pre-specified level α1(0 ≤ α1 ≤ α), and provides feedback for the sensitivesubgroup test. The critical value for the subgroup test is calculated from an α–spending function f (po)

using the overall p-value po, where f should be pre-specified up to a free parameter determined by thejoint distribution of the test statistics. With the one-sided FWER α , e.g. α = 0.025, the rejection rulesfor Ho and H+ are given by

• Reject Ho if (1) po ≤ α1, or (2) α1 < po ≤ α and p+ ≤ α f (po).

• Reject H+ if (1) po ≤ α1 and p+ ≤ α , or (2) α1 < po and p+ ≤ α f (po).

The α–spending function f (po) determines the significance level for testing the null hypothesisH+, it has two basic properties: 1. f (po) = 1 if 0 ≤ po < α1 and 0 ≤ f (po) ≤ 1 if α1 ≤ po ≤ 1. 2:P(po > α1, p+ ≤ α f (po)) = α−α1 under the global null hypothesis Ho

⋂H+.

As the feedback procedure fulfills the closure principle [80], it can be presented as a closed testingprocedure, hence the FWER is controlled in the strong sense. The closed testing presentation of feedbackprocedures is displayed in Table 4.1.

Table 4.1: The closed testing presentation of feedback procedures.

Intersection Rejection rule (Local test)H12 = Ho

⋂H+ po ≤ α1, or p+ ≤ α f (po)

H1 = Ho po ≤ α

H2 = H+ p+ ≤ α

The closed testing procedure rejects a null hypothesis if all intersection hypotheses including thisparticular null hypothesis are rejected by their local tests. So to reject either H1 or H2, the intersectionhypothesis Ho

⋂H+, denoted by H12, should be tested and rejected first. In other words, if H12 is not

rejected, neither H1 nor H2 can be rejected and the closed testing procedure stops. From the secondproperty of f (po), it can be proved that the intersection null hypothesis Ho

⋂H+ is tested at α–level, as

shown by the following equation. Both tests in the second stage are at a full α–level, thus the feedbackprocedures are α–exhaustive procedures.

P(po ≤ α1 or p+ ≤ α f (po))

= P(po ≤ α1)+P(po > α1, p+ ≤ α f (po))

= α1 +(α−α1)

= α

Page 54: clinical trials for personalized, marker-based treatment strategies

36 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

Several parametric multiple testing procedures proposed in the literature belong to the family offeedback procedures, including the parametric fallback procedure [60] and its extension developed inAlosh and Huque [4] and 4A procedure[74] . These special cases of feedback procedures have differentα–spending functions, but they can be categorized as two types of functions, i.e., constant functionsand non-constant functions. For instance, 4A procedure can be considered as a special case of feedbackprocedure with a non-constant function c = c(α1) being selected to satisfy Property 2:

f (x) = min(c/x2,α1), α1 ≤ x≤ 1.

and the parametric fallback procedure proposed in [5] can be considered as feedback procedures withconstant α–spending function c, which was chosen to satisfy Property 2:

f (x) = c(α1), α1 ≤ x≤ 1,

Table 4.2 shows the closed testing presentation of this example, where α1 is a pre-specified reducedsignificance level, i.e., 0 ≤ α1 ≤ α and α2 = c(α1) is calculated from the bivariate normal distribution.We will be investigate this procedure in detail and compare it with other procedures considered in thenext sections.

Table 4.2: The closed testing presentation of an example of feedback procedure.

Stage Intersection Rejection rule (Local test)I H12 = Ho

⋂H+ po ≤ α1 or p+ ≤ α2

II H1 = Ho po ≤ α

II H2 = H+ p+ ≤ α

4.2.2.2 Song-Chi procedure

Song and Chi [105] proposed an extension of the simple feedback procedure described in Section 4.2.1 byintroducing an extra parameter α∗1 , which sometimes suggested for ethical or regulatory considerations. Itrepresents a degree of efficacy consistency constraint in the overall population when testing the subgroupspecific effect, in order to control the treatment effects in both populations being in the same direction.The procedure can be described as follow:

• Reject Ho if (1) po ≤ α1, or (2) α1 < po ≤ α∗1 and p+ ≤ α2 and po ≤ α .

• Reject H+ if (1) po ≤ α1 and p+ ≤ α , or (2) α1 < po ≤ α∗1 and p+ ≤ α2 and p+ ≤ α .

Song-Chi procedure fulfills the closure principle, the closed presentation can be shown as in Table4.3. The first-stage test for the intersection hypothesis H12 = Ho

⋂H+ can be explained as follows:

• If po ≤ α1, H12 is rejected.

Page 55: clinical trials for personalized, marker-based treatment strategies

4.2. METHODS 37

Table 4.3: The closed testing presentation of Song-Chi procedure.

Stage Intersection Rejection rule (Local test)I H12 = Ho

⋂H+ po ≤ α1 or

α1 < po ≤ α∗1 and p+ ≤ α2II H1 = Ho po ≤ α

II H2 = H+ p+ ≤ α

• If po > α∗1 , H12 is not rejected.

• If α1 < po ≤ α∗1 , H12 is rejected if and only if p+ ≤ α2 in the test in targeted subgroup.

where po and p+ are the p-values calculated from the test statistics of the two populations. Twoparameters should be pre-specified before testing, i.e., α1 and α∗1 . α1 is the reduced α level wherethe overall population is tested for the intersection hypothesis H12. If the treatment effect in the overallpopulation is of primary interest, α1 should be chosen close to α . The consistency constraint α∗1 should bechosen in the range of α ≤ α∗1 ≤ 1. When α∗1 = 1, Song-Chi procedure reduces to the procedure in Table4.2. Let Zo, Z+ and Z− be standardized test statistics for the overall population, targeted group and thecomplementary subgroup respectively, and k be the proportion of sample size in the targeted subgroup.As a parametric procedure, Song-Chi considers the correlation between the test statistics of the overallpopulation Zo

′ and the targeted subgroup Z+, where Zo′ =√

kZ++√

1− kZ− being computed from thetwo subgroups. For simplicity, a simplified version was proposed under a given alternative hypothesiswith Z+ ∼ N(µ,1) and Z− ∼ N(µ ′,1), where µ and µ ′ are related to k. The significance level α2 in thesubgroup test is determined by

∫ zα1

zα∗1

Φ(zα2 −

√kz1√

1− k)φ(z1)dz1 = α

∗1 −α

The adjusted p-value, denoted as qo and q+ can be calculated for hypotheses Ho and H+:

qo = max(p, p0), q+ = max(p, p+)

where p is the p-value for testing H12 at the first stage, and the adjusted p-value of H12 q can be con-structed by

q =

po if po ≤ α1

α∗1 −∫ zα1

zα∗1

Φ( z2−√

kz1√1−k

)φ(z1)dz1 if α1 < po ≤ α∗1

po if po > α∗1

4.2.3 Procedures with weighted FWER-controlling methods

The weighted FWER-controlling methods described by Westfall et al. [119] are useful for testing a familyof null hypotheses Hi with or without pre-specified ordering, e.g., the various ranked patient outcomes or

Page 56: clinical trials for personalized, marker-based treatment strategies

38 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

several dose levels being tested in clinical trials. The weights may be assigned according to the importanceof various hypotheses, regardless of larger or smaller anticipated treatment effect [123]. Closed testprocedures based on weighted Bonferroni tests have recently drawn much attention for analyzing multipleoutcomes, they can be used in either single-step or stepwise procedures [14, 33, 58].

According to the classification of multiple testing procedures in Section 2.3.3, we investigate fourprocedures with weighted method based on closure principle, including a single-step approach (weightedBonferoni procedure), a stepwise procedure with pre-specified ordering approach (fallback procedure),and a step-down approach with data-driven hypotheses ordering (weighted Holm procedure) and a para-metric procedure (weighted parametric procedure).

4.2.3.1 Weighted Bonferroni test

Weighted Bonferroni test, discussed in Rosentha and Rubin [95], is the simplest weighted multiple testingprocedure. For m elementary hypotheses, H1, . . . ,Hm, the weighted Bonferroni test first divides the overallsignificance level α into m portions with pre-specified weights wi, i = 1, . . . ,m, and ∑wi = 1. Then testeach elementary hypothesis using Bonferroni test and reject Hi if pi ≤wiα . The weighted Bonferroni testis a single step procedure, and rather conservative if the number of hypotheses is large or the test statisticsare strongly correlated.

For the two hypotheses situation, i.e., testing overall population and targeted subgroup describedin Section 4.2.1, we reject the null hypotheses using the rejection rule po ≤ woα or p+ ≤ w+α , wherewo,w+ are pre-specified weights of two hypotheses and wo +w+ = 1. Since there exists only one stageand no α-propagation, we could express the weighted Bonferroni test in the closed testing presentationby using the same rejection rules in both of two stages, as shown in Table 4.4 below.

Table 4.4: The closed testing presentation of weighted Bonferroni test for two hypotheses.

Stage Intersection Rejection rule (Local test)I H12 = Ho

⋂H+ po ≤ woα or p+ ≤ w+α

II H1 = Ho po ≤ woα

II H2 = H+ p+ ≤ w+α

4.2.3.2 Fallback procedure

As an extension of the fixed sequence procedure and a simple gatekeeping procedure, the fallback proce-dure introduced by Wiens [121] addresses the major drawback of by allowing one to test all hypothesesin the pre-specified sequence even if the previous hypotheses are not rejected. It can be considered as aprocedure combining the weighted Bonferroni test and the fixed sequence procedure.

The fallback procedure is a stepwise procedure, which tests the hypotheses according to the pre-specified order with pre-specified weights. Start with the first hypothesis H1 in the order at α1 = αw1,reject H1 if p1 ≤ α1, otherwise retain it. Then continue with the next hypotheses Hi, i = 2, . . . ,m at

Page 57: clinical trials for personalized, marker-based treatment strategies

4.2. METHODS 39

αi = αi−1 +αwi if Hi−1 is rejected, and at αi = αwi if Hi−1 is retained. The hypotheses Hi can be testedat levels exceeding the initial allocated significance level αi, as long as the previous null hypothesis isrejected, with the Type I error being accumulated, this process is called α propagation. The fallbackprocedure is more powerful than the weighted Bonferroni test based on the same setting of pre-specifiedweights. If w1 = 1 and wi = 0 for i > 1, it reduces to the fixed sequence procedure. It is a procedurealso with pre-specified hypothesis ordering for each hypothesis being tested in full α level [121], thusit may have advantages in testing more important hypotheses but disadvantages in testing less importanthypotheses [122].

The fallback procedure can be extended as a closed testing procedure. The closed testing fallbackprocedure under the setting of Section 4.2.1 is displayed in Table 4.5. The pre-specified order is Ho inthe first place and follows H+. In the first stage, we perform two tests simultaneously for the intersec-tion hypothesis H12 = Ho

⋂H+ with the same decision rules as weighted Bonferroni test, then in the

second stage, we test H+ in full α level but Ho at the same significance level as the first stage due tothe pre-specified order. This shows that the closed fallback procedure is not α-exhaustive, since not allintersection hypotheses are tested at the full α level [122, 123].

Table 4.5: The closed testing presentation of fallback procedure for two hypotheses.

Stage Intersection Rejection rule (Local test)I H12 = Ho

⋂H+ po ≤ woα or p+ ≤ w+α

II H1 = Ho po ≤ woα

II H2 = H+ p+ ≤ α

4.2.3.3 Weighted Holm procedure

Holm [57] extended the weighted Bonferroni test into a step-down procedure. A step-down procedurestarts with the most significant p-value and continues the hypotheses sequentially, until all hypotheses arerejected or any hypothesis is retained. The weighted Holm procedure first defines the weighted p-valuesas p̃i = pi/wi , and orders them as p̃(1) ≤ ·· · ≤ p̃(m), where p̃( j) = p̃i j and i j denotes the index of the j-thordered weighted p-value. Then define the sets S j = i j, . . . , im, j = 1, . . . ,m. Let Hw

( j) denote the hypothesiscorresponding to p̃( j). The weighted Holm procedure rejects Hw

( j) if p̃(i)≤ α/∑h∈Si wh for all i = 1, . . . , j.The weighted Holm procedure reduces to the ordinary Holm procedure when the weights are equal.

An extension of the weighted Holm procedure with closed testing principle can be performed intwo stages, it controls the FWER strongly at level α [36]. Table 4.6 shows the closed testing weightedHolm procedure in the situation of testing the overall population and targeted subgroup described inSection 4.2.1. In the first stage, we test two local hypotheses simultaneously in the intersection hypothesisH12 = Ho

⋂H+ with the same decision rules as weighted Bonferroni test. If the intersection hypothesis

is rejected, the significance levels of the second stage are propagated to the full α level.

Page 58: clinical trials for personalized, marker-based treatment strategies

40 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

Table 4.6: The closed testing presentation of weighted Holm procedure for two hypotheses.

Stage Intersection Rejection rule (Local test)I H12 = Ho

⋂H+ po ≤ woα or p+ ≤ w+α

II H1 = Ho po ≤ α

II H2 = H+ p+ ≤ α

4.2.3.4 Weighted parametric procedure

Weighted parametric procedure introduced in Bretz et al. [16] is a stepwise procedure using the weightedparametric test, which can be considered as a parametric extension of the weighted Holm procedure.In the intersection hypotheses HJ , after assigning the weight w j, j ∈ J with ∑ j∈J w j = 1 to each localhypothesis, a weighted min-p test can be defined for each local hypothesis if the joint distribution of thep-values p j are known [117, 118]. HJ is rejected when p j ≤ cJw j(J)α , where cJ is the largest constantsatisfying

PHJ

(⋃j∈J

{p j ≤ cJw j(J)α}

)≤ α

and cJ can be computed from the joint null distribution of the p-values. When cJ = 1, the weightedparametric procedure reduces to the weighted Bonferroni test. However, due to the consideration ofthe distribution of the local tests of each intersection hypothesis, the weighted parametric procedure isuniformly more powerful than the corresponding weighted Bonferroni test [36]. After the intersectionhypothesis being rejected, the procedure continues to the next stages.

For the case of only two hypotheses, we test the same hypotheses simultaneously in section 4.2.1using the weighted parametric test. We choose to use closed testing procedure in order to control thetype I error in strong sense. The rejection rules of the weighted parametric procedure in closed testingpresentation are listed in Table 4.7. Assume the pre-specified weights of the tests for the overall and tar-geted populations are wo, w+, and wo+w+ = 1. In fact, the only difference between weighted parametricprocedure and weighted Holm procedure exists in the intersection hypothesis H12 = Ho

⋂H+, where c is

determined by the joint distribution of two test statistics, and is usually larger than 1.

Table 4.7: The closed testing presentation of weighted parametric procedure for two hypotheses.

Stage Intersection Rejection rule (Local test)I H12 = Ho

⋂H+ po ≤ cwoα or p+ ≤ cw+α

II H1 = Ho po ≤ α

II H2 = H+ p+ ≤ α

Page 59: clinical trials for personalized, marker-based treatment strategies

4.2. METHODS 41

4.2.4 Comparison of five procedures

All five procedures considered so far are all tests for two hypotheses, i.e. hypotheses in the overallpopulation Ho and targeted subgroup H+, following the closed testing principle. The closed testingprinciple requires to first reject the intersection hypothesis H12, and then the local hypotheses of H1

and H2 will be tested with or without a propagation of significance level. Table 4.8 shows the rejectionrules of local tests in each stage for the five approaches considered.

Both Song-Chi procedure and weighted parametric procedure are parametric procedures, they takethe correlation of the test statistics into account and assume the test statistics follow a certain distribution,i.e., approximately bivariate normal distribution. The correlation depends on the sample size of the tar-geted subgroup relative to the overall population. The main differences between the two procedures arethe different pre-specified parameters, which result in a different way of calculating the rejection regionsin the intersection hypothesis in the first stage. In the weighted parametric procedure, the weights of twoelementary tests are pre-specified, and the critical value c is calculated from bivariate normal distribution,thus the rejection criteria of both tests in the overall population and the targeted subgroup are not fixed.In contrast, Song-Chi procedure fixes and pre-specifies the test level of the overall population (α1), onlythe test level of the targeted subgroup is calculated from bivariate normal distribution with a formulaincorporating a pre-specified consistency constraint α∗1 . Once H12 is rejected, the rejection rules in thesecond stage are the same for both Song-Chi and weighted parametric procedures, since all test statisticsare propagated into the full α level.

The three non-parametric procedures considered, i.e., weighted Bonferroni test, weighted Holmprocedure and fallback procedure, do not consider the correlation between the test statistics of two hy-potheses, they have exactly the same rejection rules and rejection regions with the same pre-specifiedweights for the first stage testing the intersection hypothesis H12. The main difference among thesethree procedures is in the second stage. Weighted Bonferroni test is a single step procedure, it can beexpressed in the closed testing presentation with the same significance levels in both stages, and with-out α-propagation in the second stage. But the latter two procedures are stepwise procedures, and morepowerful in general. Due to the pre-specified order of the hypotheses, the fallback procedure can onlypropagate the significance level of latter hypotheses in the second stage if the former hypothesis is rejectedin the first stage, but not vise versa. Whereas the weighted Holm procedure is a data-driven procedure[34, 35], and is more flexible regarding the α-propagation than the fallback procedure. It propagates thesignificance level in both directions, regardless which hypothesis is rejected first. Weighted Holm pro-cedure can be considered as a special case of the weighted parametric procedure when the critical valuec = 1.

Figure 4.1 shows the rejection regions of the intersection hypothesis in the first stage for each pro-cedure with equal subgroup sizes and wo = 0.5,0.8. All three nonparametric procedures, i.e., weightedBonferroni test, weighted Holm and fallback procedures have the same rejection rules if the pre-specifiedweights are same, thus the rejection regions are also the same (red areas in Figure 4.1). The blue area iswider than the red area, which indicates that the rejection regions of the weighted parametric procedure

Page 60: clinical trials for personalized, marker-based treatment strategies

42 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

Table 4.8: Summary of the rejection rules of local test statistics of each stage for two hypotheses using five methods.

Procedures H12 H1 H2Song-Chi po ≤ α1 or po ≤ α p+ ≤ α

α1 < po ≤ α∗1 and p+ ≤ α2Weighted parametric po ≤ cwoα or p+ ≤ cw+α po ≤ α p+ ≤ α

Weighted Bonferroni po ≤ woα or p+ ≤ w+α po ≤ woα p+ ≤ w+α

Fallback po ≤ woα or p+ ≤ w+α po ≤ woα p+ ≤ α

Weighted Holm po ≤ woα or p+ ≤ w+α po ≤ α p+ ≤ α

are wider than the rejection regions of the weighted Bonferroni test and fallback procedure. In Song-Chiprocedure, we pre-specified α∗1 = 0.1, and assume α1 equals to woα for a comparison with other proce-dures. We can see from Figure 4.1 that the rejection region of H+ is narrow as expected, and the blackarea of H+ stops at po = 0.1. Table 4.9 shows selected numerical significance levels of two local hy-potheses (α1,α2) for the five procedures depicted in Figure 4.1, i.e., the values po and p+ are compared.For instance, when wo = w+ = 0.5, the significance levels for the three nonparametric procedures areα1 = α2 = 0.0125, and α1 = α2 = 0.147 for weighted parametric procedure, and for Song-Chi procedureα1 = 0.125 and α2 = 0.0264 with α∗1 = 0.1.

Figure 4.1: Rejection region of the intersection hypothesis H12 for k = 0.5 and wo = 0.5,0.8.Note that the scale of X- and Y- axises are zoomed until 0.2.

To investigate the influence of the targeted sample size, we considered two extra scenarios withunequal group sizes, i.e., the proportion of the targeted subgroup compared to the overall populationk = 0.25 and k = 0.75. Figure 4.2 shows the rejection regions of the intersection hypothesis H12. The re-

Page 61: clinical trials for personalized, marker-based treatment strategies

4.2. METHODS 43

jection regions of the nonparametric procedures — weighted Bonferroni test, weighted Holm and fallbackprocedures — are the same regardless of how k changes for wo and w+, since they are only influencedby the weights of the two null hypotheses, but not the proportion of the subgroups. However, as para-metric procedures, the rejection regions Song-Chi and weighted parametric procedures vary according tok, due to the consideration of the correlation between the two test statistics, despite the different waysof incorporating the correlation and computing critical values. For the weighted parametric procedure,the rejection regions of both hypotheses become larger when the proportion of subgroup is larger, whichcan be seen from Figure 4.2: the blue areas are much wider for k = 0.75 than the ones for k = 0.25,e.g., with wo = 0.5, α1 = α2 = 0.135 for k = 0.25 and α1 = α2 = 0.167 for k = 0.75. On the con-trary, for Song-Chi procedure, the rejection regions are fixed for Ho with the same pre-specified wo, sincewe compute α1 with wo, but the rejection region gets narrower when k becomes larger or wo becomeslarger, e.g., with wo = 0.5 and α1 = 0.0125, we obtain from the formula that α2 = 0.0398,0.264,0.0226for k = 0.25,0.5,0.75 respectively. Numerical results of rejection regions for all five procedures withdifferent pre-specified α1 and k proportions can be found in Table 4.9.

Table 4.9: Significance levels α1(α2) of 5 procedures for equal groupsize with different choices of weights and α∗1 = 0.1.

k wo NP WP SC0.25 0.2 0.005 (0.02) 0.0053 (0.0210) 0.005 (0.0599)

0.33 0.0083 (0.0167) 0.0089 (0.0178) 0.0083 (0.0515)0.5 0.0125 (0.0125) 0.0135 (0.0135) 0.0125 (0.0398)0.67 0.0167 (0.0083) 0.0178 (0.0089) 0.0167 (0.0272)0.8 0.02 (0.005) 0.0210 (0.0053) 0.02 (0.0165)

0.5 0.2 0.005 (0.02) 0.0056 (0.0222) 0.005 (0.0363)0.33 0.0083 (0.0167) 0.0096 (0.0193) 0.0083(0.0324)0.5 0.0125 (0.0125) 0.0147 (0.0147) 0.0125 (0.0264)0.67 0.0167 (0.0083) 0.0193 (0.0096) 0.0167 (0.0193)0.8 0.02 (0.005) 0.0222 (0.0056) 0.02 (0.0128)

0.75 0.2 0.005 (0.02) 0.006 (0.0238) 0.005 (0.0271)0.33 0.0083 (0.0167) 0.0107 (0.0215) 0.0083 (0.0255)0.5 0.0125 (0.0125) 0.0167 (0.0167) 0.0125 (0.0226)0.67 0.0167 (0.0083) 0.0215 (0.0107) 0.0167 (0.0185)0.8 0.02 (0.005) 0.0238 (0.006) 0.02 (0.0141)

NP: non-parametric procedures, i.e., weighted Bonferroni, weighted Holm andfallback: WP: weighted parametric procedure; SC: Song-Chi procedure

In Song-Chi procedure, the two pre-specified parameters α1 and α∗1 , i.e., the pre-specified signif-icance level of Ho and the consistency constraint have an impact on the significance level of H+ (α2).We first choose α∗1 = 0.1, which is same as the example given in Song and Chi [105]. To further investi-gate how α∗1 influences the performance of Song-Chi procedure, we also consider two more choices: (1)α∗1 = 1, same procedure as the simple feedback procedure introduced in Section 4.2.2.1 and (2) α∗1 equals

Page 62: clinical trials for personalized, marker-based treatment strategies

44 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

Figure 4.2: Rejection region of the intersection hypothesis H12 for k = 0.25,0.75 and wo = 0.5,0.8.Note that the scale of X- and Y- axises are zoomed until 0.2.

to certain times of α1, i.e., α∗1 = 10α1. The rejection regions of Song-Chi procedure with two choicesof α1 = woα (0.125, 0.2) and three choices of α∗1 compared to the weighted parametric procedure areshown in Figure 4.3. We can see from Figure 4.3 that, with the same pre-specified α1, the smaller α∗1is, the larger the significance level of H+ is. When α∗1 = 1, the consistency constraint does not exist,so the procedure performs similar to the other parametric procedures with pre-specified significance ofHo, e.g., 4A or parametric fallback procedures. Compared to the weighted parametric procedure, Song-Chi without consistency constraint (α∗1 = 1) has a narrower rejection region for Ho but a wider rejectionregion for H+. For example, with wo = 0.8 and k = 0.5, the significance levels of the weighted para-metric procedure are α1 = 0.0222,α2 = 0.0056, but for Song-Chi, the rejection regions are α1 = 0.02,and α2 = 0.0128,0.0087,0.0098 for α∗1 = 0.1,1 and 0.2 (10α1) respectively. Numerical results of therejection regions for Song-Chi procedure with α∗1 = 0.1,1 or 10α1 and different pre-specified α1 can befound in Table 4.10.

Page 63: clinical trials for personalized, marker-based treatment strategies

4.2. METHODS 45

Figure 4.3: Rejection region of the intersection hypothesis H12 for k = 0.5 and wo = 0.5,0.8.Note that the scale of X- and Y- axises are zoomed until 0.2; WP denotes weighted parametric procedure; SC01, SC1 and SC10a

denote Song-Chi procedure with α∗1 = 0.1,1 and 10∗α1

Table 4.10: Significance levels of Song-Chi procedure for α∗1 = 0.1,1 or 10α1 with different k and wo.

k wo α1 α2α∗1 = 0.1 α∗1 = 1 α∗1 = 10α1

0.25 0.2 0.005 0.0599 0.0213 0.13410.33 0.0083 0.0515 0.0184 0.06170.5 0.0125 0.0398 0.0145 0.03270.67 0.0167 0.0272 0.010 0.0180.8 0.02 0.0165 0.006 0.0099

0.5 0.2 0.005 0.0363 0.0226 0.06760.33 0.0083 0.0324 0.0203 0.03670.5 0.0125 0.0264 0.0168 0.02330.67 0.0167 0.0193 0.0126 0.01530.8 0.02 0.0128 0.0087 0.0098

0.75 0.2 0.005 0.0271 0.0241 0.03760.33 0.0083 0.0255 0.0228 0.02690.5 0.0125 0.0226 0.0203 0.02160.67 0.0167 0.0185 0.0168 0.01720.8 0.02 0.0141 0.0130 0.0131

Finally we look at several examples for a study with k = 0.5, and wo = w+ = 0.5, if we pre-specifyα∗1 = 0.1 and obtain the p-values as below:

(a) po = 0.0135 and p+ = 0.0115

Page 64: clinical trials for personalized, marker-based treatment strategies

46 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

(b) po = 0.0135 and p+ = 0.0135

(c) po = 0.0135 and p+ = 0.115

(d) po = 0.115 and p+ = 0.0135

For the example (a), the weighted Bonferroni test rejects H+ but not Ho, fallback procedure rejectsnone if we specify testing as order first Ho and then H+, rejects both in case of the opposite order. Theweighted Holm, weighted parametric and Song-Chi procedures reject both Ho and H+. For the example(b), the three non-parametric procedures, i.e., weighted Bonferroni test, fallback procedure and weightedHolm procedure, can not reject any hypothesis, weighted parametric procedure and Song-Chi procedurereject both Ho and H+. For the example (c), the three non-parametric procedures reject none of the twohypotheses, the weighted parametric procedure rejects Ho but not H+, Song-Chi procedure rejects neitherHo or H+ due to p+ > α∗1 = 0.1. For the example (d), the three non-parametric procedures reject none ofthe two hypotheses, the weighted parametric procedure and Song-Chi procedure rejects H+ but not Ho.

4.3 Simulations

The simulated study was planned for only marker-positive and marker-negative groups. Let Ygi denotethe outcome in patient i of subgroup g = 0,1 with g = 1 indicating the marker-positive subgroup, andlet Tgi be the randomized treatment indicator with T = 0 indicating the standard treatment. The data isgenerated according to the standard linear model:

Ygi = µg +θgTgi + εgi,

where εgi are independently, identically normally distributed with mean 0 and variance σ2. This modelis fitted to the data and two null hypotheses are tested using the specific contrast θ+ = θ− and θo =

kθ+ +(1− k)θ−. A positive assumed effect size 0.2 is assumed for the marker-positive group, so wechoose σ2 = 5 and θA = 1 in the simulations. One-sided test is used for all procedures with α = 0.025.

To achieve 90% power, a total sample size of N = 1056 is needed for testing Ho with a simple t-test in the case of θ+ = θ−. For the group specific sample sizes, we consider two scenarios: the equalsize situation, i.e. n+ = n− = N/2 = 528 and k = 0.5, and unequal size situations, where we assumethe ratio of the targeted group k = 0.25,0.75 and the sample size of the targeted group is n+ = 264,792.Considering different weights of Ho, we pre-specify w1 = 0.2,0.33,0.5,0.67,0.8. In Song-Chi procedure,we pre-specify α1 = woα and α∗1 = 0.1. For the power comparison, we consider three powers: the powerto reject Ho, the power to reject H+, and the power to reject at least one of Ho and H+, denoted as Po, P+,and Pall , respectively.

Three scenarios of the treatment effects in the targeted and complementary subgroups are consid-ered:

Page 65: clinical trials for personalized, marker-based treatment strategies

4.4. RESULTS 47

1. treatment only shows effects in the targeted subgroup but not in the complementary subgroup (θ+ =

1,θ− = 0)

2. treatment shows effects in both subgroups, but it is larger in the targeted subgroup (θ+ = 1,θ− =

0.5)

3. treatment shows similar effects in both the targeted and complementary subgroups (θ+ = 1,θ−= 1)

4.4 Results

To compare the performance of the five procedures considered, i.e., Song-Chi procedure, weighted para-metric procedure, weighted Holm procedure, fallback procedure and weighted Bonferroni test, we com-pute three powers Po, P+, and Pall defined in Section 4.3. We first take a look at the three powers of allprocedures in the first scenario described in Section 4.3. Figure 4.4 shows the power comparison of threepowers for the five procedures with k = 0.5 and α∗1 = 0.1 for Song-Chi procedure in the scenario 1. Asthe new treatment shows effect only in the targeted subgroup but not the complementary subgroup, thetargeted subgroup should provide higher power and the test in the overall population should be less pow-erful. From the plots in the middle (Po) and the right hand side (P+) of Figure 4.4, the weighted parametricprocedure obtains the highest powers among all procedures in most cases, and the weighted Bonferronitest obtains the least powers in general. We can observe that P+ has a higher power, due to the largertreatment effect in the targeted subgroup, while Po only provides around half of the power compared toP+, since the treatment shows no effect in the complementary subgroup, so the overall treatment effect isapproximately equal to half of the treatment effect in the targeted subgroup, and the overall power, i.e.,power of rejecting Ho or H+ (Pall) performs similar with Po. For the four procedures with weighting strat-egy, i.e., weighted parametric procedure, weighted Holm, fallback and weighted Bonferroni procedures,P+ decrease and P+ increases as wo, i.e., the weight of Ho increases. When we give more weights to Ho,we consider the hypothesis of the overall population is more of importance and the other hypothesis H+

is less detectable, thus we expect to reject more Ho and less H+ [118]. Due to the smaller rejection regionof rejecting Po and larger rejection region of rejecting P+ in the second stage, the fallback procedure ob-tains same Po as weighted Bonferroni test (the green and yellow dashed line collapsed in the plot in themiddle), and same P+ as weighted Holm procedure (the red and green dashed lines are collapsed in theplot on the right hand side).

However, the powers of Song-Chi do not follow the same trend as the other four procedures. Aswo increases, Po first slightly increases but only up to wo = 0.5, and then decreases. In addition, P+ isvery stable up to around wo = 0.8, and the starts to decrease. Hence for Song-Chi procedure, it does nothold that we can increase the power to reject Ho and decrease the power to reject H+ by increasing wo.The reason for this strange behavior lies in the fact that α2 can be larger than α , if wo is below 0.5 (seelast column of Table 4.9). However, α2 does not matter anymore in this case, since the significance level

Page 66: clinical trials for personalized, marker-based treatment strategies

48 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

Figure 4.4: Comparison of three powers using five procedures with k = 0.5 for scenario 1

of H+ is controlled at α level. Compared to the other procedures, Song-Chi procedure tends to be lesspowerful. But for large values of wo, it provides the highest power for H+.

Figure 4.5: Scatter plots of power comparisons using five procedures with k = 0.5 for scenario 1

Figure 4.5 shows the same power comparisons as in Figure 4.4 with 2-by-2 scatter plots. The aim ofthese scatter plots is to check how one power changes according to another power, hence the plots do notrely on the assumption that the weights are comparable across all methods, which might be questionable.The upper right point would be the optimal point as then both powers have maximum value. Again, wecan see similar trends between Pall vs. Po (left hand side) and P+ vs. Po (right hand side). In general, theweighted parametric procedure performs slightly better than the other four procedures as the blue pointsare placing closer to the optimal point comparing others with the same weight setting. The fallbackprocedure and weighted Bonferroni test have a strongly balancing trend between Po and P+ as the weightof Ho changes, but the other three procedures, especially Song-Chi, do not show this trend. Numericalresults of Figure 4.4 and 4.5 can be found in Table 4.9.

Page 67: clinical trials for personalized, marker-based treatment strategies

4.4. RESULTS 49

Figure 4.6: Comparison of three powers using five procedures with k = 0.5 for scenario 2

We also considered two other scenarios in section 4.3, i.e., when the treatment effect shows not onlyin the targeted subgroup, but also in the complementary subgroup. Scenario 2 refers to the situation thattreatment shows certain effect in the complementary subgroup but not as large as in the targeted group,and scenario 3 refers to the extreme case that the treatment effect shows similar effect in both groups. Theresults of power vs. weights of Ho and 2-by-2 power comparison using the five procedures with k = 0.5 forscenario 2 are shown in Figure 4.6. Similar as in Scenario 1, we can see a balancing between Po and P+ forprocedures with weighting strategies, i.e., weighted Bonferroni test, weighted Holm procedure, fallbackprocedure and weighted parametric procedure. As wo increases, Po increases and P+ decreases, while Pall

averages both, so it is rather stable. Due to the larger treatment effects in the complementary subgroup,all procedures obtain higher Po and Pall than Scenario 1. The weighted parametric procedure obtainshighest powers and performs the best among five procedures, whereas weighted Bonferroni obtains theleast Po and P+. However, the powers of Song-Chi do not show the general trend, and performs stable, butrather less powerful then other procedures except weighted Bonferroni test. Figure 4.7 shows the powercomparisons for scenario 3, since the treatment effects exist in both subgroups equally, Po is very close toplanned power, especially when a higher weight was given to Ho. As wo increases, Po increases, but P+ israther stable except weighted Bonferroni. The weighted parametric procedure is still the most powerfulamong all the procedures, but Song-Chi procedure performs poorly. The three powers of five proceduresfor scenario 2 and 3 are shown in Table C.2 and C.3 respectively.

We consider two special choices of α∗1 , i.e., α∗1 = 1 and α∗1 = 10∗α1, to study how power changes

Page 68: clinical trials for personalized, marker-based treatment strategies

50 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

Figure 4.7: Comparison of three powers using five procedures with k = 0.5 for scenario 3

according to α∗1 . As shown in Figure 4.3, the smaller α∗1 , the narrower the rejection region of H+.Figure 4.8 shows the line plot of power vs. wo and the scatter plot of 2-by-2 power comparisons for Pall ,Po and P+ of Song-Chi procedure with α∗1 = 0.1,1 and 10 ∗α1, compared to the weighted parametricprocedure. As we observed from the plots above, if α∗1 = 0.1, we obtain much lower powers from Song-Chi procedure compared to the powers from the weighted parametric procedure, especially when α1 issmall. The powers of Song-Chi procedure become very similar to the weighted parametric procedure bothconceptually and quantitatively when α∗1 = 1. Compared to the weighted parametric procedure, Song-Chiprocedure obtains slightly lower Po and slightly higher P+, since α1 is fixed and pre-specified as woα .When α∗1 = 10α1, α∗1 increases as wo increases, but the intended relation between wo and the powersdisappears, in particular in Scenario 1. Considering the other two scenarios, the treatment shows effectsin both targeted and complementary subgroups, we can observe from Figure 4.9 and 4.10 that, Song-Chiprocedure obtains similar powers for all three choices of α∗1 , and weighted parametric procedure is alwaysmore powerful than Song-Chi procedure.

So far, all the scenarios are discussed for the case of equal group size, we also considered unequalsubgroup sizes. Figures 4.11 and 4.12 show the plots of three powers using the five procedures forScenario 2 with a proportion of targeted subgroup k= 0.25 or k= 0.75. Compared to k= 0.5 in Figure 4.6,we can see that all three powers increase as k increases, and the differences among different proceduresbecome smaller. All the numerical values of Figures 4.11 and 4.12 can be found in Table C.2.

Page 69: clinical trials for personalized, marker-based treatment strategies

4.4. RESULTS 51

Figure 4.8: Comparison of three powers for Song-Chi with different choices of α∗1 and weighted parametric procedurewith k = 0.5 for scenario 1

Figure 4.9: Comparison of three powers for Song-Chi with different choices of α∗1 and weighted parametric procedurewith k = 0.5 for scenario 2

Page 70: clinical trials for personalized, marker-based treatment strategies

52 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

Figure 4.10: Comparison of three powers for Song-Chi with different choices of α∗1 and weighted parametric proce-dure with k = 0.5 for scenario 3

Figure 4.11: Comparison of three powers using five procedures with k = 0.25 for scenario 2

Page 71: clinical trials for personalized, marker-based treatment strategies

4.5. DISCUSSION 53

Figure 4.12: Comparison of three powers using five procedures with k = 0.75 for scenario 2

4.5 Discussion

Subgroup analysis in confirmatory clinical trials has been widely discussed, especially in the situationof only one marker, i.e., with the overall population, marker-positive and marker-negative subgroups[4, 105, 106, 128]. In this chapter, we compared two categories of multiple testing procedures — thefeedback procedure and procedures using weighting strategies, which could be applied in this situation.Xie et al. [124] described the main difference between the two families of procedures as the different wayof putting the weights on the intersection hypotheses, i.e., the procedures with weighting strategies giveweights on test statistics, while the feedback procedures give the weights on p-values [124, 125].

For procedures with weighting strategies, there exists two types of procedures: non-parametric andparametric procedures. To compare the rejection regions and powers, we chose four procedures weightingstrategies, three non-parametric procedures, i.e., the weighted Bonferroni test, the weighted Holm andthe fallback procedures, as well as one parametric procedure — the weighted parametric procedure. Onecommon advantage among the four procedures with weighting strategies is that they allow us to pre-specify the weights given to the hypotheses according to their importance in the study, for example, wecould give a larger weight to the primary endpoint and smaller weights to the secondary endpoint, thus wecould expect higher power from the analysis of the primary endpoint and lower power from the analysisof the secondary endpoint [16]. The weighted Bonferroni test is a single step multiple testing procedure,the order in which the hypotheses are tested is not important, the multiple hypotheses can be considered

Page 72: clinical trials for personalized, marker-based treatment strategies

54 CHAPTER 4. A COMPARISON OF MULTIPLE TESTING PROCEDURES

as being performed simultaneously in a single step. Thus, it provides the lowest power in general. Theweighted Holm, fallback and weighted parametric procedures are step-down procedures, hypotheses aretested sequentially. If a hypothesis is retained, testing stops and the remaining hypotheses are retained byimplication. Step-down procedures are more attractive and powerful than single-step procedures becausethey can reject more hypotheses without inflating the overall error rate [36] The weighted parametricprocedure takes the correlation between the two test statistics involved into account, which is related to thesize of the subgroup. The higher the subgroup size is, the higher the correlation between the hypothesesand the higher power we obtain from the parametric procedures. For instance, in the hypothesis setting inSection 4.2.1, the correlation between the Ho and H+ depends on the proportion of patients in the targetedsubgroup compared to the overall sample size (k), the higher k is, the larger rejection regions are, thus thehigher power should be. In contrast, the other three non-parametric procedures only rely on univariatep-values, the rejection regions of the weighted parametric procedure are wider than the non-parametricprocedures, thus the weighted parametric procedure obtains higher powers (Pall , Po and P+) in general.

Song-Chi procedure is also a parametric procedure, it incorporates a consistency constraint α∗1 ,which makes sure the treatment effects in both the overall population and the targeted subgroup are atleast tending in the same direction, as we use the one-sided test. We found that the choice of α∗1 can highlyinfluence the powers. For example, in scenario 1 in Figure 4.3, when α∗1 is small, e.g., α∗1 = 0.1, the threepowers are lower than the weighted parametric procedure in general, even lower than the non-parametricprocedures, i.e., the weighted Holm and fallback procedures, but if α∗1 = 1, the three powers becomehigher than the ones from non-parametric procedures and close to the ones from the weighted parametricprocedure. However, due to the fact that α1 is fixed and α2 varies in dependence of both α1 and α∗1 ,Song-Chi does not really allow us to weight between Ho and H+ as other methods do. We may decreasewo by choosing a smaller α1,but P+ is not guaranteed to be increased as we intended. Because of thestrange behavior that α2 may exceed α when α1 becomes very small, it can not play a role in increasingthe power of rejecting H+. In general, Song-Chi procedure is less powerful than the weighted parametricprocedure, if and only if the weight of Ho becomes large, i.e., wo ≤ 0.67, or α∗1 is large enough, e.g.,α∗1 ≥ 0.125 Song-Chi provides similar or even higher powers than the weighted parametric procedure.Similarly, with the choice of α∗1 = 10α1, the powers of Pall and P+ change dramatically as wo increases,and if and only if α1 is large enough, i.e., α1 ≤ 0.125, Song-Chi obtains similar or even higher Pall andP+. When considering no restriction, i.e., α∗1 = 1, due to the pre-specified α1, Song-Chi obtains a bitlower Po but higher P+ for all wo compared to weighted parametric test, as shown in Figure 4.8.

However, when a treatment effect does not only exist in the targeted subgroup but also the com-plementary subgroup, Song-Chi procedure performs worse than the weighted parametric procedure, e.g.,Song-Chi procedure obtain lower powers in Scenario 2 and 3 as shown in Figure 4.9 and 4.10. Thismay suggest that Song-Chi procedure is only suitable for the situation that treatment effect only existsin one subgroup but not the other, or the difference of treatment effects between two subgroups is veryhigh. Thus in the case that the subgroups are not categorized by predictive biomarker or the treatmenteffects may exist in both subgroups, the weighted parametric procedure is recommended. Other feedback

Page 73: clinical trials for personalized, marker-based treatment strategies

4.6. CONCLUSION 55

procedures without the consideration of consistency constraint may be more flexible and more powerfulthan Song-Chi in some settings, but the advantage of preventing extremely different overall and subgrouptreatment effects also is lost. The role of consistency constraint has been discussed in more detail in Aloshand Huque [5].

The weighted parametric procedure was originally developed for more general settings, thus it canbe easily applied to multiple subgroups, i.e. K ≥ 3, but Song-Chi procedure, as well as other feedbackprocedures are more likely to be used for two hypotheses, e.g. overall and targeted populations, so theywould have limitations in extending to situations involving more subgroups.

4.6 Conclusion

In confirmatory clinical trials, most of the studies consider only a single marker being informative. Whentesting both the overall and one subgroup specific effect, parametric MTPs are more powerful than non-parametric MTPs in general, due to the fact that the targeted subgroup is a subpopulation of the overallpopulation and the correlation between the two populations considered may be strong. Song-Chi proce-dure incorporates a consistency constraint to prevent extremely different treatment effect in the overallpopulation and the targeted subgroup, but it performs poorly in most scenarios, especially when the treat-ment effects exist in both the targeted and complementary subgroups. Thus we may be cautious whenperforming Song-Chi procedure. The weighted parametric procedure is more flexible and be recom-mended in many scenarios. Other feedback procedures which combine the advantages of both Song-Chiand the weighted parametric procedures shall be investigated in the future research.

Page 74: clinical trials for personalized, marker-based treatment strategies
Page 75: clinical trials for personalized, marker-based treatment strategies

Chapter 5A framework to assess the value of

subgroup analyses when the overall

treatment effect is significant

5.1 Background

The choice of relevant patient populations is an important step in the benefit assessment of a new drug.It may result in the application of subgroup analyses in a clinical trial, even if the overall effect wassignificant. A nice illustration of this point is the debate regarding the effect of clopidogrel in the CAPRIEtrial [9, 48] introduced in Chapter 3.1.

Interestingly, based on these two analyses — primary analysis based on ITT population and posthocsubgroup analysis stratified by disease group, several institutions came to different decisions. In the ap-proval of clopidogrel for the European market, EMA accepted the primary endpoint analysis of the overallpopulation because neither strong heterogeneity nor a deficient definition for the overall study populationhas been found in the CAPRIE trial [39, 48]. Similarly, FDA adhered to the ITT analysis for CAPRIE andapproved clopidogrel accordingly for all patients with atherothrombotic diseases [42]. In a cost benefitassessment for the health care system in the UK, NICE concluded with respect to efficacy in accordancewith the primary ITT analysis of the overall population of the CAPRIE trial. However, it consideredthe balance between clinical effectiveness and cost-effectiveness not to justify a replacement of aspirinby clopidogrel to prevent vascular events [87]. In a benefit assessment for the German system, IQWiGacknowledged the superiority of clopidogrel only in the subgroup of patients with PAD but performed noevaluation of cost-effectiveness [64].

The desire to perform subgroup analysis after reaching a significant overall effect appears also in

57

Page 76: clinical trials for personalized, marker-based treatment strategies

58 CHAPTER 5. A FRAMEWORK TO ASSESS THE VALUE OF SUBGROUP ANALYSES

multi-regional clinical trials (MRCT), where the overall effect is the global TE in the whole population,regardless of the regions, but since the same treatment may work differently in patients from differentcountries and regions, subgroup analysis by country or region is often required by regulatory authoritiesin MRCTs [91]. However, the situation is slightly different here, as the subgroup analyses are alreadyintended when planning the study, and can be taken into account in sample size calculation [61].

Although subgroup analyses in clinical trials have been a matter of debate and methodologicalinvestigation for at least two decades [63, 92, 127], we now meet a rather new and different situation.Traditionally, subgroup analyses were discussed in the context of trials failing to demonstrate an overalleffect, and the focus was on limiting the risk of generating spurious signals of treatment superiority bytesting many (small) groups and focusing too much on the estimates of maximally observed effects. Ifwe apply subgroup analyses after demonstrating an overall effect, we may be more concerned aboutfailing to demonstrate an existing TE in a subgroup due to insufficient power. There may be too muchfocus on confidence limits or the observed minimal effects, even if the spread of the treatment effect oversubgroups may mainly reflect a random fluctuation. This new situation did not yet attract much attention,although an EMA guideline proposal on subgroup analysis in 2010 already mentioned the topic and the2013 version of the draft guideline includes a section on this [40, 41].

Our objective is to provide and use a framework to assess and compute the long term effect ofdifferent strategies for performing subgroup analyses when the overall TE is significant. The formalframework and decision rules considered are presented in section 5.2. Results are displayed in section5.3. Section 5.4 summarizes main results and discusses potential limitations. A conclusion in section 5.5finalizes the chapter.

5.2 Methods

5.2.1 Notations

Our framework is based on considering a sequence of clinical trials, all comparing some standard treat-ment to a new experimental treatment. The sample size calculation for all studies is based on the sameassumed effect θA, resulting in identical sample sizes N. The patient population of each study can bedivided into K subgroups of equal size, and we have a decision rule φ̃ to decide on the superiority of theexperimental treatment in the whole study population, and a decision rule φ to perform such a decision atthe subgroup level.

In the sequel, we denote the studies by s = 1, ...,S, the subgroups in each study by g = 1, ..,K, andthe individual patients in each subgroup by i = 1, ..., I with I = N/K. For each patient, we assume anindividual true TE θsgi, because it is unrealistic for every patient to have the same characteristics apartfrom group partition, thus each individual patient would react differently to the same treatment. Further,θsg denotes the true TE in subgroup g of study s, and θs denotes the true TE in study s.

Page 77: clinical trials for personalized, marker-based treatment strategies

5.2. METHODS 59

5.2.2 Performance measures

In evaluating the performance of particular subgroup decision rules, we propose two measures. The pointof departure of these measures is the general goal of clinical trials to improve the overall outcome ofpatients by accepting new effective treatments for whole patient groups without putting too many patientsat risk of being harmed due to heterogeneity of treatment effects within such groups. However, due to thestochastic nature of the results of clinical trials, we have to balance between these two subgoals. On onehand, we would like to improve the overall outcome as much as possible by accepting as many effectivetreatments as possible. On the other hand, we want to minimize or control the risk of recommending apatient to switch to a treatment, which is actually inferior, i.e. with θsgi < 0.

To measure the overall improvement, we consider the overall gain. We assume that, if the exper-imental treatment is declared superior in a subgroup, all patients in that subgroup will receive the newtreatment instead of the standard treatment in future. Hence, based on our simulation, this gain is definedas

E =1S

1K

1I

S

∑s=1

K

∑g=1

I

∑i=1

θ∗sgi (5.2.1)

with

θ∗sgi =

{θsgi if φsg = 10 if φsg = 0

(5.2.2)

where φsg denotes the result of applying the decision rule φ in subgroup g of study s.

In the case of θ being a difference in survival or success probabilities, E approximates the averagegain in survival or success probability over all future patients, if we always follow the recommendationsmade by the decision rules. If θ is a difference in the expected value of a continuous outcome, e.g. aquality of life score, then E is the average gain in this score over all future patients.

To measure the risk of recommending an inferior treatment, we consider the fraction of patients witha negative true TE in the subgroups with a positive decision, i.e.

P =∑

Ss=1 ∑

Kg=1 ∑

Ii=1 1(θsgi < 0

∧φsg = 1)

I ∑Ss=1 ∑

Kg=1 1(φsg = 1)

(5.2.3)

P approximates the probability of recommending an inferior treatment among all patients changingtreatment as a consequence of the study results . In general, we aim to accept as many treatments with apositive effect as possible for as many patients as possible. At the same time, we should keep the risk ofrecommending new treatments to patients who do not benefit from the new treatment as low as possible.Consequently, we should aim at maximizing E and minimizing P. Good decision rules should imply areasonable balance between P and E.

Note that E can be also explained in terms of θsg, but this does not hold for P, P is depending on thevariance of θsgi.

Page 78: clinical trials for personalized, marker-based treatment strategies

60 CHAPTER 5. A FRAMEWORK TO ASSESS THE VALUE OF SUBGROUP ANALYSES

5.2.3 Subgroup decision rules

In this framework, we consider five families of subgroup decision rules. All these rules will be onlyapplied if the overall rule φ̃ decides on superiority, i.e. if φ̃ = 1.

In the first family φ<,α with α ∈ (0.0,1.0), we consider the null hypothesis of usual superioritytesting, namely that the new treatment is inferior to the standard treatment, i.e. φ

<,αsg = 1 if and only if

the lower bound of the two-sided (1−α)% CI for θsg is above 0.

In the second family φ>,α with α ∈ (0.0,1.0), we take the opposite view: we do not require evidencefor the superiority of the new treatment, but deny a subgroup effect only if we have evidence for theinferiority, i.e. φ

>,αsg = 0 if and only if the upper bound of the (1−α)% CI for θsg is below 0.

For the limiting case α = 1.0, in both families the decision rule reduces to comparing the estimatewith zero, i.e. φ E

sg = 1 if θ̂sg ≥ 0. In the second family, for the limiting case α = 0.0, we approach thesituation that no subgroup analyses are performed, i.e. we decide on superiority for all subgroups as soonas we decide on overall superiority, i.e. φ N

sg = φ̃

Since there is some tradition in subgroup analysis to only perform subgroup specific tests in thepresence of evidence for heterogeneity of the subgroup specific TEs [17], we also introduce the familyφ I,δ ,α . Here φ

I,δ ,αsg = 1 if and only if the null hypothesis H0: θs1 = ...= θsK can be rejected at level δ and

if the lower bound of the (1−α)% CI for θsg is above 0.

Finally, we consider a family popular in the analysis of multi-regional trials where the estimate in asubgroup should be at least some fraction of the overall TE estimate [66]: φ

F,γsg = 1 if and only if θ̂sg≥ γθ̂s.

Table 5.1: Summary of the families of subgroup decision rules.

Family Notation Condition for φsg = 11 Superiority testing φ<,α if and only if the lower bound of the two-sided

1−α confidence interval for θsg is above 02 Inferiority testing φ>,α if and only if the upper bound of the 1−α

confidence interval for θsg is not below 0.3.1 Comparing estimate with 0 φ E if θ̂sg ≥ 03.2 No subgroup analysis φ N if φ̃ = 14 Interaction gatekeeper φ I,δ ,α if and only if the null hypothesis H0: θs1 =

· · ·= θsK can be rejected at level δ and if thelower bound of the 1−α confidence intervalfor θsg is above 0

5 Fraction of estimate φ F,γ if and only if θ̂sg ≥ γθ̂s

Table 5.1 summarizes the main properties of the families considered. To well illustrate the differenceamong the five families of decision rules, we reconsider the CAPRIE trial. If applying the differentdecision rules to the CAPRIE trial, quite different decisions can be reached.

Starting from the first family φ<,α with α = 0.05, we conduct subgroup analysis in the widespreadused way to perform superiority testing in each subgroup at the 5% level, or equivalently to compare the

Page 79: clinical trials for personalized, marker-based treatment strategies

5.2. METHODS 61

lower bound of the 95% CI with 0. As we can see from 3.1, a treatment effect will be claimed only in thePAD group. If we weaken this criterion by allowing a larger α and consequently a narrower confidenceinterval, in the Stroke group, we observe a CI of [-3.5, 17.0] for α = 0.1 and of [0.04 14.0] for α = 0.26.Hence, we can now claim a treatment effect both in the PAD and the Stroke groups, if α ≥ 0.26. Thisalso holds for the limiting case α = 1.0, i.e. for φ E , where we only compare the estimate with 0, as (only)in these two groups the effect estimates are above 0 (cf. Figure 5.1).

Now we can even become more liberal and consider the second family φ>,α , comparing the upperbound of the (1−α)% CI with 0. We start with α = 1.0, i.e. with φ E , and then decrease α . In the MIgroup, for the α = 0.68, the 32% CI is [-7.3, -0.2], i.e. the upper bound of MI group starts to be negative.Hence for α < 0.68, we would accept clopidogrel in all three groups.

Since the interaction between treatment effect and disease group is statistically significant (p =

0.042), using decision rule family φ I,δ ,α with δ = 0.05, is equivalent to φ<,α . Consequently for α = 0.05,clopidogrel is only accepted in the PAD group. When using the last family of decision rules φ F,γ withγ = 0.5 as suggested in the Japanese guideline of MRCTs [66], we would reject clopidogrel only in theMI group, as not only the largest estimate found in the PAD group, but also the estimate of 7.3 observedin the Stroke group is larger than half of the overall estimate, i.e. 8.7/2 = 4.35 [22].

Figure 5.1: Decision rules applied to the CAPRIE trialNote that pink circles indicate selected subgroup using families φ E and φ F,γ with α = 1 and γ = 0.5

Page 80: clinical trials for personalized, marker-based treatment strategies

62 CHAPTER 5. A FRAMEWORK TO ASSESS THE VALUE OF SUBGROUP ANALYSES

5.2.4 Assumptions on subgroup specific treatment effects and individual treat-ment effects

Given the study specific TE θs, we assume for the subgroup specific effects

θsg ∼ N̄(θs,σ2G) (5.2.4)

with N̄ denoting a centered normal distribution, ensuring that the average of the subgroup specific effectsθ̄s.is identical to θs. Within each subgroup, it would not be reasonable to assume identical TEs for eachpatient, because the subgroup categorization considered typically reflects only one factor such as age,gender or a disease characteristic. Hence it is likely that there are also other factors with an influence onthe treatment effect, resulting in an additional variation from patient to patient. Hence for the individualTEs we assume

θsgi ∼ N̄(θsg,σ2GI) (5.2.5)

again, ensuring that θ̄sg. = θsg.

We generate the observations from the centered normal distribution by generating observations froma normal distribution and then centering them to mean 0. As this decreases the variance, we use in thefirst step an appropriately increased variance. The centered normal distributions are employed at both thesubgroup and individual patient level:

• θsg = θs + ε∗sg with ε∗sg = εsg− ε̄s. and εsg ∼ N(0, σ2G

1− 1K)

• θsgi = θsg + ε∗sgi with ε∗sgi = εsgi− ε̄sg. and εsgi ∼ N(0, σ2GI

1− 1I)

To facilitate the interpretation of our results, we re-parameterize the variance components σ2G and

σ2GI in the following way: with σ2

I = σ2GI +σ2

G, we denote the overall variance of the individual TEs in asingle study (i.e. the conditional variance given θs). We express the between-subgroup variation σ2

G as afraction of the overall variation of the individual TEs, i.e. we define

R2 = 1−σ2

GI

σ2I

=σ2

G

σ2I

(5.2.6)

R2 can be easily interpreted as an explained variation, i.e. how much of the inter-individual variationin TEs can be explained by between-subgroup variation. Further, we relate the overall individual TEvariation to the assumed effect by introducing τ , in order to obtain a quantity which is independent of thescale for the TE:

τ = σI/θA (5.2.7)

The following interpretation of τ may be helpful: If the true effect θs is equal to the assumed effectθA for a single trial, the choice τ = 0.5 implies 2 times of standard deviation (SD) of θsgi equals θA,

Page 81: clinical trials for personalized, marker-based treatment strategies

5.2. METHODS 63

and hence corresponds to a situation where 2.5% of the patients have a negative TE. Similarly, τ = 1.0corresponds to a situation where about 15.8% of the patients have a negative TE.

5.2.5 Assumptions on distribution of true study effects

The success rates of clinical trials (in the sense of reaching a significant positive TE for the new treatment)varies substantially among different areas [27, 29, 30, 73]. Hence it is necessary to consider differentscenarios for the true TE. We consider three scenarios described as follow.

In the first one, we assume that on average the true TE is identical to the assumed effect, andalthough there is a variation from study to study, the true TE is negative in only 2.5% of the trials. Wecall this scenario the optimistic scenario. Assuming a normal distribution of true TE, it can be expressedas

θs ∼ N(θA,(0.5θA)2) (5.2.8)

Next we consider a scenario where on average the true TE is half of the assumed effect. Keepingthe assumption on normality and the degree of variation of the true TE, this means that the true effect isnegative in 15.8% of all trials, and at least as large as the assumed effect in another 15.8%. This moderatescenario is given by

θs ∼ N(0.5θA,(0.5θA)2) (5.2.9)

Last, we consider a scenario where on average there is no effect, and in only 2.5% of all trials thetrue effect is at least as large as the assumed effect. This scenario named pessimistic scenario is given by

θs ∼ N(0,(0.5θA)2) (5.2.10)

Table 5.2 summarizes the main properties of the three scenarios considered.

Table 5.2: Overview about the three scenarios for the distributions of true study effect.

Scenario Distribution Implication for distribution of true study effectOptimistic θs ∼ N(θA,(0.5θA)

2) negative in only 2.5% of the trialsModerate θs ∼ N(0.5θA,(0.5θA)

2) negative in 15.8% of all trials, and at least aslarge as the assumed effect in another 15.8%

Pessimistic θs ∼ N(0,(0.5θA)2) at least as large as the assumed effect in only

2.5% of all trials

5.2.6 Scenarios for outcomes

It is an implicit assumption in our considerations that the values of E and P are mainly determined bythe structure of our framework described so far, as long as φ̃ and φ are based on asymptotically mostpowerful tests. They should not depend on the concrete choice of outcome scales, nuisance parameters

Page 82: clinical trials for personalized, marker-based treatment strategies

64 CHAPTER 5. A FRAMEWORK TO ASSESS THE VALUE OF SUBGROUP ANALYSES

of the outcome distribution and inference procedures. However, to compute E and P, a concrete choicehas to be made.

In this section, we present the results for the case of a binary outcome with the risk difference asthe effect measure, i.e. θsgi = πE

sgi − πSsgi with πS

sgi and πEsgi denoting the probability of a success for

patient i in subgroup g in study s, treated by either the standard treatment or the experimental treatmentrespectively. πS

sgi is chosen as the constant value 0.4, implying that there is no association of the individualTEs with the prognosis under the standard treatment. The decision rule φ̃ at the overall level is based onthe χ2 test, confidence intervals for the risk difference are based on Newcombes method No.10 [88], andinteraction tests are based on logistic regression. The power calculation of each single study is basedon assumed success rates of 0.6 and 0.4 in the two treatment groups, i.e. θA = 0.2. Consequently, a90% power requires a sample size of N = 280 for each study. As we assume a normal distribution forθsgi, we can not avoid that πE

sgi is out of the interval [0,1] for some patients, and hence we change θsgi tomax(0,min(1,πE

sgi))−max(0,min(1,πSsgi)) in computing the values of E and P.

We focused on the risk difference in the main analysis, as it gives us a simple interpretation of E. Tocheck our implicit assumption, we considered six further scenarios in a sensitivity analysis, as describedbelow:

1. The first additional scenario is identical scenario to the one described in the chapter, except that wechose πS

sgi = 0.1 and assumed θA = 0.1, which requires a sample size of N = 572.

2. The second additional scenario differs from the one considered above, only by πSsgi = 0.5−0.5∗θsgi.

So here we allow an association of individual treatment effects with prognosis under standardtreatment.

3. In stead of risk difference, we consider the θ = log(OR) as effect measure. Power calculationand test of overall effects are not changed. Two scenarios were considered with respect to theassociation between prognosis and individual treatment effect:

(a) πSsgi = 0.4 and πE

sgi = logit−1(logit(0.4)+θsgi)

(b) πSsgi = logit−1(logit(0.5)−0.5∗θsgi) and πS

sgi = logit−1(logit(0.5)+0.5∗θsgi)

Note that in this scenario, the subgroup specific effect θsg and the study effect θs do not refer to thegroup or study specific log(OR) values, but to the average of the individual log(OR) values.

4. Finally we consider continuous outcomes and Cohen’s d as effect measure, i.e. θ = (µE−µS)/σ2.The t-test was used as overall test. CIs were t-test based, and a regression model was used to assessinteractions. The assumed effect was 0.2, we could assume σ2 = 1 implying a sample size of 1052.The data was generated as.

Ysgi = µsgi + εsgi with εsgi ∼ N(0,1−σ2sgi) (5.2.11)

Page 83: clinical trials for personalized, marker-based treatment strategies

5.3. RESULTS 65

where µsgi =

{µS

sgi if T = 0µE

sgi if T = 1and σsgi =

{σ2

S if T = 0σ2

E if T = 1

Two scenarios were assumed in generating data with respect to the association between prognosisand individual treatment effects

(a) µSsgi = 0 and µE

sgi = θsgi, implying σ2S = 0 and σ2

E = σ2I

(b) µSsgi =−0.5θsgi and µE

sgi = 0.5θsgi, implying σ2S = 1

4 σ2I and σ2

E = 14 σ2

I

5.3 Results

We start with illustrating the use of E and P being applied to ordinary clinical trials with no subgroupanalyses. Figure 5.2 shows the histograms of the individual true TEs in the three scenarios introduced inSection 5.2.5 as well as those of patients in studies resulting in a decision on superiority. In the optimisticscenario, the vast majority of trials results in such a decision, and consequently the value of E is close tothe assumed TE of 0.2. The fraction of patients suffering from suggesting an inferior treatment is verysmall, i.e. within 3%. In the moderate scenario, less than 50% of the trials result in a positive decision,and consequently the value of E is less than half of the assumed effect, and P increases to 7%. In thepessimistic scenario, only very few trials result in a positive decision. As in these few trials the true TEtends to be positive, we have still a small overall gain indicated by a value of 0.02 for E. Also the controlof P is still functioning to some degree, with limiting it to 14%.

01

.0e

+0

52

.0e

+0

53

.0e

+0

54

.0e

+0

5F

req

ue

ncy

−.5 0 .5

E=0.18P=0.03

Optimistic scenario

01

.0e

+0

52

.0e

+0

53

.0e

+0

54

.0e

+0

5F

req

ue

ncy

−.5 0 .5

E=0.08P=0.07

Moderate scenario

01

.0e

+0

52

.0e

+0

53

.0e

+0

54

.0e

+0

5F

req

ue

ncy

−.5 0 .5

E=0.02P=0.14

Pessimistic scenario

Figure 5.2: Histogram of all individuals TEs θsgi (white) and those from studies declaring superiority in case of nosubgroup analysis (blue) for three scenarios with τ = 0.5 and R2 = 0

Plots of P and E using different decision rules for the optimistic, moderate and pessimistic scenarioswith τ = 0.5 and τ = 1 and five values of R2 are shown in Figure 5.3. In these plots, we connected thepoints belonging to the two families of superiority testing φ<,α and inferiority testing φ>,α , such thatwe have a route from the case of no subgroup analysis over the case of comparing the estimate with 0,

Page 84: clinical trials for personalized, marker-based treatment strategies

66 CHAPTER 5. A FRAMEWORK TO ASSESS THE VALUE OF SUBGROUP ANALYSES

to the case of superiority testing at levels of 20% and 5%. Although the magnitude of P and E differsacross the three scenarios and depends on τ , the general patterns observed are very similar. There areseveral different important aspects we can read out of Figure 5.3. We first start with looking at ratherliberal approaches, i.e. the family φ>,α α ∈ (0.0,1.0), starting with α = 0.0, i.e. the case of no subgroupanalysis (φ N), and moving to α = 1.0, corresponding to the case of only comparing the estimate with 0(φ E ). In Figure 5.3, this means to start from the point in the upper right corner, and follow the lines untilthe diamonds. For both values of τ and all choices of R2 in all scenarios, we can observe a reductionin P and a slight increase or decrease in E. The reduction in P is increasing when R2 and τ increase,reflecting that the overall fraction of patients with a negative TE increases. The decrease in E is largestfor R2 = 0, as we then perform subgroup analyses without any need. An increase in E can be observedif τ = 1 and for large values of R2, i.e. we remove more patients with negative TE than with a positiveTE using subgroup analysis. The %E and %P columns in the middle part of Table 5.3 quantifies thechanges observed for E and P in Figure 5.3, comparing φ E with φ N . The decrease in P is less than 0.5%if R2 = 0, between 3% and 11% if R2 = 0.25, and between 11% and 27% if R2 = 0.5. The decrease inE is always less than 0.75%, and in the case of substantial heterogeneity of TE, i.e. τ , R2 are large, theincrease maybe up to 9%.

.15

.16

.17

.18

E

.01 .015 .02 .025 .03 .035P

Optimistic scenario with τ=0.5

.05

5.0

6.0

65

.07

.07

5.0

8E

.02 .04 .06 .08P

Moderate scenario with τ=0.5

.01

2.0

14

.01

6.0

18

.02

E

.04 .06 .08 .1 .12 .14P

Pessimistic scenario with τ=0.5

.14

.15

.16

.17

.18

E

.04 .06 .08 .1 .12 .14P

Optimistic scenario with τ=1

.05

5.0

6.0

65

.07

.07

5.0

8E

.05 .1 .15 .2P

Moderate scenario with τ=1

.01

2.0

14

.01

6.0

18

.02

.02

2E

.1 .15 .2 .25 .3P

Pessimistic scenario with τ=1

R2=0.0 R

2=0.125 R

2=0.25 R

2=0.375 R

2=0.5

Figure 5.3: Plot of E vs. P using different decision rules for three scenarios with τ = 0.5/1 and five different valuesof R2

The results for the decision rules φ<,α and φ>,α are connected by a line starting with φ>,0.0, i.e. no subgroup analyses, in theupper right corner and ending with φ<,0.05 (marked by a circle). The two points marked in between correspond to φ E (marked by a

diamond) and φ<,0.2. The cross and plus correspond to φ I,0.05,0.05 and φ I,0.15,0.05 respectively. The filled and hollow trianglescorrespond to φ F,0.5 and φ F,0.75. Note that y- and x-scales vary from plot to plot.

Page 85: clinical trials for personalized, marker-based treatment strategies

5.3. RESULTS 67

Then we continue with the more stringent approaches, i.e. the family φ<,α α ∈ (0.05,1.0), startingwith α = 1.0, i.e. the case of comparing the estimate with 0, and moving to α = 0.05, i.e. significancetesting at 5% level in each subgroup. This means in Figure 5.3, we follow the lines from the diamonds tothe circles at the bottom. The points we came across in between refer to superiority testing at 20%. We canalways observe that P and E are both decreasing, i.e. we reduce the gain, as well as the fraction of patientsbeing recommended an inferior treatment. However, the reductions are now much more pronouncedcomparing to moving from φ N to φ E only. The %E and %P columns in the right part of Table 5.3quantifies these reductions. The reduction in P is in the range of 9% to 29% if R2 = 0, and between 64%and 73% if R2 = 0.5. The reduction in E is in the range of 16% to 40% if R2 = 0, and between 6% and26% if R2 = 0.5. So whether superiority testing at 5% is a good idea or not, depends on the degree ofheterogeneity and how we balance the undesirable reduction in E against the desirable reduction in P.

Table 5.3: Selected results showed in Figure 5.3. % denotes the relative decrease compared to φ N .

φ N φ E φ<,0.05

Scenario R2 E P E %E P %P E %E P %Pτ = 0.5 0 0.1790 0.0325 0.1788 0.15 0.0324 0.34 0.1495 16.5 0.0232 28.59Optimistic 0.25 0.1788 0.0326 0.1782 0.33 0.0310 5.04 0.1507 15.75 0.0149 54.35

0.5 0.1792 0.0329 0.1785 0.42 0.0275 16.40 0.1518 15.28 0.0088 73.23τ = 0.5 0 0.0784 0.0740 0.0781 0.29 0.0737 0.49 0.0562 28.22 0.0562 24.04Moderate 0.25 0.0788 0.0737 0.0783 0.65 0.0697 5.39 0.0593 24.75 0.0371 49.68

0.5 0.0785 0.0726 0.0781 0.55 0.0619 14.76 0.0612 22.07 0.0225 69.08τ = 0.5 0 0.0207 0.1386 0.0206 0.50 0.1386 0.001 0.0125 39.51 0.1129 18.52Pessimistic 0.25 0.0207 0.1401 0.0205 0.74 0.1348 3.81 0.0143 30.95 0.0735 47.51

0.5 0.0206 0.1377 0.0206 0.14 0.1218 11.53 0.0152 26.41 0.0482 64.98τ = 1 0 0.1761 0.1389 0.1760 0.09 0.1387 0.12 0.1466 16.75 0.1211 12.77Optimistic 0.25 0.1765 0.1390 0.1763 0.14 0.1250 10.05 0.1522 13.78 0.0791 43.07

0.5 0.1760 0.1383 0.1779 -1.07 0.1019 26.36 0.1580 10.23 0.0465 66.40τ = 1 0 0.0775 0.2037 0.0773 0.26 0.2034 0.15 0.0552 28.79 0.1806 11.34Moderate 0.25 0.0780 0.2035 0.0783 -0.37 0.1817 10.72 0.0641 17.85 0.1157 43.13

0.5 0.0777 0.2044 0.0806 -3.68 0.1501 26.58 0.0697 10.37 0.0692 66.16τ = 1 0 0.0205 0.2664 0.0204 0.29 0.2661 0.13 0.0124 39.75 0.2422 9.08Pessimistic 0.25 0.0207 0.2681 0.0212 -2.55 0.2388 10.94 0.0168 18.87 0.1531 42.90

0.5 0.0204 0.2680 0.0223 -9.11 0.1999 25.41 0.0192 6.01 0.0950 64.55

The next aspect depicted in Figure 5.3 is the value of using interaction tests as gatekeepers, i.e.,specific members of the family φ I,δ ,α . Two instances of this strategy are considered: performing subgrouptests at α = 5% if the interaction test was significant at either δ = 5% or δ = 15%. The results are markedby a cross or plus respectively. We can immediately see that the cross and plus are never above the linein the corresponding color, indicating that we can always find such a significance level that superioritytesting yields a larger value of E and a smaller value of P simultaneously. This suggests that this strategydoes not offer a general advantage. Similarly, Figure 5.3 depicts the value of requiring to preserve a pre-specified fraction of the overall estimate, i.e., specific members of family φ F,γ , represented by the filled

Page 86: clinical trials for personalized, marker-based treatment strategies

68 CHAPTER 5. A FRAMEWORK TO ASSESS THE VALUE OF SUBGROUP ANALYSES

and hollow triangles, referring to a fraction of γ = 0.5 and γ = 0.75 respectively. Again, these symbolsare never above the corresponding lines, hence, we can not conclude a general advantage comparingto superiority testing. However, they tend to be closer to the curves than the symbols referring to theinteraction gatekeeper strategy.

So far, we only considered the case of 2 subgroups. Figure 5.4 shows the corresponding plotsfor the moderate scenario with τ = 0.5, but comparing the case of 2, 3 or 4 subgroups. We observepatterns similar to Figure 5.3. The reduction in P does not seem to depend on the number of groups,but the degree of reducing E increases with the number of subgroups increasing. For example, for R2 =

0.25 and superiority testing at 5% level, we observe a reduction by 24.8% in the case of 2 groups, by39.1% in the case of 3 groups, and 49.6% in the case of 4 groups. Note that the rule of comparing theestimate with 0 is still associated with a moderate loss in E. With an increasing number of subgroups, theinteraction gatekeeper approach tends to become much worse than superiority testing, and the discrepancyto superiority testing also increases for the fraction of estimate approach.

.03

.035

.04

.045

.05

.055

.06

.065

.07

.075

.08

E

.02 .04 .06 .08P

Moderate scenario for 2 groups

.03

.035

.04

.045

.05

.055

.06

.065

.07

.075

.08

E

.02 .04 .06 .08P

Moderate scenario for 3 groups

.03

.035

.04

.045

.05

.055

.06

.065

.07

.075

.08

E

.02 .04 .06 .08P

Moderate scenario for 4 groups

R2=0.0 R

2=0.125 R

2=0.25 R

2=0.375 R

2=0.5

Figure 5.4: Plot of E vs. P for different number of subgroups (K, 3 plots in the upper part) and using different overallsample sizes (3 plots in the lower part) for the moderate scenario and τ = 0.5.

For the explanation of symbols see Figure 5.3. Note that y- and x-scales vary from plot to plot.

Finally, we consider the effect of increasing the sample size N in Figure 5.5, simulating the situationthat we can assess a benefit based on two or four studies, which are homogeneous enough to allow to bepooled together. Again, we observe the same patterns. Not surprisingly, the increase of power leads to anincrease in E, and applying superiority testing at the 5% level in two subgroups implies a level of E similarto no subgroup analysis in a study of half the size. In contrast, the impact on P is less pronounced, but wecan observe a slight increase with larger sample sizes. The discrepancy between superiority testing andthe interaction gatekeeper approach seems to decrease slightly when increasing the sample size, whereasthe discrepancy to fraction of estimate tends to increase slightly. Selected values corresponding to Figure5.4 and 5.5 can be found in Table 5.4.

Page 87: clinical trials for personalized, marker-based treatment strategies

5.3. RESULTS 69.0

55

.06

.065

.07

.075

.08

.085

.09

.095

.1E

.02 .03 .04 .05 .06 .07 .08 .09 .1

P

Moderate scenario with N=280

.055

.06

.065

.07

.075

.08

.085

.09

.095

.1E

.02 .03 .04 .05 .06 .07 .08 .09 .1

P

Moderate scenario with N=560

.055

.06

.065

.07

.075

.08

.085

.09

.095

.1E

.02 .03 .04 .05 .06 .07 .08 .09 .1

P

Moderate scenario with N=1120

R2=0.0 R

2=0.125 R

2=0.25 R

2=0.375 R

2=0.5

Figure 5.5: Plot of E vs. P for different number of subgroups (K, 3 plots in the upper part) and using different overallsample sizes (3 plots in the lower part) for the moderate scenario and τ = 0.5.

For the explanation of symbols see Figure 5.3. Note that y- and x-scales vary from plot to plot.

Table 5.4: Selected results showed in Figure 5.4 and 5.5. % denotes the relative decrease compared to φ N .

φ N φ E φ<,0.05

Scenario R2 E P E %E P %P E %E P %PK=2 0 0.0784 0.0740 0.0781 0.29 0.0737 0.49 0.0562 28.22 0.0562 24.04N=280 0.25 0.0788 0.0737 0.0783 0.65 0.0697 5.39 0.0593 24.75 0.0371 49.68

0.5 0.0785 0.0726 0.0781 0.55 0.0619 14.76 0.0612 22.07 0.0225 69.08K=3 0 0.0775 0.0732 0.0763 1.53 0.0722 1.36 0.0438 43.54 0.0516 29.42N=280 0.25 0.0778 0.0737 0.0763 1.87 0.0679 7.86 0.0474 39.09 0.0339 54.02

0.5 0.0781 0.0731 0.0769 1.62 0.0605 17.23 0.0503 35.59 0.0211 71.12K=4 0 0.0781 0.0740 0.0755 3.36 0.0721 2.64 0.0355 54.49 0.0501 32.35N=280 0.25 0.0786 0.0738 0.0759 3.31 0.0670 9.22 0.0396 49.58 0.0326 55.78

0.5 0.0782 0.0739 0.0759 2.94 0.0606 18.04 0.0428 45.26 0.0201 72.78K=2 0 0.0929 0.0861 0.0928 0.18 0.0859 0.30 0.0773 16.85 0.0657 23.72N=560 0.25 0.0931 0.0870 0.0928 0.30 0.0798 8.28 0.0799 14.13 0.0431 50.48

0.5 0.0930 0.0863 0.0930 -0.07 0.0675 21.85 0.0813 12.57 0.0254 70.58K=2 0 0.1007 0.1005 0.1006 0.09 0.1003 0.24 0.0920 8.68 0.0810 19.44N=1120 0.25 0.1006 0.1008 0.1006 0.04 0.0897 10.97 0.0932 7.43 0.0535 46.92

0.5 0.1006 0.1004 0.1012 -0.64 0.0736 26.735 0.0945 6.05 0.0314 68.72

For the other outcomes considered in Section 5.2.6, we obtain similar results as from risk differencein 5.3. Selected results are shown in 5.6, 5.7, 5.8.

Page 88: clinical trials for personalized, marker-based treatment strategies

70 CHAPTER 5. A FRAMEWORK TO ASSESS THE VALUE OF SUBGROUP ANALYSES

.02

5.0

3.0

35

.04

E

.02 .04 .06 .08

P

Risk difference with P0=0.1

.05

5.0

6.0

65

.07

.07

5.0

8

E

.02 .04 .06 .08

P

Risk difference with P0=0.2

.05

5.0

6.0

65

.07

.07

5

E

.02 .03 .04 .05 .06 .07

P

Odds ratio with θ=0.6

.16

.18

.2.2

2.2

4

E

.02 .03 .04 .05 .06 .07

P

Effect size with θ=0.2

R2=0.0 R

2=0.125 R

2=0.25 R

2=0.375 R

2=0.5

Figure 5.6: Additional scenarios in section 5.2.6 using moderate scenario and τ = 0.5 for risk difference with πSsgi =

0.1 (1), risk difference with πSsgi = 0.2 (2), odds ratio (3b) and effect size (4b)

For the explanation of symbols see Figure 3. Note that y- and x-scales vary from plot to plot.

.44

.46

.48

.5.5

2.5

4E

.005 .01 .015 .02 .025 .03P

Optimistic scenario with τ=0.5

.16

.18

.2.2

2.2

4E

.02 .04 .06 .08P

Moderate scenario with τ=0.5

.03

5.0

4.0

45

.05

.05

5.0

6E

.04 .06 .08 .1 .12 .14P

Pessimistic scenario with τ=0.5

.4.4

5.5

.55

E

.04 .06 .08 .1 .12 .14P

Optimistic scenario with τ=1

.16

.18

.2.2

2.2

4E

.05 .1 .15 .2P

Moderate scenario with τ=1

.03

.04

.05

.06

.07

E

.1 .15 .2 .25 .3P

Pessimistic scenario with τ=1

R2=0.0 R

2=0.125 R

2=0.25 R

2=0.375 R

2=0.5

Figure 5.7: Plot of E vs. P using different decision rules for three scenarios with τ = 0.5/1 and five different valuesof R2 using odds ratios (scenario 3a in section 5.2.6)

For the explanation of symbols see Figure 5.3. Note that y- and x-scales vary from plot to plot.

Page 89: clinical trials for personalized, marker-based treatment strategies

5.4. DISCUSSION 71

.14

.15

.16

.17

.18

E

.005 .01 .015 .02 .025 .03P

Optimistic scenario with τ=0.5

.05

5.0

6.0

65

.07

.07

5E

.02 .03 .04 .05 .06 .07P

Moderate scenario with τ=0.5

.01

2.0

14

.01

6.0

18

.02

E

.04 .06 .08 .1 .12 .14P

Pessimistic scenario with τ=0.5

.14

.15

.16

.17

.18

E

.04 .06 .08 .1 .12 .14P

Optimistic scenario with τ=1

.05

5.0

6.0

65

.07

.07

5.0

8E

.05 .1 .15 .2P

Moderate scenario with τ=1

.01

2.0

14

.01

6.0

18

.02

.02

2E

.1 .15 .2 .25 .3P

Pessimistic scenario with τ=1

R2=0.0 R

2=0.125 R

2=0.25 R

2=0.375 R

2=0.5

Figure 5.8: Plot of E vs. P using different decision rules for three scenarios with τ = 0.5/1 and five different valuesof R2 using effect sizes (scenario 4a in section 5.2.6)

For the explanation of symbols see Figure 5.3. Note that y- and x-scales vary from plot to plot.

5.4 Discussion

In this chapter, we propose a framework that allows us to compare different subgroup analysis strategiesapplied to clinical trials which have demonstrated an overall effect.

As expected, using subgroup analysis can help reducing the number of patients who are suffer-ing from an incorrect recommendation. The stricter the criterion we employ for performing subgroupanalysis, the smaller the fraction of patients with incorrect recommendation among all with such a rec-ommendation. However, we often have to pay the price for reducing the overall gain E while reducingP. If there is no subgroup variation, we have the maximal reduction in E and the minimal reduction in P.Even in the case of substantial subgroup variation, e.g. if the subgroups explain at least 50% of overallvariation in TEs from patient to patient, strict decision rules like superiority testing at the 5% level canimply a substantial reduction of E, in particular when more than 2 subgroups are considered.

The simple decision rule of requiring a positive TE estimate in a subgroup implies only a small lossin E in the worst case, and if both large subgroup variation and large individual variation exist, we couldeven obtain a gain in E. Simultaneously we always reduce P to a non-negligible degree in case of certaindegree of subgroup variation. So here we have or are close to a ‘win-win’ situation. However, usingsubgroup analysis based on the family of superiority testing, even with a moderate significance level like20%, typically implies a non negligible reduction in E. So any attempt to use some type of superioritytesting in subgroup analysis destroys this ‘win-win’ situation. Consequently, the use of superiority testing

Page 90: clinical trials for personalized, marker-based treatment strategies

72 CHAPTER 5. A FRAMEWORK TO ASSESS THE VALUE OF SUBGROUP ANALYSES

can be typically only justified, if there is an external decision that a reduction in P outweighs the reductionin E. Interaction tests are often performed as a gatekeeper of subgroup analysis. However, we found inour framework that decision rules using the interaction gatekeeper approach generally do not offer anyadvantage. Similarly, requiring a certain fraction of the overall effect — a decision rule adopted by theJapanese guideline for global clinical trials — does not offer advantages either.

Although our framework is based on considering single trials, it also allows to throw some lighton the situation of performing benefit assessments based on a meta-analysis. On a qualitative level,subgroup analysis seems to behave very similar in meta-analysis with respect to the relation betweendifferent decision rules regarding the impact on E and P. On a quantitative level, we can reach higherlevels of E. Of course we have to be aware of the fact that the gain in E is probably due to larger samplesizes allowing to demonstrate smaller TEs, for which the clinical relevance might be questionable.

In interpreting results, some limitations of our framework have to be taken into account. First, weassume that σ2

I , i.e. the inter individual variation of the TEs, is independent of the true TE θs of thetrial. It seems to be more realistic to assume that σ2

I decreases if θs approaches 0, as if a treatment is noteffective on average, it is unlikely that there are large effects at the individual level. However, assumingproportionality between σ2

I and θs is also too strong. Our choice of assuming no association between σ2I

and θs can be regarded as a conservative choice, implying an overestimation of P, as we assume morenegative individual TEs than to be expected in reality.

Second, our framework assumes that decisions on making a treatment available for all patients arebased on a single trial. This is usually not the case. For example, the FDA and EMA guidelines assumethe existence of two pivotal studies [39, 112]. However, the example of the CAPRIE trial illustrates thatsometimes decisions are based on one single, large trial. Furthermore, our considerations for the case ofpooling two or four studies suggest that many considerations and conclusions will be similar if more thanone study is available.

Third, our framework assumes that a proof of superiority implies a decision to make the treatmentavailable for all patients, whereas a failure to do so implies no access to the treatment in future. Thelatter may be the case if we consider decisions on drug approval by regulatory agencies [77], but then thefirst assumption is not fulfilled, as a proven gain in efficacy has to be balanced with the safety profile ofthe new treatment. If we consider decisions on reimbursement instead of drug approval [90], then alsothe failure to demonstrate superiority may not imply that the drug is not available for all patients. Forexample, it may be available for those who are willing to pay by themselves. In Germany, a negativedecision on an additional benefit in an HTA assessment performed after drug approval implies that thedrug may still be sold, but just at the price of the comparator, and it is up to the pharmaceutical companiesto decide whether they accept this or remove the drug completely from the market.

Fourth, in our framework we neglect that treatments are developed for patient populations of dif-ferent sizes. Our results can change if there is a correlation between true TEs and the size of the targetpopulation. Fifth, for simplicity, we made the strong assumption that all subgroups in a trial are of equalsize, which may not be true in most clinical trials. Similarly, we considered only one partition of the

Page 91: clinical trials for personalized, marker-based treatment strategies

5.5. CONCLUSION 73

study population into subgroups. In most trials, several factors like age, gender or baseline characteristicsmay define different partitions, adding the additional complexity of patients with contradictory treatmentrecommendations. Sixth, we assume in our framework that TEs observable in clinical trials will also beobserved when the treatment later is applied as part of the standard care. Finally, we did not include allapproaches of subgroup analyses in our consideration. However, our framework will allow to investigateother alternative approaches, e.g. Bayesian methods [67].

5.5 Conclusion

Subgroup analysis is a great temptation to improve the benefit assessment of new drugs by allowing toconsider smaller relevant patient groups. However, there is a risk of overlooking superior new treatmentsdue to insufficient power, which may decrease the overall efficiency of the clinical trial culture to improvethe average outcome. The simple rule of comparing effect estimates instead of confidence intervals with0 in subgroup analyses in trials with a significant overall effect may supply a good compromise. It allowsus to reduce the fraction of patients suffering from being recommended a worse treatment without dimin-ishing the overall gain achievable by the current clinical trial culture. More strict decision rules may bejustified, if there is a priori a high likelihood for substantial subgroup variation, or if avoidance of incor-rect treatment recommendations change at the individual level are given more weight than improvementof the average outcome.

Page 92: clinical trials for personalized, marker-based treatment strategies
Page 93: clinical trials for personalized, marker-based treatment strategies

Chapter 6Comparing a highly stratified treatment

strategy with the standard treatment in

randomized clinical trials

6.1 Background

Due to the increasing progress in developing biological and molecular targeted agents, marker informationbecomes a critical consideration for both planning and analyzing clinical trials [19, 82, 83]. Currently,many clinical trials are based on a single marker, dividing the patients into two subgroups and testing anew treatment among the patients in the marker positive subgroup (enrichment design [101, 110]) or inall patients (randomize-all design [45, 56]). However, with the increasing availability of multiple markerspredictive for the success of different treatments, more and more trials will be designed to test the impactof different drugs given dependent on several markers in a single type of disease, particularly in oncology[69, 71, 85, 107]. Partially, such trials are covered by the term “umbrella trial” [84]. Such highly stratifiedtreatment strategies become a challenge for the statistical analysis strategy for these studies [129].

Two nice examples of studies with highly stratified treatment strategy are the FOCUS4 trial and theSTarT Back trial introduced in Chapter 3. The FOCUS4 trial [68, 69, 111] introduced in Section 3.3 isa typical umbrella trial. This multi-arm, multi-stage PhaseII/III trial involves five different biomarkers:BRAF, PIK3CA, KRAS, NRAS and EGFR. Patients in each cohort with one or two particular biomarkersare randomized to receive either targeted therapy or placebo. The STarT Back trial (Section 3.2) is anRCT comparing stratified primary care management with the current best care in patients with back pain[49, 53]. The primary care management, is stratified by three disease risk groups, i.e., high, medium andlow risk, and treated by additional physiotherapy-led treatment with varying intensity and duration or the

75

Page 94: clinical trials for personalized, marker-based treatment strategies

76 CHAPTER 6. A HIGHLY STRATIFIED TREATMENT STRATEGY

best current care. These two examples show us that highly stratified treatment strategy is more of use inboth larger and small scale clinical trials.

Clinical trials comparing such strategies with the current standard can be analyzed in two ways:either comparing the strategy with standard of care in the whole population, or analyzing each subgroupof patients separately. The overall approach may result in suggesting treatments which are not beneficialfor the patients in some subgroups, while the subgroup approach may lack power for subgroups withsmall size. In addition, statistical considerations for most marker-based clinical trials have focused onthe the situations with two subgroups, where we expect a distinct treatment effect mainly in one group(e.g., the marker-positive subgroup) [4, 105, 106]. In this chapter we focus on studies with more than2 subgroups, and with no expectations about a distinct treatment effect in a particular subgroup. Thelatter situation has not been investigated widely so far even forth two subgroup situation with one notableexception[94].

To overcome the problems mentioned above with applying only an overall comparison and/or sepa-rate subgroup analyses, we consider in this chapter an intermediate approach aiming to establish a treat-ment effect in a subset of patients built by joining several subgroups. The approach is based on the simpleidea of selecting the subset with minimal p-value when testing the subset-specific treatment effects. Themain objective of this chapter is to investigate the proposed subset approach with respect to its successrate, i.e. the probability of establishing a treatment effect in some patients, as well as the balance betweenestablishing a superior treatment in as many patients as possible and recommending an inferior treatmentto as few patients as possible. We compare this approach with simpler approaches to select a subset,e.g., by simply joining all significant subgroups. In our investigations, we focus on scenarios where thetreatment effects are highly varying across subgroups, and where some subgroups may not benefit at all(i.e. have a treatment effect of zero). This reflects the expectation that some of the suggested therapiesmay offer little benefit in highly stratified treatment strategies, as the evidence base on the suggestions istypically limited in clinical practice.

6.2 Methods

6.2.1 Notation

Let N denote the total sample size for a given study. We assume that the patient population can bedivided into K subgroups, g = 1, . . . ,K, with sample size ng and proportions denoted by πg = ng/N.Let S denote a subset of K := {1, ...,K}, interpreted as the union of the corresponding subgroups. Thecorresponding sample size is nS = ∑g∈S ng, such that πS = nS/N = ∑g∈S πg. Let θg denote the treatmenteffect in subgroup g, and θS = ∑g∈S π̃S

g θg denote the treatment effect in subset S, where π̃Sg = πg/πS.

Finally, let PK denotes the set of all non-empty subsets of K .

Page 95: clinical trials for personalized, marker-based treatment strategies

6.2. METHODS 77

6.2.2 Analytic framework

Let Ygi denote the outcome in patient i of subgroup g, and Tgi the randomized treatment indicator with0 indicating the standard treatment and 1 the new stratified strategy. We consider the model Ygi = µg +

θgTgi + εgi, where εgi are independently and identically normal distributed with mean 0 and variance σ2.Our interest is in testing the null hypotheses

HS0 : θS ≤ 0 versus HS

1 : θS > 0

for any S ∈PK . The standard linear model theory provides us with a one-sided test at level α for each ofthese hypotheses after fitting once the above given model. The corresponding test statistics TS are basedon selecting the corresponding contrasts from this model, and hence they are – independent of the choiceof S – compared with the upper α-quantile of a t-distribution with N−2K degrees of freedom.

For each subset S ⊆PK , i.e. any set of subsets of K , we consider the multiple testing problem{HS

0 }S∈S . By computing the joint distribution of (TS)S∈S , we obtain a single step multiple testingprocedure by comparing each TS with the upper α-quantile of the distribution of max{TS|S ∈S } [47].By applying the closed testing principle [80], the single step procedure can be refined further: we considerfor each subset S ′ ⊆S a test based on max{TS|S ∈S ′} for HS ′

0 :⋂

S∈S ′HS

0 and we reject HS0 if all HS ′

0

with S ∈S ′ ⊆S are rejected.

In the single step procedure, the multiplicity adjusted p-value pS for each null hypothesis HS0 with

S ∈S can be determined by comparing TS with the distribution of max{TS|S ∈S }. In the closed testing,a p-value can be obtained for each HS ′

0 , by comparing the observed value of max{TS|S ∈S ′} with itsdistribution under HS ′

0 . The p-value pS for HS0 is then defined as the maximum p-value over all HS ′

0

with S ∈ S ′ ⊆ S . Both the single step procedure and the closed testing procedure are implementedwith the glht function in the multcomp package in R using the single-step and the Westfall optionsrespectively [15].

6.2.3 Five approaches for subset selection

In this chapter, we consider five different approaches to select a subset S∗ of K for which we can state tohave demonstrated a positive treatment effect. All five approaches can be fitted into a general framework,characterized by applying one of the two multiple testing procedures outlined in Section 6.2.2 to a specificset S of subsets and then select either a single or a union of significant subsets. Formal definitions of S∗

for each approach are given in Table 6.1; in the following we describe the approaches conceptually.

The first approach, denoted as M1, is the overall test. Here we consider only the subset correspond-ing to the overall population, i.e., S1 = {K } with the overall effect θoverall := θK . If the null hypothesisHK

0 : θoverall ≤ 0 is rejected, we select all patients, otherwise no patient is selected. Since we are applyingthe framework of Section 6.2.2., the test is based on a two factor model and is not identical to the simpletwo sample t-test. Such a two factor model is reasonable here because the group partition may act as a

Page 96: clinical trials for personalized, marker-based treatment strategies

78 CHAPTER 6. A HIGHLY STRATIFIED TREATMENT STRATEGY

prognostic factor.

The second approach, denoted as M2, is a subgroup analysis, where S2 = {{1},{2}, . . . ,{K}}. Inthis approach, we perform K tests on the null hypotheses H{g}0 : θg ≤ 0 for g = 1, . . . ,K, and select allsubgroups with a significant treatment effect. Multiplicity issues are taken into account by applying theclosed testing procedure from Section 6.2.2. Due to the independence of the subgroups, this is equivalentto using the S̆idák procedure [115] [15, p. 122].

The third approach, denoted as M3, is a combination of the two previous approaches. We performtests for the overall population and for each subgroup, i.e., S3 = {{1},{2}, . . . ,{K},K }, and applyagain the closed testing procedure to adjust for multiplicity. If the HK

0 can be rejected, all patients areselected, otherwise we select all significant subgroups. M3 can be considered as a generalization of theprinciple of investigating both the overall effect and all subgroup effects in the situation with two groups,i.e., the hybrid design recommended by [45].

The fourth approach M4 is the subset analysis. Here, we consider S4 =PK , i.e., all possible 2K−1subsets which can be built by joining subgroups, and select the subset with the smallest p-value. Becauseour interest lays in selecting a single subset, the closed testing procedure provides no improvement andwe use the single step procedure from Section 6.2.2 to adjust for multiplicity.

Finally, we consider an approach M5 similar to M4, but with a restriction on the sample size of thesubsets. Here, S5 = {S ∈PK |πS ≥ γ} contains only sufficiently large subsets, i.e., with a sample sizeabove a pre-specified fraction γ of the total sample size. In this chapter, we consider γ = 50%. Comparedto M4, M5 ignores small subsets containing few subgroups or subgroups only with small sample sizes.Since a smaller number of tests are performed, the power is expected to be higher as compared to M4, ifthere is a positive treatment effect in at least one subset covering 50% of the total sample size.

Table 6.1: Overview of the five subset selection approaches.

Set of subsets Multiple testing procedure Selected subset S∗

according to Section 6.2.2M1 S1 = {K } Not applicable K if pK ≤ α; /0 otherwiseM2 S2 = {{1},{2}, . . . ,{K}} Closed test

⋃S∈S2,pS≤α S

M3 S3 = {{1},{2}, . . . ,{K},K } Closed test⋃

S∈S3,pS≤α SM4 S4 = PK Single step argminS∈S4 pS if min{pS|S ∈S4} ≤ α;

/0 otherwiseM5 S5 = {S ∈PK |πS ≥ γ} Single step argminS∈S5 pS if min{pS|S ∈S5} ≤ α;

/0 otherwise

We conclude this section with a few remarks. In Appendix I we show that both M4 and M5, controlthe FWER. For M5, we have to make the additional assumption that there is no subgroup covering 50% ormore of the whole patient population. Because M2 and M3 use the closed testing procedure, this impliesFWER control for both approaches as well. Hence we can claim that we have proved the statisticalsignificance of the selected subset for all approaches considered. Note that for M4 we do not consider theunion of all significant subsets, as this implies a risk to accept subgroups with negative treatment effect

Page 97: clinical trials for personalized, marker-based treatment strategies

6.2. METHODS 79

estimates. This can not happen if selecting the subset with minimal p-value. See Appendix II for a proof.Note that this argument does not apply to M5.

Current practice in clinical research focuses on two-sided tests. However, throughout this chapterwe consider one-sided tests to avoid undesirable results by using two-sided tests, if we have subgroupswith both relevant positive and negative effects. On the other hand, the focus on positive effects mayimply that the evidence for negative effects in some subgroups is overlooked. To address this issue,we recommend to inspect the subgroup specific treatment effects θ̂g for all g ∈ S∗ and their confidenceintervals.

6.2.4 Quality and performance measures

Once one of our approaches has been applied, we can ask how good the suggested subset S∗ actually is.Assuming that we know the true values θg, we can define corresponding quality measures. In this chapterwe consider two quality measures. First we consider the impact

I(S∗) = ∑g∈S∗

πgθg = πS∗θS∗ ,

which reflects the expected average change in patient outcome when giving the new treatment to theselected subset S∗, and the current standard treatment to the other patients in future. Second, we considerthe inferiority rate, which is based on the assumption that each patient i in subgroup g has an individualtreatment effect θgi, and the group specific treatment effect θg is equal to the average of the individualeffects. If we assume that these individual effects are normally distributed with standard deviation σI ,the fraction of patients with negative treatment effect, i.e., patients who can not expect to benefit fromreceiving the new treatment, is Φ(

−θgσI

) in subgroup g. Hence, we can assess the overall rate of suggestingan inferior treatment by

R(S∗) = ∑g∈S∗

πgΦ

(−θg

σI

).

Note that we want to simultaneously maximize the impact I and minimize the inferiority rate R. Thesmaller the effect of a selected subgroup, the smaller is the increase in impact and the larger the increasein inferiority rate. Considering impact and inferiority rate jointly allows us to study how the differentmethods contribute to balancing between the wish of assigning a better treatment to as many patientsas possible and the risk of actually moving patients to a worse treatment. Regarding the output S∗ ofour approaches as a random variable, we can study the performance of each approach by consideringthe distribution of the impact and the inferiority rate. In particular we can consider the expected values,to which we refer as expected impact and expected inferiority rate in the sequel. These measures areconceptually similar to E and P in Chapter 5 without generating the individual treatment effect θsgi

explicitly [108]. Instead, we only use the assumed variance to compute R(S∗) here.In addition, we consider the success rate P(S∗ 6= φ), i.e. the probability to establish a treatment

effect in a non-empty subset of patients.

Page 98: clinical trials for personalized, marker-based treatment strategies

80 CHAPTER 6. A HIGHLY STRATIFIED TREATMENT STRATEGY

6.3 Illustrative example

To illustrate the five different approaches we consider the hypothetical example of a study, in which ex-ternal marker information allowed us to identify 6 different subgroups with a different new treatmentsuggested for each subgroup. Within each subgroup, the patients were randomized to either the currentstandard treatment or the new subgroup specific treatment. In all subgroups we measured the same out-come variable Y . The estimates of the subgroup-specific treatment effect, i.e, the differences of the meanvalues between the two treatment arms in each subgroup, together with their 95% confidence intervalsare shown in Figure 6.1. The estimated error variance was close to 1 in each subgroup, such that θ̂g canbe interpreted as effect sizes. The sample sizes in the 6 subgroups were 145, 167, 63, 292, 86 and 153.

Figure 6.1: Subgroup specific treatment effect estimates with point-wise 95% confidence intervals in a hypotheticalstudy with 6 subgroups.

Estimates and confidence intervals are based on the model described in Section 6.2.2.

To apply our five approaches, we first fit the model presented in Section 6.2.2. Next we constructcontrast matrices corresponding to the set of subsets considered in each approach and then apply the ghltprocedure. The implementation of these steps in R is documented in Appendix VI.

The subsets selected by each approach when using α = 0.025 are shown in Table 6.2. When ap-plying the overall test, we obtain a p-value of 0.0007, hence M1 selects the whole population, includingtwo subgroups with effects close to 0. Subgroup analysis (M2) identifies only the subgroups 5 and 6with rather large effects and adjusted p-values of p = 0.011 and p = 0.014. M3 selects again the overallpopulation, as the adjusted p-value of 0.0046 is still below 0.025. When considering all possible subsets,we find a minimal adjusted p-value of 0.0016 for the subset {3,4,5,6}, which is below 0.025, and henceM4 selects this subset. As this subset covers more than 50% of the whole population, it is also selectedby M5, but the adjusted p-value of 0.0006 is now smaller due to testing a smaller number of subsets.

Page 99: clinical trials for personalized, marker-based treatment strategies

6.4. SIMULATION STUDY 81

Table 6.2: The subsets S∗ selected be the five approaches in the example of a hypothetical study with results as shownin Figure 6.1. The subgroups included in S∗ are marked by a cross.

group1 2 3 4 5 6

M1 x x x x x xM2 x xM3 x x x x x xM4 x x x xM5 x x x x

6.4 Simulation study

The performance measured introduced in Section 6.2.4 cannot be computed analytically. Hence we con-duct a simulation study to get first insights into the behavior of the five approaches introduced.

6.4.1 Design of simulation study

Our starting point is a clinical trial based on an assumed effect θA for θoverall . We aim at a standardizedeffect size of 0.2, so σ2 = 1 and θA = 0.2 were chosen. Two parameters are considered to model thebetween-subgroup heterogeneity. A certain fraction P of the subgroup effects θ1, . . . ,θK are set to 0,the remaining subgroup effects are chosen such that θoverall = θA. In addition, we allow them to varydepending on a parameter τ , which equals to the relative half range of the θg which are not equal to 0. Therelative half range here refers to the half of the distance between the maximum and the minimum dividedby the average of the maximum and minimum. The larger τ , the larger the heterogeneity among thesubgroups. To ensure θoverall = θA, we do not choose θg randomly, but from a fixed grid, i.e. equidistantvalues. Figure 6.2 illustrates the choices of θg determined by P and τ for a study with 8 subgroups. Thetechnical implementation is outlined in Appendix III.

For the subgroup specific sample sizes, we consider two scenarios. For the equal sample size situa-tion we set ng = N/K for all g. For the unequal sample size situation, we assume that [K/2]− groups havesize 1

2 N/K and [K/2]− groups have size 32 N/K, where []− denotes the next smallest integer. If K is odd,

the remaining group has size N/K. The group sizes are randomly allocated to the groups. We combinethese two scenarios with three choices for the heterogeneity of the non-zero effects (τ = 0,0.4,0.8) andtwo choices of the fraction of zero effects (P = 0,0.25). Considering the cases K = 2,3,4,6,8, with asignificance level α = 0.025, we chose N = 1056, such that 1

2 N/K is always an integer, and a two samplet-test would have a power 90% in the case of no effect variation across the subgroups and performing asimple t-test. Without loss of generality, we can assume that all group specific intercepts µg are equal to0.

For the computation of the inferiority rate, we choose σI = 0.05 such that in a subgroup with θg =12 θA, 15.8% of patients have a negative effect. For each scenario we performed 2500 simulation runs,allowing to estimate a success rate of 90% with a standard error of 0.6%.

Page 100: clinical trials for personalized, marker-based treatment strategies

82 CHAPTER 6. A HIGHLY STRATIFIED TREATMENT STRATEGY

Figure 6.2: The values of θg (black dots) for g = 1, ...,8 for two choices of P and τ . The blue arrows illustrate thehalf range τ of the non-zero effects. The red line indicates the average of the maximum and minimum values of thenon-zero effects.

6.4.2 Results

The results of the simulation study are summarized here using the performance measures introduced inSection 6.2.4. Figure 6.3 shows how success rates change in dependence of the number of subgroupsfor different scenarios. M1 shows stable success rates (0.89− 0.91), which are close to the plannedpower for all K across all scenarios, because it analyzes the whole population regardless of the numberof subgroups. For the other four approaches, we observe decreasing success rates as K increases. Notethat the four methods are ordered as M2 < M3 < M4 < M5 with respect to success rate. The red dashedline always lays in the bottom, and the green and purple lines are placed above the blue and red lines.Method M3 always performs between M1 and M2, reflecting the compromise of testing all patients andby subgroup only. Both subset analysis methods M4 and M5 provide higher success rates as compared toM2 and M3 being closer to the success rates of M1. In general, M5 provides higher success rates than M4,which is in accordance to our expectation. The heterogeneity of patient populations has a distinct impacton the merits of M1 compared to other approaches. In the situation of no heterogeneity, i.e. P = 0 andτ = 0, M1 based on the overall test obtained highest success rates regardless of the number of subgroups.With increasing heterogeneity, subset analyses M4 and M5 may beat the overall test M1, at least when thenumber of groups is limited. In the case of substantial heterogeneity, i.e. P = 0.25 and τ = 0.8, M4 issuperior for up to K = 8, and even M2 may beat M1 for moderate values of K. All numerical values of thesuccess rates from the six scenarios can be found in Table 6.3 below.

Figures 6.4 and 6.5 show how impact and inferiority rate are related to each other in different

Page 101: clinical trials for personalized, marker-based treatment strategies

6.4. SIMULATION STUDY 83

Figure 6.3: Success rate depending on number of subgroups K for different choices of P and τ for equal subgroupsizes

Table 6.3: Success rate dependent on P and τ for equal group size and unequal group size. The results correspond toFigures 6.3 and 6.6.

K P τ Equal size Unequal sizeM1 M2 M3 M4 M5 M1 M2 M3 M4 M5

2 0 0 0.8964 0.7776 0.8504 0.8500 0.8508 0.9004 0.7896 0.8484 0.8488 0.88160 0.4 0.8964 0.8624 0.8968 0.8968 0.8968 0.9004 0.8548 0.8900 0.8900 0.89360 0.8 0.8964 0.9656 0.9680 0.9680 0.9680 0.9004 0.9592 0.9580 0.9580 0.90720.25 0 0.9080 0.8892 0.9300 0.9304 0.9300 0.9040 0.8740 0.8992 0.9112 0.92640.25 0.4 0.9080 0.9332 0.9524 0.9524 0.9524 0.9040 0.9088 0.9212 0.9208 0.90720.25 0.8 0.9080 0.9856 0.9860 0.9860 0.9856 0.9040 0.9624 0.9612 0.9612 0.9160

3 0 0 0.9036 0.6584 0.8228 0.8240 0.8652 0.9016 0.6816 0.8304 0.8348 0.86200 0.4 0.8964 0.7492 0.8528 0.8664 0.8860 0.9040 0.7464 0.8572 0.8604 0.88720 0.8 0.9084 0.8884 0.9188 0.9352 0.9348 0.9072 0.8960 0.9248 0.9384 0.94400.25 0 0.8988 0.8272 0.8832 0.9108 0.9264 0.9096 0.8416 0.8992 0.9112 0.92640.25 0.4 0.8996 0.8948 0.9284 0.9356 0.9408 0.9044 0.8920 0.9208 0.9292 0.93480.25 0.8 0.9044 0.9740 0.9808 0.9828 0.9712 0.8956 0.9632 0.9720 0.9748 0.9724

4 0 0 0.8908 0.5740 0.7868 0.7960 0.8204 0.8952 0.6156 0.7940 0.8024 0.83800 0.4 0.9032 0.6380 0.8136 0.8328 0.8632 0.9084 0.6520 0.8100 0.8376 0.87200 0.8 0.8960 0.7932 0.8604 0.8996 0.9148 0.9040 0.8048 0.8816 0.9048 0.92360.25 0 0.9028 0.7564 0.8444 0.8992 0.9120 0.9004 0.7824 0.8660 0.8972 0.92080.25 0.4 0.8988 0.8284 0.8872 0.9240 0.9320 0.8996 0.8464 0.8956 0.9144 0.92520.25 0.8 0.9024 0.9512 0.9572 0.9712 0.9684 0.9044 0.9432 0.9516 0.9668 0.9668

6 0 0 0.8944 0.4368 0.7456 0.7696 0.8192 0.9020 0.4796 0.7600 0.7748 0.83800 0.4 0.9072 0.5188 0.7812 0.8076 0.8452 0.9052 0.5224 0.7708 0.8020 0.85080 0.8 0.8968 0.6636 0.8188 0.8696 0.8908 0.9044 0.6684 0.8164 0.8572 0.89600.25 0 0.8972 0.6288 0.8048 0.8812 0.9084 0.9064 0.6676 0.8204 0.8680 0.91200.25 0.4 0.8988 0.7040 0.8308 0.8992 0.9160 0.8996 0.7320 0.8440 0.8916 0.91640.25 0.8 0.9072 0.8528 0.9092 0.9452 0.9560 0.8956 0.8468 0.8952 0.9376 0.9460

8 0 0 0.9000 0.3636 0.7408 0.7756 0.8124 0.9060 0.3900 0.7380 0.7672 0.81840 0.4 0.8928 0.4044 0.7352 0.7828 0.8324 0.9032 0.4340 0.7396 0.7772 0.84080 0.8 0.9064 0.5344 0.7800 0.8500 0.8800 0.9072 0.5636 0.7828 0.8540 0.88920.25 0 0.9088 0.5196 0.7692 0.8612 0.8872 0.8988 0.5444 0.7656 0.8440 0.89040.25 0.4 0.8948 0.5652 0.7812 0.8756 0.8984 0.9000 0.6100 0.7984 0.8836 0.91600.25 0.8 0.8940 0.7528 0.8500 0.9300 0.9360 0.9048 0.7688 0.8652 0.9348 0.9444

Page 102: clinical trials for personalized, marker-based treatment strategies

84 CHAPTER 6. A HIGHLY STRATIFIED TREATMENT STRATEGY

scenarios when comparing the five approaches. In order to avoid overcrowded plots, we consider P= 0.25and P= 0 separately in Figure 6.4 and Figure 6.5, and stratify the results by number of subgroups. Startingwith the larger fraction P = 0.25 in Figure 6.4, we compare impact and inferiority rate among the fivemethods for τ = 0,0.4,0.8 and k = 2,3,4,6,8. M1 always obtains the highest impact (0.89− 0.90), butit also yielded larger inferiority rates than the other approaches. We observe the least inferiority ratesand lowest impact for M2. M3 always plays the role of a compromise between M1 and M2. Interestingly,M1,M2,M3 seem to be placed on a straight line when the number of subgroups is large enough, e.g.K = 4,6,8. However, M4 and M5 do not lay on this line, but are placed closer to the left top corner — theoptimal point with impact equal to 1 and inferiority rate equal to 0. The differences between M4 and M5

are not very large, but again, we can see a trade-off between impact and inferiority rate: M5 tends to havelarger impact and larger inferiority rate than M4. We may conclude that — same as for the other methods— M4 and M5 try to seek the balance between maximizing impact and minimizing inferiority rate, but atlevels closer to the optimum compared to M1,M2 and M3.

Figure 6.4: Impact vs. inferiority rate for P = 0.25 and different choices of τ for equal subgroup sizes.The results are stratified by number of subgroups. The five methods are distinguished by different symbols and the choice of τ by

different colors. The lines in the plot connect M3 with M1 and M3 with M2.

For P = 0, we observe similar results in Figure 6.5, the only difference being that the methods aremore similar to each other. However, the general result about the comparison between subset analysesand the other methods still holds: M4 and M5 move closer to the optimal corner compared to M1,M2,M3

in the case of subgroup heterogeneity (τ = 0.4 or τ = 0.8) and number of subgroups K ≥ 3.

Page 103: clinical trials for personalized, marker-based treatment strategies

6.4. SIMULATION STUDY 85

Figure 6.5: Impact vs. inferiority rate for P = 0 and different choices of τ for equal subgroup sizes.For the explanation of symbols see Figure 6.4

Page 104: clinical trials for personalized, marker-based treatment strategies

86 CHAPTER 6. A HIGHLY STRATIFIED TREATMENT STRATEGY

Table 6.4: Impact and inferiority rate depending on P and τ for equal group sizes. The results correspond to Figures6.4 and 6.5.

P K τ Impact Inferiority rateM1 M2 M3 M4 M5 M1 M2 M3 M4 M5

0 2 0 0.8964 0.5874 0.7134 0.7696 0.7704 2.84e-05 1.86e-05 2.26e-05 2.43e-05 2.44e-050.4 0.8964 0.6706 0.7284 0.7626 0.7638 0.0037 0.0011 0.0015 0.0019 0.00190.8 0.8964 0.8756 0.8792 0.8830 0.8827 0.0950 0.0070 0.0085 0.0125 0.0122

3 0 0.9036 0.3369 0.7021 0.6817 0.7401 2.86e-05 1.07e-05 2.22e-05 2.16e-05 2.34e-050.4 0.8964 0.4318 0.6838 0.7021 0.7589 0.0025 3.58e-04 0.0015 0.0012 0.00130.8 0.9084 0.6298 0.7162 0.7852 0.8589 0.0642 0.0031 0.0221 0.0127 0.0171

4 0 0.8908 0.2127 0.6931 0.6197 0.6484 2.82e-05 6.74e-06 2.19e-05 1.96e-05 2.05e-050.4 0.9032 0.2759 0.6766 0.6445 0.6808 0.0019 1.72e-04 0.0013 0.0009 0.00090.8 0.8960 0.4373 0.6389 0.7141 0.7549 0.0478 0.0014 0.0222 0.0101 0.0112

6 0 0.8944 0.0961 0.6517 0.5343 0.5843 2.83e-05 3.04e-06 2.06e-05 1.69e-05 1.85e-050.4 0.9072 0.1357 0.6628 0.5669 0.6140 0.0014 6.56e-05 0.0010 0.0006 0.00070.8 0.8968 0.2372 0.6195 0.6431 0.6897 0.0345 0.0004 0.0204 0.0081 0.0092

8 0 0.9000 0.0576 0.6598 0.4973 0.5372 2.85e-05 1.82e-06 2.09e-05 1.58e-05 1.70e-050.4 0.8928 0.0755 0.6353 0.5130 0.5651 0.0012 2.82e-05 0.0008 0.0005 0.00050.8 0.9064 0.1320 0.6232 0.5974 0.6450 0.0294 0.0003 0.0187 0.0074 0.0084

0.25 2 0 0.9080 0.7956 0.8626 0.8912 0.8918 0.1126 0.0040 0.0045 0.0060 0.00530.4 0.9080 0.8349 0.8680 0.8877 0.8862 0.1145 0.0045 0.0053 0.0070 0.00630.8 0.9080 0.9400 0.9412 0.9431 0.9428 0.1611 0.0074 0.0089 0.0124 0.0118

3 0 0.8988 0.6099 0.7434 0.8369 0.8921 0.1135 0.0020 0.0435 0.0127 0.01470.4 0.8996 0.6605 0.7431 0.7913 0.8838 0.1129 0.0028 0.0213 0.0106 0.0220.8 0.9044 0.8325 0.8559 0.8763 0.9278 0.1534 0.0037 0.0110 0.0110 0.0650

4 0 0.9028 0.4105 0.6645 0.7799 0.8036 0.1129 0.0014 0.0648 0.0161 0.01720.4 0.8988 0.4968 0.6672 0.7688 0.8088 0.1125 0.0013 0.0504 0.0144 0.01520.8 0.9024 0.6918 0.7295 0.8221 0.8873 0.1451 0.0013 0.0284 0.0149 0.0650

6 0 0.8972 0.2179 0.6424 0.7068 0.7501 0.1122 0.0007 0.0709 0.0180 0.01930.4 0.8988 0.2885 0.6239 0.7115 0.7541 0.1120 0.0004 0.0594 0.0162 0.01850.8 0.9072 0.4585 0.6455 0.7675 0.8314 0.1345 0.0012 0.0504 0.0183 0.0257

8 0 0.9088 0.1205 0.6336 0.6407 0.6844 0.1136 0.0004 0.0751 0.0196 0.02270.4 0.8948 0.1544 0.6143 0.6513 0.7000 0.1119 0.0005 0.0702 0.0201 0.02210.8 0.8940 0.2874 0.5837 0.7173 0.7787 0.1280 0.0007 0.06284 0.0206 0.0266

So far, we only presented the results from simulation studies with equal subgroup sizes. Whenconsidering unequal subgroup sizes, similar results are observed. The corresponding tables and figuresare shown in Table 6.5 and Figure 6.6 - 6.8.

Page 105: clinical trials for personalized, marker-based treatment strategies

6.4. SIMULATION STUDY 87

Figure 6.6: Success rate depending on number of subgroups K for different choices of P and τ with unequal groupsizes

Table 6.5: Impact and inferiority rate depending on P and τ for unequal group sizes. The results correspond toFigures 6.7 and 6.8.

P K τ Impact Inferiority rateM1 M2 M3 M4 M5 M1 M2 M3 M4 M5

0 2 0 0.9004 0.6428 0.7364 0.7857 0.8380 2.85e-05 2.04e-05 2.33e-05 2.49e-05 2.65e-050.4 0.9004 0.7013 0.7568 0.7769 0.8601 0.0031 6.33e-04 0.0010 0.0015 0.00170.8 0.9004 0.8412 0.8413 0.8302 0.8937 0.0628 0.0071 0.0094 0.0094 0.0365

3 0 0.9016 0.3977 0.7227 0.7076 0.7383 2.86e-05 1.26e-05 2.29e-05 2.24e-05 2.34e-050.4 0.9040 0.4585 0.7079 0.7105 0.7584 0.0022 3.51e-04 0.0013 0.0011 0.00120.8 0.9072 0.6588 0.7393 0.7952 0.8443 0.0578 0.0028 0.0182 0.0105 0.0122

4 0 0.8952 0.2805 0.6798 0.6352 0.6762 2.84e-05 8.88e-06 2.15e-05 2.01e-05 2.14e-050.4 0.9084 0.3190 0.6765 0.6570 0.7039 0.0017 1.51e-04 0.0011 0.0008 0.00080.8 0.9040 0.4837 0.6874 0.7335 0.7884 0.0438 0.0013 0.0199 0.0094 0.0108

6 0 0.9020 0.1373 0.6638 0.5532 0.6278 2.86e-05 4.35e-06 2.10e-05 1.75e-05 1.99e-050.4 0.9052 0.1704 0.6556 0.5795 0.6524 0.0014 6.28e-05 0.0009 0.0006 0.00070.8 0.9044 0.2795 0.6393 0.6467 0.7223 0.0331 0.0006 0.0191 0.0075 0.0094

8 0 0.9060 0.0770 0.6490 0.5098 0.5717 2.87e-05 2.44e-06 2.06e-05 1.61e-05 1.81e-050.4 0.9032 0.1010 0.6491 0.5309 0.6031 0.0012 3.86e-05 0.0008 0.0005 0.00060.8 0.9072 0.1721 0.6222 0.6039 0.6736 0.0283 0.0003 0.0171 0.0067 0.0081

0.25 2 0 0.9040 0.8005 0.8486 0.8696 0.8791 0.1114 0.0028 0.0036 0.0049 0.08590.4 0.9040 0.8298 0.8556 0.8634 0.8914 0.1129 0.0031 0.0041 0.0056 0.08670.8 0.9040 0.8996 0.8994 0.8939 0.9095 0.1431 0.0062 0.0081 0.0097 0.1053

3 0 0.9096 0.6461 0.7683 0.8408 0.8761 0.1166 0.0019 0.0300 0.0109 0.01200.4 0.9044 0.6904 0.7649 0.7995 0.8707 0.1136 0.0024 0.0172 0.0083 0.01280.8 0.8956 0.8211 0.8445 0.8534 0.9141 0.1475 0.0034 0.0088 0.0095 0.0335

4 0 0.9004 0.4874 0.6926 0.7899 0.8318 0.1126 0.0011 0.0491 0.0128 0.01460.4 0.8996 0.5551 0.6923 0.7771 0.8329 0.1104 0.0011 0.0393 0.0116 0.01380.8 0.9044 0.7153 0.7596 0.8271 0.9000 0.1418 0.0028 0.0275 0.0153 0.0284

6 0 0.9064 0.2855 0.6511 0.7080 0.7897 0.1123 0.0012 0.0628 0.0159 0.01980.4 0.8996 0.3532 0.6369 0.7181 0.7935 0.1139 0.0008 0.0516 0.0138 0.02010.8 0.8956 0.4896 0.6390 0.7634 0.8454 0.1318 0.0010 0.0395 0.0154 0.0303

8 0 0.8988 0.1663 0.6252 0.6496 0.7211 0.1126 0.0004 0.0708 0.0168 0.02040.4 0.9000 0.2086 0.6158 0.6732 0.7466 0.1121 0.0004 0.0643 0.0175 0.02220.8 0.9048 0.3395 0.6072 0.7360 0.8060 0.1277 0.0006 0.0560 0.0177 0.0290

Page 106: clinical trials for personalized, marker-based treatment strategies

88 CHAPTER 6. A HIGHLY STRATIFIED TREATMENT STRATEGY

Figure 6.7: Impact vs. inferiority rate for P = 0.25 and different choices of τ for unequal subgroup sizes.For the explanation of symbols see Figure 6.4

Figure 6.8: Impact vs. inferiority rate for P = 0 and different choices of τ for unequal subgroup sizes.For the explanation of symbols see Figure 6.4

Page 107: clinical trials for personalized, marker-based treatment strategies

6.5. DISCUSSION 89

6.5 Discussion

Due to the increasing progress in developing biological or psychological markers allowing to select anoptimal treatment from a larger variety of treatments, highly stratified treatment strategies will be de-veloped more often in the future. These new treatment strategies need to be tested and compared witha standard treatment in randomized clinical trials to show their superiority. The recent raise in the in-terest about umbrella trials illustrate this development, although such trials often include also adaptiveelements not considered in this chapter. New analysis approaches may be in need due to the drawbacksof current methods in this situation. Subgroup analyses may become unattractive if we can not recruit alarge number of patients such that every subgroup is of sufficient size. A simple overall comparison isalways valid, but it may overlook the existence of subgroups with no effect. This issue may be highlyrelevant, as we should expect ineffective treatment suggestions for some subgroups in highly stratifiedtreatment strategies, as it is unlikely that all subgroup specific suggestions are based on good evidence.For example, if we build eight subgroups by combining three binary molecular markers, each suggestinga different add-on to a standard chemotherapy, we can not expect to be able to rely on existing data fromprevious studies for each of the eight combinations of the positive/negative status of the three markers.

To fill in this gap between subgroup and overall analysis, we suggest subset analyses, which allowsus to demonstrate superiority of a new treatment strategy in at least one subset of subgroups. To obtain afirst insight into this approach, we considered three performance measures: success rate, impact and in-feriority rate. We studied these measures for different scenarios of subgroup heterogeneity. As expected,the subset approach tends to yield higher success rates than the subgroup approach. This is also and inparticular true if the heterogeneity of treatment effects is limited. When the heterogeneity of treatmenteffects is large enough, the advantages of subset analysis are demonstrated in the way of obtaining evena higher success rate than the overall test. However, the simple overall test may still have higher successrate if the heterogeneity is limited or the number of subgroups is large.

When comparing all methods with respect to the balance between impact and inferiority rate, theadvantage of subset analysis becomes obvious, as it balances at a level closer to the optimum. Restrictingthe selection of subsets to those with sufficient sample size increases both impact and inferiority rate,illustrating again the trade-off between impact and inferiority rate. Consequently, to make any decisionabout the potential advantage implied by restricting the subset size we have to specify the costs or utilitiesfor a change in impact or inferiority rate, respectively.

Selecting the subset with minimum p-value, as done in M4 and M5, is only one possible approachto perform subset analysis. Its optimality can not be claimed from the results of our simulation studies.Alternatively, we may choose other subsets among all significant subsets using alternative criteria, e.g.the maximal estimated impact or the maximum size. However, the minimum p-value approach may bestill rather attractive, since it automatically excludes subgroups with negative treatment effect estimatesfrom the selection. In addition, we may allow to consider the union of significant subsets even if theresulting set may not be significant. Using closed testing procedures, we may also be able to enlarge the

Page 108: clinical trials for personalized, marker-based treatment strategies

90 CHAPTER 6. A HIGHLY STRATIFIED TREATMENT STRATEGY

set of significant subsets. Improvements may also result from taking logical dependencies into account incorrecting for multiplicity. For example, if we split a subset S into S1 and S2, HS

0 can only be false if at leastHS1

0 or HS20 are false. Finally, we could try to take clinical background knowledge or expectations into

consideration, as there are clinically motivated rationales behind the suggested treatments. For instance,we may expect similarity of effects in subgroups with similar marker constellation or the expected numberof subgroups with no effect is limited. Subset analysis with size restriction could be considered as a firststep in this direction.

In this chapter, we focused on p-values with correction for multiple testing to follow the establishedregulatory principle to have a ‘proof’ for the superiority of the new treatment in the population we want tooffer the treatment to. Although the subset approach considered already ensures some control of the riskto include subgroups with true negative effects, regulatory agencies may still require subgroup analysison top of our analysis for the confirmatory efficacy analysis, as well as safety analyses [41]. Note that ourapproach is not applicable to safety outcomes, as averaging of effects is less justifiable for such outcomes.

Page 109: clinical trials for personalized, marker-based treatment strategies

Chapter 7Concluding remarks and further research

Personalized medicine aims to identify the optimal treatments and optimal dosages to patients based onboth their diseases and patient related characteristics. The increasing development of biomarkers andmolecular targeted agents promises the application of personalized medicine in various areas, includingdiagnosis, prognosis and selection of targeted therapies. As conducting clinical trials is a crucial step indrug development, it is also more often planned and applied in the development of personalized medi-cations. In this thesis, we investigated further on the topic of clinical trial design and statistical analysisissues in direction of personalized, marker-based treatment strategies. In particular, we considered var-ious methodologies for subgroup and interaction analyses, as well as multiple testing procedures in therandomized controlled trials. We addressed the comparisons of current statistical approaches as well asthe development of new methodologies in order to develop recommendations for clinical trials in person-alized medicine in the future.

As biomarkers play more and more important role in personalized medicine, especially the pre-dictive biomarkers, individualized stratified treatment strategies become plausible [78, 79, 125]. Manypredictive biomarkers are developed for treatment selection to identify patients who are more likely tobenefit from a particular treatment, and many clinical trial designs considering marker information areproposed in the past decades. Chapter 2 summarized five common clinical trial designs for predictivebiomarkers, i.e., the randomized-all design, the biomarker-strategy design, the enrichment or targeted de-sign, the biomarker stratified or interaction design and the individual profile design. Every design has itsown advantages and drawbacks, and should be applied in different situations and needs with cautions. Forinstance, the randomized-all design ignores the biomarker status in the planning stage, so it is suitable forthe situation of no biomarker available or accessible. If a biomarker is accessible at the planning phase,but it is not sure if the biomarker is predictive or prognostic, the interaction design could be chosen forperforming subgroup analysis retrospectively. However, when the biomarker has diagnosed as predictivebiomarker with firm evidence, using the enrichment or targeted design is more likely to benefit the tar-

91

Page 110: clinical trials for personalized, marker-based treatment strategies

92 CHAPTER 7. CONCLUDING REMARKS AND FURTHER RESEARCH

geted subpopulation of patients and to harm less patients from other subpopulations. When the evidenceof a predictive biomarker is not strong enough, the biomarker-strategy design might be better to confirmthe prediction of the biomarker, although this design may be less efficient compared to the targeted design[56]. The individual profile design can be considered as a special case of the biomarker-strategy designwith multiple biomarkers or subgroups comparing to the standard control [129]. In this thesis, we havediscussed and applied randomized-all design and individual profile design incorporating other statisticalconsiderations, e.g. subgroup selection, interaction testing, as well as multiple testing issues.

Our starting point is the simplest situation of subgroup analysis where only one pre-specified markeris considered relevant for the treatment selection in confirmatory RCTs. Several statistical analysis strate-gies are proposed in the literature for this situation, categorized by the choices and sequences of subgrouptesting, i.e., in the marker-positive, the marker-negative subgroups and/or the overall population. Wediscussed four different approaches in Chapter 2 — fixed-sequence, MaST, fallback and treatment-by-biomarker interaction approaches. With different assumptions about the biomarker involved, differentanalysis approaches should be chosen carefully. Fix-sequence approaches pre-specify the hypothesestesting orders, and reject one by one in sequence, thus they are suitable for studies involving a well-defined predictive biomarker. MaST is appropriate when it is important to consider the treatment effectin both the biomarker-positive and biomarker-negative subgroups. When the treatment effect may behomogeneous in the overall population, it is reasonable to apply the fallback approach. The treatment-by-biomarker interaction approach should be chosen when there is limited evidence about a differencein treatment effects in two subgroups. There are ongoing discussions regarding the various statisticalanalysis strategies in RCTs involving a single biomarker [37, 45], we also try to make our contributionswith research findings on this topic in this thesis. Many multiple testing procedures are proposed or canbe applied for different statistical analysis approaches in the literature [4, 60, 121, 122].

In Chapter 4, we focused on multiple testing procedures aiming at establishing a treatment effect inat least one of the two populations, i.e., the overall population and marker-positive subgroup [4, 5, 74, 105,106, 128]. We compared the feedback procedures with multiple testing procedures with weighted FWERcontrolling strategies with the closed testing principle. Five non-parametric and parametric proceduresincluding Song-Chi procedure from the family of feedback procedures and four procedures with weightedstrategies, i.e.,the weighted Bonferroni, fallback, weighted Holm and weighted parametric procedures,were considered in the hypotheses for testing the treatment effect in the overall population Ho and/orthe targeted subgroup H+. We compared the rejection regions and three powers, which are defined asthe power to reject Ho, the power to reject H+, and the power to reject Ho or H+, for all five selectedprocedures through simulation studies in different scenarios.

The main differences among all five procedures regarding the rejection regions and powers arebetween non-parametric and parametric procedures, as well as between the properties of feedback proce-dure and procedures with weighting strategies. On one hand, all three non-parametric procedures, i.e., theweighted Bonferroni test, the fallback and the weighted Holm procedures, have the same rejection rules aswell as rejection regions in the intersection hypothesis, the differences are due to different α-propagation

Page 111: clinical trials for personalized, marker-based treatment strategies

93

rules in the second stage, as a result, the powers obtained from different procedures are different accord-ingly. Under the same setting of weights, the weighted Bonferroni test obtains lowest powers in generaldue to no α-propagation, and the weighted Holm procedure propagates the significance level to full α

for both hypotheses in the second stage, thus it obtains higher powers than the other two non-parametricprocedures. Due to considering the correlation between the two test statistics explicitly, the weightedparametric test has wider rejection regions of both hypotheses than the non-parametric procedures, andalso obtains the highest powers among all procedures with weighting strategies. On the other hand, due tothe consistency constraint in Song-Chi procedure, the rejection region of H+ allows only until po ≤ α∗1 ,which could be much narrower than other parametric procedures in case a conservative restriction is cho-sen. As a result, Song-Chi provides less power than weighted parametric procedure, sometimes evenlower than non-parametric procedures. The feedback procedures can be considered as giving the weightson the p-values instead of the test statistics, thus they also have the advantage of allowing to balance thepowers of rejecting hypothesis in overall population and power of rejecting hypothesis in the targetedsubgroup, by choosing pre-specified significance level of Ho. However, Song-Chi procedure loses theproperty due to the restriction constraint α∗1 . Furthermore, Song-Chi performs poor when the treatmenteffect also exists in the complementary subgroup; consequently, we should be more cautions when apply-ing Song-Chi procedure. In further research, it may be interesting to investigate further on how to choosethe suitable value of α1 and α∗1 , as well as other feedback procedures which may have similar advantagesas Song-Chi procedure but performs better in more general scenarios.

Next, we addressed the problem of the randomize-all clinical trials with only one marker, wherehowever, the marker was not considered and involved at beginning of the planing phase, but a post hocsubgroup analysis was initiated or required after a significant TE could be shown in the overall popu-lation for certain reasons, e.g. regulatory requirements or health technology assessment. We propose aframework to assess and compute the long term effects of different strategies to perform subgroup anal-ysis in this special situation. We consider two performance measures, i.e., the average post-study TEfor patients in all studies (E) and the fraction of patients with a negative TE in the positive studies (P).Five families of existing decision rules, i.e., superiority testing, inferiority testing, the limiting cases ofno subgroup analysis or comparing estimate to 0, interaction gatekeeper, and fraction of estimate, arecompared under different assumptions of subgroup specific and individual TEs. The optimal decisionrule should provide a good balance between the two measures in terms of maximizing E and minimizingP. Optimistic, moderate and pessimistic scenarios are assumed for the various true TEs across differenttherapeutic areas.

Although the values of the two measures are rather different across three scenarios, the patternsare very similar. Compared to no subgroup analysis using the overall test, performing subgroup analysisreduces the fraction of patients being recommended to a worse treatment (P), and the more strict criteriawe apply, e.g., superiority testing at 5%, the larger reduction in P, but also the larger reduction in theoverall gain (E). On the other hand, the simple decision rule of comparing estimate of subgroup effectwith 0 provides a “win-win” balance between the two measures, i.e., it only results in a very small loss of

Page 112: clinical trials for personalized, marker-based treatment strategies

94 CHAPTER 7. CONCLUDING REMARKS AND FURTHER RESEARCH

E in some cases, and we could even obtain some gain if both subgroup and individual variations are large.In the mean time, it reduces P in a non-neglectable amount, especially in the case of a certain degree ofsubgroup variations. Although interaction tests are often required and recommended before performingsubgroup analysis, we found that the decision rules of interaction gatekeeper do not provide advantagesin general, as there always exists a significance level of superiority testing which yields a larger E andsmaller P simultaneously. Similarly, the decision rule adopted by Japanese guideline for multi-regionalclinical trials, i.e., requiring subgroup effect at least 50% of the overall TE, also offers no advantage. Theperformance of each decision rule highly depends on the size of subgroups. If we reduce the size of eachsubgroup, e.g., increase the number of subgroups with the same total sample size, both E and P reducefor all decision rules except no subgroup analysis. In particularly, the reductions of E are very small forcomparing estimate to 0 but dramatically large for subgroup analyses. While the reductions of P are verysmall for all methods and it does not seem to depend on the number of subgroups. On the contrary, ifwe increase the subgroup size by pooling assumed homogeneous studies together, both E and P increaseas the total sample size increases. We can obtain substantial in E for all decision rules, however, thismay due to the fact that larger sample sizes allow to demonstrate smaller TEs, which may not be clinicalrelevant. In summary, we should always remember that under the same circumstance, we may need toeither reduce the number of subgroups of interest or increase sample sizes to ensure the power.

Despite of some limitations, e.g., strong assumptions on the patient population size and subgroupsize we made for simplicity, the framework proposed in Chapter 5 allows us to compare subgroup anal-ysis strategies in RCTs where overall treatment effect has been declared. This problem was aware bymany statisticians, but only considerably discussed recently, especially by regulatory agencies [3, 41].The EMA guideline on confirmatory subgroup analysis [41] shows an increasing interest in performingsubgroup analyses in the presence of a significant positive overall effect. Two of three scenarios dis-cussed in the guideline explicitly consider the situation where subgroup analysis could be conducted afterthe overall TE been demonstrated. The first scenario describes an interest of verifying the treatment ef-ficacy across the subgroups under the condition of an overall statistically persuasive treatment efficacy.It addresses credibility in biological plausibility, directional consistency, which is widely considered inthe multi-regional clinical trials [61, 75] or clinical trials covering multiple diseases or multiple tumortypes in oncology, e.g., basket trials [84, 102]. The second scenario mainly address the concerns in theperspective of health technology assessment or drug liscencing and labeling. It considers the situationwhen the clinical data presents an overall statistically persuasive efficacy but with a borderline or uncon-vincing benefit/risk assessment, such that it is of interest to identify a post hoc subgroup as part of theconfirmatory testing strategy. This is a clear change compared to discussions about subgroup analysesin the literature. Alosh et al. [3] also mentioned the scenario of the existence of significant treatmentefficacy in overall population but not in the subgroups. In additional, they also showed an interest insubgroup analysis methods that are not based on significance testing and/or interaction tests, in particularthe approach just to look at the estimate, or Bayesian analyses are discussed. The former approach isclosely in line with our considerations in Chapter 5, which suggests that it may be reasonable to look at

Page 113: clinical trials for personalized, marker-based treatment strategies

95

more liberal approaches. Therefore, we believe, with some further refinement, e.g., consider more real-istic assumptions or other alternative approaches such as Bayesian methods, our framework can play animportant role in providing evidence about the choice of subgroup analysis methods for the scenarios andperspectives mentioned above.

The third topic we addressed in this thesis is a more complex but realistic situation involving multi-ple biomarkers in RCTs. In Chapter 6, we consider a similar design as the individual profile or marker-based and stratified design (Figure 2.9) and the umbrella trial design (Figure 2.10) described in Section2.2.5. We assume there already exists a stratified treatment strategy being dependent on a marker patternand dividing the whole population into small subgroups. Patients from different subgroups are suggestedto receive different treatments according to their marker information. The increasingly raised interestsabout umbrella trials illustrate this development [69, 71, 85, 107], although we did not considered adap-tive designs in this chapter as some well-known trials did. To compare this highly stratified strategywith the standard treatment in an randomized clinical trial for demonstrating its superiority, we present aframework to compare a new approach — subset analysis — with simpler ones like subgroup analysis,overall analysis performing an overall test only, and the combination of both. The subset analysis aimsto demonstrate a treatment effect for a subset of patients built by joining several subgroups, instead ofeach single subgroup. Three measures, i.e., success rate, impact and inferiority rate, are considered in thisframework. We first compared the success rate — defined as the probability of identifying at least onenon-empty significant subset — among five approaches with different number of subgroups under differ-ent assumptions of between subgroup and individual variations, including scenarios contain zero-effectsubgroups and individuals, as we also expect no benefit from the new strategy in some subpopulations.

As expected, we found subset analysis to be more powerful than subgroup analysis by obtaininghigher success rate regardless of number of subgroups in all scenarios from our simulations. The overallanalysis may be more powerful if the population is homogeneous or there exists many subgroups, butwhen the heterogeneity is large or the number of subgroups is small, the subset analyses show moreobvious advantages, and even subgroup analysis can be more powerful than the overall analysis in somecases. This may suggests that in case we have prior knowledge of higher heterogeneity among predictivebiomarkers in the study, it is better to avoid performing only overall analysis for the overall population,but to investigate further into the subgroups. The other two measures impact and inferiority rate arethe conceptually same measures as E and P proposed in Chapter 5, but are computed in a differentway. Impact is defined as the expected average change in the outcome when patients will be treated asrecommended, the inferiority rate is defined as the fraction of patients recommended to switch to a inferiortreatment. When investigating the relation between impact and inferiority rate, a good approach shouldsuccessfully maximize impact and minimize inferiority rate. Overall analysis and subgroup analysisalways reach the two extremes by obtaining either largest or smallest values of both impact and inferiorityrate, and the combination of both places in the middle. Whereas the subset analyses approaches performbetter than all the three methods by showing a better balance of impact and inferiority rate at a higherlevel closer to the optimal point. Subset analysis restricted to sufficiently large size of subsets could

Page 114: clinical trials for personalized, marker-based treatment strategies

96 CHAPTER 7. CONCLUDING REMARKS AND FURTHER RESEARCH

improve both the success rate and impact, but the inferiority rate is also increased accordingly, whichyields another trade-off.

The framework proposed in Chapter 6 allows us to give first insights into the new approach ofanalyzing subsets instead of each single subgroup. There is still room for improvement in future research,especially with respect to increasing the sample size of the selected subset in order to benefit as manypatients as possible. First of all, we could consider alternative criteria, e.g., maximal estimated impactor maximal size to select a significant subset among all subsets with significant treatment effects. Wemay also enlarge the size of the selected subset, by selecting the union of significant subsets even if it isnot significant. Second, we may try to increase the success rate by decreasing the number of tests beingperformed, e.g., taking logical dependencies into account or considering prior knowledge like clinicalexpectations about similarities in the treatment effects within different subgroups. In this way, we mayalso address the problem that increasing the number of subgroups and testing all 2K−1 subsets becomescomputationally demanding. With the fast development of molecularly targeted agents, this situationwith multiple markers being involved will become more and more common. This framework is only thefirst step of investigation for multiple markers, we hope to foresee that more optimal approaches will becarried out for solving problems in this situation in the future.

Despite of the limitations, the two frameworks introduced in Chapter 5 and Chapter 6, both con-sidered and compared several existing methods. In particularly, both frameworks focused on the hetero-geneities among patients within the same subgroup and considered the treatment effect of the individualpatient level. To our knowledge, this idea is not widely considered in the literature of subgroup analyses[13]. We proposed the same performance measures in the two frameworks, despite the different way ofcalculation and computation. The ultimate goal is selecting a treatment to benefit as many patients aspossible and to harm the patients as least as possible. Consequently, we should pay attention to maximizethe expected average overall gain and minimize the overall rate of patients being recommended to inferiortreatments, if the new treatment was declared to be superior and given to the selected subgroup or subset.

Although there is still much room for improvement, our work provides at least some insights ofstatistical issues in personalized treatment strategies in randomized clinical trials. It may shed a light forfurther research on this topic and we believe the outputs of this thesis are relevant for all clinicians andstatisticians involved in the planning and analysis of studies on personalized treatment strategies, both inindustrial and academic settings.

Page 115: clinical trials for personalized, marker-based treatment strategies

Bibliography

[1] Academy of Medical Science (2015). Stratified, personalised or P4 medicine: a new direction forplacing the patient at the centre of healthcare and health education (technical report). Retrieved January6, 2016: https://www.acmedsci.ac.uk/viewFile/564091e072d41.pdf.

[2] Alosh, M., Bretz, F., and Huque, M. F. (2014). Advanced multiplicity adjustment methods in clinicaltrials. Statistics in Medicine, 33:693–713.

[3] Alosh, M., Fritsch, K., Huque, M., Mahjoob, K., Pennello, G., Rothmann, M., Russek-Cohen, E.,Smith, F., Wilson, S., and Yue, L. (2015). Statistical considerations on subgroup analysis in clinicaltrials. Statistics in Biopharmaceutical Research, 7:286–303.

[4] Alosh, M. and Huque, M. F. (2009). A flexible strategy for testing subgroups and overall population.Statistics in Medicine, 28:3–23.

[5] Alosh, M. and Huque, M. F. (2013). Multiplicity considerations for subgroup analysis subject toconsistency constraint. Biometrical Journal, 3:444–462.

[6] American Red Cross. History of blood transfusion. Retrieved December 13, 2015: http://www.redcrossblood.org/learn-about-blood/history-blood-transfusion.

[7] Assmann, S. F., Pocock, S. J., Enos, L. E., and Kasten, L. E. (2000). Subgroup analysis and other(mis)uses of baseline data in clinical trials. The Lancet, 355:1064–1069.

[8] Ballman, K. V. (2015). Biomarker: Predictive or prognostic? Journal of Clinical Oncology, Pub-lished Online:doi: 10.1200/JCO.2015.63.3651.

[9] Bender, R., Koch, A., Skipka, G., Kaiser, T., and Lange, S. (2010). No inconsistent trial assess-ments by NICE and IQWiG: different assessment goals lead to different assessment results regardingsubgroup analyses. Journal of Clinical Epidemiology, 63:1305–1307.

97

Page 116: clinical trials for personalized, marker-based treatment strategies

98 CHAPTER BIBLIOGRAPHY

[10] Berger, R. L. (1982). Multiparameter hypothesis testing and acceptance sampling. Technometrics,24:295–300.

[11] Biomarkers Definition Working Group (2001). Biomarkers and surrogate endpoints: preferred def-initions and conceptual framework. Clinical Pharmacology Therapeutics, 69:89–95.

[12] BioPharmNet. Subgroup analysis. Retrieved December 28, 2015: http://biopharmnet.com/

subgroup-analysis/.

[13] Brannath, W. (2016). Methoden zur quantifizierung von behandlungseffektvarianzen undderen verwendung zur beurteilung von subgruppenheterogenitä. Retrieved January 28,2016: http://www.vetmed.fu-berlin.de/einrichtungen/institute/we16/kolloquium/

abstracts_2015_2016/2016_01_19_Brannath.pdf.

[14] Bretz, F., Brannath, W., Maurer, W., and Posch, M. (2009). A graphical approach to sequentiallyrejective multiple test procedures. Statistics in Medicine, 28:586–604.

[15] Bretz, F., Hothorn, T., and Westfall, P. (2011a). Multiple Comparisons Using R. Chapman &Hall/CRC.

[16] Bretz, F., Posch, M., Glimm, E., Klinglmueller, F., Maurer, W., and Rohmeyer, K. (2011b). Graph-ical approaches for multiple comparison procedures using weighted bonferroni, simes, or parametrictests. Biometrical Journal, 6:894–913.

[17] Brookes, S. T., Whitely, E., Egger, M., Smith, G. D., Mulheran, P. A., and Peters, T. J. (2004).Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample sizefor the interaction test. Journal of Clinical Epidemiology, 57:229–236.

[18] Buyse, M. and Michiels, S. (2010). Biomarkers and surrogate endpoints in clinical trials. In:

Oncology Clinical Trials (Kelly W.K. and Halabi S.), pages 215–226.

[19] Buyse, M., Michiels, S., Sargent, D. J., Grothey, A., Matheson, A., and de Gramont, A. (2011).Integrating biomarkers in clinical trials. Expert Review, 11(2):171–182.

[20] Buyse, M., Sargent, D. J., Grothey, A., Matheson, A., and de Gramont, A. (2010). Biomarkersand surrogate end points—the challenge of statistical validation. Nature Reviews Clinical Oncology,7:309–317.

[21] Buyse, M., Vangeneugden, T., Bijnens, L., Renard, D., Burzykowski, T., Geys, H., and Molen-berghs, G. (2003). Validation of biomarkers as surrogates for clinical endpoints. In: Biomarkers in

Clinical Drug Development (Bloom JC and Dean RA, eds.), page 149–168.

[22] CAPRIE Steering Committee (1996). A randomised, blinded, trial of clopidogrel versus aspirin inpatients at risk of ischaemic events (CAPRIE). The Lancet, 348:1329–1339.

Page 117: clinical trials for personalized, marker-based treatment strategies

BIBLIOGRAPHY 99

[23] Chau, C. H., Rixe, O., McLeod, H., and Figg, W. D. (2008). Validation of analytical methods forbiomarkers employed in drug development. Clinical Cancer Research, 14(19):5967–5976.

[24] ClinicalTriaks.gov. Glossary of common site terms. Retrieved December 13, 2015: https://

clinicaltrials.gov/ct2/about-studies/glossary.

[25] Cobo, M., Isla, D., Massuti, B., Montes, A., Sanchez, J. M., Provencio, M., Viñolas, N., Paz-Ares,L., Lopez-Vivanco, G., Muñoz, M. A., Felip, E., Alberola, V., Camps, C., Domine, M., Sanchez, J. J.,Sanchez-Ronco, M., Danenberg, K., Taron, M., Gandara, D., and Rosell, R. (2007). Customizingcisplatin based on quantitative excision repair cross-complementing 1 mrna expression: a phase iiitrial in non-small-cell lung cancer. Journal of Clinical Oncology, 25:2747–2754.

[26] DeMets, D.and Friedman, L. and Furberg, C. (2010). Fundamentals of Clinical Trials. Springer, 4edition.

[27] Dent, L. and Raftery, J. (2011). Treatment success in pragmatic randomised controlled trials: areview of trials funded by the UK health technology assessment programme. Trials, 12:Article 109.

[28] Dessí, N., Pascariello, E., and Pes, B. (2013). A comparative analysis of biomarker selection tech-niques. BioMed Research International, 2013 (2013):http://dx.doi.org/10.1155/2013/387673.

[29] Djulbegovic, B., Kumar, A., Miladinovic, B., Reljic, T., Galeb, S., Mhaskar, A., Mhaskar, R., Hozo,I., Tu, D., Stanton, H., and et al. (2013). Treatment success in cancer: Industry compared to publiclysponsored randomized controlled trials. Plos One, 8(3):Article e58711.

[30] Djulbegovic, B., Kumar, A., Soares, H., Hozo, I., Bepler, G., Clarke, M., and Bennett, C. (2008).Treatment success in cancer. Archives of Internal Medicine, 168(6):632–642.

[31] Dmitrienko, A. and Lipkovich, I. (2014). Exploratory subgroup analysis: Post-hoc subgroup iden-tification in clinical trials. Retrieved December 28, 2015: http://www.ema.europa.eu/docs/en_GB/document_library/Presentation/2015/03/WC500183524.pdf.

[32] Dmitrienko, A., Muysers, C., Fritsch, A., and Lipkovich, I. (2015). General guidance on exploratoryand confirmatory subgroup analysis in late-stage clinical trials. Journal of Biopharmaceutical Statis-

tics, page Accepted author version.

[33] Dmitrienko, A., Offen, W. W., and Westfall, P. H. (2003). Gatekeeping strategies for clinical trialsthat do not require all primary effects to be significant. Statistics in Medicine, 22:2387–2400.

[34] Dmitrienko, A. and Ralph B. D’Agostino, S. (2009). Traditional multiplicity adjustment methodsin clinical trials. Statistics in Medicine, 32:5172–5218.

[35] Dmitrienko, A., Ralph B. D’Agostino, S., and Huque, M. F. (2013). Key multiplicity issues inclinical drug development. Statistics in Medicine, 32:1079–1111.

Page 118: clinical trials for personalized, marker-based treatment strategies

100 CHAPTER BIBLIOGRAPHY

[36] Dmitrienko, A., Tamhane, A. C., and Bretz, F. (2010). Multiple Testing Problems in Pharmaceutical

Statistics. Chapman & Hall/CRC.

[37] Eng, K. H. (2014). Randomized reverse marker strategy design for prospective biomarker validation.Statistics in Medicine, 33:3089–3099.

[38] European Medicines Agency (2000a). Points to consider on multiplicity issues in clinical tri-als. Released in January 2000: http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003640.pdf.

[39] European Medicines Agency (2000b). Points to consider on validity and interpretation of meta-analyses, and one pivotal study. Released on October 2000: http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003659.pdf.

[40] European Medicines Agency (2010). Concept paper on the need for a guideline onthe use of subgroup analyses in randomised controlled trials. Released on 22-04-2010:http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/

2010/05/WC500090116.pdf.

[41] European Medicines Agency (2014). Draft guideline on the investigation of subgroups in con-firmatory clinical trials. Released in January 2014: http://www.ema.europa.eu/docs/en_GB/

document_library/Scientific_guideline/2014/02/WC500160523.pdf.

[42] FDA Advisory Committee (1997). 82nd meeting of the cardiovascular and renal drugs com-mittee. Released on 24-10-1997: http://www.fda.gov/ohrms/dockets/ac/97/transcpt/

3338t2.pdf.

[43] Freidlin, B., Korn, E. L., and Gray, R. (2014). Marker sequential test (mast) design. Clinical trials,11:19–27.

[44] Freidlin, B., McShane, L. M., and Korn, E. L. (2010). Randomized clinical trials with biomarkers:Design issues. Journal of National Cancer Institute, 102:152–160.

[45] Freidlin, B., McShane, L. M., and Korn, E. L. (2013). Phase III clinical trials that integrate treatmentand biomarker evaluation. Journal of Clinical Oncology, 31:3158–3161.

[46] Gail, M. and Simon, R. (1985). Testing for qualitative interactions between treatment effects andpatient subsets. Biometrics, 41(2):361–372.

[47] Genz, A. and Bretz, F. (1999). Numerical computation of the multivariate t probabilities with appli-cation to power calculation of multiple contrasts. Journal of Statistical Computation and Simulation,63:361–378.

Page 119: clinical trials for personalized, marker-based treatment strategies

BIBLIOGRAPHY 101

[48] Hasford, J., Bramlage, P., Koch, G., Lehmacher, W., Einhaeupl, K., and Rothwell, P. (2010). In-consistent trial assessments by the National Institute for Health and Clinical Excellence and IQWiG:standards for the performance and interpretation of subgroup analyses are needed. Journal of Clinical

Epidemiology, 63:1298–1304.

[49] Hay, E. M., Dunn, K. M., Hill, J. C., Lewis, M., Mason, E. E., Konstantinou, K., Sowden, G.,Somerville, S., Vohora, K., Whitehurst, D., and Main, C. J. (2008). A randomised clinical trial ofsubgrouping and targeted treatment for low back pain compared with best current care. the STarTBack trial study protocol. BioMed Central, 9:58:doi:10.1186/1471–2474–9–58.

[50] Hemmings, R. (2011). Subgroup analyses – scene setting from the eu regulators perspective.Retrieved December 28, 2015: http://www.ema.europa.eu/docs/en_GB/document_library/Presentation/2011/11/WC500118093.pdf.

[51] Hemmings, R. (2012). Subgroup analyses: Important, infuriating and intractable. Retrieved Decem-ber 28, 2015: https://www.efspi.org/documents/activities/international%20events/

3importantinfuriatingandintractablerobhemmings.pdf.

[52] Hemmings, R. (2014). An overview of statistical and regulatory issues in the planning, analysis,and interpretation of subgroup analyses in confirmatory clinical trials. Journal of Biopharmaceutical

Statistics, 24:4–18.

[53] Hill, J. C., Whitehurst, D. G. T., Lewis, M., Bryan, S., Dunn, K. M., Foster, N. E., Konstantinou, K.,Main, C. J., Mason, E., Somerville, S., Sowden, G., Vohora, K., and Hay, E. M. (2011). Comparisonof stratified primary care management for low back pain with current best practice (STarT Back): arandomised controlled trial. The Lancet, Published Online:doi:10.1016/S0140–6736(11)60937–9.

[54] Hochberg, Y. (1988). A sharper bonferroni procedure for multiple significance testing. Biometrika,75:800–802.

[55] Hochberg, Y. and Tamhane, A. C. (2008). Multiple Comparison Procedures. John Wiley & SonsInc.

[56] Hoering, A., LeBlanc, M., and Crowley, J. J. (2008). Randomized phase III clinical trial designsfortargeted agents. Clinical Cancer Research, 14(14):4358–4367.

[57] Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of

Statistics, 6:65–70.

[58] Hommel, G., Bretz, F., and Maurer, W. (2007). Powerful short-cuts for multiple testing procedureswith special reference to gatekeeping strategies. Statistics in Medicine, 26:4063–4073.

Page 120: clinical trials for personalized, marker-based treatment strategies

102 CHAPTER BIBLIOGRAPHY

[59] Hunter, D. J., Losina, E., Guermazi, A., Burstein, D., Lassere, M. N., and Kraus, V. (2010). A path-way and approach to biomarker validation and qualification for osteoarthritis clinical trials. Current

Drug Targets, 11(5):536–545.

[60] Huque, M. F. and Alosh, M. (2008). A flexible fixed-sequence testing method for hierarchicallyordered correlated multiple endpoints in clinical trials. Journal of Statistical Planning and Inference,138:321 – 335.

[61] Ikeda, K. and Bretz, F. (2010). Sample size and proportion of Japanese patients in multi-regionaltrials. Pharmaceutical Statistics, 9:207–216.

[62] International Conference on Harmonisation (1999a). General considerations for clinical trials (ICHE8). Statistics in Medicine, 18:1905–1942.

[63] International Conference on Harmonisation (1999b). Statistical principles for clinical trials (ICHE9). Statistics in Medicine, 18:1905–1942.

[64] IQWiG (2006). Final report plan for the evaluation: "clopidogrel versus acetyl-salicylic acid in the secondary prevention of vascular events". Available at 21-01-2008: http://www.iqwig.de/download/A04-1A_Abschlussbericht_Clopidogrel_versus_

ASS_in_der_Sekundaerprophylaxe.pdf.

[65] Janes, H., Pepe, M. S., McShane, L. M., Sargent, D. J., and Heagerty, P. J. (2015). The fundamen-tal difficulty with evaluating the accuracy of biomarkers for guiding treatment. Journal of National

Cancer Institute, 107(8):doi: 10.1093/jnci/djv157.

[66] Japanese Pharmaceuticals and Medical Devices Agency (2007). Basic principles on global clini-cal trials. Published on 2007-09-28: http://www.pmda.go.jp/operations/notice/2007/file/0928010-e.pdf.

[67] Jones, H., Ohlssen, D., Neuenschwander, B., Racine, A., and Branson, M. (2011). Bayesian modelsfor subgroup analysis in clinical trials. Clinical Trials, 8:129–143.

[68] Kaplan, R. (2015). The FOCUS4 design for biomarker stratified trials. Chinese Clinical Oncology,4(3):35:doi: 10.3978/j.issn.2304–3865.2015.02.03.

[69] Kaplan, R., Maughan, T., Crook, A., Fisher, D., Wilson, R., Brown, L., and Parmar, M. (2013).Evaluating many treatments and biomarkers in oncology: A new design. Journal of Clinical Oncology,31:4562–4568.

[70] Karapetis, C. S., Khambata-Ford, S., Jonker, D. J., O’Callaghan, C. J., Tu, D., Tebbutt, N. C., andet.al (2008). K-ras mutations and benefit from cetuximab in advanced colorectal cancer. New England

Journal of Medicine, 359:1757–1765.

Page 121: clinical trials for personalized, marker-based treatment strategies

BIBLIOGRAPHY 103

[71] Kim, E. S., Herbst, R. S., Wistuba, I. I., and et al. (2011). The battle trial: Personalizing therapy forlung cancer. Cancer Discovery, 1(1):44–53.

[72] Koch, G. G. and Schwartz, T. A. (2014). An overview of statistical planning to address subgroupsin confirmatory clinical trials. Journal of Biopharmaceutical Statistics, 24:72–93.

[73] Kumar, A., Soares, H., Wells, R., Clarke, M., Hozo, L., Bleyer, A., Reaman, G., Chalmers, I., andDjulbegovic, B. (2005). Are experimental treatments for cancer in children superior to establishedtreatments? observational study of randomised controlled trials by the children’s oncology group.British Medical Journal, 331(7528):1295–1298B.

[74] Li, J. D. and Mehrotra, D. V. (2008). An efficient method for accommodating potentially under-powered primary endpoints. Statistics in Medicine, 27:5377–5391.

[75] Liu, J.-P., Chow, S.-C., and Hsiao, C.-F. (2013). Design and Analysis of Bridging Studies. Chapman& Hall/CRC.

[76] Machin, D. and Fayers, P. M. (2010). Randomized Clinical Trials: Design, Practice and Reporting.Wiley-Blackwell.

[77] Maggioni, A., Darne, B., Atard, D., Abadie, E., Pitt, B., and Zannad, F. (2007). FDA and CPMPrulings on subgroup analyses. Issues in Drug Development, 107:97–102.

[78] Mandrekar, S. J. and Sargent, D. J. (2009). Clinical trial designs for predictive biomarker validation:Theoretical considerations and practical challenges. Journal of Clinical Oncology, 27:4027–4034.

[79] Mandrekar, S. J. and Sargent, D. J. (2010). Predictive biomarker validation in practice: lessons fromreal trials. Clinical Trials, 7(5):567–573.

[80] Marcus, R., Peritz, E., and Gabriel, K. (1976). On closed testing procedures with special referenceto ordered analysis of variance. Biometrika, 63:655–660.

[81] Marty, M., Cognetti, F., Maraninchi, D., Snyder, R., Mauriac, L., Tubiana-Hulin, M., Chan, S.,Grimes, D., Antón, A., Lluch, A., Kennedy, J., O’Byrne, K., Conte, P., Green, M., Ward, C., Mayne,K., and Extra, J.-M. (2005). Randomized phase ii trial of the efficacy and safety of trastuzumab com-bined with docetaxel in patients with human epidermal growth factor receptor 2-positive metastaticbreast cancer administered as first-line treatment: the m77001 study group. Journal of Clinical On-

cology, 23:4265–4274.

[82] Matsui, S., Buyse, M., and Simon, R. (2015). Design and Analysis of Clinical Trials for Predictive

Medicine. Chapman & Hall/CRC.

[83] Matsui, S., Choai, Y., and Nonaka, T. (2014). Comparison of statistical analysis plans in randomize-all phase iii trials with a predictive biomarker. Clinical Cancer Research, 20:2820–2830.

Page 122: clinical trials for personalized, marker-based treatment strategies

104 CHAPTER BIBLIOGRAPHY

[84] Menis, J., Hasan, B., and Besse, B. (2014). New clinical research strategies in thoracic oncol-ogy: clinical trial design, adaptive, basket and umbrella trials, new end-points and new evaluations ofresponse. European Respiratory Review, 23:367–378.

[85] Middleton, G., Crack, L. R., Popat, S., Swanton, C., Hollingsworth, S. J., Buller, R., Walker, I., Carr,T. H., Wherton, D., and Billingham, L. J. (2015). The national lung matrix trial: translating the biologyof stratification in advanced non-small-cell lung cancer. Annals of Oncology, 26(12):2464–2469.

[86] National Cancer Institute. NCI dictionary of cancer terms. Retrieved December 13, 2015: http://www.cancer.gov/publications/dictionaries/cancer-terms?cdrid=45858.

[87] National Institute for Health and Clinical Excellence (2005). Clopidogrel and modified-releasedipyridamole in the prevention of occlusive vascular events. Available at 21-01-2008: www.nice.

org.uk/nicemedia/pdf/TA090guidance.pdf.

[88] Newcombe, R. G. (1998). Interval estimation for the difference between independent proportions:comparison of eleven methods. Statistics in Medicine, 17:873–890.

[89] Ondra, T., Dmitrienko, A., Friede, T., Graf, A., Miller, F., Stallard, N., and Posch, M. (2016). Meth-ods for identification and confirmation of targeted subgroups in clinical trials: A systematic review.Journal of Biopharmaceutical Statistics, 26:99–11.

[90] Paget, M., Chuang-Stein, C., Fletcher, C., and Reidd, C. (2011). Subgroup analyses of clinicaleffectiveness to support health technology assessments. Pharmaceutical Statistics, 10:532–538.

[91] Pocock, S., Calvo, G., Marrugat, J., Prasad, K., Tavazzi, L., Wallentin, L., Zannad, F., and Garcia,A. A. (2013). International differences in treatment effect: do they really exist and why? European

Heart Journal, 34:1846–1852.

[92] Pocock, S. J., Assmann, S. E., Enos, L. E., and Kasten, L. E. (2002). Subgroup analysis, covari-ate adjustment and baseline comparisons in clinical trial reporting: current practice and problems.Statistics in Medicine, 21:2917–2930.

[93] Ren, Z., Davidian, M., George, S. L., Goldberg, R. M., Wright, F. A., Tsiatis, A. A., and Kosorok,M. R. (2012). Research methods for clinical trials in personalized medicine: A systematic review.University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series.

[94] Rosenblum, M., Liu, H., and Yen, E.-H. (2014). Optimal tests of treatment effects for the overallpopulation and two subpopulations in randomized trials, using sparse linear programming. Journal of

the American Statistical Association, 109:1216–1228.

[95] Rosentha, R. and Rubin, D. B. (1983). Ensemble-adjusted p values. Psychological Bulletin, 94:540–541.

Page 123: clinical trials for personalized, marker-based treatment strategies

BIBLIOGRAPHY 105

[96] Ross, J. S. and Fletcher, J. A. (1998). The her-2/neu oncogene in breast cancer: Prognostic factor,predictive factor, and target for therapy. The Oncologist, 3:237–252.

[97] Roy, S. (1953). On a heuristic method for test construction and its use in multivariate analysis. The

Annals of Statistics, 24:220–238.

[98] Simon, N. and Simon, R. (2013). Adaptive enrichment designs for clinical trials. Biostatistics,14:613–625.

[99] Simon, R. (2008). The use of genomics in clinical trial design. Clinical Cancer Research,14:5984–5993.

[100] Simon, R. (2012). Clinical trials for predictive medicine. Statistics in Medicine, 31:3031–3040.

[101] Simon, R. (2013). Genomics Clinical Trials and Predictive Medicine. Cambridge University Press.

[102] Simon, R. (2014). Biomarker based clinical trial design. Chinese Clinical Oncology, 3(3):39:doi:10.3978/j.issn.2304–3865.2014.02.03.

[103] Simon, R. and Wang, S.-J. (2006). Use of genomic signatures in therapeutics development inoncology and other diseases. The Pharmacogenomics Journal, 6:166–173.

[104] Slamon, D., Eiermann, W., Robert, N., Pienkowski, T., Martin, M., Press, M., and et.al (2011).Adjuvant trastuzumab in her2-positive breast cancer. New England Journal of Medicine, 365:1273–1283.

[105] Song, Y. and Chi, G. Y. H. (2007). A method for testing a prespecified subgroup in clinical trials.Statistics in Medicine, 26:3535–3549.

[106] Stallard, N., Hamborg, T., Parsons, N., and Friede, T. (2014). Adaptive designs for confirmatoryclinical trials with subgroup selection. Journal of Biopharmaceutical Statistics, 24:168–187.

[107] Steuer, C., Papadimitrakopoulou, V., Herbst, R., Redman, M., Hirsch, F., Mack, P., Ramalingam,S., and Gandara, D. (2015). Innovative clinical trials: The lung-map study. Clinical Pharmacology &

Therapeutics, 97(5):488–491.

[108] Sun, H. and Vach, W. (2015). A framework to assess the value of subgroup analyses whenthe overall treatment effect is significant. Journal of Biopharmaceutical Statistics, Published On-line:doi:10.1080/10543406.2015.1052484.

[109] Tajik, P., Zwinderman, A. H., Mol, B. W., and Bossuyt, P. M. (2013). Trial designs for personaliz-ing cancer care: A systematic review and classification. Clinical Cancer Research, 19:4578–4588.

[110] Temple, R. J. (1994). Special study designs: early escape, enrichment, studies in non-responders.Communications in Statistics - Theory and Methods, 23(2):499–531.

Page 124: clinical trials for personalized, marker-based treatment strategies

106 CHAPTER BIBLIOGRAPHY

[111] The Medical Research Council (MRC) (2014). FOCUS 4. Released in January 2014: http:

//www.focus4trial.org/.

[112] U.S. Food and Drug Administration (1998). Guidance for industry: Providing clinicalevidence of effectiveness for human drug and biological products. Published on May 1998:http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/

Guidances/ucm078749.pdf.

[113] U.S. Food and Drug Administration (2012). Guidance for industry: Enrichment strategies for clin-ical trials to support approval of human drugs and biological products. Published on December 2012:http://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinformation/

guidances/ucm332181.pdf.

[114] U.S. Food and Drug Administration (2013). Paving the way for personalized medicine: FDA’srole in a new era of medical product development. Retrieved April 26, 2014: http://www.fda.gov/downloads/scienceresearch/specialtopics/personalizedmedicine/ucm372421.pdf.

[115] S̆idák Z (1967). Rectangular confidence regions for the means of multivariate normal distributions.Journal of the American Statistical Association, 62:626–633.

[116] Wang, S.-J., O’Neill, R. T., and Hung, H. M. J. (2007). Approaches to evaluation of treatmenteffect in randomized clinical trials with genomic subset. Pharmaceutical Statistics, 6:227–244.

[117] Westfall, P. and Young, S. (1993). Resampling based multiple testing. John Wiley & Sons.

[118] Westfall, P. H., Krishen, A., and Young, S. S. (1998). Using prior information to allocate signifi-cance levels for multiple endpoints. Statistics in Medicine, 17:2107–2119.

[119] Westfall, P. H., Kropf, S., and LivioFinos (2004). Weighted FWE-controlling methods in high-dimensional situations. Institute of Mathematical Statistics, 47:143–154.

[120] WHO. International clinical trials registry platform (ICTRP). Retrieved December 13, 2015:http://www.who.int/ictrp/glossary/en/.

[121] Wiens, B. L. (2003). A fixed sequence bonferroni procedure for testing multiple endpoints. Phar-

maceutical Statistics, 2:211–215.

[122] Wiens, B. L. and Dmitrienko, A. (2005). The fallback procedure for evaluating a single family ofhypotheses. Journal of Biopharmaceutical Statistics, 15:929–42.

[123] Wiens, B. L. and Dmitrienko, A. (2010). On selecting a multiple comparison procedure for analysisof a clinical trial: Fallback, fixed sequence, and related procedures. Statistics in Biopharmaceutical

Research, 2(1):22–32.

Page 125: clinical trials for personalized, marker-based treatment strategies

BIBLIOGRAPHY 107

[124] Xie, C., Lu, X., and Chen, D.-G. D. (2015). Comparative study of five weighted parametric multi-ple testing methods for correlated multiple endpoints in clinical trials. In: Clinical Trial Biostatistics

and Biopharmaceutical Application (Walter R. Young and Ding-Geng (Din) Chen), pages 421–432.

[125] Young, W. R. and Chen, D.-G. D. (2015). Clinical Trial Biostatistics and Biopharmaceutical

Applications. CRC Press.

[126] Yusuf, S., Collins, R., and Peto, R. (1984). Why do we need some large, simple randomized trials?Statistics in Medicine, 3:409–420.

[127] Yusuf, S., Wittes, J., Probstfield, J., and Tyroler, H. A. (1991). Analysis and interpretation oftreatment effects in subgroups of patients in randomized clinical trials. Journal of The American

Medical Association, 266:93–98.

[128] Zhao, Y. D., Dmitrienko, A., and Tamura, R. (2010). Design and analysis considerations in clinicaltrials with a sensitive subpopulation. Statistics in Biopharmaceutical Research, 2:72–83.

[129] Ziegler, A., Koch, A., Krockenberger, K., and Großhennig, A. (2012). Personalized medicine usingdna biomarkers: a review. Human Genetics, 131:1627–1638.

Page 126: clinical trials for personalized, marker-based treatment strategies
Page 127: clinical trials for personalized, marker-based treatment strategies

Appendix AProofs and computation details of

Chapter 6

A.1 Proof of FWER control for M4 and M5

Let S , S ′ be a partition of PK with H0S true for all S∈S and H0

S′ false for all S′ ∈S ′. As we use a singlestep procedure, FWER control requires to demonstrate that P(TS0 ≥ c)≤ α for any S0 ∈S with — in thecase of M4 — c denoting the upper quantile of the distribution of max(TS|S ∈PK) under

⋂S∈PK

HS0 . The

latter is equivalent to⋂

g=1,...,KH{g}0 , as {g} ∈PK for all g = 1, . . . ,K, and any HS

0 is implied by⋂

g∈SH{g}0 .

As TS0 ≤ max(TS|S ∈PK), we have P(TS0 ≥ c) ≤ α if θg ≤ 0 ∀g = 1, . . . ,K holds. Now we can writeTs as θ̂s√

ˆvar(θ̂S)= θ̂s√

σ̂2h2S

with θ̂S = ∑ π̃gθ̂g following a normal distribution with mean θS and variance

σ2h2S with h2

S = ∑ π̃2g

var(θ̂g)

σ2 and σ̂2 stochastically independent of θ̂S and following a distribution only

depending on σ2, and hS not depending on θg, as var(θ̂g)

σ2 is only depending on the empirical distributionof the treatment indicators Tgi in the subgroups. Consequently, the distribution of TS0 does only dependon θS0 and σ2. For any θS0 ≤ 0 we can find a choice of θg ≤ 0 with the corresponding value for θS0 .Consequently, P(TS0 ≥ c)≤ α .

To extend the proof to M5, we have to ensure that any H{g}0 is implied by⋂

S∈PK ,πS≥0.5HS

0 . However,

if⋂

S∈PK ,πS≥0.5,g∈SHS

0 6= H{g}0 , then there must be a g′ 6= g with g′ ∈ S for all S with g ∈ S and πS ≥ 0.5.

However, this implies πK \g′ ≤ 0.5, and hence πg′ ≥ 0.5, which contradicts the assumption.

109

Page 128: clinical trials for personalized, marker-based treatment strategies

110 CHAPTER A. PROOFS AND COMPUTATION DETAILS OF CHAPTER 6

A.2 Appendix II: Proof of positivity of subgroup effect estimates

As in computing the adjusted p-values TS is compared with a constant independent of S, it suffices toshow that for any subset S and any subgroup g′ /∈ S with θ̂g′ ≤ 0, we have TS∪{g′} < TS.

We have

θ̂S∪{g′}

θ̂S=

∑g∈S∪{g′}πgθ̂g/(πS +πg′)

∑g∈S πgθ̂g/πS≤ ∑g∈S πSθ̂g

∑g∈S πSθ̂g

πS

πS +πg′=

πS

πS +πg′

and with Vg = var(θ̂g) and h2S as in Appendix I, we have

h2S∪{g′}

h2S

=∑g∈S∪{g′}π2

gVg/(πS +πg′)2

∑S∪{g′}π2gVg/π2

S>

∑g∈S π2gVg

∑g∈S π2gVg

π2S

(πS +πg′)2 =π2

S(πS +πg′)2

and by TS =θ̂s√σ̂2h2

S

we have

TS∪{g′}TS

=θ̂S∪{g′}/θ̂S√

h2S∪{g′}/h2

S

<πS/(πS +πg′)

πS/(πS +πg′)= 1

A.3 Appendix III: Technical implementation

Since PK is not always an integer, we choose K0, the numbers of subgroups with a zero-effect, randomlyas K−0 = [PK]− or K+

0 = [PK]+, with K−0 = [PK]− denoting the next smallest integer and K+0 = [PK]+

denoting the next largest integer. K0 is chosen as K−0 with probability K+0 −PK and as K+

0 with probabilityPK−K−0 . This implies

E(K0) = K+0 (PK−K−0 )+K−0 (K+

0 −PK) = PK(K+0 −K−0 ) = PK

We set θg = 0 for g=K−K0+1, . . . ,K. For the remaining groups G+ = {1, . . . ,K−K0}, we require

θG+ =θA

πG+

such thatθoverall = 0+πG+

θA

πG+= θA

Actually, if K−K0 > 1, we choose for g = 1, . . . ,K−K0,

θg = θ∗ f (g)

with f (g) = (1− τ)+2τg−1

K−K0−1 , i.e. values on a grid between 1− τ and 1+ τ and θ ∗ =θG+

∑ π̃G+g f (g)

, such

that ∑K−K0g=1 π̃G+

g θ ∗ f (g) = θG+ . If K−K0 = 1, we choose θg = θG+ .

Page 129: clinical trials for personalized, marker-based treatment strategies

Appendix BR code for Chapter 6

> rm ( l i s t = l s ( a l l =TRUE ) )> l i b r a r y ( multcomp )> l i b r a r y ( g t o o l s )> o p t i o n s ( d i g i t s =3)> ## I n p o r t da ta> d a t<−read . csv ( f i l e = " ~ / P r o j e c t / P_BAD/ Wri te / Submiss ion / SIM / Example / e x a m p l e d a t a . c sv " ,

> ### C re a t e dummy v a r i a b l e f o r combin ing t r e a t m e n t and group>> f o r ( i i n 1 : l e n g t h ( d a t $ group ) ) {+ i f ( d a t $ t [ [ i ] ] = = 1 ) {+ d a t $ t r t g r p [ [ i ] ] = d a t $ group [ [ i ] ] + 6+ } e l s e { d a t $ t r t g r p [ [ i ] ] = d a t $ group [ [ i ] ] }+ }>> d a t $ t r t g r p <−as . f a c t o r ( d a t $ t r t g r p )> head ( d a t )

group e f f e c t n t eps mean y t r t g r p1 1 −0.05 145 0 0 .6831 −0.0457 0 .6831 12 1 −0.05 145 0 0 .0329 −0.0457 0 .0329 13 1 −0.05 145 0 −2.5081 −0.0457 −2.5081 14 1 −0.05 145 0 −0.4630 −0.0457 −0.4630 15 1 −0.05 145 0 −0.0690 −0.0457 −0.0690 16 1 −0.05 145 0 −0.1816 −0.0457 −0.1816 1

> #### C a l c u l a t i n g t h e f r a c t i o n o f each subgroup> s i z e<−c ( 1 4 5 , 167 , 63 , 292 , 86 , 153)> f r a c <− s i z e / sum ( s i z e )

> #### D e f i n e t h e c o r r e l a t i o n s t r u c t u r e

111

Page 130: clinical trials for personalized, marker-based treatment strategies

112 CHAPTER B. R CODE FOR CHAPTER 6

> contrm1 <− f u n c t i o n ( n , f r a c ) {+ m <− f r a c+ C <−c (−m,m)+ ## s p e c i a l f e a t u r e+ names (C) <− c ( " F u l l " )+ C+ }> c o n t r 1 <− cont rm1 ( 6 , f r a c )> c o n t r 1

F u l l <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA><NA> <NA>−0.1600 −0.1843 −0.0695 −0.3223 −0.0949 −0.1689 0 .1600 0 .1843 0 .0695 0 .32230 .0949 0 .1689

> ## For subgroup a n a l y s i s M2> ## n groups ( n rows )> contrm2 <− f u n c t i o n ( n , f r a c ) {+ m <− diag ( n ) * f r a c+ C <−cbind (−m,m)+ ## s p e c i a l f e a t u r e+ rownames (C) <− apply (m, 1 , f u n c t i o n ( l )+ i f e l s e ( a l l ( l > 0 ) , " F u l l " , p a s t e 0 ( ’ g ’ , p a s t e 0 ( which ( l >0 ) , c o l l a p s e = ’ ’ ) ) ) )+ C+ }> c o n t r 2 <− cont rm2 ( 6 , f r a c )> c o n t r 2

[ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ , 5 ] [ , 6 ] [ , 7 ] [ , 8 ] [ , 9 ] [ , 1 0 ] [ , 1 1 ] [ , 1 2 ]g1 −0.16 0 .000 0 .0000 0 .000 0 .0000 0 .000 0 . 1 6 0 .000 0 .0000 0 .000 0 .0000 0 .000g2 0 . 0 0 −0.184 0 .0000 0 .000 0 .0000 0 .000 0 . 0 0 0 .184 0 .0000 0 .000 0 .0000 0 .000g3 0 . 0 0 0 .000 −0.0695 0 .000 0 .0000 0 .000 0 . 0 0 0 .000 0 .0695 0 .000 0 .0000 0 .000g4 0 . 0 0 0 .000 0 .0000 −0.322 0 .0000 0 .000 0 . 0 0 0 .000 0 .0000 0 .322 0 .0000 0 .000g5 0 . 0 0 0 .000 0 .0000 0 .000 −0.0949 0 .000 0 . 0 0 0 .000 0 .0000 0 .000 0 .0949 0 .000g6 0 . 0 0 0 .000 0 .0000 0 .000 0 .0000 −0.169 0 . 0 0 0 .000 0 .0000 0 .000 0 .0000 0 .169

> ## For a l l subgroups and o v e r a l l e f f e c t M3> ## n groups ( n+1 rows )> contrm3 <− f u n c t i o n ( n , f r a c ) {+ m <− diag ( n ) * f r a c+ C <−rbind ( cbind (−m,m) , c (− f r a c , f r a c ) )+ rownames (C) <− apply (m, 1 , f u n c t i o n ( l )+ i f e l s e ( a l l ( l > 0 ) , " F u l l " , p a s t e 0 ( ’ g ’ , p a s t e 0 ( which ( l >0 ) , c o l l a p s e = ’ ’ ) ) ) )+ C+ }> c o n t r 3 <− cont rm3 ( 6 , f r a c )> c o n t r 3

[ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ , 5 ] [ , 6 ] [ , 7 ] [ , 8 ] [ , 9 ] [ , 1 0 ] [ , 1 1 ] [ , 1 2 ]g1 −0.16 0 .000 0 .0000 0 .000 0 .0000 0 .000 0 . 1 6 0 .000 0 .0000 0 .000 0 .0000 0 .000g2 0 . 0 0 −0.184 0 .0000 0 .000 0 .0000 0 .000 0 . 0 0 0 .184 0 .0000 0 .000 0 .0000 0 .000

Page 131: clinical trials for personalized, marker-based treatment strategies

113

g3 0 . 0 0 0 .000 −0.0695 0 .000 0 .0000 0 .000 0 . 0 0 0 .000 0 .0695 0 .000 0 .0000 0 .000g4 0 . 0 0 0 .000 0 .0000 −0.322 0 .0000 0 .000 0 . 0 0 0 .000 0 .0000 0 .322 0 .0000 0 .000g5 0 . 0 0 0 .000 0 .0000 0 .000 −0.0949 0 .000 0 . 0 0 0 .000 0 .0000 0 .000 0 .0949 0 .000g6 0 . 0 0 0 .000 0 .0000 0 .000 0 .0000 −0.169 0 . 0 0 0 .000 0 .0000 0 .000 0 .0000 0 .169F u l l −0.16 −0.184 −0.0695 −0.322 −0.0949 −0.169 0 . 1 6 0 .184 0 .0695 0 .322 0 .0949 0 .169

> ### S u b s e t a n a l y s i s ( a l l s u b s e t s ( 2 ^ n−1 rows ) )> c o n t r a s t M a t r i x <− f u n c t i o n ( n , f r a c ) {+ m <− ( p e r m u t a t i o n s ( 2 , n , rep=T)−1)[−1 ,]+ m1 <− l a p p l y ( seq _ a l o n g ( 1 : nrow (m) ) , f u n c t i o n ( i ) {+ l<−m[ i , ] * f r a c+ l+ } )+ m2 <− do . c a l l ( rbind , m1)+ C <− cbind (−m2 , m2)+ ## s p e c i a l f e a t u r e+ rownames (C) <− apply (m, 1 , f u n c t i o n ( l )+ i f e l s e ( a l l ( l >0 ) , " F u l l " , p a s t e 0 ( ’ g ’ , p a s t e 0 ( which ( l >0 ) , c o l l a p s e = ’ ’ ) ) ) )+ C+ }> c o n t r 4 <− c o n t r a s t M a t r i x ( 6 , f r a c )> c o n t r 4

[ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ , 5 ] [ , 6 ] [ , 7 ] [ , 8 ] [ , 9 ] [ , 1 0 ] [ , 1 1 ] [ , 1 2 ]g1 −0.16 0 .000 0 .0000 0 .000 0 .0000 0 .000 0 . 1 6 0 .000 0 .0000 0 .000 0 .0000 0 .000g2 0 . 0 0 −0.184 0 .0000 0 .000 0 .0000 0 .000 0 . 0 0 0 .184 0 .0000 0 .000 0 .0000 0 .00 0g3 0 . 0 0 0 .000 −0.0695 0 .000 0 .0000 0 .000 0 . 0 0 0 .000 0 .0695 0 .000 0 .0000 0 .000g4 0 . 0 0 0 .000 0 .0000 −0.322 0 .0000 0 .000 0 . 0 0 0 .000 0 .0000 0 .322 0 .0000 0 .00 0g5 0 . 0 0 0 .000 0 .0000 0 .000 −0.0949 0 .000 0 . 0 0 0 .000 0 .0000 0 .000 0 .0949 0 .000g6 0 . 0 0 0 .000 0 .0000 0 .000 0 .0000 −0.169 0 . 0 0 0 .000 0 .0000 0 .000 0 .0000 0 .169g12 −0.16 −0.184 0 .0000 0 .000 0 .0000 0 .000 0 . 1 6 0 .184 0 .0000 0 .000 0 .0000 0 .00 0g13 −0.16 0 .000 −0.0695 0 .000 0 .0000 0 .000 0 . 1 6 0 .000 0 .0695 0 .000 0 .0000 0 .000g14 −0.16 0 .000 0 .0000 −0.322 0 .0000 0 .000 0 . 1 6 0 .000 0 .0000 0 .322 0 .0000 0 .00 0g15 −0.16 0 .000 0 .0000 0 .000 −0.0949 0 .000 0 . 1 6 0 .000 0 .0000 0 .000 0 .0949 0 .000g16 −0.16 0 .000 0 .0000 0 .000 0 .0000 −0.169 0 . 1 6 0 .000 0 .0000 0 .000 0 .0000 0 .169g123 −0.16 −0.184 −0.0695 0 .000 0 .0000 0 .000 0 . 1 6 0 .184 0 .0695 0 .000 0 .0000 0 .000. . . #### Here omi t 50 l i n e sF u l l −0.16 −0.184 −0.0695 −0.322 −0.0949 −0.169 0 . 1 6 0 .184 0 .0695 0 .322 0 .0949 0 .169

> ## S u b s e t a n a l y s i s w i t h 50% o f p a t i e n t s> c o n t r a s t M a t r i x 5 0 <− f u n c t i o n ( n , f r a c ) {+ m <− ( p e r m u t a t i o n s ( 2 , n , rep=T)−1)[−1 ,]+ # p r i n t (m)+ aux <− apply (m, 1 , f u n c t i o n ( l ) sum ( l * f r a c ) )+ p r i n t ( aux )+ m<−m[ aux > = 0 . 4 9 9 , ]+ # p r i n t (m)+ m1 <− l a p p l y ( seq _ a l o n g ( 1 : nrow (m) ) , f u n c t i o n ( i ) {+ l<−m[ i , ] * f r a c

Page 132: clinical trials for personalized, marker-based treatment strategies

114 CHAPTER B. R CODE FOR CHAPTER 6

+ l+ } )+ m2 <− do . c a l l ( rbind , m1)+ C <− cbind (−m2 , m2)+ ## s p e c i a l f e a t u r e+ rownames (C) <− apply (m, 1 , f u n c t i o n ( l )+ i f e l s e ( a l l ( l > 0 ) , " F u l l " , p a s t e 0 ( ’ g ’ , p a s t e 0 ( which ( l >0 ) , c o l l a p s e = ’ ’ ) ) ) )+ p r i n t ( nrow (C ) )+ C+ }> c o n t r 5 <− c o n t r a s t M a t r i x 5 0 ( 6 , f r a c )> c o n t r 5

[ , 1 ] [ , 2 ] [ , 3 ] [ , 4 ] [ , 5 ] [ , 6 ] [ , 7 ] [ , 8 ] [ , 9 ] [ , 1 0 ] [ , 1 1 ] [ , 1 2 ]g456 0 . 0 0 0 .000 0 .0000 −0.322 −0.0949 −0.169 0 . 0 0 0 .000 0 .0000 0 .322 0 .0949 0 .169g346 0 . 0 0 0 .000 −0.0695 −0.322 0 .0000 −0.169 0 . 0 0 0 .000 0 .0695 0 .322 0 .0000 0 .169g3456 0 . 0 0 0 .000 −0.0695 −0.322 −0.0949 −0.169 0 . 0 0 0 .000 0 .0695 0 .322 0 .0949 0 .169g24 0 . 0 0 −0.184 0 .0000 −0.322 0 .0000 0 .000 0 . 0 0 0 .184 0 .0000 0 .322 0 .0000 0 .000g246 0 . 0 0 −0.184 0 .0000 −0.322 0 .0000 −0.169 0 . 0 0 0 .184 0 .0000 0 .322 0 .0000 0 .169g245 0 . 0 0 −0.184 0 .0000 −0.322 −0.0949 0 .000 0 . 0 0 0 .184 0 .0000 0 .322 0 .0949 0 .000g2456 0 . 0 0 −0.184 0 .0000 −0.322 −0.0949 −0.169 0 . 0 0 0 .184 0 .0000 0 .322 0 .0949 0 .169g2356 0 . 0 0 −0.184 −0.0695 0 .000 −0.0949 −0.169 0 . 0 0 0 .184 0 .0695 0 .000 0 .0949 0 .169g234 0 . 0 0 −0.184 −0.0695 −0.322 0 .0000 0 .000 0 . 0 0 0 .184 0 .0695 0 .322 0 .0000 0 .00 0g2346 0 . 0 0 −0.184 −0.0695 −0.322 0 .0000 −0.169 0 . 0 0 0 .184 0 .0695 0 .322 0 .0000 0 .169g2345 0 . 0 0 −0.184 −0.0695 −0.322 −0.0949 0 .000 0 . 0 0 0 .184 0 .0695 0 .322 0 .0949 0 .000g23456 0 . 0 0 −0.184 −0.0695 −0.322 −0.0949 −0.169 0 . 0 0 0 .184 0 .0695 0 .322 0 .0949 0 .169g146 −0.16 0 .000 0 .0000 −0.322 0 .0000 −0.169 0 . 1 6 0 .000 0 .0000 0 .322 0 .0000 0 .169g145 −0.16 0 .000 0 .0000 −0.322 −0.0949 0 .000 0 . 1 6 0 .000 0 .0000 0 .322 0 .0949 0 .000g1456 −0.16 0 .000 0 .0000 −0.322 −0.0949 −0.169 0 . 1 6 0 .000 0 .0000 0 .322 0 .0949 0 .169g134 −0.16 0 .000 −0.0695 −0.322 0 .0000 0 .000 0 . 1 6 0 .000 0 .0695 0 .322 0 .0000 0 .000g1346 −0.16 0 .000 −0.0695 −0.322 0 .0000 −0.169 0 . 1 6 0 .000 0 .0695 0 .322 0 .0000 0 .169g1345 −0.16 0 .000 −0.0695 −0.322 −0.0949 0 .000 0 . 1 6 0 .000 0 .0695 0 .322 0 .0949 0 .000g13456 −0.16 0 .000 −0.0695 −0.322 −0.0949 −0.169 0 . 1 6 0 .000 0 .0695 0 .322 0 .0949 0 .169g126 −0.16 −0.184 0 .0000 0 .000 0 .0000 −0.169 0 . 1 6 0 .184 0 .0000 0 .000 0 .0000 0 .169g1256 −0.16 −0.184 0 .0000 0 .000 −0.0949 −0.169 0 . 1 6 0 .184 0 .0000 0 .000 0 .0949 0 .169g124 −0.16 −0.184 0 .0000 −0.322 0 .0000 0 .000 0 . 1 6 0 .184 0 .0000 0 .322 0 .0000 0 .00 0g1246 −0.16 −0.184 0 .0000 −0.322 0 .0000 −0.169 0 . 1 6 0 .184 0 .0000 0 .322 0 .0000 0 .169g1245 −0.16 −0.184 0 .0000 −0.322 −0.0949 0 .000 0 . 1 6 0 .184 0 .0000 0 .322 0 .0949 0 .000g12456 −0.16 −0.184 0 .0000 −0.322 −0.0949 −0.169 0 . 1 6 0 .184 0 .0000 0 .322 0 .0949 0 .169g1236 −0.16 −0.184 −0.0695 0 .000 0 .0000 −0.169 0 . 1 6 0 .184 0 .0695 0 .000 0 .0000 0 .169g1235 −0.16 −0.184 −0.0695 0 .000 −0.0949 0 .000 0 . 1 6 0 .184 0 .0695 0 .000 0 .0949 0 .000g12356 −0.16 −0.184 −0.0695 0 .000 −0.0949 −0.169 0 . 1 6 0 .184 0 .0695 0 .000 0 .0949 0 .169g1234 −0.16 −0.184 −0.0695 −0.322 0 .0000 0 .000 0 . 1 6 0 .184 0 .0695 0 .322 0 .0000 0 . 000g12346 −0.16 −0.184 −0.0695 −0.322 0 .0000 −0.169 0 . 1 6 0 .184 0 .0695 0 .322 0 .0000 0 .169g12345 −0.16 −0.184 −0.0695 −0.322 −0.0949 0 .000 0 . 1 6 0 .184 0 .0695 0 .322 0 .0949 0 .000F u l l −0.16 −0.184 −0.0695 −0.322 −0.0949 −0.169 0 . 1 6 0 .184 0 .0695 0 .322 0 .0949 0 .169

> #### F i t model and g e t a d j u s t e d p−v a l u e s from each method> # t h e t a T<− by ( d a t $ t , d a t $ group , f u n c t i o n ( x ) u n i que ( x ) )

Page 133: clinical trials for personalized, marker-based treatment strategies

115

> aa <− aov ( y~ t r t g r p , d a t )

> # O v e r a l l a n a l y s i s M1> f i t 1 <−g l h t ( aa , l i n f c t =mcp ( t r t g r p = c o n t r 1 ) , a l t e r n a t i v e =c ( " g r e a t e r " ) )> sum1<−summary ( f i t 1 )> sum1

S i m u l t a n e o u s T e s t s f o r G e n e r a l L i n e a r Hypo thesesM u l t i p l e Compar isons o f Means : User−d e f i n e d C o n t r a s t s

F i t : aov ( formula = y ~ t r t g r p , data = d a t )

L i n e a r Hypo theses :E s t i m a t e S td . E r r o r t v a l u e Pr ( > t )

1 <= 0 0 .2277 0 .0677 3 .362 0 .000403 ***−−−S i g n i f . codes : 0 *** 0 .001 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1( A d j u s t e d p v a l u e s r e p o r t e d −− s i n g l e−s t e p method )

> pp1 <− sum1$ t e s t $ p v a l u e s [ 1 ]> names ( pp1 )<−c ( " Pov " )> pp1

Pov0 .000403414

> # Subgroup a n a l y s i s M2> f i t 2 <−g l h t ( aa , l i n f c t =mcp ( t r t g r p = c o n t r 2 ) , a l t e r n a t i v e =c ( " g r e a t e r " ) )> sum2<−summary ( f i t 2 , t e s t = a d j u s t e d ( t y p e =" W e s t f a l l " ) )> sum2

S i m u l t a n e o u s T e s t s f o r G e n e r a l L i n e a r Hypo theses

M u l t i p l e Compar isons o f Means : User−d e f i n e d C o n t r a s t s

F i t : aov ( formula = y ~ t r t g r p , data = d a t )

L i n e a r Hypo theses :E s t i m a t e S td . E r r o r t v a l u e Pr ( > t )

g1 <= 0 −0.0080 0 .0271 −0.30 0 .616g2 <= 0 0 .0129 0 .0291 0 . 4 4 0 .549g3 <= 0 0 .0216 0 .0179 1 . 2 1 0 .304g4 <= 0 0 .0709 0 .0384 1 . 8 4 0 .125g5 <= 0 0 .0560 0 .0209 2 . 6 8 0 .022 *g6 <= 0 0 .0743 0 .0278 2 . 6 7 0 .022 *−−−S i g n i f . codes : 0 *** 0 .001 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1( A d j u s t e d p v a l u e s r e p o r t e d −− W e s t f a l l method )

Page 134: clinical trials for personalized, marker-based treatment strategies

116 CHAPTER B. R CODE FOR CHAPTER 6

> pp2 <−sum2$ t e s t $ p v a l u e s [ 1 : 6 ]> names ( pp2 )<−p a s t e ( " Psa " , names ( c o e f ( f i t 2 ) ) , sep =" " )> pp2Psag1 Psag2 Psag3 Psag4 Psag5 Psag60 .616 0 .549 0 .304 0 .125 0 .022 0 .022

> # Combina t ion o f o v e r a l l and subgroup a n a l y s e s M3> f i t 3 <−g l h t ( aa , l i n f c t =mcp ( t r t g r p = c o n t r 3 ) , a l t e r n a t i v e =c ( " g r e a t e r " ) )> sum3<−summary ( f i t 3 , t e s t = a d j u s t e d ( t y p e =" W e s t f a l l " ) )> sum3

S i m u l t a n e o u s T e s t s f o r G e n e r a l L i n e a r Hypo theses

M u l t i p l e Compar isons o f Means : User−d e f i n e d C o n t r a s t s

F i t : aov ( formula = y ~ t r t g r p , data = d a t )

L i n e a r Hypo theses :E s t i m a t e S td . E r r o r t v a l u e Pr ( > t )

g1 <= 0 −0.0080 0 .0271 −0.30 0 .6161g2 <= 0 0 .0129 0 .0291 0 . 4 4 0 .5493g3 <= 0 0 .0216 0 .0179 1 . 2 1 0 .3041g4 <= 0 0 .0709 0 .0384 1 . 8 4 0 .1246g5 <= 0 0 .0560 0 .0209 2 . 6 8 0 .0184 *g6 <= 0 0 .0743 0 .0278 2 . 6 7 0 .0192 *F u l l <= 0 0 .2277 0 .0677 3 . 3 6 0 .0028 **−−−S i g n i f . codes : 0 *** 0 .001 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1( A d j u s t e d p v a l u e s r e p o r t e d −− W e s t f a l l method )

> pp3 <−sum3$ t e s t $ p v a l u e s [ 1 : ( 6 + 1 ) ]> names ( pp3 )<−p a s t e ( " Pos " , names ( c o e f ( f i t 3 ) ) , sep =" " )> pp3

Posg1 Posg2 Posg3 Posg4 Posg5 Posg6 P o s F u l l0 .61611 0 .54928 0 .30413 0 .12458 0 .01837 0 .01915 0 .00285

> # S u b s e t a n a l y s i s M4> f i t 4 <−g l h t ( aa , l i n f c t =mcp ( t r t g r p = c o n t r 4 ) , a l t e r n a t i v e =c ( " g r e a t e r " ) )> sum4<−summary ( f i t 4 , t e s t = a d j u s t e d ( t y p e =" s i n g l e−s t e p " ) )> sum4

S i m u l t a n e o u s T e s t s f o r G e n e r a l L i n e a r Hypo theses

M u l t i p l e Compar isons o f Means : User−d e f i n e d C o n t r a s t s

Page 135: clinical trials for personalized, marker-based treatment strategies

117

F i t : aov ( formula = y ~ t r t g r p , data = d a t )

L i n e a r Hypo theses :E s t i m a t e S td . E r r o r t v a l u e Pr ( > t )

g1 <= 0 −0.0080 0 .0271 −0.30 0 .9968g2 <= 0 0 .0129 0 .0291 0 . 4 4 0 .9194g3 <= 0 0 .0216 0 .0179 1 . 2 1 0 .6198g4 <= 0 0 .0709 0 .0384 1 . 8 4 0 .2983g5 <= 0 0 .0560 0 .0209 2 . 6 8 0 .0585 .g6 <= 0 0 .0743 0 .0278 2 . 6 7 0 .0605 .g12 <= 0 0 .0049 0 .0397 0 . 1 2 0 .9731g13 <= 0 0 .0135 0 .0324 0 . 4 2 0 .9253g14 <= 0 0 .0629 0 .0470 1 . 3 4 0 .5516g15 <= 0 0 .0480 0 .0342 1 . 4 0 0 .5164g16 <= 0 0 .0663 0 .0388 1 . 7 1 0 .3614g123 <= 0 0 .0265 0 .0436 0 . 6 1 0 .8756. . . #### Here omi t 50 l i n e sF u l l <= 0 0 .2277 0 .0677 3 . 3 6 0 .0089 **−−−S i g n i f . codes : 0 *** 0 .001 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1( A d j u s t e d p v a l u e s r e p o r t e d −− s i n g l e−s t e p method )

> pp4 <−sum4$ t e s t $ p v a l u e s [ 1 : nrow ( c o n t r 4 ) ]> names ( pp4 )<−p a s t e ( " Pss " , names ( c o e f ( f i t 4 ) ) , sep =" " )> pp4

Pssg6 Pssg5 Pssg56 Pssg4 Pssg46 Pssg45 Pssg456 Pssg30 .060478 0 .058458 0 .002361 0 .298262 0 .022017 0 .033763 0 .001596 0 .619809

Pssg36 Pssg35 Pssg356 Pssg34 Pssg346 Pssg345 Pssg3456 Pssg20 .034108 0 .041251 0 .001466 0 .170146 0 .011326 0 .017316 0 .000749 0 .919405

Pssg26 Pssg25 Pssg256 Pssg24 Pssg246 Pssg245 Pssg2456 Pssg230 .174632 0 .263265 0 .016557 0 .346520 0 .039618 0 .061588 0 .004037 0 .717308

Pssg236 Pssg235 Pssg2356 Pssg234 Pssg2346 Pssg2345 Pssg23456 Pssg10 .095377 0 .146177 0 .008436 0 .214772 0 .021116 0 .033340 0 .001991 0 .996826

Pssg16 Pssg15 Pssg156 Pssg14 Pssg146 Pssg145 Pssg1456 Pssg130 .361374 0 .516427 0 .046753 0 .551602 0 .087233 0 .132476 0 .010652 0 .925315

Pssg136 Pssg135 Pssg1356 Pssg134 Pssg1346 Pssg1345 Pssg13456 Pssg120 .212854 0 .316307 0 .024157 0 .375183 0 .048427 0 .075231 0 .005578 0 .973145

Pssg126 Pssg125 Pssg1256 Pssg124 Pssg1246 Pssg1245 Pssg12456 Pssg1230 .398213 0 .541171 0 .077985 0 .533949 0 .104885 0 .155183 0 .016574 0 .875565Pssg1236 Pssg1235 Pssg12356 Pssg1234 Pssg12346 Pssg12345 P s s F u l l0 .253710 0 .361569 0 .042509 0 .376733 0 .061099 0 .092408 0 .008858

> min4 <− min ( pp4 )> min4[ 1 ] 0 .000749> which . min ( pp4 )Pssg3456

15

Page 136: clinical trials for personalized, marker-based treatment strategies

118 CHAPTER B. R CODE FOR CHAPTER 6

> # S u b s e t a n a l y s i s M5> f i t 5 <−g l h t ( aa , l i n f c t =mcp ( t r t g r p = c o n t r 5 ) , a l t e r n a t i v e =c ( " g r e a t e r " ) )> sum5<−summary ( f i t 5 , t e s t = a d j u s t e d ( t y p e =" s i n g l e−s t e p " ) )There were 18 warnings ( use warnings ( ) t o s e e them )> sum5

S i m u l t a n e o u s T e s t s f o r G e n e r a l L i n e a r Hypo theses

M u l t i p l e Compar isons o f Means : User−d e f i n e d C o n t r a s t s

F i t : aov ( formula = y ~ t r t g r p , data = d a t )

L i n e a r Hypo theses :E s t i m a t e S td . E r r o r t v a l u e Pr ( > t )

g456 <= 0 0 .2012 0 .0518 3 . 8 8 <0.01 ***g346 <= 0 0 .1668 0 .0507 3 . 2 9 <0.01 **g3456 <= 0 0 .2228 0 .0548 4 . 0 6 <0.01 ***g24 <= 0 0 .0838 0 .0482 1 . 7 4 0 .206g246 <= 0 0 .1581 0 .0557 2 . 8 4 0 .020 *g245 <= 0 0 .1398 0 .0525 2 . 6 6 0 .032 *g2456 <= 0 0 .2141 0 .0594 3 . 6 0 <0.01 **g2356 <= 0 0 .1648 0 .0487 3 . 3 8 <0.01 **g234 <= 0 0 .1054 0 .0514 2 . 0 5 0 .120g2346 <= 0 0 .1797 0 .0585 3 . 0 7 0 .010 *g2345 <= 0 0 .1614 0 .0555 2 . 9 1 0 .016 *g23456 <= 0 0 .2357 0 .0621 3 . 8 0 <0.01 ***g146 <= 0 0 .1372 0 .0546 2 . 5 1 0 .045 *g145 <= 0 0 .1189 0 .0515 2 . 3 1 0 .071 .g1456 <= 0 0 .1932 0 .0585 3 . 3 0 <0.01 **g134 <= 0 0 .0845 0 .0503 1 . 6 8 0 .225g1346 <= 0 0 .1588 0 .0575 2 . 7 6 0 .025 *g1345 <= 0 0 .1405 0 .0545 2 . 5 8 0 .038 *g13456 <= 0 0 .2148 0 .0612 3 . 5 1 <0.01 **g126 <= 0 0 .0792 0 .0485 1 . 6 3 0 .242g1256 <= 0 0 .1352 0 .0528 2 . 5 6 0 .040 *g124 <= 0 0 .0758 0 .0553 1 . 3 7 0 .345g1246 <= 0 0 .1501 0 .0619 2 . 4 2 0 .055 .g1245 <= 0 0 .1318 0 .0591 2 . 2 3 0 .084 .g12456 <= 0 0 .2061 0 .0653 3 . 1 6 <0.01 **g1236 <= 0 0 .1008 0 .0517 1 . 9 5 0 .144g1235 <= 0 0 .0825 0 .0483 1 . 7 1 0 .216g12356 <= 0 0 .1568 0 .0557 2 . 8 1 0 .021 *g1234 <= 0 0 .0974 0 .0581 1 . 6 8 0 .226g12346 <= 0 0 .1717 0 .0644 2 . 6 6 0 .031 *g12345 <= 0 0 .1534 0 .0617 2 . 4 8 0 .048 *F u l l <= 0 0 .2277 0 .0677 3 . 3 6 <0.01 **−−−

Page 137: clinical trials for personalized, marker-based treatment strategies

119

S i g n i f . codes : 0 *** 0 .001 ** 0 . 0 1 * 0 . 0 5 . 0 . 1 1( A d j u s t e d p v a l u e s r e p o r t e d −− s i n g l e−s t e p method )

> pp5 <−sum5$ t e s t $ p v a l u e s [ 1 : nrow ( c o n t r 5 ) ]> names ( pp5 )<−p a s t e ( " P50 " , names ( c o e f ( f i t 5 ) ) , sep =" " )> pp5

P50g456 P50g346 P50g3456 P50g24 P50g246 P50g245 P50g2456 P50g23560 .000740 0 .005297 0 .000271 0 .205667 0 .020158 0 .031620 0 .001871 0 .004249

P50g234 P50g2346 P50g2345 P50g23456 P50g146 P50g145 P50g1456 P50g1340 .120122 0 .010374 0 .016308 0 .000832 0 .045110 0 .070859 0 .005194 0 .225093P50g1346 P50g1345 P50g13456 P50g126 P50g1256 P50g124 P50g1246 P50g12450 .024962 0 .037900 0 .002247 0 .242029 0 .039796 0 .344904 0 .055321 0 .084071

P50g12456 P50g1236 P50g1235 P50g12356 P50g1234 P50g12346 P50g12345 P 5 0 F u l l0 .008107 0 .143929 0 .216471 0 .021215 0 .226105 0 .031186 0 .047790 0 .004077

> min5 <− min ( pp5 )> min5[ 1 ] 0 .0004593971> which . min ( pp5 )P50g3456

3

Page 138: clinical trials for personalized, marker-based treatment strategies
Page 139: clinical trials for personalized, marker-based treatment strategies

Appendix CTables of Chapter 4

121

Page 140: clinical trials for personalized, marker-based treatment strategies

122C

HA

PT

ER

C.T

AB

LE

SO

FC

HA

PT

ER

4Table C.1: Results of three powers using five procedures with k = 0.25,0.5,0.75 for scenario 1.

SC WP WH FB WBk Weight Pall Po P+ Pall Po P+ Pall Po P+ Pall Po P+ Pall Po P+0.25 0.2 0.1420 0.0528 0.1276 0.2480 0.0448 0.2396 0.2392 0.0428 0.2316 0.2392 0.0172 0.2316 0.2392 0.0172 0.2316

0.33 0.1420 0.0528 0.1276 0.2316 0.0464 0.2192 0.2240 0.0456 0.2120 0.2240 0.0276 0.2120 0.2240 0.0276 0.21160.5 0.1424 0.0532 0.1276 0.2060 0.0508 0.1896 0.1992 0.0500 0.1832 0.1992 0.0368 0.1832 0.1992 0.0368 0.18000.67 0.1420 0.0528 0.1276 0.1804 0.0560 0.1612 0.1740 0.0544 0.1552 0.1740 0.0480 0.1552 0.1740 0.0480 0.14880.8 0.1288 0.0504 0.1124 0.1504 0.0596 0.1276 0.1456 0.0572 0.1252 0.1456 0.0528 0.1252 0.1456 0.0528 0.1124

0.5 0.20 0.4236 0.2332 0.4116 0.4932 0.2380 0.4864 0.4748 0.2332 0.4692 0.4748 0.1100 0.4692 0.4748 0.1100 0.46760.33 0.4232 0.2328 0.4116 0.4812 0.2444 0.4664 0.4624 0.2408 0.4492 0.4624 0.1476 0.4492 0.4624 0.1476 0.44320.50 0.4236 0.2332 0.4116 0.4568 0.2472 0.4368 0.4340 0.2420 0.4160 0.4340 0.1828 0.4160 0.4340 0.1828 0.40320.67 0.4092 0.2316 0.3940 0.4196 0.2528 0.3960 0.3964 0.2436 0.3756 0.3964 0.2112 0.3756 0.3964 0.2112 0.35000.80 0.3908 0.2276 0.3744 0.3740 0.2576 0.3468 0.3632 0.2520 0.3384 0.3632 0.2388 0.3384 0.3632 0.2388 0.2840

0.75 0.20 0.6952 0.5448 0.6936 0.7164 0.5532 0.7136 0.6928 0.5448 0.6916 0.6928 0.3564 0.6916 0.6928 0.3564 0.69160.33 0.6952 0.5448 0.6936 0.7084 0.5544 0.7020 0.6764 0.5440 0.6720 0.6764 0.4192 0.6720 0.6764 0.4192 0.66880.50 0.6892 0.5444 0.6852 0.6928 0.5596 0.6796 0.6508 0.5420 0.6424 0.6508 0.4712 0.6424 0.6508 0.4712 0.63080.67 0.6772 0.5412 0.6708 0.6604 0.5648 0.6412 0.6236 0.5476 0.6104 0.6236 0.5172 0.6104 0.6236 0.5172 0.56440.80 0.6628 0.5372 0.6548 0.6328 0.5720 0.6108 0.6012 0.5516 0.5840 0.6012 0.5396 0.5840 0.6012 0.5396 0.4980

NP: non-parametric procedures, i.e., weighted Bonferroni, weighted Holm and fallback: WP: weighted parametric procedure; SC: Song-Chi procedure

Page 141: clinical trials for personalized, marker-based treatment strategies

123

Table C.2: Results of three powers using five procedures with k = 0.25,0.5,0.75 for scenario 2.

SC WP WH FB WBk Weight Pall Po P+ Pall Po P+ Pall Po P+ Pall Po P+ Pall Po P+0.25 0.20 0.3112 0.2780 0.2040 0.3620 0.2784 0.2500 0.3524 0.2720 0.2444 0.3524 0.2120 0.2444 0.3524 0.2120 0.2316

0.33 0.3272 0.2940 0.2040 0.3848 0.3116 0.2376 0.3716 0.3020 0.2316 0.3716 0.2632 0.2316 0.3716 0.2632 0.21160.50 0.3452 0.3120 0.2040 0.3996 0.3428 0.2184 0.3912 0.3364 0.2160 0.3912 0.3160 0.2160 0.3912 0.3160 0.18000.67 0.3556 0.3224 0.2040 0.4112 0.3680 0.2072 0.4016 0.3604 0.2032 0.4016 0.3524 0.2032 0.4016 0.3524 0.14880.80 0.3592 0.3320 0.1924 0.4180 0.3880 0.1980 0.4116 0.3824 0.1960 0.4116 0.3788 0.1960 0.4116 0.3788 0.1124

0.5 0.2 0.4948 0.4464 0.4420 0.5492 0.4548 0.4900 0.5344 0.4460 0.4796 0.5344 0.3496 0.4796 0.5344 0.3496 0.46760.33 0.5028 0.4544 0.4420 0.5648 0.4780 0.4792 0.5448 0.4672 0.4668 0.5448 0.4040 0.4668 0.5448 0.4040 0.44320.50 0.5088 0.4604 0.4420 0.5804 0.5096 0.4648 0.5560 0.4940 0.4500 0.5560 0.4604 0.4500 0.5560 0.4604 0.40320.67 0.5108 0.4680 0.4284 0.5788 0.5280 0.4452 0.5604 0.5148 0.4360 0.5604 0.5004 0.4360 0.5604 0.5004 0.35000.80 0.5052 0.4712 0.4112 0.5692 0.5392 0.4236 0.5532 0.5260 0.4164 0.5532 0.5212 0.4164 0.5532 0.5212 0.2840

0.75 0.20 0.7124 0.6576 0.7036 0.7296 0.6644 0.7148 0.7092 0.6492 0.6968 0.7092 0.5148 0.6968 0.7092 0.5148 0.69160.33 0.7172 0.6624 0.7036 0.7416 0.6808 0.7076 0.7112 0.6608 0.6852 0.7112 0.5812 0.6852 0.7112 0.5812 0.66880.50 0.7180 0.6648 0.6960 0.7456 0.6944 0.6968 0.7096 0.6732 0.6716 0.7096 0.6340 0.6716 0.7096 0.6340 0.63080.67 0.7144 0.6660 0.6832 0.7400 0.7088 0.6800 0.7068 0.6856 0.6580 0.7068 0.6688 0.6580 0.7068 0.6688 0.56440.80 0.7092 0.6676 0.6728 0.7324 0.7172 0.6660 0.7084 0.6980 0.6536 0.7084 0.6940 0.6536 0.7084 0.6940 0.4980

NP: non-parametric procedures, i.e., weighted Bonferroni, weighted Holm and fallback: WP: weighted parametric procedure; SC: Song-Chi procedure

Page 142: clinical trials for personalized, marker-based treatment strategies

124C

HA

PT

ER

C.T

AB

LE

SO

FC

HA

PT

ER

4

Table C.3: Results of three powers using five procedures with k = 0.25,0.5,0.75 for scenario 3.

SC WP WH FB WBk Weight Pall Po P+ Pall Po P+ Pall Po P+ Pall Po P+ Pall Po P+0.25 0.20 0.6612 0.6568 0.2584 0.7132 0.7064 0.2588 0.7056 0.6992 0.2580 0.7056 0.6792 0.2580 0.7056 0.6792 0.2316

0.33 0.6988 0.6944 0.2584 0.7664 0.7612 0.2580 0.7596 0.7552 0.2560 0.7596 0.7448 0.2560 0.7596 0.7448 0.21160.50 0.7404 0.7360 0.2584 0.8016 0.7976 0.2556 0.7928 0.7888 0.2552 0.7928 0.7856 0.2552 0.7928 0.7856 0.18000.67 0.7628 0.7584 0.2584 0.8256 0.8228 0.2544 0.8192 0.8164 0.2544 0.8192 0.8152 0.2544 0.8192 0.8152 0.14880.80 0.7736 0.7720 0.2548 0.8408 0.8392 0.2548 0.8352 0.8336 0.2540 0.8352 0.8332 0.2540 0.8352 0.8332 0.1124

0.5 0.2 0.6780 0.6716 0.4956 0.7240 0.7128 0.4972 0.7084 0.6988 0.4920 0.7084 0.6560 0.4920 0.7084 0.6560 0.46760.33 0.7052 0.6988 0.4956 0.7628 0.7536 0.4948 0.7464 0.7384 0.4900 0.7464 0.7148 0.4900 0.7464 0.7148 0.44320.50 0.7292 0.7228 0.4956 0.7920 0.7848 0.4932 0.7768 0.7700 0.4888 0.7768 0.7616 0.4888 0.7768 0.7616 0.40320.67 0.7460 0.7408 0.4904 0.8108 0.8048 0.4920 0.7980 0.7932 0.4892 0.7980 0.7916 0.4892 0.7980 0.7916 0.35000.80 0.7556 0.7520 0.4804 0.8224 0.8188 0.4900 0.8100 0.8076 0.4880 0.8100 0.8076 0.4880 0.8100 0.8076 0.2840

0.75 0.20 0.7520 0.7396 0.7148 0.7692 0.7544 0.7172 0.7540 0.7400 0.7076 0.7540 0.6652 0.7076 0.7540 0.6652 0.69160.33 0.7624 0.7500 0.7148 0.7924 0.7784 0.7144 0.7728 0.7608 0.7028 0.7728 0.7244 0.7028 0.7728 0.7244 0.66880.50 0.7724 0.7600 0.7104 0.8176 0.8048 0.7108 0.7856 0.7792 0.6980 0.7856 0.7636 0.6980 0.7856 0.7636 0.63080.67 0.7780 0.7668 0.7044 0.8304 0.8256 0.7060 0.8044 0.8004 0.6976 0.8044 0.7968 0.6976 0.8044 0.7968 0.56440.80 0.7808 0.7728 0.6980 0.8356 0.8324 0.7048 0.8200 0.8172 0.7012 0.8200 0.8160 0.7012 0.8200 0.8160 0.4980

NP: non-parametric procedures, i.e., weighted Bonferroni, weighted Holm and fallback: WP: weighted parametric procedure; SC: Song-Chi procedure

Page 143: clinical trials for personalized, marker-based treatment strategies

125

Table C.4: Results of three powers for Song-Chi procedure with k = 0.5 and α∗1 = 0.1,1,10α comparing to weighted parametricprocedure.

SC α∗1 = 0.1 SC α∗1 = 1 SC α∗1 = 10∗α1 WPScenario Weight Pall Po P+ Pall Po P+ Pall Po P+ Pall Po P+1 0.2 0.4236 0.2332 0.4116 0.4916 0.2344 0.4884 0.3460 0.2040 0.3280 0.4932 0.2380 0.4864

0.33 0.4232 0.2328 0.4116 0.4776 0.2336 0.4720 0.4064 0.2268 0.3928 0.4812 0.2444 0.46640.5 0.4236 0.2332 0.4116 0.4620 0.2364 0.4520 0.4372 0.2324 0.4272 0.4568 0.2472 0.43680.67 0.4092 0.2316 0.3940 0.4392 0.2380 0.4240 0.4236 0.2348 0.4084 0.4196 0.2528 0.39600.8 0.3908 0.2276 0.3744 0.4088 0.2344 0.3924 0.4036 0.2320 0.3872 0.3740 0.2576 0.3468

2 0.2 0.4236 0.2332 0.4116 0.4916 0.2344 0.4884 0.3460 0.2040 0.3280 0.4932 0.2380 0.48640.33 0.4232 0.2328 0.4116 0.4776 0.2336 0.4720 0.4064 0.2268 0.3928 0.4812 0.2444 0.46640.5 0.4236 0.2332 0.4116 0.4620 0.2364 0.4520 0.4372 0.2324 0.4272 0.4568 0.2472 0.43680.67 0.4092 0.2316 0.3940 0.4392 0.2380 0.4240 0.4236 0.2348 0.4084 0.4196 0.2528 0.39600.8 0.3908 0.2276 0.3744 0.4088 0.2344 0.3924 0.4036 0.2320 0.3872 0.3740 0.2576 0.3468

3 0.2 0.4236 0.2332 0.4116 0.4916 0.2344 0.4884 0.3460 0.2040 0.3280 0.4932 0.2380 0.48640.33 0.4232 0.2328 0.4116 0.4776 0.2336 0.4720 0.4064 0.2268 0.3928 0.4812 0.2444 0.46640.5 0.4236 0.2332 0.4116 0.4620 0.2364 0.4520 0.4372 0.2324 0.4272 0.4568 0.2472 0.43680.67 0.4092 0.2316 0.3940 0.4392 0.2380 0.4240 0.4236 0.2348 0.4084 0.4196 0.2528 0.39600.8 0.3908 0.2276 0.3744 0.4088 0.2344 0.3924 0.4036 0.2320 0.3872 0.3740 0.2576 0.3468

WP: weighted parametric procedure; SC: Song-Chi procedure