Explore the Scientific R&D Platform

Free Trial

Case studies of dealing with study design issues

About the webinar

As trials increase in complexity and scope, there is a requirement for trial designs to reflect this.

From dealing with non-proportional hazards in survival analysis to dealing  with cluster randomization, we examine how to deal with study design issues of complex trials. 

Extending A Trial’s Design
Case Studies Of Dealing With Study Design Issues


In this free webinar you will learn about:

  • Dealing with study design issues
  • Practical worked examples of
  • Non-proportional Hazards
  • Cluster Randomization
  • Three Armed Trials
  • & more

Play the video below to watch
the complete recording of this webinar

Duration: 60 minutes 
Nothing showing? Click here to accept marketing cookies


Non-proportional Hazards
Non-proportional hazards and complex survival curves have become of increasing interest, due to being commonly seen in immunotherapy development. This has led to interest in assessing the robustness of standard methods and alternative methods that better adapt to deviations.

In this webinar, we look at methods proposed for complex survival curves and the weighted log-rank test as a candidate model to deal with a delayed survival effect. 

Cluster Randomization
Cluster-randomized designs are often adopted when there is a high risk of contamination if cluster members were randomized individually. Stepped-wedge designs are useful in cases where it is difficult to apply a particular treatment to half of the clusters at the same time.

In this webinar, we introduce cluster randomization and stepped-wedge designs to provide an insight into the requirements of more complex randomization schedules.

Three Armed Trials
Non-inferiority testing is a common hypothesis test in the development of generic medicine and medical devices. The most common design compares the proposed non-inferior treatment to the standard treatment alone but this leaves uncertain if the treatment effect is the same as from previous studies. This “assay sensitivity” problem can be resolved by using a three arm trial which includes placebo alongside the new and reference treatments for direct comparison.

In this webinar we show a complete testing approach to this gold standard design and how to find the appropriate allocation and sample size for this study.

Looking for more resources?

Transcript of webinar
*Please note, this is auto generated, some spelling and grammatical differences may occur* 

Hello and welcome to today's webinar, Extending a Trial Design: Case Study of Dealing with study design Issues demonstrated on inquiry.
So today's webinar will be looking at some common issues that occur in clinical trials and field trials where standard statistical methods or designs aren't appropriate, and how you could extend upon those standard methods to get a more appropriate statistical analysis for your design and the implications I would have on your sample size determination.
So, before we get started, just a few frequently asked questions. Firstly, is this webinar being recorded? Yes, it is being recorded, and the slides and the recording will be sent to you after this webinar's complete later today.
Secondly, if you have any questions that you want to ask about anything in this webinar, please feel free to use the questions tab on the right-hand side of your webinar. software unanswered. And there, I'll try to answer as many of those as possible. at the end of the webinar, and I needed I don't get round to, I'll make sure to e-mail you with an answer, either today or tomorrow.
OK, I think that basically covers that, but if you have any issues with the audio, or any questions do, let me try to let me know using that chat window.
So as I said, today's webinar is Extending a Trials Design, Case Studies of Dealing with study design Issues. Before we get started, let me introduce myself. My name is Ronan Fitzpatrick Head of Statistics here at ..., who develop inquiry. I've been inquiry, lead researcher, since I inquiry ... many years ago at this point. I've given workshops and talk to places like the FDA Js and seminar and obviously give a lot of these webinars on demos as well for inquiry.
So in terms of what are those extensions, those areas, those case studies we're going to cover today while we split them into three parts to date, firstly, we'll be looking at the issue of non proportional hazards in the context of survival analysis. Then we'll be looking at cluster randomization and newer extensions involving stepped wedge designs, and then we'll be looking at tree arm trials in the context of non inferiority testing and then a small time for conclusions and discussions and reply to your questions at the end.
Obviously, this webinar is being presented with inquiry. Your number-one solution for optimizing clinical trials. And this complete trial design platform will help make your clinical trials faster. That's costing you more successful.
And if, you know, in 20 19, 90% of organizations, for clinical trials approved by the FDA, use that inquiry. And you can see some of the reviews here, And also some of the companies that use inquiry along the bottom there as well.
So, with that out of the way, let's get into the meat of the webinar, and as I said, these are case studies. So, what I'm hoping to do is kind of show, you know, you have your standard method, let's say, your log rank test, let's say, your T test. And what are some of the complications or design issues? That could cause you to reconsider whether that designed or tested is appropriate. And what kind of adjustments are then required when you're talking about, say something like your sample size calculation?
Which, obviously, is of high interest, because I inquiry is one of the leading sample size determination software in clinical trials. So, the first thing we're gonna look at is non proportional hazards, which is relevant for survival analysis. Now, I've covered survival analysis in depth and several webinars in the past, so, if you're interested in knowing more about sample size for survival analysis, we'd be happy to share the link to the recording of those on request. So feel free to get in touch about toppled. In short, survival analysis is about no statistical inferences about the expected duration of time to an event. And, obviously, survival analysis, implied start event is often something quite serious, such as mortality or death. But also, could be something like progression to the next stage of cancer, for example. And for this type of analysis, where we are trying to estimate the expected time until you have an event, there are many, many Common Match its proposed. But two most commonly used are probably the log rank test.
proportional hazards regression model. Technically, log rank test is really just a subset of the cost model, portion. We don't need to go into that today.
Now, the important thing to note about survival analysis in the context of sample size determination with we're designing a study, is that the power is related to the number of events, not the sample size. Why is this because individuals who don't have the event will end up being right censored. And while I wouldn't say that are ignored completely, of course they are. They are used to adjust the analysis, they're not the ones driving the statistical hypothesis of how much time it takes to get to an event. Because if you haven't had the event by the end of the study, then you are contributing dot time to that, can be used to differentiate your control and treatment groups. And does the sample size calculations that we'll be using both here for your standard methods like the log rank test port for also for methods that apply for a non proportional hazards? Those? The sample size is really just a determination. Of how many people do.
I think I will need to get the number of events, it's almost more like a meta model on top of your events calculation, which is the core of what you need to do. So, if I require 100 events, and I say any 200 sample size, if it ends up that the, you know, recruitment is slower than expected, or the events are coming in slower than expected than I should be, I can take steps to accelerate that, to get more of the events that I actually need. And because of that, flexibility, in the analysis and design of survival analysis, studies tend to be much higher, and, you know, that will depend on many of the options about what we think is going to happen. And about how, you know, we've chosen to designer study in terms of stuff, like a cruel time, follow up time, dropout, et cetera.
So, as I mentioned, I've covered many of these issues in the past, and previous webinars, specifically, on survival analysis, but the one that we're interested in today, is a very common assumption for survival analysis for the regression model, in particular, that of proportional hazards, Which is basically, that you have a constant hazard ratio between your two groups. So, basically, you know, no time varying variation in your, in your treatment effect, or indeed any of your other effects that you choose to include as covariates. So lots of these other things like accrual patterns and follow up and censoring and crossing over. I'm not really going to cover these in detail today because they're not the core part of this webinar. But as they say, if you're interested in these, I have covered this in the past to get in touch if you want to see our core material on survival analysis. So we're gonna focus in here today just on the issue of non proportional hazards.
So as mentioned, the proportional hazard model, and implicitly log rank test, rely on the proportional hazards assumption. Ignoring this can lead to wrong inference, because there are very different ways that this could occur.
So you know when we have this non constant effect, size, is non constant hazard ratio, then different things could be happening. So you can see here on the right hand side, we could have a delayed treatment effect. This is very common in the case of immunotherapies, for example, where what they're seeing with these immunotherapies that takes some amount of time, a few weeks.
A few months where you see no difference between your immunotherapy and the proposed, you know, control or reference treatment. But then after that period of time you do start to see that significant improvement. And you know that's obviously very different from the assumption that these two lines will start separating at a constant rate over time under the proportional hazards assumption. And, you could show very easily that the model, or the long run tests, will have no reduced power if you have this delayed effect. But there is also other major issues, such as crossing hazards, where, you know, the treatment effect basically reverses where your treatment is initially say, better in the early stages. But by the end is actually worse than the, than the control group. And then you could have a diminishing treatment effect, or an inquiry, or an increasing treatment effect where it's basically more important, or more or more valuable as you go on towards the end of the trial.
Delayed effect is kind of a subset of the increasing treatment effect problem.
So what you basically have here is that it's a number of reasons why ... say Immunotherapies, The leading hypothesis for why you get delayed effect is simply that it just takes time to activate. There are other hypotheses, such as that this could be explained by immunotherapy is being more targeted that there's basically a subgroup where certain genotypes are better served by certain amino therapies. Because I see a lot of them are developed in the context of being targeted for certain cancers or for certain people with certain genotypes. So, that could have an effect, where you could see a similar delay, treatment, effect, type, hazard curve, but, which is actually due to some people responding And some people not responding. And then, something that could just be other covariates are stratified effect stratification effects that are causing this over time, and then sometimes things are complicated. Sometimes, you just don't get the kind of like the constant effect hazard ratio over time.
That sometimes happens, and I think as we've seen a massive improvement in the quality of data for survival analysis over the last 20 years, this issue has been noticed a lot more as we've gone over in time. You know, as as you get more times that you follow up with patients, as you get more accurate results, they found that it's NPH problem has become more significant. And because of that, we're seeing a very active debate about, well, how are we going to deal with this boat in the context of, you know, clinic like pre-clinical research and just out there in the wild, but also in the context of clinical trials in particular? So there's been a lot of different methods being proposed rates, so this is a very active area of research of debate within the clinical trials and face streaked clinical trials as well. So none of this is certainly not going to say any.
This is like a stone, certainly not far from it, but there's lots of proposed solutions at the moment, so, you know, there's ways that the usage of weighted linear rank test, which is basically where events, depending when they occur during the study, are weighted higher or lower. So, for example, you know, using something like a tire aware or ..., those are ones that tend to emphasize earlier events more than the log rank test or .... But, let's say they're basically equivalent if you only have a single effect for treatment and the cost model, that's, that's a log rank test. And then there's the suggestion of, well, take the weighted linear rank test, one step further, where you actually combine the results from multiple linear rank, test the weight those appropriately based on what actually happened and then use that to, I pulled a kind of, you know, a robust result or statistical inference about that. A lot of debate about the max combo has a lot of interesting papers on the right now. And then, the world will be focusing on today, which is a piecewise weighted rank test, which is based in extension of the rank test.
For example, the Log Rank Test boat, where, instead of having a single weighting scheme, just, we also bless our trial into distinct time periods, where we actually explicitly say we expect different hazard ratios to occur and make inferences based on based on that. So basically weighting different, like, stop times or sub units of time of each study different of the study differently.
We'll talk about that in a moment. But there are also just for the sake of fairness, other model, free approaches that have many proponents, such as the median survival, just using data as you're kind of measure of effect. There's Kaplan meier K M curves. Just using there's ways to directly compare those on one of the ones which has seen a lot of traction in recent years, is the use of the restricted. mean survival time or MST as one of those ways of kind of comparing Kaplan Meier these non parametric, you know, telomeres a non parametric way to characterize your survival analysis data. And this is one way to compare two Kaplan Meier against each other effectively in a in a principled fashion.
This is an area where basically inquiry we're going to be looking at this a lot over the next couple of releases. But we're just gonna be focusing today on the Piece wise Braided Log Rank Test which was included in our last update like a couple of. Months ago.
So, what is the piece wise, waded log rank test? Well, it's basically a model has been proposed to deal with the non proportional hazards problem, and the version that will show today is specifically related to the delayed effect. So, remember, the delayed effect is where, for the first time period, there is no difference between the control and treatment. But then, after that, there is something that looks like our constant, like proportional hazard type thing occurring, where this, you know, increased sat here. So, for this first period here, nothing's happening. And then suddenly you get something that looks kind of get your standard survival curve.
And so, you know, the piecewise, they're just saying that there's a different assumed hazard ratio per time period, namely, in the simplest case, where an equals one, until some specified time, and then some constant hazard ratio from that point on. And then, the weighted page, just refer to different weights, are applied to how important we considered the hazard ratio from each of those periods.
And so the simple model that we talking about here, what's referred to by Zhu, et al in 20 17 as the Apple model, is one where we assume a hazard ratio from time zero to time T, and then a constant hazard ratio from time T onwards until the end of the study. where we do right censoring. And that the ways that will use is, that we're going to give no waiting to the, you know, pre hazard rate went to the constant, like, the hazard ratio equals one period. So, they said, we're not too worried about the first six months. That's just how it is, but then it while we're really interested in this stuff, you know, after six months, we're saying, for example, are our treatments going to show its efficacy? That's what we're going to focus on here and give that a waste of one postdoc time.
So that's a very simple model. Like you could do your thing. You do a lot different weightings there. You could also you assume different hazard ratios. But under this very easy, Simple to Understand Apple Model, this is the starting point. And you can also use simulation to evaluate what would happen. If we change, the weightings change these hazard ratios like the, the Null Hazard Ratio, it, before T, et cetera. But it is important to note that no matter what you're doing in regards to those weights, you are having, to make a fairly strong assumption about, when you expect the treatment to start being aphasia. So you're having to say, I think, after six months, after 12 months, after three months, after a week, we think the immunotherapy will kick in, for lack of better term. And you need to know that a priori, and you need to build that into your model. Obviously, in the context of clinical trials, in particular, you know, you can't just make this up as you go along this. You know, these models have to be pre-specified.
That's a key part of regulatory practice in terms of designing your study. That says, Oh, there's always of course, wiggle room, no, one thing, just different way, from what you expect. But, you know, ideally, you would know, and have specified, using the scientific evidence available when you take this time, when the immunotherapy will show its efficacy over the preexisting treatment to occur. And much like the regression model, you do have to make a fairly strong assumption. But what you take the baseline hazardous going to be, because on the, under the baseline hazard in the ... model, you know that, based on hazards, what's gonna be happening to? Both groups?
So, you're, you're assuming, that both groups are gonna basically have more, or less the same type of hazard profile, You're having to make an assumption about that, that's fairly strong, that's an assumption, and then you have this random time lag models from Judith extended this to allow for time, to very like the time till they affect kicks in to vary by, by subject. We're not going to talk about today but just be aware there is extensions for that, there is obviously be response to the criticisms outlined. Here are some of the restrictions, I'd like to hear. But for day-to-day, we'll just focus on the simple one here. And if you're interested in, the higher, the test statistic looks, it looks like this here, with this as W S, the middle one here, being, this simplified Apple model, which is just a subset of the much wider piecewise mated log rank, test, statistics.
So this is that subset, where HR equals one, and Constant hazard ratio onwards, wage zahra before time T wave one post time T.
Materials could obviously be extended to 3, 4 or 5 different piece wise parts as well. But for now, we'll focus on the simple to one. So, to illustrate this, I'm actually going to use a fairly, you know, I'm actually gonna extend a preexisting example from the Survivor webinars that don't previously, which was looking at a ... for advanced pancreatic, and decorate and neuroendocrine tumors. New England Journal of Medicine. And what I'm gonna do here is basically start off with what, actually, they actually plan for the study, which is where this particular treatment kicked in immediately, how the constant hazard ratio. But then, effectively, imagine that maybe this isn't amino therapy instead, or maybe this is what actually happened for a ... or walk but actually ended up happening.
Obviously, that it didn't, but just hypothetically that, instead of the hazard ratio of zero point six six six six occurring from, from the start, let's assume that for the first six months, there's no difference in the hazard in each group, and then only after adopting this hazard ratio of zero point six six six kick in.
So, this is Remember, this is for a standard log rank, handset's analysis.
All I'm going to do is extend this under the assumption that for the first six months, the Reno difference in our groups. And then only subsequently would we find this hazard ratio zero point six six? So, we're really taking a preexist example and extending dots. But to be fair, when I'm first going to do want to replicate the sample is just make sure that we, we have the original example, John, correctly.
So, you know, this is inquiry. If anyone's familiar with inquiry. The sample size calculations happen in this top left panel. There are Help Cards here to give you know what the definition of each row is, but it's also given, on the far left-hand side, what each row is. And each column is an individual calculation. With those Rosen Yellow has been those that you calculate for the K-th power and sample sites. So let's do our calculation in column one. Where they had a test significance level of zero point zero two five. That was the one sided level.
They had an accrual period of 74 weeks, but you'll have noticed that the median survival is given a month in survival analysis. It's very important that all time dependent units are parameter ice at the same time unit. So we're going to take the accrual and follow up period of weeks, turned out in two months, simply by dividing by four. So 74 divided by four is equal to 18.5, and the maximum like to follow up is simply the accrual period, plus the minimum follow appeared. In this case, the minimum follow-up period was 39, 74 plus 39 is equal to 113.
We take 113 and divide it up by four, will get 28.25.
If you want to check the mass, feel free to do so, but I have done this example a number of times. So I'm fairly confident on the numbers at this point.
So the other thing we need to do with entities, exponential parameters, this is a model where we're going to same exponential survival curves. Very standard assumption, where we're starting for sample size calculation. For median survival, we're going to have nine in group one, which is the treatment group and six for group two, when we get this hazard ratio of zero point six six. So we're using here the side table, which appeared automatically when we selected the exponential parameter rho one. And this is just a smaller tool to convert between different types of survival parameters. You know, we now have our exponential parameters from our median survival, or, we click Transfer Those, move it to the main table.
Then, we enter our power of 92.6, and we get 176 per group, which is equal to the 352, given in the Richer Researcher Statement. So, 352 patients, you can see that here.
And then 282 events. So remember, we need 282 events. And under these assumptions, about how long will be recruiting people, and minimum follow, we're saying, we would need 352 patients to get that number of events before dropout, If we want to count for dropout, it goes up to 392.
OK, so moving on, that's a fairly simple example. So now let's move on to the Piecewise, Waded, Log rank test.
So, as I said, like the major thing to take away from this here is, Dash, you know, now we're going to make an assumption that for the first six months, this difference did not exist. And then actually, the exponential parameters were the same in both groups, like it was equivalent to have a median survival of six months in each group basically, for the first six months.
But everything else will keep the same that is constant, so we'll keep zero point zero two five at the one-sided level, 18.5 accrual period, 28.25, maximum like the follow up, power of 92.6.
We'll have equal sample size per group type as assumed from the last table.
And so, now, we're saying, OK, what is the time to treatment effect T one? That's our delayed time, or delayed effect, whenever you want to call it. And in this case, we want to say that it's six months.
So, we're saying that for the first six months, Let us assume, that instead of having you know what immediately aspect ratio of zero point six hundred sixty six that are, you know, are we had a reduction of one third in the events in group, in the treatment group compared to the control group. Let's say for the first six months. 
They're both going to be at the same hazard as the control group, because that's our like our baseline hazard. We assume it's going to be roughly what will be in our control group, that, that's the assumption we're making here.
And then, we need to calculate the proportions surviving beyond time to effect P, Which is basically just, how many people will still be alive, get it.
Just for simplification, say assuming everyone had started from, from the very beginning.
How many people would have a live or not had the event? By the time we got to six months now, thankfully, we have a situation here that if we assume that the control rate would have been, you know, a median survival of six months, well, then if the median survival, it's six months in each group, up until time, up until six months, then we would expect 50% of people to not have had the event.
I think that that should be relatively You know, the median survival definition is the, you know, when was 50% of the events have occurred if both are not a boater characterized by a median survival of six months up until six months then we'd expect 50% lives.
That's ignoring the accrual effect, by the way. Obviously, that would make things a lot more difficult to analyze, but that's ignoring that. That will also be accounted for in the calculation for this calculation of what is here.
Within the study, within this calculation here, within this little formula here, doesn't the kept it up.
And then, we have our postal, a hazard ratio, which we're going to keep us the same as before of zero point six six six six six, or two over tree.
And you'll see here, doc will need 848 people per group.
A very substantial increase, You can see over the previous normal lock rankings, which makes a lot of sense. Like, if we're missing our effect size for the first six months, that's a very substantial amount of time where we're not seeing an effect.
And therefore, you know, obviously, we're only really dealing with the last, you know, in this case, run 18 months, where we have the effect occur, And so, the sample size increases very much, very bigger than what we would expect. But, as we'll see in a moment, if we achieve, If we have this effect that this existed, and we had just used our standard log rank test, we would have seen a very dramatic reduction in our power compared to the 92% power we see here.
And so that's, you know, it's something to keep in mind, basically.
When looking at this, you can say, we need 424 per group.
So we're talking about, you know, a Pilot, 200 and something increase over the previous example now, per group. So that's a non-trivial amount, like we're nearly double the sample size. But that's not surprising for this type of effect. This is a very strong, very long amount of time to have this. And this obviously illustrates that an immunotherapy is a more challenging type of treatment to evaluate, fully, Maybe, will require larger clinical trials. Maybe, you know, somewhat of a drawback for, and indeed, the idea that it takes six months for it to kick in. If this is a very, if you're at a very stiff, severe stage of disease, you know, six months can be a very long time in practical terms.
So, as I said, though, if we assume the same thing, the same study design, boat had continue to use the log rank test. What would have happened where we can actually evaluate the inquiry? Using simulations. So, how do we do this? Well, basically, what we're gonna do is set up a design where there was no difference in the groups for the first six months, then, set it so that the effect size, the hazard rate in each group would revert to what we had in our original design and then see what would have happened.
So, I'll just set up our simulation design here.
She's a random seed here. And so, in this table, we're going to set up a full set of what we think it's going to happen in the study and replicate what we, what, what is, basically being assumed payer six months have no difference, followed by a hazard ratio of zero point six six six, which is basically equivalent to saying, you know, exponential parameters of the these two exponential parameters.
So, for the first six months, we'll set the time, period six for column two here. And the side table, 18.5 for here. Just just ensure that we have the same cruel period as previously and then 28.25 for the end of the study. And then we're saying, for the cruel, this is like, we're consuming constant accrual up until this point. Let's keep that that way.
So, you know, six divided by 18.5 gives you around, zero point twenty two force will give, this is 32.4%, and then 67.6% for the second time, and then zero accrual for this.
So we're saying, basically, this is just to say equivalent to saying, there was constant accrual from time 0 to 18.5 at which point we stopped accruing new subjects, and then we get into the, you know, the dropout problem. Let's just set that to zero. We're not assuming any dropout in this particular example.
So then, let's setup what we think is going to happen. So, we're saying that for the first six months, there's no difference, And the board of the hazard rates are equal to zero point one one six. So, remember, that's equivalent to a median survival of six months.
So, we're just going to copy and paste this just to speed things up.
And so in this case, you know, like, we're assuming the group, two is the control group, so that's obviously going to be the same for the entire study in D, group two row here, but in the group, one row, the treatment group, for the first six months, It's it's the same. But for the subsequent two months, it reverts to what we expected in the original design. This hazard ratio acid event rate of zero point zero zero seven of a median survival of nine. If we, if we start from the beginning, that's, just enter that here.
And then let's pull in our sample size of 424 per group.
And see what power we recall.
So take a little just a few moments to Rome.
You know, obviously, we could replicate the original log rank test, but that's relatively safe trivial here, but you can see that if we had used a standard log rank test, under the same assumptions, that we had used to find a sample size using the Apple Batted using this fairly simple Apple model, we would have only had power of 57%. You only a 57% chance that we would have a significant P value, given this existed. If we'd use a standard log rank, test and set up the Piecewise ..., So, obviously, this Apple model is fairly simple unfairly restrictive, but if this is what actually happened, you're seeing a very, very, very large increase in power, You know, we could, you know, let's just decrease the citizen just to make this simpler.
Like we could see, like roughly, how much sample size would we have needed to get to an equivalent power as the Apple method, under this assumption by plotting it out here?
So this will take just take a few moments to run. So remember, you know, this is if this is a standard log rank test, obviously, you're better off using, you know, a log rank test if you have a constant hazard ratio. That's the assumption of where the most powerful test, the rational model. It's basically the same thing.
But if we have this delayed effect, test model suddenly becomes a lot stronger, because we would have needed. We don't have around 50% power for the same type of results here. And that's, yeah.
Let me just leave this here for a second.
We probably need to increase the sample size significantly to get back up to that result.
OK, I think that's enough here on that. So, hopefully, you've learned something there, And as I say, there's a lot of other models proposed for the same problem. Like the max combo these weighted linear rank test median survival or MST. We'll be looking at all those in future releases of inquiry. If there's particular solutions that aren't mentioned here, or ones that you think are particularly interesting, do feel free to get in touch, and that will help us prioritize which ones to look at sooner rather than later.
OK, the next section we're going to move on to is cluster randomization.
So, cluster randomization is where, you know, the treatment is being randomized by cluster, not subjects. So, right? Like, you know, let us say we have 100 people, you know.
And there's 10 schools so, you know, we have 10 people per school, 10 students per school, or probably per class in this case.
And, we have two options to randomize. We can either take the full hundred people and then regardless of which school terrain will put 50 into the control group and 15 to trigger this is fully roundabouts designed properly the default in a clinical trial.
But, you know, maybe that's difficult, like, if you were doing a teaching intervention, it's kinda hard to take, like tree students. A Muon class and give them one teaching method bond and the other seven different teaching methods are in the same classroom. So you're kind of, its do, B will say, impossible nearly to do that. So what you have to do is to randomize, you know, five of the classes of 10 students to the treatment group, and then five of the other classes to the control group.
And that's where you have cluster randomization.
I think the important thing to note is that, if you do that, the effective unit of inference really becomes the cluster. This is kind of emphasized by the ..., or ... School of Approach to inference, which Stephen ... talks about a lot here. And this is where he'd screenshots from for cluster randomization, so no.
It's important to emphasize that. If you move up one level and randomization, that directly affects the causal, I suppose, framework for how you should interpret that study, like the causal, the causal level. It's really up the cluster level at that point. Now, you can make subject level level inferences using some assumptions. Both the cluster is really the level of causal inference in this randomization scheme, like, basically, Causality follows randomization is, more or less, the idea of experimental design.
But at a practical level, if you're going to use cluster randomized design. You know, you're, you're not getting something that's really fully randomize. You're getting something that's close to, hopefully, but not quite that. Because, obviously, you know, imagine that, you know, five of those are boys schools, five of those girls skills. But, like, tree of the boys school, like Florida Boys Girls end up in one side, and only one of the other side that increases. No, probably many of the imbalances in that, so, what you need to do is adjust for the fact that, within clusters, you're gonna see more self similarity between clusters. So, you know, you know, as a boys school, in a poor neighborhood, the students are probably gonna do more self similarity. dom, you know, A girls school in a more socially economically privileged group, and, therefore, you need to adjust for herself similar, those centers are. And you can also do a direct adjustment for those covariates As well, of course.
But you also, you just need to ideally account for the intrinsic self similarity effect using a measure such as inter cluster correlation, the coefficient of variation, or the within cluster variance and you can actually show most of these are are relatively interchangeable, which ultimately interpreting the same idea but from different directions.
And I said, these are useful mostly if it's difficult to randomize by subject but also can sometimes lead to reduce costs, and if there's, as I say, a contamination effect where, you know, it's difficult to splice site, You know, five people within a cluster, to only get one treatment and five get the other, because they're in that they share say the same space. Like, you know, they they share the same hospital rooms, the same schoolrooms, then that contamination effect can be, you know, undermine a fully randomized design. And therefore, you need to go many to go to cluster randomization, but you do do is the cost of, you know, really lowering your overall power slash effective sample size. And then there's obviously the problem of selection bias. And it's a higher chance of ending up with on balanced designs a randomization because you're moving multiple people at the same time, rather than each person individually.
Then one, like, like, you know, so that's what I've mostly talked about. There is a simple cluster randomized design, but there are extensions of this, such as a stepped wedge design, which is where, you know, subjects, subjects move from treatment control over time. So, let's say, you know, Like, you look at this pattern, here. What we're saying here is that in in time point, once after the first month, everyone's getting the control group in time to suddenly group six is getting the treatment group. And then over time, that's moving up and up and up until everyone is getting the treatment group.
Almost like a kind of crossover type, design where everyone's hearts to control, a baseline course. This is design, it's often paired with cluster randomization. Once again, the unit is the cluster, not the subject. And this is one reason why you might want to do this is because of treatment scarcity. So, imagine, you want to teach a new, like a tea, new nursing, like new nursing technique. You know, you probably not have that many people who are capable of teaching that method, right? Because it's a new intervention. Like, you're trying to evaluate this new intervention. And therefore, you know, you only have one team to teach, like you to copy, and, you know, five places at once. So, they need to teach in hospital one. For the first time, like, for the first month. And then only the second one can then move on to the time to attend at our cluster, to hospital, to hospital tree, in one tree, so on and so forth. And that's naturally creating a stepped wedge design. And that needs to be accounted for in your modeling, which you can kind of see on the right-hand side, and in your statistical analysis, empowerment, statistic, power analysis, et cetera.
Door is one nice thing from a single point of view though, is that you are getting this within cluster analysis, similar to a crossover trial, you do now have an estimate of like you can compare, you know, like a cluster, how it was doing before it Before the intervention and after the intervention directly. So, you're getting that within cluster analysis, as well as the between group analysis, which is something. But, obviously, there are some issues here where the allocation bias, the expensiveness being expensive on the complex analysis all may mean that this is not viable for your particular study. So. You know cluster randomization and stepped wedge cluster randomization. These are really tools usually used in response to real practical problems or realities that you need to account for in your statistical analysis because that's how it has the wrong.
And so what we're going to do here is actually going to do a stepped wedge design, which was looking at the membrane sweeping to reduce induction of labor busy teaching midwifery teams this the D&I CD guidance.
This is a relatively simple cluster randomization, where they had, you know, these 12, 12 weeks, where each week they were moving on to a new set of midwife teens, the 10 hospitals that they were looking at And then 12 measurements per cluster per time. These measurements were basically 12 births per cluster per time. There is like one small complication here that we're talking about. What's known as a cross-sectional design, where there's 12 new subjects being run, like being looked at each time.
So you know, you know, we're doing the same cluster group multiple times. In this case, you know, we're looking at new measurements per time. Thankfully in the case of births were probably not expecting, you know, old births from the last week to be new parts this week that that's, you know, something strange is going on that's occurred. So thankfully is naturally a cross-section design, where the stepped wedge is probably best understood. And you can see that, you know, within clusters were not expecting a huge amount of self similarity. So we're saying the more or less hospitals are mostly interchangeable with each other here.
And, so, they wanted to know that they'd have a baseline proportion of point four in terms of the, the, the end point of interest and they wanted to increase stock to zero point five basically.
Now, technically, they looked over a wide range of different scenarios. We're just gonna look at a specific scenario today.
And so we can just go to the stepped wedge study here.
So, what we're going to do is it's firstly actually replicate something close to what they have there. So they had a 5% or zero point zero five significance level two sided level group.
one proportion of say zero point four that were sent in group one here is to control proportion and in this case they actually calculated what would be the greatest effect size they could see given the constraints that they had in terms of access to nurses and births and so on and so forth. ... background, interference I live near to major traffic are thoroughly.
So the ITC of zero point zero two two zero two.
And you know there's the, I'm not going to go into this much today but basically complete means that you have the perfect one for every time unit, you, you, you basically step up once. Whereas if if you don't have a very specific relation between the number of time periods, and the number of clusters, basically K divided by T, equals an integer. You can't get a perfect step up. At some point. You have to skip one, and like, do, you know, to say, you know, not move on to the next group for two weeks, or two time periods instead. Just isn't a balanced design or a complete design. It's actually a balanced design, which means that in this case, we need to just set the balance here because it's complete and you have to fulfill this. K minus T , they divided by T equals an integer criteria.
And so they had a number of time periods equal to 12.
They had so 12 measurements, 10 clusters, and 12 births per time period, basically per cluster per tenant here. So every in each cluster, at each time period, 12 parts be evaluated. And then they had a power of 80%. Now, I'm just going to manually select the correct Sulfa here to ensure that, you know, we're going to calculate Lambda two, are, sorry, pi two.
But we want to ensure that does, this gives us the correct one here that's stipends to me that we're going to calculate given pi one. But we want pi two to be greater.
So we enter 80 here, and you'll see here that we get around the correct proportion here. So they said it will be about a 10% increase, or like point one increase on the portion scale. You can see it's actually closer to zero point one. one for, you know, we take zero point five one four zero, that's around zero point one one four, but the thing is that we didn't fully replicate what actually happened in the study.
Because what actually happened in the study is that they had like a pattern of results that looked like this where an actuality there was a lead in period where basically they didn't designate they didn't do anything. They didn't change anything, and only after 12 weeks of leading day-to-day duty intervention, and then follow up for 12 weeks, or the intervention happened for practical purposes. We can basically ignore the lead in periods. So, imagine that this study really happened, Started here at week 13. And then went on to week 39 here. Where you can see here, there's not even a very even step pattern. So you can hear, you know, before they got to group Tree, they had, there was a week where they didn't do any new intervention. You know, there's a gap of week two here.
So if we want to really replicate what happened, we'd have to enter a recruitment pattern that replicates this and that is possible to do in inquiry. To do that, all we need to do is to replicate what we put in previously. But instead of selecting a balanced design pattern, we need to enter a custom design pattern.
Which is the last option here in this drop-down for design pattern. But to also account for the fact that there's going to be time periods where nothing was happening. Because these white time periods are equivalent to missing data effectively. We're going to basically assume that the length of time was from 13 to week 39. And that basically means that we're going to assume that there is 27 time periods.
Include the zero time period, of course.
We're including one time period where everyone didn't get the intervention. So it's really starting from week 12, technically this week, where nothing is happening for anyone, theoretically.
And then we still have 10 clusters.
And we still have 12 per time period. But basically, what we need to do is basically just, for any time period that's in white, in the previous pattern, we're just going to consider that to be equivalent to missing data.
And so, to Duke missing data, an inquiry, we decenter, we enter into the cell manually, you could use, our model equals nine, I'm sorry, into the, the time here, for no observation now, Obviously, I could write decide manually, but that's probably a bit of a tedious thing to watch. So, I prepared this earlier. So, you can see here, the zero indicates that you were in the control group, one, indicates that you were in a treatment group nine, indicates that there was missing data here. So, that, the group that are clustered, that moved into treat and reversed as these missing data, where, basically, to study it, stopped that cohort. Whereas, for this last one, you can see it when, you know, once eight had finished all of, its observations to 12 weeks of observation. That's when it was finished. And that's when the study ended.
So, we're just going to put that into inquiry in the site table here.
Oh, and we have the wrong sulfur harrigan.
So let's just select solver tree here to get the correct result where the treatment group is better, has a higher proportion of success. And you can see there in that case now, I would actually equals exactly the point one effect size that the authors hops.
And, you know, we could compare this to what would have happened if we'd use the standard, no study design, where we had no cluster randomization.
So, let's say, you know, zero point four, and then power of 80%, and, you know, the sample size, You know, what was the sample size in this case?
Well, it's 12 by 10 by 12, 1, 44 0.
Know, if we take 144 0 and divide up by two, we should get something like 720, I believe.
And, you know, in that case, you know, you know, if we had just assumed that this wasn't a cluster randomized problem, we would have been over Optimistic, like we would have said, oh, we could get effect sizes as low as zero point forty seven treat. That would have obviously been a major mistake on our part.
And if we had looked at, say, a normal cluster randomized case, where we ignored that effect?
You know, we could have ended up with very different study. So let's imagine we don't have repeated measures. Let's just say you only did analysis in each cluster once.
Let's assume we had the same effect size. What would have the sample size required have been done?
Know, let's let us assume the simpler case, where there's just 12, looks per per per group. And then look at the power of 80%. No, we wouldn't need 36 clusters in that case. A lot more. Nurses would have a lot more like hospitals would have ever been looked at, but, maybe, you know, the more equivalent sample size is the 144. No 12 by 12. You know, 12 sample size per time period by the 12.
But, by the 10, 12 time periods, we've looked at each, we've looked at least nurse, nursing, school 12 times, and done, for each time period, don, 12 individual ones.
And, you can see that even if we did 144, it looks purr each cluster.
No. We would have required 11.
Uh, it would have acquired, looks like less here.
But once again, we would have been over optimistic effectively, which, again, would be highly problematic.
This should be zero point two. Apologies.
Yep. So they get, sorry. That's my mistake.
So, you can see here that you know you'd need 40 per group And you know that's a lot more clusters And then this would have been slightly less efficient than using the you know, the stepped wedge design. That if we just consider these as 144 individual measurements. So, you know, stepped wedge design. It's certainly not a free lunch spot where there are important things that deals with, it is important to know, use an analysis that reflects the reality of your study.
So, you know, there are multiple complications associated those. Which I covered in a previous webinar. If you're intentionally cover be more, feel free to get in contact, But basically, there are additional complications that come with it. But, you know, if you're having this effect where it's taking time to get to each cluster, a stepped wedge design might be you. And if cluster randomization is a constraint due to contamination or other effects, then you should definitely adjust for that cluster randomization. If you ignore it, you're going to be way too optimistic about what your study is doing, and you need you will make mistakes about your sample size calculation and end up being underpowered.
So the final thing I want to cover here relatively briefly is just try I'm trials, because this is mostly just an extension if your standard non inferiority testing, where, you know, not a free already test, is where you just want to prove that your treatment is no worse than standard by a pre-specified margin known as the non inferiority margin. And this non inferiority margin is often selected as a fraction of, what's considered to be the active control effect. So basically, how well did the standard treatment do versus placebo, you want to take a certain percentage of that M two as your non inferiority? And Margie, this has talked about a lot in the FDA guidance or not if you're already testing, which is included in the reference here.
And obviously, not if you're already testing is very common. In the case of generics, generic drugs, medical devices were usually no, you're not making a claim that your drug is better than the preexisting drug. It's a generic, it's just you know the same thing as the preexisting drug, you're hoping or a medical device, you know a pacemaker You're not expecting a pacemaker to revolutionize the problem. You just hope you pop, it has comparable performance to other things. And so you need to control, you know, you're just comparing treatment or control, Would I placebo, that's a standard non inferiority test.
Bought the problem with that simple design is that you haven't really proven within your own individual trial that, you know, the reference treatment or indeed your own treatment are significantly better than placebo. That's been assumed based on the prior evidence. But it's not been proved by your own study. And in fact, you're basically using something equivalent to a historic control. And that means that when you're doing a non inferiority study would often is required, is a very stringent replication of what occurred in the original superiority trial to ensure that the conditions are exactly the same. And so that you have comparable effect sizes occurring for both the reference treatment, DRL D and the treatment treatment, that the actual new treatment, the generic, for example.
So, if that's a major issue, and it's something that is a major issue, if you go into the guidance on this.
What is one solution, but one solution would be to, rather than, you know, having to replicate the prior study. Why not just include a placebo arm directly? So, bring back into placebo arm.
And, then, compared to placebo arm to either, either on and, or basically the reference group, or the experiment, and, or the experiment. And then, if you're doing better in the placebo group, in your tree arm trial, then obviously, now you have shown, or proven, that, not only is our treatment better than placebo, but then what we do to not have free already test. Then, we have also shown that these to treat, they've treatment is equivalent to reference. And this is why it's often considered to be the gold standard.
The reason it's not, on more often is the simple reason, that placebo would not be allowable. If it's unethical, like if we have a reference treatment that already works And this is a serious condition, for example, we can't give people placebo because then we're exposing people to a high risk. But in certain areas where, you know, treatment effects are more uncertain, notice more uncertainty in general, or the reference treatment is fairly ineffective at this point. Then, placebo then becomes more acceptable in those cases. So if you can do this ethically, there's no reason why you can't do dots.
And theoretically, what you just need to do, if you have a placebo group, is one test hypothesis, that your experiment, or your reference is better than placebo, and then test whether the experiment is non inferior to the reference group. Basically, hypothesis test greater than the non inferiority margin.
And because that's a, what we call a conjunctive hypothesis, like, both have to be true to prove your analysis. Like you can't like, you know if I get one, if I prove the first, but not the second that I can't do it, basically a gatekeeper procedure where hypothesis one has to happen before I'm even willing to test hypotheses to. In this case, you know, ..., before I do H one B.
Then we retain Alpha, without any multiplicity, just stuff that the Alpha is retained because of the very strict gatekeeper procedure. But one thing that's been shown by some of the theoretical work in this area is that you can simplify this down to really a test of the ratio of differences. So the difference between, experimental on first, can placebo or E minus P, divided by the reference versus placebo.
So, you know, the, the effect size for the experiment over placebo divided by the effect size of reference over placebo, and then showing that is greater than some non inferiority ratio, which we all call theta here.
And so, this is what we call a complete testing procedure, which actually takes these two sequential hypotheses can be basically talked about as if they were this one complete testing hypothesis here.
And that can be evaluated in a unified framework, using a wall type test for a variety of different endpoints from means, to proportioning survivals, to incidence rates, and so on. And, you know, doing that, we do.
And then, there's also ways that we can optimize our allocation to maximize our power, given this tree arm trial design, and, you know, as I mentioned, this has been extended for a variety of different endpoints by different authors over the last 20 years or so.
And so, what I'm gonna do very briefly here, is do a sample size calculation that has this tree arm, fruit, tree on rules, where, you know, they actually had difference from placebo. So, we're gonna sustain the placebo arm is zero, and then we had a 1.5, 6% increase for both groups. So they actually are going to be equal under the alternative hypothesis that these authors assumed they had a non inferiority ratio of zero point five so they said, you know, if if the experiment alarm did, no worse in terms have been battered over placebo or 50% or zero point five of the reference arm of the standard treatment. They will be happy enough to say it's non inferior and that's a relatively liberal one. That's not what you would usually see in the case of, like drugs, for example. Where this will be more stringent, and I were talking about here, more of a, you know, a particular solution, which is a lot less invasive.
Than the previous one, The previous approach to doing so, obviously, the improvement and safety or the improvement in. And people are willing to do us.
Docking sometimes justify a more or less stringent, non inferiority margin, standard deviation of 2.5, power of 80%. And they allocated zero point three eight zero point three zero point twenty four in terms of experimental reference and proportions is the proportion in each of those groups.
So, what I agree, this is actually relatively easy. So, we just have our significance level of zero point zero two five, and note the dates, a two sided confidence interval of zero point zero five.
Not a free already, Hypotheses are really at the one such level, so I have that here. Just if you're interested not.
We said 1.5, 6% for experiment and reference and a placebo arm of zero, these are technically really the differences from placebo, but for practical purposes, for the means hypothesis that doesn't make a difference. The non inferiority margin is zero point five, so we're saying if if the experiment arm of the least at least 50% of the effect over placebo of the reference effect over placebo, we'd be happy to say it's non inferior. And then the common variants of 6.25. So standard deviation is what we gave or they gave in their statement there. But this is the variants, which is obviously just the standard deviation squared. So, 2.5 squared is equal to 6.25.
We had our allocation ratio of zero point three eight point, tree, ace zero point twenty four, the power of 80%.
You see, we get a sample size of 352. This is technically, these are one tree for, instead of one tree tree, but that's probably it's due to very small rounding issues versus this, I verified this against older software, and this is the correct example, for these allocation, Richard. Perhaps they were using zero point three seven or something weird. I got the result. Blood!
This is the closest replication here. And basic grounding would explain most of what they saw a difference. So it's 352 versus 350 to take off, but I'm pretty confident that's basically just a very small deviation based, And maybe their allocation ratios haven't been slightly different. But, you know, as I mentioned, there is an optimal allocation, which is really just saying an allocation. That maximizes your power for a particular design. In the case of the means case were common variants. It's actually very simple. It's given them the help card here, point five for the experimental arm. The non inferiority ratio divided by two for the reference arm, which is zero point five, divided by two of zero point five, then what happens to be zero point twenty five for the other one here as well. And then if we enter 80% power, you'll see that the sample size requirement is 326 per group instead Sorry, Just seems to have gone a bit Weebly over or second here.
Let's just restore this back to what it was previously.
And you can see here that, obviously, the sample size is lower than the previous example of 352, so, you know, if we wanted to simply reduce sample size, this would perhaps be considered the more appropriate allocation.
And, you know, we were to compare that to a standard non inferiority test. Obviously, now if we were using a standard two arms superiority test, we usually use the differences rather than the ratio of differences. So it's not necessarily as easy to kind of see exactly what that is. But basically we can see that, you know, for this case, a, you know, a difference of half, 1.56 is zero point seven eight.
Therefore, the expected differences zero was expecting to be the same under the alternative. So this is just to note, that, this is, you know, zero point one five six divided by two because the non inferiority ratio is zero point five, that's a relatively simple calculation in this case. It's not saying that's what the difference would be, under the null. And then, standard deviation of 2.5, we're back into the standard deviation scale here, and a power of 80%.
And you'd have 160 tree per group here, which means would be less sample size in total for this design, of course.
But you can see that the sample size per group will be less for equal sample size allocation, for E, versus at the experimental and reference group, which is the most common way it's done, even though this is technically the optimal design.
So, you know, 2 to 2 to 1 or ones like data, the most common for this type of design.
So, you can see here, 163 per group versus 134 per group. So, even though in total, you know, this is obviously, adding the placebo group is means that you're looking at more people in total, you know, you need less in your experimental groups, the placebo group, maybe in certain cases, could be seen as being, less, less difficult to kind of deal with.
OK, and I think that, that's pretty much our time up. So I'll just get into the conclusions now. So, you know, trials will often require adjustments from our standard methods, what we're talking about. Your standard T test, Your standard log rank test. Your standard non inferiority design. You know, our standard fully randomized design and you know, if you have complications, you need to adjust for them. If you have complexities, they have to be adjusted for in your design, and in your analysis, if you don't do so, it's very likely you'll have an underpowered study. You'll have a higher chance of mistake, of mistakes. And you'll end up making, you know, having drugs thrown away that shouldn't be, or coming to erroneous conclusions that it's effective. And that's obviously not acceptable in the case of any type of study, but especially in clinical trials. So in terms of non proportional hazards, this is a very common issue, especially in immunotherapies.
And we should look at and you know there's a lot of discussion now about weighted models, max combo, piecewise weighted linear models on Model three measures, such as the restricted mean survival time. We looked at the weighted models approach today. Cluster randomization is a very real constraint In the real-world, for example, vaccine studies, often deal with cluster randomization is easier to vaccinate the whole school, Don T know, vaccinate 50% in a school and then not vaccinate the other 50% or give them a fake vaccination or not fake, but no, different placebo vaccination.
And you need to adjust for the complex cluster effect, or for the cluster effect. And if you have a more complex design, like a stepped wedge design, need to adjust for that to tree armed, non inferiority trial, gives you a direct comparison against placebo for both your treatment and efficacy dealing with the assay sensitivity problem.
Be aware that there's a nice, flexible, easy to understand framework to do testing in that case. This isn't a huge leap forward in terms of doing the testing, it's actually relatively simple. So, don't be afraid of the statistical side if you choose to go with a tree on trial.
So, that completes our webinar today. I just want to thank you so much for listening. I hope you have learned something useful today. And if you have any questions after the webinar, feel free to e-mail us at info at ... dot com.
And just let you know a couple of months ago, we released our summer 2020 release, which included 26 new tables in these areas.
And if you want to try anything you saw here today, whether that be something you don't have access to inquiry, or you want to access something like The Adaptive Features and Inquiry Pro, feel free to go to ... dot com forward slash trial. And you can try inquiry in a with all of its features activated within your browser for a couple of days. And if you want to get started with inquiry and find some of our videos and former webinars, you can go to ... dot com, forward slash stars, and, you know, if you want some of the references, they are available in these slides, which, as I said, the recording and the slides will be sent to you after this webinar is complete.
So, I'm just going to take a moment here to look at any questions.
They'll come in during the webinar, and, as I said, I need, I don't get to today, I'll make sure to get back to you via e-mail in a couple of days.
OK, it's just a couple of questions here, I will get to all today, but one was just about extending the non proportional hazard models to deal with interim analysis. That's probably something that's going to happen fairly soon. I would imagine, I'm technically you can do that already. For log rank test, you know, you could simulate that using a log rank test at the moment. But, obviously, as we said, that log rank test is inefficient for the current case at the moment, But there is work happening on that. I would assume, as soon as some of that research has made available. I think, I've actually seen a few papers on, already once. That kinda comes more valid, obviously. Something that we'll definitely look at, I suppose, the delayed effect does make the, you know, how do you infer or make inferences about the early interim analyzes work? Know, there's this practical complication there. So I'm very interested to keep reading up on that area, but it's not something we're looking at, probably in the immediate future.
And then, there's just one question here about stepped wedge design, like whoa.
I didn't understand the question. but, basically, just asking like, you know, why did I include all of those?
Like, all those basically missing data. As I said, they're not really missing data.
What we're trying to replicate here is the fact that this study design was, kind of donning a certain way, which is to say that they had a 12 week lead in periods before they did anything. Then, they did the classroom intervention for these nurses fleet as a midwife And then they followed up for 12 weeks afterwards, but like, if I if they start the week 13 was the first intervention, but no, they only followed up for 12 weeks. Regardless of, like, you know, they didn't do up until the end of 38, 39, only for 12 weeks after the intervention.
You can see there's these times here that, you know, for the first cluster where they weren't being followed up anymore. So that data is effectively missing. This is like this isn't the problematic missing data problem, really and that's adjusted for the stepped wedge design, so basically, it's kind of trying to replicate what really happened and based on the results that we got, you can see that's probably how they were doing their sample size calculation.
OK, as I mentioned, there's there's probably a few other, there's a few other things here. But nothing too important, I'll get back to you by e-mail, so.
Just want to say thank you so much once again for attending, and I said if you have any questions, feel free to e-mail info at ... dot com. So, thank you so much for attending, and I hope to hear from you soon. So, once again, thank you so much, and goodbye.

Subscribe by Email