# What Statistical Considerations to take in Clinical Trial Designs

July 5, 2019

What Statistical Considerations to take in Clinical Trial Designs - This is an excerpt from our webinar "Innovative Strategies for Successful Trial Design."

# What Statistical Considerations To Take in Clinical Trial Designs

This video is an excerpt from our webinar: Innovative Strategies For Successful Trial Design

##### Transcriptions

So statistical considerations; as a statistician myself of course, I’m probably incredibly biased and there is lots of interest, there's lots of focus on this and there's always trying to talk about this, but as I'll talk about later maybe oftentimes talk about it more often than the things that actually mattered. There was an interesting flow chart I've seen on Twitter and there’s a current debate about p-values and it's basically saying 90% of the debate happens by P values, and then only 10% of the debate is about what questions are you asking, what model are you choosing, what all those kind of other questions that come with designing an effective study, or whether the study is plausible in the first place. Regardless, this does have an effect (perhaps just slightly more emphasized than it needs to be).

## What Statistical Considerations to take in Trial Designs

So obviously the choices that you make in terms of your statistical or analysis choices do have a very important effect on what the questions you're going to be asking and can be of interest, hypotheses are of interest and how to answer them. I'm grouping together here those truly pure statistical questions like whether I should pick a t-test, or wilcoxon test, or a p-value versus the Bayes factor, with the kind of more general kind of analysis type things like the estimands idea for example. These are all in that same area and the latter probably ends up being more important oftentimes. But my general feedback it will be that statistical models and the methods that you choose should really maybe consider to be more of an emergent property of the design that you want to do. If you design a good trial and if you've taken careful consideration of the constraints that exist around answering the question that you have, oftentimes the limited choice, the statistical model or the best statistical model or the choice of models that are appropriate will fall out of that process almost naturally. It oftentimes makes better sense to think about “What is a good trial to design”, then think about “Do I want a t-test versus a wilcoxon test, etc.”.

Sample size is one level further down, which is that once you have picked your statistical methodology, sample size can be talked about that point. Asking for the sample size before you designed your study, or if you have already thought about the design of your study, or the model for your study, it's kind of a fool's game in my opinion. I see it denry's an emergent property from doing good design, good modeling in the first place.  I suppose in clinical trials specifically I think maybe two of the bigger issues are “What is success”, the 2.5 percent obviously type 1 error, alpha, significance level, however you want to describe it, is obviously something that exists and it's expected to exist going forward. It's included in all the literature you get from the FDA and similar, and for phase 3 trials I should say typically, and most of the methods that are built around that for confirmatory trials tend to replicate the type 1 error criterion as well.

But obviously in wider pre-clinical research and academia the idea of having this basically p-value threshold is a much wider issue of debate. On the other hand you also have these issues of what to measure and how to measure what's of interest, the estimands thing of course, and stuff like responder analysis. There tends to be, even within clinical trials, there's very much a tendency to try and split people into responders versus non responders and say by splitting your data into categories and that's oftentimes a poor choice. It definitely still has very much value in the wider debate about trying to use all the data as it is, rather than trying to turn it into more (at the modeling stage) digestible format, and basically turning a continuous measure into a binary measure is generally not a good idea, for example. I’ve been putting this forward already but just to note that advanced statistical methods are not an excuse for creating a bad design if you have the ability to make improvements to the design and avoid issues that are avoidable, then you really should. So perhaps there's a note of caution not to get lost in the tyranny of small differences, or this tyranny of small vanities.

## Bayesian Statistics in Clinical Trial Design

The topic that I‘ll just be covering briefly here is an example of a Bayesian criterion in the context of clinical trial design. Just to mention that Bayesian methods are of great interest in clinical trials, particularly in an earlier stage you're already pretty much part of the natural ecosystem and they're growing of interest for confirmatory trials of course as well. One of the issues for them is generally at the confirmatory stage to 2.5 type one error criterion, it continues to be kind of a standard one. I suppose you know briefly why the kind of inverse conditional -actually you know what- I’ll come back at the end I haven't mentioned here at the end looks like basically like

In terms of sample size and bayes, or design and bayes, there are two different ways that Bayesian analysis or Bayesian approaches end up influencing study design and sample size calculations in particular. There are ones where if you're going to do a Bayesian analysis, say using credible interval, you can do designs and do sample size calculations etc that are built around the fact that you're going to do a bayesian analysis. But there is also interest in these kind of Bayesian criteria which exist, or which exist either in harmony, or as complementary pieces, or as pseudo replacements for pre-existing frequentist analyses, so things like bayes factors for example would be something there. It's a hybrid of a Bayesian approach and a frequentist p-value.

The two major issues that these methods are generally dealing with is firstly the prior uncertainty that you have, so this is at the design stage obviously you have these estimates of what you think is going to happen, say in the context of a sample size calculation. Well, we don't really know that value for sure, but so maybe it makes more sense to treat that, not as a fixed value (that we think the main difference is going to be 2.5 previously), but it's more like a normal distribution.
That's one of the big things between Bayesian and frequentist is that Bayesians tend to approach things as if they're random variables (so it's a normal distribution with 2.5 mean and then some kind of sampling variance associated with that).

The other issue of course is the inverse conditional, which is to say that the traditional frequentist p-value, the significant or statistically significant result, it's a probability of the null hypothesis, probability of getting the data that you give or more extreme, in favor or against the null hypothesis given that the null hypothesis is true. So you're assuming a world where the null hypothesis was true so that usually you're assuming you're wrong, you're assuming that the mean difference between your treatment and the post treatment, or the standard treatment is 0 for example, you're assuming that's true and then you're saying “given that's true, how likely is it that we saw the data that we actually saw” basically. Some kind of data generating model, how likely is it using the data that we actually got would be from that data generating model.  So obviously that’s the big issue there I suppose, does that really reflect practical decision-making? Do people when they want to test something think about “well I'm gonna assume I'm wrong, how likely am I to have seen what I did given I'm wrong?”, or do they think “I have this data how likely is it that I'm right or that I'm wrong given that that data occurred?”. We know that the probability of the data given the hypothesis is certainly not equal for the probability that the hypothesis given the data if the hypothesis is true or not.

What they argue (Lee & Zelen), is that from a clinicians point of view it's perhaps, or from a practical point of view, they create these posterior errors, these equivalent type 1 type 2 errors, which instead of being the, for example, instead of the error being the type 1 error being the probability of rejecting the null hypothesis when the null hypothesis is actually true, (So that’s your type I error, your alpha), that’s why we control it to 0.05 or 0.25 typically, 0.25 I should say. They should argue, “well no actually what you want is the probability that you know given the data we got, how likely is it that the point hypothesis that we picked that Delta equals 0 is actually true”, so it's the inverse conditional problem basically. They argue that the latter question, “how likely is the hypothesis is true given my data” perhaps better reflects true medical decision making. Now, I'm not going into debate, this is a hour-long webinar and we still have another section to cover but just to say that the debate exists there and that's their perspective.

One note I would make is that the perspective of a clinician or a practical decision maker, versus a regulator may be slightly different I think. As a clinician it probably makes sense to be evaluating things, and I'm just gonna get the best thing possible. From a regulator point of view I would note from their perspective being skeptical, working from a starting assumption that this doesn't work, you need to prove to us that it does work. Starting from a testing rubric, where the test is “given that they were wrong or given that the treatment doesn't work, how likely is that we would have seen what we see?”. That does make more sense from the perspective of someone who's trying to decide whether this is safe and efficacious enough to allow into the wider ecosystem.

Now that is beyond the wider debate about the actual stochastic behavior of the p-value versus bayesian criteria, etc, that obviously is very influential just thinking in terms of the principled idea of whether it should be data given hypothesis or hypothesis given data. I can see both sides basically and maybe the different approaches make sense for different types of actors, when you're deciding between treatments; that is doctors versus regulators for example.

Like any trial design, careful planning can mitigate or eliminate certain risks.
If you are exploring adaptive designs, one important factor is to select validated and trusted software that is designed for your adaptive trialsnQuery has dedicated adaptive trial design functionality that contains a selection of sample size tables designed specifically for areas of adaptive design.

We recently hosted a webinar examining Advantages & Disadvantages of Adaptive Sample Size Re-Estimation. You can watch this webinar on demand by clicking the image below.

In this webinar you’ll learn about: