When the ASA released their statement released a statement in The American Statistician on p-values and statistical significance in 2016, it created a new wave of discussion and debate on p-values and null hypothesis significance testing (NHST).
With numerous alternative approaches to statistical inference being championed and some calls to abandon statistical significance completely, it has never been a better time to catch up on these alternatives and see their potential effects on the areas of study planning and sample size determination. We will explore these issues in this free webinar presented by nQuery.
To watch a recording of this webinar and download the slides
just click the image below.
Can't watch a video right now?
The slides and auto generated transcript for the webinar are below.
So welcome to this webinar “Alternatives to the P value and power, the effect on sample size determination” presented by nQuery.
What we're hoping to do is to have a discussion about, the current controversy about P values and of course the associate idea of statistical power and like what could we see in terms of changes to how sample size determination is done at the various stages of clinical trials and in wider research, so hopefully you'll get an idea of both that controversy around p-values and what some of the future methods of interest if you choose an alternative Maybe Before we get started. My name is Roman Fitzpatrick. I'm the head of Statistics here at Statsols. I'll be presenting today's webinar. I've been the nQuery Lead researcher since nQuery and version 3.0 and I've given guess talks and guess workshops and places like the FDA and that JSM. So hopefully you'll get some interesting tidbits out of today's talk and some interesting Theory as well as some practical guidance On sample size that doesn't involve p-values and Power.
So in terms of what we cover in today's demonstration in today's webinar first will be kind of covering the current controversy. Basically, like what is power? What is p values?
Why are they mean criticized currently and what are some of the Alternatives out there in terms of different approaches in terms of doing analysis and inference for your study, but also their by extension how you would do the planning for such a study if you weren't using a p-value or weren't using hypothesis testing and how that would affect the sample size determination in particular and then we're going to be focusing on two main alternative approaches one is sample size for frequent intervals constables be the common one there and then bays in sample size which will focus both on base in intervals, but also a brief mention of the idea of Bayesian testing and then there was some discussion of inclusions at the end before we get started just in case you have a few frequently asked questions if you want a recording of This webinar, Being done so that will be sent out after this webinar is complete and the slides and I'm using will be sent along side that recording. So you're getting those momentarily wants this webinar is complete. If you have any questions use the questions tab on the right hand side of the webinar screen and I'll try to answer a couple of days at the end, but I need I don't answer during the webinar. I will email you afterwards with the appropriate response. So hopefully if you have questions keep them going is going and if there is any technical issues make sure to ask those questions as well.
Just so I know to make changes on our end to deal with those If you want to know what we working on today just constant devil for two means of Tolerance and 2441 mean elections were won subject credible interval for 1 Min and posterior test for two-sample z-test. So as mentioned this webinar is being presented by nQuery, which is sample size software used very widely in the pharmaceutical and biotechnology Industries with 91% of organizations having a license of nQuery who do clinical trials for the FDA with some of those on this slide here. So obviously the whole period is to give a lot of background but then to show practically what you how to do sample size calculations for these alternatives to power alternatives to P values in a software such as nQuery. Obviously, the general lessons will be extensible to other solve for that does the same thing as well?
So sample size determination. I suppose the first thing I want to do is just give a little bit of background about this.
So sample size determination is the process of finding the appropriate sample size to your study. I'd rather generic term. But basically what we want to do is have a study which has a reasonable chance of success given some assumption about what we think is going to happen in the design and sample size determination is a very common part of clinical trial design for various reasons. But basically we're trying to balance what you consider the ethical and practical issues that exist in a clinical trial first and foremost, of course is that we want to do a trial or study which is likely to generate valid conclusions. We do not want to have a trial that is very underpowered because what's going to happen in that case is we massively increase our chance that the result or conclusion we get from our study is completely different from what will actually happen in the population.
So Andrew Gelman talks about what are known as Type S errors Type M errors with s errors being sign errors with a sign of the effect found in your study is the opposite of what will actually exist in the Obviously a very huge issue there, but also Type M errors magnitude errors where the magnitude of the effect found in your trial is different from the magnitude of the actual effect in the population. Like, you know, you can have a difference of 0.1 or sorry one in your actual design, but the actual effect is 0.1, or maybe it's actually a hundred and you've underestimated both of those would be pretty problematic. But of course those need to be considered against the cost constraints that exists against the ethical issues that exist So You Know sample size determination that is therefore kind of a Goldilocks problem. We want to have enough people so we can make valid conclusions and have good make a good study, but we don't want to expose people unnecessarily to a treatment regimen that hasn't been tested yet and we don't want, you know to create trials that are so expensive that they become financially and economically unfeasible. And of course for all these reasons sample size determination is a fairly standard regulatory requirement from entity such as the FDA for confirmatory trials.
But if you look at things like the nature publishing group, they're kind of study protocol template that they give the first thing you'll find under statistical considerations is how did you get your sample size?
So in terms of sample size determination it goes without saying that the most widely used criteria is statistical power and effectively the power is the probability that you will detect a true effect of a specified size or It's important to note that statistical power or perhaps more accurately the what's this is the power the term has become associated with is the idea of doing hypothesis testing. And of course by extension the p-values this is also known as null hypothesis significance testing nhst in many places as well. So obviously this is the kind of area where you said it significance level of say 5 percent and then if your p-value is less than 0.05 or 5% you come to conclusion that you can reject the null hypothesis. This is still like this is the most widely used form of inference and that is reflected in the fact that statistical power which is about really giving you their chance of rejecting obviously having a significant P value given you should have a slimming p-value. I say your different is not like you have a meaningful difference.
That's basically what power is trying to measure your probability of getting a success in the sense of a significant p-value, but of course, it's important to note that The P values as a concept are under significant scrutiny in today's literature in terms of the effects that they've had on Research in terms of their usage. And so if that's happening and I'll be talking about that at scrutiny in the next slide. But if that's happening we need to ask what are the Alternatives that exist out there for inference. So when we actually do our analysis, but just as importantly we also want to think about how are we going to design our study if we're not going to be relying on p-value for knocking be relying on hypothesis testing and what alternatives exist and what effect did it have on the sample size determination because the point about sample size determination about valid inference that cost constraints Etc. Don't go away just because you've decided not to use power if you chose not to use power.
So what is the controversy over nhst? I'm p-values. Well, there's been many many long-standing criticisms for NHS T and P values, which I won't go into detail here. But basically I think the biggest ones and the ones that people are most worried about are there misinterpretation and their openness to abuse. So the misinterpretation is often related to psi D of D inverse conditional which is kind of the last Point here as well, which is basically that the you know in null hypothesis significance testing and the p-value we're basically getting a measure of How likely are data is given the hypothesis is true or perhaps faults in the case of power. So what's the problem with that? Well, most people argue that intuitively when we do statistical testing what we're actually interested in is How likely should I got my data? Like not not is not How likely is the data? I got given the hypothesis is true.
It's like How likely is My theory my hypothesis How likely is it? True given the data that I got? And so that's the inverse conditional and of course, we know that the probability of a given B does not equal the probability of B, given a in most practical circumstances. So what that means is that when people try to explain what a 5% significant result significant the 5% level is they often misinterpreted they say it's like, oh that means that the results there's like a less than 5% chance, they're due to random chance or other types of misinterpretations. I won't go through them all to avoid proliferating them. But basically that's very common. Whereas obviously the p-value has a very strict definition which is the probability that you would have that you would get the data to did or more extreme given that the null hypothesis is true. So you've created a fixed Universe where the null hypothesis is that the mean difference is equal to 0 is actually true and all you're saying here is like if we were to generate data from that world.
How likely is it that we would have got the result that we did There are other issues around the p-value that like in terms of abuse that are also very significant. Just we have a doll a lot of ideas by P hacking about basically like the p-value if you just do enough tests, you're likely to get a significant p-value just to to a stochastic nature. So that's obviously major issue more of an ethical practical issue. But obviously if you have approaches which are more robust against that that would be very welcome and there's a couple of other more minor points like that. The p-value is directly related to the end which means that you have if you have low end you're trying to guess if we plot is very low and if you have heart if you go to a high enough then eventually you will get a significant P value for any arbitrarily nonzero difference and that is going to be probably all are really points are equally likely under the null hypothesis and 0 just happens to be the most likely one.
But at high end any basically any value is likely to be significant at not exactly zero it obviously also ignores the effect size if you've caught a p-value what I call The effect size which is more common than we would definitely like and you're ignoring like the magnitude of the effect the clinical significance of that effect and other practical issues and then there's a general Bayesian argument that the prior data prior at least should be into implemented in your in your statistical framework. But those despite all this it is still the most widely used to difficult framework in practice. But that is changing a little bit like I suppose these criticisms. They're going all the way back to the like busy the 50s when nhst became standardized by combining the approaches of Fisher with neyman-pearson, you know, somewhat controversial time at the conference which way the time itself. I'm going to the history of that but if nhst is it has a very interesting history in and of itself, but in recent times there has been obviously a new debate kind of brought back to the for by the statement by the American statistical Association Zahn p-values and hypothesis testing where they emphasized.
You know, what the precise Finished news. What is you know that good design kind of ameliorate some of the issues we have with P values, but also that you know designed the p-value is very kind of intrinsically tied to a certain type of design basically of randomized Trials is very tied to P values in most cases but also specifically focusing on the weaknesses, including some of the ones that I've given above.
And so, you know the supplementary material I would recommend reading that it's very very good. Lots of very different views on the topic. And if you want to get a more opinionated view on it, I would definitely recommend reading all of that. But basically I think that's precipitated. I you know renewed interest in the topic and of course this year we had some articles and papers like in papers and journals like nature and the American statistician which have renewed the calls which themselves are quite long-standing of basically abandoning the p-value completely and about abandoning hypothesis testing completely, but the big open question I suppose is what should replace them I suppose.
I think there's two levels that I think one is in kind of General day-to-day research being done in University places and stuff like that like preclinical research Etc where I think that the much more open the base where I think in the context of confirmatory clinical trials phase three trials, I think it's a little bit more narrow because I can the in the first area there's obviously a strong belief that the Reliance on this kind of machine for generating significant results or having, you know able to categorize results. People are very much against that whereas I think in the context of clinical trials, there will always be a requirement from a like from a regulatory point of view to pre-specify the success Criterion a priority to prevent abuse by motivated actors.
So and obviously that argument does apply for other research, but I think when you're talking about the approval of drugs and such as that a pre-specified success Criterion, it's basically required to basically prevent undue influence being placed upon Regulatory Agencies and forcing them to kind of think about things after results are collected. So I think there's a little bit of a new ones there. But I think in general research the p-value has definitely not being fit for purpose.
So what I'm talking about here is just two Alternatives that have been put forward in terms of doing sample size determination without using power and there's kind of two schools here. But obviously have been lots of other proposals here like stuff like for prediction models, there's being different sample size methods for those but the to I'm going to talk about here are one is basic sample size not on power, but rather based on the Precision of your estimate. So basically instead of focusing on will I get a significant result or not. You should focus more more more more of your energy and more of your sample size calculation on how accurate you want your final estimate to be be and obviously that means you're going to be using something like a statistical interval a comfortable a credible interval or tolerance interval Etc. And what you want to do is make sure that the width of that given some assumptions about what's going to happen is no larger than some pre specified length.
And so that's kind of a change in Focus there and there's been Several different people putting forward methods in this obviously the simultaneity becomes intervals in general terms are quite widely used in quite well known both Rothman recently in 2018 and Kelly further back in 2003. Both argued that constant it will sample size could be seen as a superior alternative to power in many cases particularly in you know, general research preclinical research. The other one is the idea of using other success criteria. If you want to base your sample size on some type of test and obviously the most likely candidate is a set of Bayesian criteria. So basically this is idea of using sample size based on a different Criterion than the power P value such as a basic criteria, like the Bayes factor or the posterior error or basically all not just the power a power alongside other criteria. So you want to not just have your statistical power be sufficient you want.
Say that it has other aspects that have to be filled as well. Perhaps from a Bayesian framework perhaps from other types of error types such as you know, your FDR your likelihood ratio Etc. And then you know different people have put forward different like this is a much wider category than the first one in the sense that people have lots of different views and what success should mean a priori but like some foundational papers in the area will be Adcock who did some stuff regarding credible intervals and that kind of framework Lindley in 1997 who talked about the choice of sample size a very interesting paper which took a decision theoretically cost approach to sample size calculations and has been a wide array of different methods put forward for bazin criteria ranging from base factors, the posterior error test, etc. Etc. And for frequent tests Alternatives, like the FDR the false Discovery rate sample size for that prism as well.
So this is a much wider category in today's we're going to focus mostly on the Bayesian School perspective. But you know just to mention is much wider the much wider topic that I'll be able to cover today. And I think I think this other idea of all their success Criterion in clinical trials specifically is for me the big open question that they that they will have to be answered because as I said in a regulatory context pre specifying what success is before the study occurs, it's such a central part of good, you know regulation that if you're not willing to say that a Bayes Factor has to be at least 30 or your FDR right has to be at least like, you know, this Valley like, you know, 0.05 or some other Criterion and it's unlikely to replace the current strict 2.5% type 1 error rate that is obviously the current requirement for phase 3 confirmed two trials in the vast majority of cases.
So as well as the first alternative is interval estimation. So just on the broadest term what is interred? What is a statistical interval? Well, basically a statistical interval is one which contains the probable range for an unknown parameter. That might be a mean that might be a proportion of the expected population that might mean a proportion of a future sample and many other different things and like obviously probable here being kind of used in its kind of most broadest sense. But probably under the frequent test framework of is the probability under repeated sampling. So we're basically saying if we were to reproduce the procedure for The Confident ervil or other type of frequency interval, we're saying that all of those repeated samples we expect with a certain confidence a certain percentage of those to contain the unknown parameter.
And so using the interval approach emphasizes the estimate value but also emphasizes its uncertainty. So hopefully interval estimation, if you're going to using intervals at the end for inference rather than testing that means that you will be looking not just at whether that specific effect size is significant, but you have to think about well, this is the effect size that I got. What does that mean now in terms of clinical significance in terms of real world significance and the hope is there for decouple that idea of significance without looking at the effect size the effect sizes should be more Central to what we think about when we get a result rather than just the effect size being all just turned into a p-value and then forgotten about but the uncertainty point is also important because obviously if you have an estimate, which is no longer a point estimate and I think that's like one of the biggest mistakes being made right now in terms of alternate like getting rid.
The p-value is a dangerous turn around and talk about just about the estimate value as the value as they think that Nors the core part of sadistic. The whole reason statistics is a field is that uncertainty is absolutely vital to understanding and making inferences of any kind whether you're going to be using a p-value or not. So hopefully by using interval estimation rather than just talking about estimate values or talking about testing you combine the importance of the effect size value which the idea of how do we deal with the uncertainty with that estimate which of course is what testing approach is implicitly doing with the degrees of freedom and the variance Etc.
So hopefully some things there ironically when you get to sample size determination the first part about the actual effect size value is actually ends up dropping out because in many cases the specific effect size that you're going to get doesn't matter to really calculate the constant level and it's really the Precision of the interval that you're targeting. So you want to know how wide interval is and of course by implication what you're really focusing on for sample size purposes is that variance that uncertainty in the parameter estimate. Now, the parameter estimate does have an effect sometimes depending on the type of test you're doing basically is the location of the estimate and the variance of the estimator are they independent not which is true for say means usually but not for proportions.
But the basic Point here is that the sample size information is basically asking you to think about what quality of estimate do I want what you know how good an estimate do I want in the sense of having a good amount of certainty, but the Final value end up having and you know the effects I dint of Being de-emphasized for the sample size purposes. But remember like, you know sample size is just a means of getting a good estimate. Obviously if you get a good estimate and when you start looking at well what estimate actually get in the final study that's when the thinking has to happen because you you're not tying the statistical method to the testing process intrinsically anymore.
So it holds goes without saying that the confidence interval in the frequent to school is by far the most widely used to testicle interval and basically the kind of Layman's definition would be the probability that you will have the true value for a parameter such as the mean or the mean difference or the odds ratio under repeated sampling. So that's to say if I take a thousand different if I if I set my comfortable process at the start do that a thousand times if my assumptions have been fulfilled around like this region Etc. Then I would expect that. I've a 95% confidence level the 95% of those thousand intervals will contain the true parameter. Once again, a kind of counter intuitive interpretation it like it makes no statement about How likely an individual CI is to contain the parameter only How likely a set like a proportion of the set of see eyes would be to contain the actual True Value with the assumptions being correct one.
Any of the inverse conditional idea there? This is given, you know, your assumptions are correct.
So the sample size determination obviously is targeting the interval width. So that's obviously quite closely tied to the idea of the standard error of the estimate which is kind of tied to your kind of normal law of large number type Theory and you know, the idea is this gives you the correct width on average is what you're usually targeting if you look at the sample size method in the top right hand corner here that's basically saying on average the end here I will be, you know, this end will give you the correct width but it's important to note that the interpretation of average here is a little bit unusual that you might think that's like the median like 50% of the time the way it won't be larger than it but due to this Regional assumption around the standard error or run the standard deviation the variance which is more related for the chi-squared than the normal distribution in actuality. They sent this actually like the probability of this with in like like the true with not being wider than the specified wait.
It's actually a little bit higher than 50% and there's like there's an adjustment for that. We can do called the coverage correction and to basically specify what probability we want the linked to be the length to be higher than our Target length. So, you know, if we have a length if we were Target length of one we want, you know by default if we just use the kind of normal derivation that might be like 40 percent of the time the length might be slightly higher than our Target length, whereas we might want to set that to be like 80% chance that it will be ever higher as was important to note that yes, that that probably does sound a bit low, but of course the lengths that you're likely to see higher than the target length, most of those will still be near the target like they won't be like off into the Wilderness way larger than you expect.
I was born small note. I always like to make when it comes to Precision sample size is that if you have location independence with the variance, so basically the estimate for the value does not affect the variance of the value which is true for see the means case. If you're to zooming Z statistic Z distribution and the power and the comfortable can be related fective lie as Concepts and basically you can just think of it as like take the lower limit of your comps interval. We're assuming that higher effects are better in this case, but basically take the limit that's closest to the null hypothesis value take the difference between the lower limit and the null hypothesis value.
So say between 0 and the lower value and then basically power for that and that would actually give you that actually gives you basically the power As you can see this kind of standardized effect size is here of K here.
And so that's kind of what you can get there. It was like it kind of it kind of makes sense to a certain extent if you think about it the lower limit of a see I was on exactly was exactly on the null hypothesis value while you're basically saying that what's the chances that lower limit will be less than the null hypothesis value. Well about 50% of the kind of implicit distribution for the lower limit is placed right on the null hypothesis value. So in that case, there's a 50% chance lower limit will be higher than the null hypothesis value if that makes sense.
But of course, we're not focusing in power today. It's just a small note. You may be interested in.
So what I'm going to do here is just do a couple of fairly simple Compass interval sample sizes because I think the Assumption here is that most of you are familiar with doing this it was the major thing that we're doing here is that we're kind of changing the framework in the sense that you know, we have confidence level instead of significance level standard deviation would be the same thing if we were doing a one-sample t-test but now instead of power we're targeting the half width of the confidence interval here. This is a real paper just looking at rosiglitazone for insulin sensitivity from diabetic medicine and this is just a simple example that will just kind of explore very quickly here.
So for anyone who isn't familiar when I can query an nQuery the sample size calculations happen in the top left-hand corner here in this kind of spreadsheet type format. Each column is an individual calculation with right Rose being white rose been inputs and yellow rose being I put that is to say you can solve for the limit the distance from mean to limit the constant for width or the sample size, but you need to input these first and each row is a required assumption for the calculation. So we need to know the confidence level. We need to know the standard deviation Etc. Also note that on the right hand side. You can see there's a little help carried here telling you giving you guidance on each of these rows. So if you don't know a standard value saying 0.95 for the confidence level it gives you that there as long as lawlessly acceptable range of values.
So let's first let's quickly replicate the answer here of 0.95 or 95% This is on the proportion scale obviously for two sided interval with a standard deviation equal to 1 point 7 and a distance from mean of 1.5. And we see that we get sample size of n of 10 per group.
We get a little statement here we could do some plots as well if we're interested in but we won't focus on that now. So basically you can see here that sample size calculations for comps intervals are basically just a simple if not more simple than doing a power calculation for Tita Sol in terms of barriers to why you would consider this approach over like null hypothesis testing. Then there really is nothing any from a sample size calculation point of view and of course software like an nQuery will tend to have sample size calculations for comes intervals for a wide variety of all the common type of parameters are interested in ranging from Hazard ratios to odds ratios the log odds ratios, the mean differences to mean ratios, etc. Etc. All of those are basically covered in most standard software such as nQuery. So that's not a major issue here.
And so the major thing emphasis here is we have this distance from the mean to limit 1 Point 5, that means that you know, we want an estimate with a quality of you know, Upper limit of with the limits no further than 1.5 away from that and that, you know, that's how confident we want to be in our final result. As I mentioned though. In this case, you know, there is something called coverage correction because this is only really guaranteeing that on average the length will be 1.5. So and I say an average average is evenly really lineup due to the distributional assumptions of the variance and its relation to the chi-squared with that being more like that being 50/50. It's more closer.
I believe to around 40% so If you wanted to guarantee that the probability that the length of the interval you get in your final analysis is no larger than your target length, you can use the coverage probability to adjust for that. So let's just like this is the same example 42 means except we now have coverage probability here. So let's just put in the same values. We have previously just for clarity because you can see here that the all the inputs are the same for this example except for a coverage probability. And so you can see here that when the standard deviation must be estimate its observed value may be small or large in the True Value. The cover probably is the probably The observed it will be no longer than the specified value of w here and so 0.8 is a standard value and you can kind of think of like, well, this is almost like the power for the interval. So this is like How likely like this is like, you know, not just we want the confidence level of 0.95. But I also want to say we don't want the length of our interval.
And to be longer than the target interval what a given probability and so the probability of the waste the coverage probability. However, you want to phrase it. It's kind of somewhat equivalent to the power of your interval. So definitely worth something considering when you're doing your sample size calculation, but you will note of course that you know, because we're being more strict are the sample size requirement has of course increased from 10 to 14. So something to keep in mind as well. I suppose going back to something just in the slide, you know, the coverage probability derivation for the more complex.
But if you looked at this equation here and you're wondering where this came from if you thought about like a constant travel like the actual equation comfortable where the half width basically equals the Z statistic the standardized e statistic like 1.96 for 95% to side again interval multiplied by the standard deviation divided by the square root of n well, basically all that's been done for this equation, which is what we As in the first example is rearranged so that the half width is on the right hand side and N is on the left hand side and then just you know, we had the square all of this because obviously the end what's the square root of N and that's how that kind of comes around and those kind of that kind of allows us was that more naive estimate is where most standard estimates are but do keep in mind this criteria of coverage probability. But as I mentioned like, you know, this is right on average the chances of having it. I mean like a limit difference that's like 15 highly highly unlikely. It's more argue that you're more like maybe have like 1.6 has one point five five Stones kind of values and maybe you consider those to be, you know to like not good enough. You really want 1.5. That's when you get the coverage probability involved.
So before we move on because I think you know the actual sample size calculations in both these cases going to be relatively trivial. I'm just going to do a very simple case for the portion just to illustrate one of those minor points that I made that you know, in this case you'll notice there is no mention of the actual value for the mean difference. So this is a comfortable for the difference of two means but there's no main difference anywhere here and that's because we know that what the T distribution or Z distribution assumptions for the to means there is location Independence between the actual estimate and the variance so we only really need to know the variance but in the proportions case, we know that the variance and the estimate are not independent. We know that the variance is related to P times 1 minus P if we're talking about a single proportion. So obviously now the value of P or the value of pi is affecting the value for the proportion is affecting the variance estimate itself.
And therefore you need to have the effect size because it's a basically it's basically a sides and the variance calculation in a single parameter But for this case, you know, what we're looking at here is a 95% confidence interval two-sided interval when expected proportion of 50% And in this case, it said sample size of 24 would give a limited distance of zero point two and you can see that we get nearly approximately 0.2 for the distance from proportion to limit here for end of 24 one other note. I'd make about proportions and indeed talking about more complicated cases in the means case is that often times? You'll have lots of different options for doing the constant of an approach like you could have the log the like you can use the log version of the proportion. You could look at for rare events. You might want to have an interval for example where you want to just have an interval given a probability of observing a single event. Obviously, that's when dealing with rare event. And obviously there are other adjustments related to things like the finite population adjustment very similar to what you have in a t-test excetera.
And like there's exact versions. There's like in the 2K sample case for proportions. There's a news Khmer news Khmer. I'm scum and lots of different versions of intervals. So do know that when you get beyond the means case continuous case the range of different options available, even if you're like, even like even for the same type of parameter and even with the same distributional assumptions can vary quite widely.
So this example is pretty simple. I hopefully don't need to go through that too much and then you know, like these other examples just kind of show some of the variations but as I mentioned there's example like the for proportions for you'd have the log odds ratio for both the paired case and the two sample case and various other options, and we're adding more of those all the time for all of those different situations.
Before I finish up with the frequent as intervals, I suppose in many cases. You're probably as find sample size becomes intervals. That's fine. But there are other frequent has intervals that are used or perhaps are owned or used that can Target other types of parameter. Obey see things other than a parameter value other than a mean difference or other type of standardized parameter value and to of the frequency interval Jim a may be familiar with but maybe haven't used in a while is the prediction interval and the tolerance interval. So the prediction interval is an interval to contain a future sample subject or parameter with a given probability. So there's lots of different definitions of this depending on its but it's all focusing idea that you know, you're going to build an interval on a current set of data and now you want to predict with a certain confidence how like like I you want to create an interval with a has a with a certain confidence that it will contain a value for my future sample.
So the moat the Actually case the kind of wanted you probably retard in your intro stats class is the case where you want to create a prediction interval where it'll have a certain confidence level a certain probability under repeated sampling to contain a future single observation. So that's to say that I you know, I have a set of data of take a hundred people. I want to create a prediction interval and that's busy saying that if I get a single like measure from a future person from the same population, I want the real like 95% chance that interval will create will include that future person and it's not a trivial thing there because often comes intervals prediction intervals have historically kind of being mrs. Misused or kind of misinterpreted.
I think there's a classic case related to drug testing in sports or doping testing in sports where a claimant in one jurisdiction successfully appealed their ban on the basis that at that time they had been using confidence intervals to basically measure whether your Your result was considered unusual or not and they successfully well actually no, you should be using something like a prediction interval which obviously we tend to be wider than a constant level and they argued. Well, this is what you should have been doing even within the frequent as framework when we're not even talking about bass and stuff. They're just frequent does and they actually made a success and these days that's been you know change to basically be using things like prediction intervals and tolerance intervals rather than constant volts when you're talking about the kind of range of values of accepted if it is based on statistical science, of course as well as for sample size determination purposes. It's important to note that the sample size is not based on the actual width you're going to get for your prediction interval but is actually related to the limiting interval. So I don't have enough time to cover the limiting intervals in this case, but for single value, it would basically be equal to the standardized Z statistic for your confidence level.
So one point 9642 sided 95925 % 1 times your various it sound like Three times your standard deviation. So that will basically be your limiting interval in the case of a single future observation. And that's basically how much wider will your prediction really versus that limiting interval for a given sample size and then there's tolerance intervals which is an interval dick a day in a given proportion of the sample population with a given probability. So rather than talking about How likely, you know interval defined, you know, which will contain the mean or for a percentage time given repeated sampling. This is I want to have a certain proportion of the sample population with a given probability. So you have the population coverage. So you want to have 95% of the population within your interval effectively and it was warm like like small dick note their courses that you know to make this for a frequent point of view. You have to make a fair you're busy have to assume what the underlying population distribution is exponential mean Etc. And so it is quite tight.
Tied to having a fairly confident belief in the underlying distribution to yeah, that's true most statistical testing, but this was particularly true for tolerance intervals.
I'm basically your sample size determination becomes calculating what's known as the tolerance factor which brawl intents and purposes simply replaces your standardized a z statistic and sample size Bart and you multiply that by your standard deviation to get your tolerance interval. It doesn't really have as much of a kind of intuitive or important interpretation to stand alone. But basically that's where the sample size determination calculations come in. They just calculate this tolerance factor and often known as K and then you can get your end relatively trivially from that.
So I'm going to do is just take some examples of sample size from the to kind of cases that were covered in the original papers that the sample size methods in nQuery or based on and just quickly kind of show those off to just illustrate was talking about there. I'm not going to go too much into them because they basically work quite similarly to all of your kind of standard sample size calculations. So in the case of the tolerance interval, we had a 95% confidence interval for which is a lower interval and that's explain why it's here like basically, you know, exponential distributions usually for Life distributions. And since longer life is usually better. We're kind of looking for Interval where the lower estimate for the lifetime say of liquor for an industrial process or indeed perhaps for survival process. We want to lower limit to BL but then allow, you know, we're happy for the interval to contain any we're not really worried about lifetimes that are too long. So then we have a ninety percent proportion covered. So that's the proportion.
In all of the expected distribution in this case X mentality. We want that 90% of that proposed distribution to be in our future interval. We have an exceedance margin here of 0.05 and exceedance probability of 0.05. And now these basically work I would say very similarly to the coverage probability correction. We were talking about here except instead of just having a generic one where it's like we just don't want any values of the length greater than our Target length. This is actually specifying some upper limit for the kind of exceedance. That's particularly important to us.
So in this case like we don't want like if we want done we want 90% covered on average, but we only wanted to be a 5% chance that I 95 per like well 95% is excluded basically.
And so, you know, that's the first example the other example as I mentioned prediction interval is related to the relative with and that's a relatively simple calculation. No, but in this context there are two approaches and you'll kind of see that the other approach firstly let me see you do this way, which is basically saying what is the expected relative weight? As I said relative to the limiting interval, which is the definition is given in the home card here if you're interested. So in this case, the limited with was one point one seven seven. So you want a with that one point one seven seven times the limiting interval. We have a future sample size here of 1. So this is the classic prediction interval. We want to average interval for a single future observation from the same population and we get a sample size of 11.
so in this case, there's obviously a slight issue here where it's made the assumption that we not the closest Value, but the Put the value that ensures that you have less than the expected relative weight. But if we go here and we had 10 here you'll see that this is the closer value really, so if you want the closest and this is really just guaranteeing that the length is less than your target length.
And so as I mentioned haters too. Well, there's another approach here with the upper prediction bound which is really just the upper, you know, the upper Bound for our you know with here for a relative with with an exceedance probability. That looks pretty familiar. Right and that's basically saying we this is this is the width we're targeting and we don't want to exceed that with a given probability that seems very familiar because it's basically doing the same things as coverage probability.
And so in both these cases sample size calculations are relatively simple. There's also a little homework required on your part. If you want understand where these concepts are value should come from but once you've figured that out doing the calculation itself is relatively trivial in software such as nQuery.
So with the last like 15 minutes or so here, what I want to cover is Bayesian sample size briefly note that I've covered based in sample size many times in these webinars before including the last webinar, which was kind of talking about using Bayesian sample size methods to compliment frequent test sample size or more accurately compliment p-value powerbase sample size. So today's episode is much more like well, we're not just talking about we're not just talking about improving sample size methods. We're talking about using sample size for Bayesian methods as well as the wider context here is that Bayesian analysis is obviously becoming more widely popular whether in terms of sample size large, obviously more likely in in general inference do to Warn its ability to integrate prior knowledge. One of the criticisms. We had a p-values though. It's hard.
It's not impossible, but it is difficult relatively speaking to do it properly integrate, you know prior knowledge there is stuff like combining P values Etc that but also with more Intuitive interpretation that's more of a opinion than a fact but the you know, the Bayesian inverse conditional like the Bayesian interpretation is the inverse conditional of the frequencies interpretation. So remember on the frequent interpretation, it's the probability of the data given the hypothesis is true. Whereas the Bayesian interpretation if you're doing a test if you have some kind you like some hypothesis you're testing is the probability the hypothesis is true given the data and somewhat definitely argue that's more intuitive interpretation of all people are looking for when they do statistical inference 04 sample size determination purposes. There's kind of two main schools of Bayesian sample size one is in the first one we're going to focus on today is sample size for Bayesian methods. And so this is basically looking for sample size for specific values of Bayesian parameters such as base vectors credible intervals can't eat or assessment method has like the probability of selecting the right level in a phase one MTD trial and things like utility functions. That's kind of the cost.
Theoretical approach by Lindley I mentioned earlier on a woman talking to focus on too much data with those kind of come in a little bit today is also the improving sample size frequent a sample size methods using Bayesian basically approaches Bayesian thinking Concepts, like Assurance and bl intervals and terms of adaptive design. I did a whole webinar on those a month ago. The recording is available on our website through if you get in contact at info at Sol's.com, we'll be happy to share the recording of that webinar and the slides are not webinar. But today the focus is on sample size for Bayesian methods and the easiest way to get into sample size for Bayesian methods and it kind of fits in nicely what we've just covered there is well instead of kind of what will probably be considered replicating the mistake of trying to create an inferential standard that's based on testing rather than you know, accuracy or variance.
Why don't we just do sample size if we Do Bayesian analysis based on credible interval, which is obviously the basically the equivalent become sinful via the Bayesian length.
So intuitive terms are credible interval actually is stating in layman's terms that a given interval has a certain probability of containing the true interval. Sorry containing the True Value obviously within that within the lens of the past percentage of the posterior error being within a given credible interval and I was incredible dreams are one of the most commonly used Bayesian methods for Interval estimation. Well, it probably it is the most commonly used intervention basic metaphor interval estimation and the methods by Adcock initially and then extended by Joseph and Bella sleigh. Basically, you can do these kind of intervals for the means and proportions case based on their work and there is a bit of a complication that comes in here in terms of what your selection criteria is, which I'll talk about in a moment and there's also different scenarios in terms of whether you know your value for the Precision which is just the reciprocal of the variance.
So basically the same Precision variants are directly related and also are like Are you just doing a pure credible interval or are you thinking about frequent intervals at the same time? That's the MBL thing which I'll mention later on, but I did Cover in detail in the last webinar.
So as well as the easiest way to discuss like the complications comes from actually doing example, this is just a table from the original Joseph and Bella flypaper. I'm just going to replicate the example to hear where they had a prior for the Precision a gamma private position with two and two for the to gamma parameters a hyper parameter of 10 a length of zero point 2 and a 1 minus Alpha a busy a coverage error or coverage probability of zero point nine five.
And so I've actually done this busy for the sake of time. I've done this already so I can focus on the actual interesting part, which is what some of these different options mean. But basically the coverage area here, which is just 1 minus the like 1 minus the value given in the table below. This is our coverage error, which is basically the proportion of the time that the interval will not contain the True Value. So that's that's just like, you know, and then and it so it now it's okay to kind of make that kind of simpler or more intuitive interpretation versus the Constable case and once we've done that we then need to think about the prior parameters. So remember like it's a credible interval I got now we're going to ask. Well, we have a credible interval we want we're going to basically assume it's normally distributed data for the purposes of this and so we're still going to have something where we expected to be normally distributed. It will kind of look a lot like comfortable but we do now get to integrate prior information into our credible intervals.
So basically we need to know what the variance Says to get our interval and so all we're going to do now is I like how much do we know about our prior estimate for the variance so we can apply a prior to that and if we don't know anything about that we might use some of the kind of more standard kind of own informative priors, but I think for the case of a variance usually we have some decent idea of the estimate here and there's of the various ways. You could get the prior for this value the maximum, you know, today's posteriors tomorrow's prior comes to mind.
But regardless of where you get the prior from We're assuming here that's characterized as a gamma distribution for the Precision, which has mentioned here is the reciprocal of the variants on this comes down to the fact that the inverse gamma is often used for the variance. But you know, the gamma distribution is perhaps more intuitive more well-known and it's kind of a standard way of parametrizing in the Bayesian School regardless, and so in this case, you can see that in the gamma distribution low, like basically lower values of the gamma parameters, usually imply that your priors less informative. So assuming that the ratio of these two parameters is the same the lower these are the less informative. We're basically saying we believe the prior to be so in this case, we have a fairly weak prior for the variance. So we're saying that the, you know, the strength of the evidence that we basis prior on wasn't that strong. Let's keep it week and obviously when we get real data will be updating that degrade our you know, I'm much better.
our posterior distribution which will obviously still be quite strong believes more influenced by the real data than the prior initially true this So that's kind of the first thing here. I think you know, it's important to note that you know, if you keep the ratio of these two the same you can actually you can show very trivially that, you know, the prior mean for the gamma is equal to the ratio of these two types of the ratios kept constant. The actually means that your estimate for the Precision and by extension the variance and standard deviation will be kept constant. So if I do that but I just keep increasing it. So let's say I put these two 200 and 200 this is actually saying, you know, I have a really strong prior now but that the estimate of the of the Precision is still the same.
It's still expected to be one because it's you know, this these two divided the same thing divided by 7 x equal to 1 and of course in that simple case, you can also easily derive that if the Precision equals 1 the variance equals once it's reciprocal of 1 is 1 and if the variance is 1 the standard deviation also equal to 1 at our current the like that in the second In this case, we assumed a relatively weak prior and as you can see that as the prior got stronger their the sample size went down a lot very unsurprising, of course because then if we have this prior data, and obviously we're going to be using that property in the Bayesian framework for integrating prior data using it as an actually using it as it should be appropriately based on how strong our belief in the priorities and obviously if we have a strong prior and of course we're assuming this is true or assuming for purposes of planning purposes that we got a good estimate of the prior for planning purposes. Then the sample size ends up being greatly reduced.
The posterior interval length is base is the same as the confidence of a length except just know this is obviously the total length, not the half length. And then the Hydra parameter n 0 is really just what you give is your prior sample size. So, you know, if we are taking this prior parameter values from a study of size 10, this is an adjustment factor for that and in real terms, what is often just comes down to is that if it whatever you got for the high parameter equal to 0 basically where you scared got this prior from node previous data and you just take that away so you can see there's like a difference of 10. If I set the satanic cetera doesn't doesn't match exactly in all cases, but that's raw A good rule of thumb and so you can see that the parameters required to pay regardless of your criteria are exactly the same and we have a subset here of 0.05 and so on so forth.
So the main difference of these do these things is like what They allowed to be only correct on average the ACC. The average coverage correction says we want the coverage are the basically the probability the coverage error probability of containing the true interval to be averaged only on correct average while the length will always be zero point two is exactly zero point 2 the length to be on average correct, but the coverage area has the exactly right which is probably this actually maps on closer to what we think of in terms of constant evil calculations, but ironically ends up being different much different interpretation at an extremist to the Constable which we'll see in a moment and the WOC which is basically extension of ALC but has subset here and the definition of this is basically very similar to the coverage correction. We saw earlier. We just want to 5% chance that length will be exceeded in our actual final interval. We only want a 5% chance of that. That's what this subset is doing here.
And as I mentioned if we were to do if we had a very strong prior, you might intuitively think well that actually ends up being pretty similar to the idea of a frequent this interval, right and it kind of comes back to that point. Maybe you might have heard that comes intervals and credible intervals don't end up being that different overall and that actually you can actually show that to a certain extent because if I said a 95% constant which is equivalent to 5 percent coverage are we saw previously standard deviation equal to 1 which we note that for the case where the two down parameters the same the Precision equal to one variance equal to one standard deviation equal to 1 and we said our Destinies are half distance is zero point 1 which is 1/2 0 point 2, we get a sample size of 385 if I take the ACC Criterion and let's make this a really informative prior.
Let's say it's mm mm very informative prior very low variance for our prior effectively will see a sample size ends up becoming the same thing as we got For this one except for the hyperparameters ear because obviously the constant devil doesn't take into account that you had prior data. So what I had any prior data those become the same thing and even though the internet like in the constant of a case, we're assuming the error is fixed and the length is variable. It's actually in the bays in case where the error is variable and the length is 6 that you end up getting them coming to the same value and that kind of reflects the fact that the inverse conditional problem like your kind of focusing on two different things almost but somehow they end up looking the same if we allow these two things to be like we're looking at different things in right on average but yet they end up being roughly the same interpretation and I can come back to that point. The people have mentioned about considerable to creditors been quite similar as I mentioned MBL intervals are very similar except that we want to not just have are credible interval be of a certain length, but we also want our comfortable with the same assumptions about the variance to be of the right length.
But I cover those and webinar last month's I'm not gonna cover it here today, but basically the principles and inputs are very similar. So if you've done the / if you can do a sample size calculation for credible intervals, you can do the same for the NBL the next Beijing likelihood approach.
I'm going to very briefly mention a posterior approach because I did cover it last night mode just because the posterior approach is probably somewhere halfway between being a method for improving frequentist statistics and being a full-on alternative criteria itself because this posterior approach is allowing you to Target the posterior error, but the posterior is based on allowing success to be defined by the P value itself. And in fact, if you think about the base actor and one of the most commonly and well-popularized ways of calculating P base doctor is a derivation of the base Factor based on the p-value and I think this posterior approach ends up being very similar indeed. And so I cover this example in the last webinar. So if you want to in more detail, you can see it there, but basically if we have these inputs here.
I just know that these are the invert like Alpha here corresponds the beta and vice versa. It's a bit unusual and we have a 50% chance against the null hypothesis. So we are integrating the prior belief against the null hypothesis.
We enter 2.64. Our difference in means. We have ten point six standard deviations for group. We have some size ratio of 1 we get 250 for which is not too far in this case away from the 524 total in that case, you know different 50 or you know, this obviously would change if our pasta like this is kind of 50/50 against the null hypothesis is kind of similar a little bit similar to the to the to the to the streak with this case I suppose but if we change these its prior probability belief against it things get very different and you can see that we're automatically getting for these Bayesian criteria. The frequent is criteria that we're looking for.
So what significance level of 0.05 power of 25% if you want to play around with that more I would definitely recommend using the posterior are calculator option from the assistance menu because in this case you can actually go either way so I can for a given significance level and power and probably again, The value let's say 0.75. I can get back these posterior errors. Just remembering that this Alpha really is closer to this value here. So this is like 76% a Bayesian power. And then this is like 0.018 significance level easy and significance level.
But I think important to note that the Bayes factor, which is obviously one of the criteria being put forward as an alternative the p-values there's a very well known there was a very well-publicized derivation of the base Factor based on the p-value Assumption and I think the idea of the posterior are here as defined by liens. Ellen is very very similar.
So discussion and conclusions sample size determination, very vital for study design with power being the most common criteria. But as I said intrinsically tied to the nhst p-values controversy, and therefore if we're going to have alternatives to the NHS TP values, we need to have methods that sample size determination methods for those Alternatives much preferably. So if we're going to use frequent as intervals more commonly as our kind of final thing that we're targeting rather than the P value then sample size methods are well developed not just for constables but also for the other maybe less well known but still very interesting intervals of the Tolerance on prediction intervals.
And I think it is important to note that they do exist and to just not just use comfortably by default when you're really looking for something else credible intervals have that same I'd vantage's of say relying on a on an interval framework rather than testing framework, but have that more in You would have interpretation from the Bayesian framework of you know, there's a 95% chance to a given trial interval contains the True Value that's a it's still a bit of a simplification but the more accurate simplification because the constant rule one is one where 95% of all kinds intervals will contain the food value doesn't make any statement about how given How likely a given constant devil is to contain the True Value and a posterior approach has an example perhaps on early example of a kind of different success criteria that you might want to consider alongside or in addition to power in terms of doing your sample size calculation. So not just looking for your kind of standard frequencies errors, but also at your base in errors and things like the base factors and such and so forth will probably lean into that same idea going forward.
So that completes the webinar. I just want to thank you all for attending and I'll be looking at the questions here at the moment. But just in terms of some final bits and Bobs our summer update happened in May this year. So that should be downloadable right now automatically.
And basically we added a series of new features 29 New Classic tables that fixed-term non-adaptive tables in hierarchical modeling and interval estimation five new Assurance tables and posterior are related tables the liens Ellen approach and seven new adaptive tables ready to sample size re estimation unconditional power in particular on blinded sample size for estimation for survival data is probably the big thing that was added in that update and you can get the details at www.sailrite.com /what - new and just to mention that and if you don't have a license for nQuery, or you don't have a license for the Bayesian or adaptive features you can Free trial of those by simply filling in the format cells.com forward slash trial there's no don't download installation required. So busy will be able to use a version of nQuery within your browser for a trial period all we need is just you know, busy your basic details and your email address so that we can send you on the relevant process for getting into the trial. So if you want to try either nQuery in total, if you don't have nQuery or any of the kind of more car-like the more complex features that are an opportunity to do so.
So any further Ado if you have any questions after today in the webinar just email me at info at that sells.com. I suppose what I refer to do it want to thank you so much for listening. I'm going to take a couple of questions here at the end. I'm going to take a moment just read them and answer a few low on air. But otherwise if you're interested the references will be in the slides at the end here for all of the examples and other things reference and I'll just I'll be back in a moment. Just answer a few questions.
Okay, just kind of like a general question of like How likely do I think personally I suppose is what they're asking as I like is that like power and P values are going to be phased out in near future basically bit longer than that, but that's the basic gist. I think most personally as I said in like nQuery, like were mostly focused on clinical trials mostly focused on the pharmaceutical biotechnology places. I think the 2.5% type 1 error is makes more sense in the clinical trial context for as I said one practical reason, which is as a regulatory agency, you know, if you basically say will they will get a result or get our inferential statistic and then decide what success means there's obviously a lot more room there for kind of negotiation a lot more room there for 44 que no difficulties because you know many cases you know, the regular ages they're looking at all these different studies at the same time.
Whereas, you know, each trial is very motivated to have success and therefore, you know regulated having a single success Criterion means that you have a regular rice approach. There's no, you know idea of it being biased there's no idea of them, you know favoring one person over another it kind of removes that kind of controversy makes their job. Just easier even if that means that they might miss a few more results than if they went to a more open-minded approach like using the interval that the estimate would it's interval with its interval and stuff like that. I think one other things I would note that from like I talked about the inverse condition on the more intuitive interpretation. I think from the perspective of a regulatory agency the less intuitive interpretation of the frequent to school that kind of philosophically makes sense. Like I would argue it's kind of like a devil's advocacy type of argument where your as a regular agency going to assume that the null hypothesis basic that your the drug is in effect. It was usually no lipolysis is true. So we're assuming that's true now tell me How likely is that what you got in your study?
Is what would have happened given a juror given the Devil's Advocate argument that you're wrong is actually true. So I think philosophically it probably makes more sense in that context. Whereas obviously in day-to-day research happening at universities happening in companies happening in preclinical research that interpretation isn't really as required that's people kind of doing their own results trying to get I kind of get evidence for their hypothesis, which is kind of often ill-defined her or not. Well defined already having a more General kind of more approach which kind of puts a central to it.
Let's get a result and interpret them down and think about them then and then kind of make conclusions I were going to do next based on that is perhaps makes more sense in that context and the basic idea of it's the probability of the hypothesis given the data is, you know, maybe more the kind of evidence that they're thinking of that's the evidence kind of level they're thinking of And it was just one small question about the posterior approach. I think probably because I skipped over a little bit too much too quickly. As I mentioned. I did this I did this in the previous webinars will cover their butts just like where they prior probability parameter comes from and as well as just met that day. It's kind of similar to all prior parameters. It kind of has to come from, you know, intuition expert elicitation previous data. That's basically just you going within a frequent as framework. You kind of going. Okay how much prior how much do I believe the null? Hypothesis is true or not April right because obviously one of the big issues or one of the kind of jokes that you often get Irish frequent personalities. It's like how it would deal with testing that the sun is going to rise in the morning situations like that where I was either such a very large prior probability of play. Why aren't you taking care of the why aren't you including that? We're two basins would obviously argue and this is just an opportunity to do that.
And if this is at the end of the day, you're going to base your final result on a p-value. Why don't you just include this kind of thinking these kind of errors in your thinking at the same time? And I said the Bayes factor or one derivation of the base Factor based on the p-value basic makes it very similar argument would have basically the same parameter used to kind of get from the p-value to the base Factor this probability against the null.
Okay, there's a couple of other minor questions. Mostly related to kind of practical things like the slide cetera and a couple of other smaller ones, so I'll get back to you by email, but just want to finally thank everyone again for attending. I hope you learned some interesting ideas here. And once again, just thank you so much, and I look forward to having you at the next webinar. Thank you a lot and goodbye.