Explore the Scientific R&D Platform

Try Now
Pricing
MENU
Try Now
Pricing

An Overview of Multiple Imputation in SOLAS for Missing Data 5.0

July 30, 2015

In our previous post we discussed the pervasive problem of missing data in data analysis. To recap quickly, in a data set with 5 variables measured at the start of a study and monthly for six months, if each variable is 95% complete with a random 5% of the values missing, then the proportion of cases that are expected to be incomplete are 1-(.95)^35= 0.834. That is, only 17% of the cases would be complete and with traditional complete case analysis, you would then lose 83% of your data.

How It Works?

With Solas 5.0TM, missing values in a data set are filled-in with plausible estimates to produce a complete data set that can be analyzed using complete-data inferential methods and designed to accommodate a range of missing data scenarios in both longitudinal and single-observation study designs.

Solas for Missing Data Analyisis

What methods are used?

Hot-deck Imputation sorts respondents and non-respondents into a number of imputation subsets according to a user-specified set of covariates. An imputation subset comprises cases with the same values as those of the user-specified covariates. Missing values are then replaced with values taken from matching respondents (i.e. respondents that are similar with respect to the covariates). SOLASTM will not impute any missing values for which no matching respondent is found.

The Predictive Model Based Methods that are available at present are an Ordinary Least Squares (OLS) Regression, applied to continuous or ordinal data, and a Discriminant Model applied to categorical data. Multiple imputations are generated using a regression model of the imputation variable on a set of user-specified covariates. The imputations are generated through randomly drawn regression model parameters from the Bayesian posterior distribution based on the cases for which the imputation variable is observed. Each imputed value is the predicted value from these randomly drawn model parameters plus a randomly drawn error-term. The randomly drawn error-term is added to the imputations to prevent over-smoothing of the imputed data while the regression model parameters are drawn from a Bayesian posterior distribution in order to reflect the extra uncertainty due to the fact that the regression parameters can be estimated, but not determined, from the observed data.

The Propensity Score Method applies an implicit model approach based on Propensity Scores and an Approximate Bayesian Bootstrap to generate the imputations. The propensity score is the estimated probability that a particular element of data is missing. The missing data are filled in by sampling from the cases that have a similar propensity to be missing. The multiple imputations are independent repetitions from a Posterior Predictive Distribution for the missing data, given the observed data.

The Mahalanobis Distance Matching Method is used to identify cases that have similar characteristics to cases that have missing values. Missing data are filled in by sampling from the closest cases. The multiple imputations are independent repetitions drawn from the range of closest cases.

The Predictive Mean Matching Method applies Ordinary Least Squares Regression for estimating predicted values for each case in the dataset. Rather than using the predicted values for the imputation, they are used to identify similarities between cases with missing values and fully observed cases. Cases are sorted in to Donor Pools and similar to the Propensity Score method imputations are drawn from these pools.

The Combination Method uses the Propensity Score Method and Predictive Mean Matching methods described above are both applied to the data set. This results in each case in the data set having a propensity score and predicted value associated with it which are then used as covariates and the Mahalanobis Distance method is applied to find cases that can be used to impute missing values.

Regardless of the method used, once imputed values are generated, the resulting data sets can be used by any complete statistical analysis while the uncertainty around the missing data is taken into account by imputing two or more different values per missing data entry.

In the white paper The Consequences of Missing Data in the ATLAS ACS 2-TIMI 51 Trial we discuss not only the consequences of missing data but also consider some alternative approaches which could have been adopted.

Download White Paper 

To sign up for more information and to be the first to receive a free trial of Solas 5.0TM sign just sign up, we'll be in touch soon. You can also watch our recent webinar; 'Visual Sensitivity Analysis for Missing Data'

 

Subscribe by Email

Comments (3)