Chapter 12 Missing Values and Imputation
12.1 Introduction
Missing observations are common in real life datasets. Values may not be recorded for a number of reasons. For example, environmental data is often not recorded due to failures in measurement devices or instruments. In clinical trials, survival times and other covariates may be missing because of patient’s drop-out.
The literature on missing values is ample. To mention a few, Little and Rubin (2002)
provides a nice overview of missing values and imputation. Buuren (2018)
gives a more recent description of the area, which includes examples using R
code. A curated list of R
packages for missing data is available in the
Missing data CRAN Task View available at
https://cran.r-project.org/web/views/MissingData.html.
In general, missing values can seldom be ignored. Models that include a way to account for missing data should be preferred to simply ignoring the missing observations. When missing values can be modeled from the observed data, imputation models can be used to provide estimates of the missing observations. Models can be extended to incorporate a sub-model for the imputation.
The different mechanisms that lead to missing observations in the data are
introduced in Section 12.2. INLA
can fit models with missing
observations in the response by computing their predictive distribution, as
discussed in Section 12.3. In principle, INLA
cannot directly
handle missing observations in the covariates, as they are part of the latent
field, but imputation of missing covariates using different methods is
discussed in Section 12.4. Multiple imputation of missing
observations in the covariates using INLA within MCMC is described in Section
12.5.
12.2 Missingness mechanism
Dealing with missing observations usually requires prior reflection on how the data went missing and the missingness mechanism. Depending on the reasons why the data has missing observations the statistical analysis needs to be conducted in one way or another. Little and Rubin (2002) describe the different mechanisms for missing data.
Missing completely at random (MCAR) occurs when the missing data are independent from the observed or unobserved data. This means that the missing values can be ignored and the analysis can be conducted as usual. Missing at random (MAR) occurs when the missing data depends on the observed data. In this case, this can be introduced into the model so that missing observations are imputed as part of the model fitting. Finally, Missing not at random (MNAR) occurs when the missingness mechanism depends on both the observed and missing data. This scenario is difficult to tackle since there is no information about the missingness mechanism and the missing data.
Furthermore, a distinction between missing observations in the response variable or the covariates can be done. The missing observations in the response variable can naturally be predicted as the distribution of the response values is determined by the statistical model to be fit. On the other hand, missing values in the covariates are explanatory variables in the model and it is not always clear how they can be estimated. Similarly, missing indices in the definition of the latent random effects are difficult to handle as it is not always obvious how their values can be filled-in. However, different imputation mechanisms can be considered.
INLA
does not allow missing values in the definition of the effects in the
latent field, and this is a practical problem for model fitting. In general,
INLA
will not include the fixed or random term in the linear predictor of an
observation if the associate covariate has a value of NA
. See
Gómez-Rubio, Cameletti, and Blangiardo (2019) and the FAQ at the INLA
website
() for details.
12.3 Missing values in the response
First of all, handling missing observations in the response will be considered. Given that the distribution of the response variable is part of the model, it is possible to predict the missing values by computing their predictive distribution. Given a missing response \(y_m\) its predictive distribution is
\[ \pi(y_m \mid \mathbf{y}_{obs}) = \int \pi(y_m, \bm\theta \mid \mathbf{y}_{obs})d\bm\theta = \int\pi(y_m \mid \mathbf{y}_{obs}, \bm\theta) \pi(\bm\theta \mid \mathbf{y}_{obs})d\bm\theta . \]
INLA
will automatically compute the predictive distributions for all
missing values in the response, which should be assigned a NA
value when
defining the data.
To illustrate the computation of the predictive distribution of the missing
values in the response the fdgs
dataset (in package mice
, van Buuren and Groothuis-Oudshoorn 2011) will
be used. This dataset records information of 10030 children measured within
the Fifth Dutch Growth Study 2009 (Schonbeck et al. 2011, 2013).
Table 12.1 shows the different variables in the dataset.
Variable | Description | # missing |
---|---|---|
id |
Child unique id number. | 0 |
reg |
Region (a factor with 5 levels). | 0 |
age |
Age (in year). | 0 |
sex |
Sex (boy or girl ). |
0 |
hgt |
Height (in cm). | 23 |
wgt |
Weight (in kg). | 20 |
hgt.z |
Re-scaled height (as a Z-score). | 23 |
wgt.z |
Re-scaled weight (as a Z-score). | 20 |
Data can be loaded and summarized as follows:
library("mice")
data(fdgs)
summary(fdgs)
## id reg age sex
## Min. :100001 North: 732 Min. : 0.008 boy :4829
## 1st Qu.:106352 East :2528 1st Qu.: 1.619 girl:5201
## Median :203855 South:2931 Median : 8.085
## Mean :180091 West :2578 Mean : 8.158
## 3rd Qu.:210591 City :1261 3rd Qu.:13.548
## Max. :401955 Max. :21.993
##
## hgt wgt hgt.z wgt.z
## Min. : 46.0 Min. : 2.58 Min. :-4.470 Min. :-5.040
## 1st Qu.: 83.8 1st Qu.: 11.60 1st Qu.:-0.678 1st Qu.:-0.625
## Median :131.5 Median : 27.50 Median :-0.019 Median : 0.026
## Mean :123.9 Mean : 32.38 Mean :-0.006 Mean : 0.046
## 3rd Qu.:162.3 3rd Qu.: 51.10 3rd Qu.: 0.677 3rd Qu.: 0.707
## Max. :208.0 Max. :135.30 Max. : 3.900 Max. : 4.741
## NA's :23 NA's :20 NA's :23 NA's :20
Note that several variables in the dataset have missing observations.
In particular, height (hgt
) and weight (wgt
), which are common variables
used as response or predictors in models.
In order to provide a smaller dataset to speed up computations, only the children with missing values (in height and weight) and another 1000 ones taken at random will be used in the analysis:
# Subset for speed-up and testing
# Subsect 1, observations with NA's
which(is.na(fdgs$wgt) | is.na(fdgs$hgt))
subset1 <-
#Subset 2, random sample of 500 individuals
set.seed(1)
sample((1:nrow(fdgs))[-subset1], 1000)
subset2 <-# Subset 1 + subset 2
fdgs[c(subset1, subset2), ]
fdgs.sub <-summary(fdgs.sub)
## id reg age sex hgt
## Min. :100098 North: 78 Min. : 0.071 boy :493 Min. : 46.0
## 1st Qu.:106293 East :275 1st Qu.: 1.743 girl:550 1st Qu.: 86.3
## Median :204306 South:298 Median : 8.594 Median :136.1
## Mean :183214 West :250 Mean : 8.565 Mean :127.0
## 3rd Qu.:211388 City :142 3rd Qu.:14.164 3rd Qu.:165.1
## Max. :401949 Max. :21.884 Max. :199.0
## NA's :23
## wgt hgt.z wgt.z
## Min. : 2.58 Min. :-4.263 Min. :-4.075
## 1st Qu.: 11.96 1st Qu.:-0.668 1st Qu.:-0.652
## Median : 29.00 Median :-0.046 Median :-0.007
## Mean : 33.61 Mean :-0.023 Mean : 0.020
## 3rd Qu.: 53.10 3rd Qu.: 0.645 3rd Qu.: 0.696
## Max. :117.30 Max. : 3.200 Max. : 3.630
## NA's :20 NA's :23 NA's :20
Next, a model to predict height based on age and sex is fit to the subset data
set fdgs.sub
with INLA
. This model includes the options to compute the
predictive distributions of the missing observations in the response by setting
compute = TRUE
in argument control.predictor
.
library("INLA")
inla(hgt ~ age + sex, data = fdgs.sub,
hgt.inla <-control.predictor = list(compute = TRUE))
summary(hgt.inla)
##
## Call:
## "inla(formula = hgt ~ age + sex, data = fdgs.sub,
## control.predictor = list(compute = TRUE))"
## Time used:
## Pre = 0.394, Running = 0.336, Post = 0.128, Total = 0.858
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 77.148 0.720 75.735 77.148 78.560 77.148 0
## age 6.023 0.055 5.914 6.023 6.131 6.023 0
## sexgirl -4.716 0.730 -6.149 -4.716 -3.284 -4.716 0
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.007 0.00 0.007 0.007
## 0.975quant mode
## Precision for the Gaussian observations 0.008 0.007
##
## Expected number of effective parameters(stdev): 3.00(0.00)
## Number of equivalent replicates : 340.06
##
## Marginal log-Likelihood: -3976.26
## Posterior marginals for the linear predictor and
## the fitted values are computed
In order to show the predictive distribution, we will obtain first the indices of the first two children with missing values of height in the reduced dataset and show the row names in the original dataset.
which(is.na(fdgs.sub$hgt))[1:2]
hgt.na <-rownames(fdgs.sub)[hgt.na]
## [1] "273" "2034"
Next, the summary statistics of the fitted values can be shown
using index hgt.na
:
$summary.fitted.values[hgt.na, c("mean", "sd")] hgt.inla
## mean sd
## fitted.Predictor.0001 122.25 0.5353
## fitted.Predictor.0005 83.25 0.6830
Similarly, a model can be fit to explain weight based on age and sex and to compute the predictive distribution of the missing observations:
inla(wgt ~ age + sex, data = fdgs.sub,
wgt.inla <-control.predictor = list(compute = TRUE),
control.compute = list(config = TRUE))
summary(wgt.inla)
##
## Call:
## c("inla(formula = wgt ~ age + sex, data = fdgs.sub,
## control.compute = list(config = TRUE), ", " control.predictor =
## list(compute = TRUE))")
## Time used:
## Pre = 0.356, Running = 0.315, Post = 0.0693, Total = 0.74
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 6.259 0.436 5.403 6.259 7.115 6.259 0
## age 3.331 0.034 3.265 3.331 3.397 3.331 0
## sexgirl -2.379 0.446 -3.255 -2.379 -1.503 -2.379 0
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.02 0.001 0.018 0.02
## 0.975quant mode
## Precision for the Gaussian observations 0.021 0.02
##
## Expected number of effective parameters(stdev): 3.00(0.00)
## Number of equivalent replicates : 341.01
##
## Marginal log-Likelihood: -3486.82
## Posterior marginals for the linear predictor and
## the fitted values are computed
This model could be used as a simple imputation mechanism for the missing
values of weight, so that the complete dataset could be used to fit, for
example, another model to explain height based on age, sex, and weight.
Here, we check the posterior means of the predictive distribution of
weight for the first two children with missing values of the variable
wgt
:
which(is.na(fdgs.sub$wgt))
wgt.na <-$summary.fitted.values[wgt.na[1:2], c("mean", "sd")] wgt.inla
## mean sd
## fitted.Predictor.0002 23.07 0.3204
## fitted.Predictor.0003 27.62 0.3332
12.3.1 Joint model with height and weight
The two previous models consider height and weight separately, but it is clear that there is a high correlation between height and weight, which is caused by the age and sex of the child. For this reason, a joint model for height and weight could be proposed to exploit a correlated effect between the coefficients of age in both models.
This can be easily implemented with INLA
by using a model with two Gaussian
likelihoods (for height and weight, respectively) and the latent effect
iid2d
to include two correlated coefficients in the two parts of the model.
Hence, the model will be the following:
\[ hgt_i = \alpha_h + \beta_h age_i + \beta_{h,1} sex_i + \epsilon_{h,i} \]
\[ wgt_i = \alpha_w + \beta_w age_i + \beta_{w,1} sex_i + \epsilon_{w,i} \]
Here, \(\alpha_h\) and \(\alpha_w\) are model intercepts, \(\beta_h\) and \(\beta_w\)
coefficients on age
(which will be correlated), \(\beta_{h,1}\) and
\(\beta_{w,1}\) coefficients on sex
, and \(\epsilon_{h,i}\) and \(\epsilon_{w,i}\)
two error terms.
Vector \((\beta_h, \beta_w)^{\top}\) is modeled using a multivariate Gaussian distribution with mean \(c(0, 0)^{\top}\) and covariance matrix
\[ \left[ \begin{array}{cc} 1 / \tau_h & \rho / \sqrt(\tau_h \tau_w)\\ \rho / \sqrt(\tau_h \tau_w) & 1 / \tau_w\\ \end{array} \right] \]
Here, \(\tau_h\) and \(\tau_w\) are the precisions of the coefficients and \(\rho\) is
the correlation parameter, which is between 0 and 1. This model is implemented
in INLA
as the iid2d
latent effect (see Section 3.3 for
more details). Note that the covariate age
will be introduced as a weight of
this latent effect (see below for details) when it is defined with function
f()
.
First of all, the bivariate response variable needs to be put in a two-column matrix given that the model will be made of two likelihoods (see Section 6.4 for details).
nrow(fdgs.sub)
n <- matrix(NA, nrow = 2 * n, ncol = 2)
y <-1:n, 1] <- fdgs.sub$hgt
y[+ 1:n, 2] <- fdgs.sub$wgt y[n
Similarly, given that the model will include two intercepts, these need to be defined explicitly as covariates with all values equal to one. The structure must also be a two-column matrix to have two different intercepts, one for each likelihood.
matrix(NA, nrow = 2 * n, ncol = 2)
I <-1:n, 1] <- 1
I[+ 1:n, 2] <- 1 I[n
Variable sex
needs to be put in a similar format so that the model includes
two separate coefficients in each of the two parts of the model.
matrix(NA, nrow = 2 * n, ncol = 2)
SEX <-1:n, 1] <- fdgs.sub$sex
SEX[+ 1:n, 2] <- fdgs.sub$sex SEX[n
Variable age
also needs to be put in a different format, but given that
it will be a weight in the iid2d
latent effect, it must be passed as a vector
of replicated values:
rep(fdgs.sub$age, 2) age.joint <-
Finally, an index vector to indicate which coefficient to use from the iid2d
model is required. This index will be 1
for the first half of observations
(to indicate that the coefficient is \(\beta_h\)) and 2
for the second half (to
indicate that the coefficient is \(\beta_w\)).
rep(1:2, each = n) idx.age =
The model is fit and summarized as seen below. Note how the estimates of the
intercepts and the coefficients for sex
are very similar to those from the
univariate models.
# Model formula
y ~ -1 + I + f(idx.age, age, model = "iid2d", n = 2) + SEX
joint.f <-# Model fit
inla(joint.f,
fdgs.joint <-data = list(y = y, I = I, SEX = SEX, age = age.joint, idx.age = idx.age),
family = rep("gaussian", 2),
control.predictor = list(compute = TRUE))
# Summary
summary(fdgs.joint)
##
## Call:
## c("inla(formula = joint.f, family = rep(\"gaussian\", 2), data =
## list(y = y, ", " I = I, SEX = SEX, age = age.joint, idx.age =
## idx.age), control.predictor = list(compute = TRUE))" )
## Time used:
## Pre = 0.276, Running = 2.17, Post = 0.101, Total = 2.55
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## I1 81.740 1.272 79.242 81.740 84.236 81.740 0
## I2 8.658 0.773 7.140 8.658 10.174 8.658 0
## SEX1 -4.649 0.729 -6.081 -4.649 -3.219 -4.649 0
## SEX2 -2.377 0.445 -3.250 -2.377 -1.504 -2.377 0
##
## Random effects:
## Name Model
## idx.age IID2D model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.007 0.000 0.007 0.007
## Precision for the Gaussian observations[2] 0.020 0.001 0.018 0.020
## Precision for idx.age (component 1) 0.309 0.088 0.164 0.301
## Precision for idx.age (component 2) 1.177 0.314 0.640 1.154
## Rho1:2 for idx.age 0.782 0.082 0.601 0.790
## 0.975quant mode
## Precision for the Gaussian observations 0.008 0.007
## Precision for the Gaussian observations[2] 0.021 0.020
## Precision for idx.age (component 1) 0.508 0.284
## Precision for idx.age (component 2) 1.867 1.110
## Rho1:2 for idx.age 0.916 0.810
##
## Expected number of effective parameters(stdev): 5.99(0.003)
## Number of equivalent replicates : 341.13
##
## Marginal log-Likelihood: -7481.61
## Posterior marginals for the linear predictor and
## the fitted values are computed
The coefficients for age
are part of the random effects of the model. The
estimates are very similar to the ones obtained in the univariate models. Note
how the standard deviations seem to be slightly smaller.
$summary.random$idx.age fdgs.joint
## ID mean sd 0.025quant 0.5quant 0.975quant mode kld
## 1 1 6.024 0.05538 5.915 6.024 6.133 6.024 3.453e-07
## 2 2 3.328 0.03355 3.262 3.328 3.394 3.328 1.345e-07
The estimates of height of the children that had missing values are these now:
$summary.fitted.values[hgt.na, c("mean", "sd")] fdgs.joint
## mean sd
## fitted.Predictor.0001 122.20 0.5348
## fitted.Predictor.0005 83.19 0.6825
In a similar way, the estimates of weight of the children with missing values explored above can also be checked:
$summary.fitted.values[n + wgt.na[1:2], c("mean", "sd")] fdgs.joint
## mean sd
## fitted.Predictor.1045 23.08 0.3192
## fitted.Predictor.1046 27.62 0.3319
12.4 Imputation of missing covariates
As seen in the previous section, the missing values in the response can be handled by computing their predictive distribution and this is possible because the data generating process is specified in the model likelihood. Missing values in the covariates pose a different problem as covariates are supposed to be explanatory variables and the model does not provide any information about the generating process of the covariates. However, imputation models can be proposed by exploiting the correlation among the different observed variables.
A simple approach to tackle missing observations in the covariates is to build a regression model to explain the covariate with the missing observations and plug-in the obtained estimates (e.g., posterior means) from their predictive distributions (Little and Rubin 2002). This approach is simple and easy to implement in most cases, but it ignores the uncertainty about the imputed values.
A better approach would be to include a sub-model for the imputation of missing values within the main model to be fit. This will propagate the uncertainty from the imputed values throughout the model. However, this means than an extra level is required in the model and INLA is not designed to include this (see, for example, Gómez-Rubio, Cameletti, and Blangiardo 2019).
In order to develop an example of the approaches mentioned above to fit models with missing observations in the covariates, we will build a model to explain height based on age, sex and weight. Note that now the missing observations in height are part of the response (and their predictive distributions can be computed) but the missing observations in weight are part of the latent effect.
First of all, a model is fit to the reduced dataset fdgs.sub
. Note that this
data.frame
contains missing observations of variable wgt
, which is in the
linear predictor. Hence, INLA
will not remove the rows with the missing
observations in wgt
before fitting the model, which means that some
observations are actually not used to estimate the coefficient of wgt
. This
model will provide a baseline to compare to other approaches.
inla(hgt ~ 1 + age + sex + wgt, data = fdgs.sub)
hgt.noimp <-summary(hgt.noimp)
##
## Call:
## "inla(formula = hgt ~ 1 + age + sex + wgt, data = fdgs.sub)"
## Time used:
## Pre = 0.193, Running = 0.221, Post = 0.0198, Total = 0.434
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 75.308 0.740 73.855 75.308 76.760 75.308 0
## age 4.971 0.147 4.681 4.971 5.260 4.971 0
## sexgirl -4.012 0.716 -5.417 -4.012 -2.608 -4.012 0
## wgt 0.318 0.041 0.236 0.318 0.399 0.318 0
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.008 0.00 0.007 0.008
## 0.975quant mode
## Precision for the Gaussian observations 0.009 0.008
##
## Expected number of effective parameters(stdev): 4.00(0.00)
## Number of equivalent replicates : 255.03
##
## Marginal log-Likelihood: -3954.24
Next, we take the previous model fit to wgt
to plug-in the posterior
means of the missing observations into a new imputed data set called
fdgs.imp
:
fdgs.sub
fdgs.plg <-$wgt[wgt.na] <- wgt.inla$summary.fitted.values[wgt.na, "mean"] fdgs.plg
Note how the values of wgt
in the new dataset fdgs.plg
do not contain any NA
’s:
summary(fdgs.plg$wgt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.58 12.00 29.00 33.54 52.90 117.30
This new dataset is used to fit a new model where there are only missing observations in the response:
inla(hgt ~ 1 + age + sex + wgt, data = fdgs.plg,
hgt.plg <-control.predictor = list(compute = TRUE))
summary(hgt.plg)
##
## Call:
## "inla(formula = hgt ~ 1 + age + sex + wgt, data = fdgs.plg,
## control.predictor = list(compute = TRUE))"
## Time used:
## Pre = 0.215, Running = 0.384, Post = 0.0747, Total = 0.674
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 73.982 0.751 72.507 73.982 75.455 73.982 0
## age 4.348 0.171 4.012 4.348 4.683 4.348 0
## sexgirl -3.490 0.705 -4.874 -3.490 -2.107 -3.490 0
## wgt 0.503 0.049 0.407 0.503 0.599 0.503 0
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.008 0.00 0.007 0.008
## 0.975quant mode
## Precision for the Gaussian observations 0.009 0.008
##
## Expected number of effective parameters(stdev): 4.00(0.00)
## Number of equivalent replicates : 255.03
##
## Marginal log-Likelihood: -3932.10
## Posterior marginals for the linear predictor and
## the fitted values are computed
The estimates of height for the two first children with missing observations is:
$summary.fitted.values[hgt.na[1:2], ] hgt.plg
## mean sd 0.025quant 0.5quant 0.975quant mode
## fitted.Predictor.0001 118.20 0.6431 116.94 118.20 119.46 118.20
## fitted.Predictor.0005 82.63 0.6530 81.35 82.63 83.91 82.63
Note how the estimates of the model effects have changed. It is worth mentioning that the last model has included the subjects that had missing observations of the weight to estimate its coefficient because they have now been imputed.
Including the full imputation mechanism in the model will require a sub-model
within the main model. This is not currently implemented in INLA
but this can
be easily implemented as a new latent effect with the methods described in
Section 11 (Gómez-Rubio, Cameletti, and Blangiardo 2019). Alternatively,
Cameletti, Gómez-Rubio, and Blangiardo (2019) propose sampling from the predictive distribution of the
imputation model, fitting models conditional on this imputed values and then
using Bayesian model average on all the models fit to estimate a final model.
The resulting model will account for the uncertainty of the imputation
mechanism. In the next sections we provide different solutions to this problem
by using a sample from the imputation model and by implementing an imputation
mechanism using INLA within MCMC (Gómez-Rubio and HRue 2018).
12.4.1 Sampling from the imputation model
When imputed values are plugged-into the data the actual model fit is conditional on these values and therefore, ignores the uncertainty about the imputation process. The marginals resulting from this model can be represented as
\[ \pi(\theta_t \mid \mathbf{y}_{obs}, \mathbf{x}_{mis} = x^*_{mis}) \] where \(x^*_{mis}\) are the plugged-in values of the missing covariates and \(\theta_t\) represents a latent effect or hyperparameter in the model.
Note that the imputation model will produce predictive distributions \(\pi_I(\mathbf{x}_{mis} \mid \mathbf{y}_{imp})\), where \(\mathbf{y}_{imp}\) are the observed data used in the imputation model. This model may have further parameters \(\bm\theta_{I}\), for which posterior marginals can be obtained (but these are not really of interest now).
The posterior distribution of the parameters in the model can be obtained by integrating out the missing observations with regard to the imputation model, i.e.,
\[ \pi(\theta_t \mid \mathbf{y}_{obs}) = \int\pi(\theta_t, \mathbf{x}_{mis} \mid \mathbf{y}_{obs}) d\mathbf{x}_{mis} = \int\pi(\theta_t \mid \mathbf{x}_{mis}, \mathbf{y}_{obs}) \pi(\mathbf{x}_{mis} \mid \mathbf{y}_{obs}) d\mathbf{x}_{mis} \]
Note that now \(\mathbf{y}_{obs}\) refers to the observed data for the main model and the imputation model, i.e., \(\mathbf{y}_{obs} = (\mathbf{y}, \mathbf{y}_{imp})\). \(\mathbf{y}\) now includes the response variables plus any other fully observed covariates in the main model. The previous equation makes it clear that the predictive distribution of the missing observations is informed by the observed data in the general model and the data used in the imputation model.
As \(\pi(\mathbf{x}_{mis} \mid \mathbf{y}_{obs})\) is seldom available, it can be replaced by the predictive distribution of \(\mathbf{x}_{mis}\) obtained with the imputation model, i.e., \(\pi_I(\mathbf{x}_{mis} \mid \mathbf{y}_{imp})\). This assumes that \(\mathbf{x}_{mis}\) is only informed by the observed data in the imputation model (\(\mathbf{y}_{imp}\)) and ignores any dependence on the response variable and the main model. Then, the approximation is
\[ \pi(\theta_t \mid \mathbf{y}_{obs}) \simeq \int\pi(\theta_t \mid \mathbf{x}_{mis}, \mathbf{y}_{obs}) \pi_I(\mathbf{x}_{mis} \mid \mathbf{y}_{imp}) d\mathbf{x}_{mis} \]
Hence, the the posterior marginal of \(\theta_t\) can be approximated by sampling \(n_{imp}\) values of \(\mathbf{x}_{mis}\) from their predictive distribution \(\{\mathbf{x}^{(i)}_{mis}\}_{i=1}^{n_m}\) (see discussion in Cameletti, Gómez-Rubio, and Blangiardo 2019):
\[ \pi(\theta_t \mid \mathbf{y}_{obs}) \simeq \frac{1}{n_i} \sum_{i=1}^{n_{imp}}\pi(\theta_t \mid \mathbf{y}_{obs}, \mathbf{x}_{mis} = \mathbf{x}^{(i)}_{mis}) \]
Note that \(\pi(\theta_t \mid \mathbf{y}_{obs}, \mathbf{x}_{mis} = \mathbf{x}^{(i)}_{mis})\) is obtained by fitting a model with INLA in which the missing observations have been set to \(\mathbf{x}^{(i)}_{mis}\).
The following code obtains 50 samples from the posterior distribution
of the imputation model obtained using function inla.posterior.sample()
(see Section 2.7):
50
n.imp <- inla.posterior.sample(n.imp, wgt.inla) wgt.pred <-
The samples include the linear predictor. For the first set of samples, the values of the imputed weights (i.e., the linear predictor) are:
1]]$latent[wgt.na,] wgt.pred[[
## Predictor:2 Predictor:3 Predictor:4 Predictor:6 Predictor:9
## 22.477 27.522 52.876 13.306 7.518
## Predictor:11 Predictor:14 Predictor:15 Predictor:21 Predictor:22
## 11.746 12.493 6.210 17.201 70.037
## Predictor:28 Predictor:29 Predictor:34 Predictor:35 Predictor:36
## 44.484 12.355 23.686 26.014 55.913
## Predictor:37 Predictor:38 Predictor:39 Predictor:40 Predictor:42
## 34.227 37.247 33.164 36.284 37.637
The original dataset can be completed with each of the 50 sets of imputed
values and the the model can be fit with INLA
:
lapply(1:n.imp, function(i) {
imp.models <-$wgt[wgt.na] <- wgt.pred[[i]]$latent[wgt.na, ]
fdgs.plginla(hgt ~ 1 + age + sex + wgt, data = fdgs.plg,
control.predictor = list(compute = TRUE))
})
The fit models can be put together by computing the average model using
Bayesian model averaging (Hoeting et al. 1999). For this, function inla.merge()
is used with equal weights (\(1/n_{imp}\)):
inla.merge(imp.models, rep(1 / n.imp, n.imp)) model.imp <-
The marginals for the fixed effects and hyperparameters are then extracted to be used later:
model.imp$marginals.fixed
marg.fixed <- model.imp$marginals.hyperpar marg.hyperpar <-
Summary statistics of the marginals of the fixed effects can be obtained as usual:
summary(model.imp)
##
## Call:
## "inla(formula = hgt ~ 1 + age + sex + wgt, data = fdgs.plg,
## control.predictor = list(compute = TRUE))"
## Time used:
## Pre = 80.8, Running = 11.2, Post = 5.92, Total = 97.8
## Fixed effects:
## mean sd
## (Intercept) 75.712 2.702
## age 3.722 0.641
## sexgirl -1.869 2.343
## wgt 0.654 0.185
##
## Model hyperparameters:
## mean sd
## Precision for the Gaussian observations 0.013 0.002
##
## Marginal log-Likelihood: -278.21
## Posterior marginals for the linear predictor and
## the fitted values are computed
Similarly, summary statistics of the predictive distribution for the first two children with missing values of height can be obtained using the estimates of the linear predictor:
$summary.linear.predictor[hgt.na[1:2], ] model.imp
## mean sd
## Predictor.01 118.7 2.412
## Predictor.05 85.0 2.193
Note that if the likelihood is not Gaussian in order to get the estimates of the fitted values the linear predictor needs to be conveniently transformed.
In order to assess the impact of the imputed values of weight in the model, the averaged predictive distribution for a given child with a missing value of height can be compared to the predictive distribution obtained from fitting the models to the 50 imputed values. Figure 12.1 shows the averaged predictive distribution (in black) as well as the different 50 predictive distributions. In this particular case, all of them are very similar.
12.5 Multiple imputation of missing values
Gómez-Rubio and HRue (2018) discuss the use of INLA within MCMC to fit models with missing observations. In a Bayesian framework, missing observations can be treated as any other parameter in the model, which means that they need to be assigned a prior distribution (if an imputation model is not provided).
This example will be illustrated using the nhanes2
(Schafer 1997), available
in the mice
package (van Buuren and Groothuis-Oudshoorn 2011). The R
code reproduced here is taken from
the example in Gómez-Rubio and HRue (2018) and it is available from GitHub (see
Preface for details).
The nhanes2
dataset is a subset of 25 observations from the National Health
and Nutrition Examination Survey (NHANES). Table 12.2 shows the
different variables in this dataset, which can be loaded and summarized as:
library("mice")
data(nhanes2)
summary(nhanes2)
## age bmi hyp chl
## 20-39:12 Min. :20.4 no :13 Min. :113
## 40-59: 7 1st Qu.:22.6 yes : 4 1st Qu.:185
## 60-99: 6 Median :26.8 NA's: 8 Median :187
## Mean :26.6 Mean :191
## 3rd Qu.:28.9 3rd Qu.:212
## Max. :35.3 Max. :284
## NA's :9 NA's :10
Variable | Description |
---|---|
age |
Age group (numeric, with 1 =20-39, 2 =40-59 and 3 =60+). |
bmi |
Body mass index (in kg / m\(^2\)). |
hyp |
Hypertensive (numeric, with 1 =no and 2 =yes). |
chl |
Cholesterol level (mg/dL). |
Note how there are missing observations of the body mass index and the cholesterol level. The interest now is to build a model to explain the cholesterol level based on the other variables in the dataset:
\[ chl_i = \alpha + \beta_1 bmi_i + \beta_2 age^{40-59}_i + \beta_3 age^{60+}_i + \epsilon_i,\ i=1, \ldots, 25 \]
Here, \(age^{40-59}_i\) and \(age^{60+}_i\) are indicator variables derived from
age
to indicate whether a patient is in age group 40-59 or 60+, respectively.
In addition, \(\alpha\) is the model intercept, \(\beta_j,\ j=1,\ldots, 3\) are
coefficients and \(\epsilon_i\) an error term.
This model can be fit to the original dataset as follows (but bear in mind that
INLA
will not remove the rows in the dataset with missing observations of the
covariates):
inla(chl ~ 1 + bmi + age, data = nhanes2)
m1 <-summary(m1)
##
## Call:
## "inla(formula = chl ~ 1 + bmi + age, data = nhanes2)"
## Time used:
## Pre = 0.202, Running = 0.0554, Post = 0.0186, Total = 0.276
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## (Intercept) 150.689 30.138 92.974 149.933 212.846 148.584 0
## bmi 1.241 1.139 -1.069 1.254 3.473 1.278 0
## age40-59 15.570 19.042 -23.214 15.994 52.005 16.820 0
## age60-99 34.929 22.327 -11.106 35.674 76.792 37.173 0
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant
## Precision for the Gaussian observations 0.001 0.00 0.00 0.001
## 0.975quant mode
## Precision for the Gaussian observations 0.001 0.001
##
## Expected number of effective parameters(stdev): 3.19(0.202)
## Number of equivalent replicates : 4.69
##
## Marginal log-Likelihood: -94.86
In order to consider the imputation of the missing observations together with
model fitting, INLA within MCMC will be used as described in Section
11.5. This means that the Metropolis-Hastings will be used to
sample new values of the missing observations of bmi
and a new model will be
fit on the imputed dataset at every step of the Metropolis-Hastings algorithm.
Before starting with the implementation of this method, a copy of the original
dataset is created (d.mis
) required by the general implementation of the
algorithm in function INLAMH()
, and then the indices of the missing values of
bmi
(idx.mis
) and their number (n.mis
) are obtained. This will be used in
the implementation of the functions required in the Metropolis-Hastings
algorithm.
#Generic variables for model fitting
nhanes2
d.mis <- which(is.na(d.mis$bmi))
idx.mis <- length(idx.mis) n.mis <-
Thr first function to be defined (fit.inla()
) is the one to fit the model
with INLA
given the imputed values of bmi
:
#Fit linear model with R-INLA with a fixed beta
#d.mis: Dataset
#x.mis: Imputed values
function(data, x.mis) {
fit.inla <-
$bmi[idx.mis] <- x.mis
data
inla(chl ~ 1 + bmi + age, data = data)
res <-
return(list(mlik = res$mlik[1,1], model = res))
}
Next, the proposal distribution is defined. In this case, a Gaussian
distribution centered at the current imputed values of bmi
with
variance 10 is used. Two functions to compute the density, dq.beta()
,
and to obtain random values, rq.beta()
, are created:
#Proposal x -> y
#density
function(x, y, sigma = sqrt(10), log =TRUE) {
dq.beta <- dnorm(y, mean = x, sd = sigma, log = log)
res <-
if(log) {
return(sum(res))
else {
} return(prod(res))
}
}#random
function(x, sigma = sqrt(10) ) {
rq.beta <-rnorm(length(x), mean = x, sd = sigma)
}
Next, the prior on the missing values is set, in this case, a Gaussian distribution with mean that of the observed values and standard deviation twice that of the observed values. This will provide an informative prior but with ample variability:
#Prior for beta
function(x, mu = mean(d.mis$bmi, na.rm = TRUE),
prior.beta <-sigma = 2*sd(d.mis$bmi, na.rm = TRUE), log = TRUE) {
dnorm(x, mean = mu, sd= sigma, log = log)
res <-
if(log) {
return(sum(res))
else {
} return(prod(res))
} }
The implementation of the Metropolis-Hastings is available in function
INLAMH()
. A total of 100 simulations will be used for inference,
after 50 burn-in iterations and thinning one in 10 from 1000 iterations.
library("INLABMA")
# Set initial values to mean of bmi
rep(mean(d.mis$bmi, na.rm = TRUE), n.mis)
d.init <-#Run MCMC simulations
INLAMH(d.mis, fit.inla, d.init,
inlamh.res <-
rq.beta, dq.beta, prior.beta, n.sim = 100, n.burnin = 50, n.thin = 10)
The simulated values of the missing values of bmi
can be put together
and summarized as follows:
#Show results
do.call(rbind, inlamh.res$b.sim)
x.sim <-summary(x.sim)
## V1 V2 V3 V4
## Min. : 9.69 Min. : 8.97 Min. : 3.75 Min. :10.7
## 1st Qu.:21.06 1st Qu.:24.90 1st Qu.:23.57 1st Qu.:18.6
## Median :25.31 Median :28.50 Median :30.20 Median :21.6
## Mean :24.96 Mean :27.28 Mean :29.68 Mean :22.4
## 3rd Qu.:29.20 3rd Qu.:30.95 3rd Qu.:35.47 3rd Qu.:25.3
## Max. :38.02 Max. :38.68 Max. :53.45 Max. :38.6
## V5 V6 V7 V8
## Min. : 8.62 Min. : 0.83 Min. : 8.99 Min. : 8.85
## 1st Qu.:21.97 1st Qu.:20.26 1st Qu.:22.82 1st Qu.:23.37
## Median :28.73 Median :25.71 Median :27.51 Median :28.14
## Mean :27.77 Mean :25.88 Mean :27.50 Mean :28.17
## 3rd Qu.:33.18 3rd Qu.:30.40 3rd Qu.:32.29 3rd Qu.:32.53
## Max. :41.76 Max. :49.38 Max. :44.75 Max. :53.56
## V9
## Min. :12.7
## 1st Qu.:21.4
## Median :26.4
## Mean :27.7
## 3rd Qu.:33.1
## Max. :52.4
Figure 12.2 shows density estimates of the posterior distributions
of the nine imputed values of bmi
. As can be seen, there is ample
variability in the posterior distributions. This may be due to the vague prior
assigned to the missing values of bmi
and the small sample size of the
dataset with full observations, which makes it difficult to obtain accurate
estimates of the missing values. Hence, a more informative prior or an
imputation model that exploits the available information better (e.g.,
a linear regression) may be required to produce more accurate estimates.
This example has shown how to fit models that can include a simple
imputation mechanism for missing values in the covariates. In order
to obtain estimates of the parameters of the full model, the fit models
obtained in the INLA within MCMC run must be put together with
function inla.merge()
.
lapply(inlamh.res$model.sim, function(X) { X$model })
nhanes2.models <- inla.merge(nhanes2.models, rep(1, length(nhanes2.models)))
nhanes2.imp <-summary(nhanes2.imp)
##
## Call:
## "inla(formula = chl ~ 1 + bmi + age, data = data)"
## Time used:
## Pre = 86.5, Running = 8.23, Post = 5.34, Total = 100
## Fixed effects:
## mean sd
## (Intercept) 41.953 63.724
## bmi 4.929 2.238
## age40-59 29.421 17.994
## age60-99 49.389 23.081
##
## Model hyperparameters:
## mean sd
## Precision for the Gaussian observations 0.001 0.001
##
## Marginal log-Likelihood: -90.54
As compared to the model fit to the observations without missing values of
bmi
, the posterior means of all the fixed effects change. The uncertainty
about the intercept and coefficient of bmi
also increases (i.e., they have a
larger posterior standard deviation). Note that this may be due to the vague
prior given to the missing values of bmi
. However, the uncertainty about the
estimates of age
are very close to those obtained with the previous model.
12.6 Final remarks
Although the analysis of data with missing observations is feasible with INLA
it is computationally very expensive when missing values are in the covariates.
Computing the predictive distribution of missing observations in the response
is still reasonable because it is implemented inside the INLA
package, but
when different models need to be fit to impute missing observations in the
covariate computational times increase. This may be alleviated by the use of
parallel computing methods and hardware.
Finally, Gómez-Rubio, Cameletti, and Blangiardo (2019) describe a promising approach to include new latent
effects in the model to impute missing observations in the covariates.
References
Buuren, Stef van. 2018. Flexible Imputation of Missing Data. 2nd ed. Boca Raton, FL: CRC Press.
Cameletti, Michela, Virgilio Gómez-Rubio, and Marta Blangiardo. 2019. “Bayesian Modelling for Spatially Misaligned Health and Air Pollution Data Through the INLA-SPDE Approach.” Spatial Statistics 31: 100353. https://doi.org/https://doi.org/10.1016/j.spasta.2019.04.001.
Gómez-Rubio, Virgilio, Michela Cameletti, and Marta Blangiardo. 2019. “Missing Data Analysis and Imputation via Latent Gaussian Markov Random Fields.” arXiv E-Prints, December, arXiv:1912.10981. http://arxiv.org/abs/1912.10981.
Gómez-Rubio, Virgilio, and HRue. 2018. “Markov Chain Monte Carlo with the Integrated Nested Laplace Approximation.” Statistics and Computing 28 (5): 1033–51.
Hoeting, Jennifer, David Madigan, Adrian Raftery, and Chris Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science 14: 382–401.
Little, Roderick J. A., and Donald B. Rubin. 2002. Statistical Analysis with Missing Data. 2nd ed. New York: Wiley & Sonc, Inc.
Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data. London: Chapman & Hall.
Schonbeck, Y., H. Talma, P. van Dommelen, B. Bakker, S. E. Buitendijk, R. A. Hirasing, and S. van Buuren. 2013. “The World’s Tallest Nation Has Stopped Growing Taller: The Height of Dutch Children from 1955 to 2009.” Pediatric Research 73 (3): 371–77.
Schonbeck, Y., H. Talma, P. van Dommelen, B. Bakker, S. E. Buitendijk, R. A. Hirasing, and S. van Buuren. 2011. “Increase in Prevalence of Overweight in Dutch Children and Adolescents: A Comparison of Nationwide Growth Studies in 1980, 1997 and 2009.” PLoS ONE 6 (11): e27608.
van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://www.jstatsoft.org/v45/i03/.