Regression & Finite Mixture Models

I wrote a post a while back about Mixture Distributions and Model Comparisons. This post continues on that theme and tries to model multiple data generating processes into a single model. The code for this post is available at the github repository. There were many useful resources that helped me understand this model, and some are mentioned below and others are mentioned in the previous post in this topic:

I use the R library flexmix, for the sample data and fitting the model, in addition to Stan and MCMC approach. The data distribution is shown in the figure below, where looking at the figure suggests that there may be two data generation processes at work here. To start with, we try and find a distribution that will capture certain aspects of this data, and use posterior predictive checks to see if our distribution choice is appropriate. A previous post on posterior predictive checks covers this topic in more detail, and tests different distributions. To keep it short, I just try a normal mixture distribution with 2 mixture components and then perform some tests.

The figure in the left panel shows the density plot for the raw data, and the panel on the right shows the same data, with smattering of 200 predictive data sets from the fitted distribution – visually it appears to be fine, with the left component slightly higher in some cases. However the aspects of the data that I want to our distribution to capture are the mean, variance, maximum and minimum values, and the test result PValues suggest that any variation can be explained by sampling variability.

> ivTestQuantities
variance  minimum  maximum     mean 
   0.380    0.155    0.300    0.420 

The first steps are to fit a mixture regression model with 2 components and one predictor using the flexmix package and then to repeat this using Stan.

> fit.flex = flexmix(yn ~ x, data=NPreg, k=2)
> summary(fit.flex)

Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2)

       prior size post>0 ratio
Comp.1 0.561  104    198 0.525
Comp.2 0.439   96    146 0.658

'log Lik.' -704.0635 (df=7)
AIC: 1422.127   BIC: 1445.215 

> ## fitted coefficients
> parameters(fit.flex)
                     Comp.1     Comp.2
coef.(Intercept) 31.7644433 -0.7685309
coef.x            0.1867837  5.1841287
sigma             7.7718438  3.3828795

The same model is now broken down into parts and fit using Stan – you can write a log posterior function in R as well to do this, but I find that the optimiser tends to not explore the parameter space if we track the mixture components as well. However if we keep the mixture components fixed, then it tends to converge to the same point as Stan.

> print(fit.stan, digi=3)
Inference for Stan model: fitNormalMixtureRegression.
1 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=2000.

                   mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
sigma[1]          3.405   0.010 0.395    2.676    3.126    3.395    3.667    4.223  1448 1.002
sigma[2]          7.848   0.019 0.632    6.725    7.407    7.805    8.270    9.164  1142 1.000
iMixWeights[1]    0.439   0.001 0.046    0.350    0.406    0.438    0.472    0.530  1000 1.000
iMixWeights[2]    0.561   0.001 0.046    0.470    0.528    0.562    0.594    0.650  1000 1.000
betasMix1[1]      5.163   0.003 0.146    4.870    5.067    5.160    5.262    5.452  2000 1.001
betasMix2[1]      0.176   0.013 0.347   -0.501   -0.056    0.171    0.391    0.861   751 1.001
mu[1]            -0.686   0.016 0.713   -2.072   -1.164   -0.705   -0.228    0.750  2000 1.000
mu[2]            31.846   0.067 1.971   27.898   30.590   31.890   33.136   35.791   853 1.000
lp__           -708.099   0.063 1.885 -712.367 -709.142 -707.715 -706.712 -705.449   899 1.001

Samples were drawn using NUTS(diag_e) at Tue Jul  4 16:10:02 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Comparing the results from the two procedures, i.e. flexmix and the Stan model, suggest a nice convergence between the two methods.

Fitted or Predicted Values

Once the coefficients of the regression equation have been determined, we can calculate the predicted (on new data) or fitted (current data) values. There are two ways to get this data: either get predicted data for each component; Or get an aggregate predicted value using each component.

> intercepts
[1] -0.6598333 31.5694812
> betas
[1] 5.1614796 0.2241089
> iMixWeights
[1] 0.44 0.56
> iPred1 = lStanData$X %*% c(intercepts[1], betas[1])
> iPred2 = lStanData$X %*% c(intercepts[2], betas[2])
> ## compare with fit.flex
> head(f)
       Comp.1    Comp.2
[1,] 32.54457 20.883670
[2,] 31.98889  5.460877
[3,] 32.19311 11.129077
[4,] 32.87877 30.159296
[5,] 32.20489 11.456077
[6,] 33.19092 38.822976
> head(cbind(iPred1, iPred2))
       [,1]     [,2]
1 20.897771 32.50550
2  5.542358 31.83878
3 11.185795 32.08381
4 30.132872 32.90649
5 11.511366 32.09795
6 38.758702 33.28101

However getting one aggregate predicted value for the response variable may be more appropriate in certain settings, and will require us to take a weighted average of each component (the weight determined by the mixing probability).

> iAggregate = cbind(iPred1, iPred2)
> iAggregate = sweep(iAggregate, 2, iMixWeights, '*')
> iAggregate = rowSums(iAggregate)
> dfPredicted = data.frame(stan=iAggregate, flexmix=p.agg)
> head(dfPredicted)
      stan  flexmix
1 27.39810 27.42164
2 20.26835 20.33447
3 22.88868 22.93915
4 31.68610 31.68404
5 23.03985 23.08942
6 35.69120 35.66522
> ## calculate Mean Squared Error MSE
> mean((dfPredicted$flexmix - NPreg$yn)^2)
[1] 104.4622
> mean((dfPredicted$stan - NPreg$yn)^2)
[1] 104.3325
> ## both pretty close

Comparing the aggregate values from the two models show that both have produced similar results, and the mean squared errors in each case are also similar.

Hierarchical Linear Regression – 2 Level Random Effects Model

Regression is a popular approach to modelling where a response variable is modelled as a function of certain predictors – to understand the relations between variables. I used a linear model in a previous post, using the bread and peace model – and various ways to solve the equation.

In this post, I want to fit a 2 level linear model, using the concept of Random Effects modelling. There is a lot of literature available on this subject, and some references are shown below:

[1] lme4: Mixed-effects modelling with R. Douglas M. Bates (2010). link

[2] West, B. T., Welch, K. B., Ga, A. T., & Crc, H. (2007). LINEAR MIXED MODELS A Practical Guide Using Statistical Software. Statistics in Medicine (Vol. 27). http://doi.org/10.1002/sim.3167

[3] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

The code in this post can be seen on the github page here.

The simulated data in this example consists of 30 observations of a Normal distributed variable (so in the generalised linear modelling speak, the link function is Identity). These 30 observations can be sub-grouped into 5 individual groups. We assume that there is a common relationship between these 5 groups (e.g. 5 Machines of same type; 5 Persons from same cohort;). This grouping of course requires some domain knowledge and common sense. In ref [3] there is a nice introductory chapter on hierarchical modelling which is a recommended read – in particular the concept of exchangeability. The data also has one predictor variable.

The figure shows the data in 3 panels: Top Left shows the data at the population level (or a higher level – where all observations are part of one big group); Top Right shows the same data but the 5 sub-groups are coloured differently; Bottom panel shows the 5 groups plotted separately on each panel (Ref [1] has some nice examples of how to plot structured data in various ways). Looking at this figure, we can see that the slope in each group is about equal, however the intercept for each group is different. We would like to consider this in our modelling and model the group intercepts, randomly distributed around the central intercept.

################### first use normal regression and lme4 package
fit.lm = lm(resp ~ pred, data=dfData)
summary(fit.lm)
Call:
lm(formula = resp ~ pred, data = dfData)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3043 -1.6194 -0.4117  1.5824  6.7582 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.28521    0.92127    2.48   0.0194 *  
pred         1.50345    0.08684   17.31   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.616 on 28 degrees of freedom
Multiple R-squared:  0.9146,    Adjusted R-squared:  0.9115 
F-statistic: 299.7 on 1 and 28 DF,  p-value: < 2.2e-16
# residual se = 2.616

####################################################
#### fit a 2 level model with random intercept

library(lme4)
fit.lme = lmer(resp ~ pred + (1 | group), data=dfData, REML=F)
summary(fit.lme)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: resp ~ pred + (1 | group)
   Data: dfData

     AIC      BIC   logLik deviance df.resid 
   141.3    146.9    -66.6    133.3       26 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.5264 -0.7096 -0.2699  0.4328  2.6193 

Random effects:
 Groups   Name        Variance Std.Dev.
 group    (Intercept) 2.664    1.632   
 Residual             3.777    1.943   
Number of obs: 30, groups:  group, 5

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.89979    1.00879   1.883
pred         1.54593    0.06604  23.408

Correlation of Fixed Effects:
     (Intr)
pred -0.594
## random effect or group level sd = 1.632
## population level sd = 1.9434
## total = 1.6322+1.9434 = 3.5756

The interesting numbers I would like to look at are coefficients for the population (higher level group) in a simple linear model and a model with random effects – the slope coefficients are about the same but the intercepts are different in both models. Furthermore, the second (which is perhaps more important part) set of numbers are the amount of residual noise unexplained. The simpler model has a residual standard deviation of 2.6, while the 2 level model has 2 noise terms: Random effect level standard deviation 1.6 and remaining population level standard deviation 1.9 (which is smaller than the simpler model).  So this suggests that you can explain more of your sampling noise in your data by using a Random effects model and accounting for the structure of the data.

Now lets break this model down into parts and write our own models, the first one is a MCMC based approach using Stan, and the second one is using an optimisation with resampling based approach. In both cases you are trying to estimate multiple parameters in a parameter space, and you hope to converge to a point.

The code for the Stan model is in the same directory of the repository at github – and is structurally very similar to the R code shown below, which defines the log posterior function (see the post on parameter estimation for details).

mylogpost = function(theta, data){
  ## data
  resp = data$resp # resp
  mModMatrix = data$mModMatrix
##### this mapping variable maps each group level intercept to each data point
#### at the likelihood level
  groupIndex = data$groupIndex  ## mapping variable to map each random effect to its respective response variable
  ## parameters to track/estimate
  sigmaRan = exp(theta['sigmaRan']) # random effect scale/sd
  sigmaPop = exp(theta['sigmaPop']) # population level sd
  betas = theta[grep('betas', names(theta))] # vector of betas i.e. regression coefficients for population
  iGroupsJitter = theta[grep('ran', names(theta))]# random effects jitters for the group deflections
  
  ## random effect jitter for the population intercept
  # each group contributes a jitter centered on 0
  # population slope + random jitter
  ivBetaRand = betas[1] + iGroupsJitter
  # create a matrix of betas with the new interceptr/unique intercept for each random effect
  ivIntercept = ivBetaRand[groupIndex] # expand this intercept using the mapping variable
  iFitted = as.matrix(mModMatrix[,2:ncol(mModMatrix)]) %*% betas[2:ncol(mModMatrix)]
  iFitted = ivIntercept + iFitted # using identity link so no transformation required
  ## priors
  # sigma priors
  lhp = dunif(sigmaRan, 0, 2, log=T) + dunif(sigmaPop, 0, 5, log=T)
  # random effects prior
  lran = sum(dnorm(iGroupsJitter, 0, sigmaRan, log=T))
  # priors for the betas
  lp = sum(dcauchy(betas, 0, 10, log=T))
  # write the likelihood function
  lik = sum(dnorm(resp, iFitted, sigmaPop, log=T))
  val = lhp + lran + lp + lik
  return(val)
}

Some points about this code snippet:

  • sigmaRan is the standard deviation for the group level. All groups are assumed to have the same noise centred on zero. This is a general assumption and may not be true if you have different types of groups. If you feel brave, you can assign a different distribution to each type of group.
  • sigmaPop is the population distribution at the likelihood level.
  • I have set the priors for the sigmaRan and sigmaPop as uniform 0 to 2 and 0 to 5 respectively. This is not necessary but helps the convergence process, if you are having problems with convergence.
  • The population level regression coefficients are given a 0 centred Cauchy distribution.
  • We are tracking 9 parameters in this optimisation problem: 2 standard deviations, 2 regression coefficients and 5 group level deflections.
fit.lap = mylaplace(mylogpost, start, lData)

### lets use the results from laplace and SIR sampling with a t proposal density
### to take a sample
tpar = list(m=fit.lap$mode, var=fit.lap$var*2, df=4)
## get a sample directly and using sir (sampling importance resampling with a t proposal density)
s = sir(mylogpost, tpar, 5000, lData)
apply(s, 2, mean)[-c(1:2)]
    betas1     betas2       ran1       ran2       ran3       ran4       ran5 
 1.8598225  1.5462439  0.3754429  0.7316874 -0.5849880  1.7242024 -2.1096244 
exp(apply(s, 2, mean)[1:2])
sigmaPop sigmaRan 
2.082105 1.351142 

We use Sampling importance resampling (SIR) algorithm using a t proposal density to generate a random sample for our parameters using the mode and covariance matrix calculated earlier with the function mylaplace.

I show the results from all 4 methods below, and you can see they are pretty similar.

############### you can compare the results from stan and lmer and the logposterior function
# Inference for Stan model: linearRegressionRandomEffects.
# 4 chains, each with iter=5000; warmup=2500; thin=1; 
# post-warmup draws per chain=2500, total post-warmup draws=10000.
# 
#                    mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff  Rhat
# betas[1]           1.775   0.038 1.620  -1.736   0.935   1.841   2.738   4.786  1853 1.000
# betas[2]           1.549   0.001 0.070   1.411   1.503   1.549   1.597   1.685  5649 1.001
# sigmaRan           2.961   0.051 2.471   0.875   1.728   2.414   3.487   8.093  2377 1.001
# sigmaPop           2.094   0.005 0.327   1.563   1.863   2.056   2.280   2.843  4928 1.001
# rGroupsJitter[1]   0.503   0.036 1.623  -2.559  -0.419   0.434   1.326   3.996  2083 1.000
# rGroupsJitter[2]   0.901   0.037 1.635  -2.068  -0.055   0.808   1.733   4.493  1997 1.000
# rGroupsJitter[3]  -0.636   0.036 1.618  -3.838  -1.516  -0.646   0.225   2.813  2036 1.000
# rGroupsJitter[4]   2.140   0.036 1.660  -0.808   1.131   2.013   3.008   5.892  2077 1.000
# rGroupsJitter[5]  -2.448   0.035 1.652  -5.842  -3.378  -2.406  -1.502   0.792  2179 1.000

# > summary(fit.lme)
# Linear mixed model fit by maximum likelihood  ['lmerMod']
# Formula: resp ~ pred + (1 | group)
# Data: dfData
# 
# Random effects:
#   Groups   Name        Variance Std.Dev.
# group    (Intercept) 2.664    1.632   
# Residual             3.777    1.943   
# Number of obs: 30, groups:  group, 5
# 
# Fixed effects:
#   Estimate Std. Error t value
# (Intercept)  1.89979    1.00879   1.883
# pred         1.54593    0.06604  23.408
# 
# Random effect jitters from lme
# > t(ranef(fit.lme)$group)
#              1         2          3       4         5
# (Intercept) 0.3888984 0.7627657 -0.6863616 1.94236 -2.407663

# > fit.lap$mode[-c(1:2)]
# betas1     betas2       ran1       ran2       ran3       ran4       ran5 
# 1.9037216  1.5456933  0.3897490  0.7787912 -0.7075750  1.8900023 -2.3491993 
# > exp(fit.lap$mode)[1:2]
# sigmaPop sigmaRan 
# 1.805638 1.463412

## SIR sampling - these will slighly change each time you run it
# > apply(s, 2, mean)[-c(1:2)]
# betas1     betas2       ran1       ran2       ran3       ran4       ran5 
# 1.8598225  1.5462439  0.3754429  0.7316874 -0.5849880  1.7242024 -2.1096244 
# > exp(apply(s, 2, mean)[c(1:2)])
# sigmaPop sigmaRan 
# 2.082105 1.351142 

However in problems with many parameters and groups, the optimisation based approach may not converge, so optimisation+SIR, Stan or lmer based approach may be required – or a different optimization library required.

Model Quality Scores

In a previous post we calculated some model quality scores. However I had been struggling in calculating the Log Likelihood and AIC for a Random effects model – due to a mistake I was making in the calculations. Eventually I figured it out with help from another post I found here.

## first write the log predictive density function
lpd = function(theta, data){
  betas = theta # vector of betas i.e. regression coefficients for population
  ## data
  resp = data$resp # resp
  mModMatrix = data$mModMatrix
  group = data$group
  sigmaRan = exp(data$sigmaRan) # random effect scale/sd
  sigmaPop = exp(data$sigmaPop) # population level sd
  # calculate fitted value
  iFitted = mModMatrix %*% betas
  ######## perform notational acrobatics
  # construct the variance covariance matrix
  z = model.matrix(resp ~ 0 + group)
  zt = t(z)
  mRan = diag(sigmaRan^2, nrow=nlevels(group), ncol=nlevels(group))
  mRan = z %*% mRan %*% zt
  mPop = diag(sigmaPop^2, nrow=length(resp), ncol=length(resp))
  mCov = mRan + mPop
  ## likelihood function with posterior theta
  return(dmnorm(resp, iFitted, mCov, log=T))
}

lData$group = dfData$group
lData$sigmaRan = log(1.632)
lData$sigmaPop = log(1.943)
theta = fixef(fit.lme)
 
lpd(theta, lData)
[1] -66.63871
## compare with lme
logLik(fit.lme)
'log Lik.' -66.63871 (df=4)

The code above shows that to calculate these scores, we need the log predictive density function and to calculate the log likelihood correctly we need to use a multivariate normal density with a covariance matrix that accounts for population level variance and group level variance.

I have been looking into calculating WAIC for structured models or doing cross-validation (which is on my todo list) – however this can get tedious when you have to take into account the structure of the groups. I will try and cover that in a post at sometime in the future.