Hierarchical Linear Regression – 2 Level Random Effects Model

Regression is a popular approach to modelling where a response variable is modelled as a function of certain predictors – to understand the relations between variables. I used a linear model in a previous post, using the bread and peace model – and various ways to solve the equation.

In this post, I want to fit a 2 level linear model, using the concept of Random Effects modelling. There is a lot of literature available on this subject, and some references are shown below:

[1] lme4: Mixed-effects modelling with R. Douglas M. Bates (2010). link

[2] West, B. T., Welch, K. B., Ga, A. T., & Crc, H. (2007). LINEAR MIXED MODELS A Practical Guide Using Statistical Software. Statistics in Medicine (Vol. 27). http://doi.org/10.1002/sim.3167

[3] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

The code in this post can be seen on the github page here.

The simulated data in this example consists of 30 observations of a Normal distributed variable (so in the generalised linear modelling speak, the link function is Identity). These 30 observations can be sub-grouped into 5 individual groups. We assume that there is a common relationship between these 5 groups (e.g. 5 Machines of same type; 5 Persons from same cohort;). This grouping of course requires some domain knowledge and common sense. In ref [3] there is a nice introductory chapter on hierarchical modelling which is a recommended read – in particular the concept of exchangeability. The data also has one predictor variable.

The figure shows the data in 3 panels: Top Left shows the data at the population level (or a higher level – where all observations are part of one big group); Top Right shows the same data but the 5 sub-groups are coloured differently; Bottom panel shows the 5 groups plotted separately on each panel (Ref [1] has some nice examples of how to plot structured data in various ways). Looking at this figure, we can see that the slope in each group is about equal, however the intercept for each group is different. We would like to consider this in our modelling and model the group intercepts, randomly distributed around the central intercept.

################### first use normal regression and lme4 package
fit.lm = lm(resp ~ pred, data=dfData)
summary(fit.lm)
Call:
lm(formula = resp ~ pred, data = dfData)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3043 -1.6194 -0.4117  1.5824  6.7582 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.28521    0.92127    2.48   0.0194 *  
pred         1.50345    0.08684   17.31   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.616 on 28 degrees of freedom
Multiple R-squared:  0.9146,    Adjusted R-squared:  0.9115 
F-statistic: 299.7 on 1 and 28 DF,  p-value: < 2.2e-16
# residual se = 2.616

####################################################
#### fit a 2 level model with random intercept

library(lme4)
fit.lme = lmer(resp ~ pred + (1 | group), data=dfData, REML=F)
summary(fit.lme)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: resp ~ pred + (1 | group)
   Data: dfData

     AIC      BIC   logLik deviance df.resid 
   141.3    146.9    -66.6    133.3       26 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.5264 -0.7096 -0.2699  0.4328  2.6193 

Random effects:
 Groups   Name        Variance Std.Dev.
 group    (Intercept) 2.664    1.632   
 Residual             3.777    1.943   
Number of obs: 30, groups:  group, 5

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.89979    1.00879   1.883
pred         1.54593    0.06604  23.408

Correlation of Fixed Effects:
     (Intr)
pred -0.594
## random effect or group level sd = 1.632
## population level sd = 1.9434
## total = 1.6322+1.9434 = 3.5756

The interesting numbers I would like to look at are coefficients for the population (higher level group) in a simple linear model and a model with random effects – the slope coefficients are about the same but the intercepts are different in both models. Furthermore, the second (which is perhaps more important part) set of numbers are the amount of residual noise unexplained. The simpler model has a residual standard deviation of 2.6, while the 2 level model has 2 noise terms: Random effect level standard deviation 1.6 and remaining population level standard deviation 1.9 (which is smaller than the simpler model).  So this suggests that you can explain more of your sampling noise in your data by using a Random effects model and accounting for the structure of the data.

Now lets break this model down into parts and write our own models, the first one is a MCMC based approach using Stan, and the second one is using an optimisation with resampling based approach. In both cases you are trying to estimate multiple parameters in a parameter space, and you hope to converge to a point.

The code for the Stan model is in the same directory of the repository at github – and is structurally very similar to the R code shown below, which defines the log posterior function (see the post on parameter estimation for details).

mylogpost = function(theta, data){
  ## data
  resp = data$resp # resp
  mModMatrix = data$mModMatrix
##### this mapping variable maps each group level intercept to each data point
#### at the likelihood level
  groupIndex = data$groupIndex  ## mapping variable to map each random effect to its respective response variable
  ## parameters to track/estimate
  sigmaRan = exp(theta['sigmaRan']) # random effect scale/sd
  sigmaPop = exp(theta['sigmaPop']) # population level sd
  betas = theta[grep('betas', names(theta))] # vector of betas i.e. regression coefficients for population
  iGroupsJitter = theta[grep('ran', names(theta))]# random effects jitters for the group deflections
  
  ## random effect jitter for the population intercept
  # each group contributes a jitter centered on 0
  # population slope + random jitter
  ivBetaRand = betas[1] + iGroupsJitter
  # create a matrix of betas with the new interceptr/unique intercept for each random effect
  ivIntercept = ivBetaRand[groupIndex] # expand this intercept using the mapping variable
  iFitted = as.matrix(mModMatrix[,2:ncol(mModMatrix)]) %*% betas[2:ncol(mModMatrix)]
  iFitted = ivIntercept + iFitted # using identity link so no transformation required
  ## priors
  # sigma priors
  lhp = dunif(sigmaRan, 0, 2, log=T) + dunif(sigmaPop, 0, 5, log=T)
  # random effects prior
  lran = sum(dnorm(iGroupsJitter, 0, sigmaRan, log=T))
  # priors for the betas
  lp = sum(dcauchy(betas, 0, 10, log=T))
  # write the likelihood function
  lik = sum(dnorm(resp, iFitted, sigmaPop, log=T))
  val = lhp + lran + lp + lik
  return(val)
}

Some points about this code snippet:

  • sigmaRan is the standard deviation for the group level. All groups are assumed to have the same noise centred on zero. This is a general assumption and may not be true if you have different types of groups. If you feel brave, you can assign a different distribution to each type of group.
  • sigmaPop is the population distribution at the likelihood level.
  • I have set the priors for the sigmaRan and sigmaPop as uniform 0 to 2 and 0 to 5 respectively. This is not necessary but helps the convergence process, if you are having problems with convergence.
  • The population level regression coefficients are given a 0 centred Cauchy distribution.
  • We are tracking 9 parameters in this optimisation problem: 2 standard deviations, 2 regression coefficients and 5 group level deflections.
fit.lap = mylaplace(mylogpost, start, lData)

### lets use the results from laplace and SIR sampling with a t proposal density
### to take a sample
tpar = list(m=fit.lap$mode, var=fit.lap$var*2, df=4)
## get a sample directly and using sir (sampling importance resampling with a t proposal density)
s = sir(mylogpost, tpar, 5000, lData)
apply(s, 2, mean)[-c(1:2)]
    betas1     betas2       ran1       ran2       ran3       ran4       ran5 
 1.8598225  1.5462439  0.3754429  0.7316874 -0.5849880  1.7242024 -2.1096244 
exp(apply(s, 2, mean)[1:2])
sigmaPop sigmaRan 
2.082105 1.351142 

We use Sampling importance resampling (SIR) algorithm using a t proposal density to generate a random sample for our parameters using the mode and covariance matrix calculated earlier with the function mylaplace.

I show the results from all 4 methods below, and you can see they are pretty similar.

############### you can compare the results from stan and lmer and the logposterior function
# Inference for Stan model: linearRegressionRandomEffects.
# 4 chains, each with iter=5000; warmup=2500; thin=1; 
# post-warmup draws per chain=2500, total post-warmup draws=10000.
# 
#                    mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff  Rhat
# betas[1]           1.775   0.038 1.620  -1.736   0.935   1.841   2.738   4.786  1853 1.000
# betas[2]           1.549   0.001 0.070   1.411   1.503   1.549   1.597   1.685  5649 1.001
# sigmaRan           2.961   0.051 2.471   0.875   1.728   2.414   3.487   8.093  2377 1.001
# sigmaPop           2.094   0.005 0.327   1.563   1.863   2.056   2.280   2.843  4928 1.001
# rGroupsJitter[1]   0.503   0.036 1.623  -2.559  -0.419   0.434   1.326   3.996  2083 1.000
# rGroupsJitter[2]   0.901   0.037 1.635  -2.068  -0.055   0.808   1.733   4.493  1997 1.000
# rGroupsJitter[3]  -0.636   0.036 1.618  -3.838  -1.516  -0.646   0.225   2.813  2036 1.000
# rGroupsJitter[4]   2.140   0.036 1.660  -0.808   1.131   2.013   3.008   5.892  2077 1.000
# rGroupsJitter[5]  -2.448   0.035 1.652  -5.842  -3.378  -2.406  -1.502   0.792  2179 1.000

# > summary(fit.lme)
# Linear mixed model fit by maximum likelihood  ['lmerMod']
# Formula: resp ~ pred + (1 | group)
# Data: dfData
# 
# Random effects:
#   Groups   Name        Variance Std.Dev.
# group    (Intercept) 2.664    1.632   
# Residual             3.777    1.943   
# Number of obs: 30, groups:  group, 5
# 
# Fixed effects:
#   Estimate Std. Error t value
# (Intercept)  1.89979    1.00879   1.883
# pred         1.54593    0.06604  23.408
# 
# Random effect jitters from lme
# > t(ranef(fit.lme)$group)
#              1         2          3       4         5
# (Intercept) 0.3888984 0.7627657 -0.6863616 1.94236 -2.407663

# > fit.lap$mode[-c(1:2)]
# betas1     betas2       ran1       ran2       ran3       ran4       ran5 
# 1.9037216  1.5456933  0.3897490  0.7787912 -0.7075750  1.8900023 -2.3491993 
# > exp(fit.lap$mode)[1:2]
# sigmaPop sigmaRan 
# 1.805638 1.463412

## SIR sampling - these will slighly change each time you run it
# > apply(s, 2, mean)[-c(1:2)]
# betas1     betas2       ran1       ran2       ran3       ran4       ran5 
# 1.8598225  1.5462439  0.3754429  0.7316874 -0.5849880  1.7242024 -2.1096244 
# > exp(apply(s, 2, mean)[c(1:2)])
# sigmaPop sigmaRan 
# 2.082105 1.351142 

However in problems with many parameters and groups, the optimisation based approach may not converge, so optimisation+SIR, Stan or lmer based approach may be required – or a different optimization library required.

Model Quality Scores

In a previous post we calculated some model quality scores. However I had been struggling in calculating the Log Likelihood and AIC for a Random effects model – due to a mistake I was making in the calculations. Eventually I figured it out with help from another post I found here.

## first write the log predictive density function
lpd = function(theta, data){
  betas = theta # vector of betas i.e. regression coefficients for population
  ## data
  resp = data$resp # resp
  mModMatrix = data$mModMatrix
  group = data$group
  sigmaRan = exp(data$sigmaRan) # random effect scale/sd
  sigmaPop = exp(data$sigmaPop) # population level sd
  # calculate fitted value
  iFitted = mModMatrix %*% betas
  ######## perform notational acrobatics
  # construct the variance covariance matrix
  z = model.matrix(resp ~ 0 + group)
  zt = t(z)
  mRan = diag(sigmaRan^2, nrow=nlevels(group), ncol=nlevels(group))
  mRan = z %*% mRan %*% zt
  mPop = diag(sigmaPop^2, nrow=length(resp), ncol=length(resp))
  mCov = mRan + mPop
  ## likelihood function with posterior theta
  return(dmnorm(resp, iFitted, mCov, log=T))
}

lData$group = dfData$group
lData$sigmaRan = log(1.632)
lData$sigmaPop = log(1.943)
theta = fixef(fit.lme)
 
lpd(theta, lData)
[1] -66.63871
## compare with lme
logLik(fit.lme)
'log Lik.' -66.63871 (df=4)

The code above shows that to calculate these scores, we need the log predictive density function and to calculate the log likelihood correctly we need to use a multivariate normal density with a covariance matrix that accounts for population level variance and group level variance.

I have been looking into calculating WAIC for structured models or doing cross-validation (which is on my todo list) – however this can get tedious when you have to take into account the structure of the groups. I will try and cover that in a post at sometime in the future.

Advertisements

Published by

uhkniazi

Data Science, Statistics, Bayesian Statistics, Machine Learning, Bioinformatics, R, Stan, MCMC. https://www.linkedin.com/in/umar-niazi-72751123/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s