Regression & Finite Mixture Models

I wrote a post a while back about Mixture Distributions and Model Comparisons. This post continues on that theme and tries to model multiple data generating processes into a single model. The code for this post is available at the github repository. There were many useful resources that helped me understand this model, and some are mentioned below and others are mentioned in the previous post in this topic:

I use the R library flexmix, for the sample data and fitting the model, in addition to Stan and MCMC approach. The data distribution is shown in the figure below, where looking at the figure suggests that there may be two data generation processes at work here. To start with, we try and find a distribution that will capture certain aspects of this data, and use posterior predictive checks to see if our distribution choice is appropriate. A previous post on posterior predictive checks covers this topic in more detail, and tests different distributions. To keep it short, I just try a normal mixture distribution with 2 mixture components and then perform some tests.

The figure in the left panel shows the density plot for the raw data, and the panel on the right shows the same data, with smattering of 200 predictive data sets from the fitted distribution – visually it appears to be fine, with the left component slightly higher in some cases. However the aspects of the data that I want to our distribution to capture are the mean, variance, maximum and minimum values, and the test result PValues suggest that any variation can be explained by sampling variability.

> ivTestQuantities
variance  minimum  maximum     mean 
   0.380    0.155    0.300    0.420 

The first steps are to fit a mixture regression model with 2 components and one predictor using the flexmix package and then to repeat this using Stan.

> fit.flex = flexmix(yn ~ x, data=NPreg, k=2)
> summary(fit.flex)

flexmix(formula = yn ~ x, data = NPreg, k = 2)

       prior size post>0 ratio
Comp.1 0.561  104    198 0.525
Comp.2 0.439   96    146 0.658

'log Lik.' -704.0635 (df=7)
AIC: 1422.127   BIC: 1445.215 

> ## fitted coefficients
> parameters(fit.flex)
                     Comp.1     Comp.2
coef.(Intercept) 31.7644433 -0.7685309
coef.x            0.1867837  5.1841287
sigma             7.7718438  3.3828795

The same model is now broken down into parts and fit using Stan – you can write a log posterior function in R as well to do this, but I find that the optimiser tends to not explore the parameter space if we track the mixture components as well. However if we keep the mixture components fixed, then it tends to converge to the same point as Stan.

> print(fit.stan, digi=3)
Inference for Stan model: fitNormalMixtureRegression.
1 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=2000.

                   mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
sigma[1]          3.405   0.010 0.395    2.676    3.126    3.395    3.667    4.223  1448 1.002
sigma[2]          7.848   0.019 0.632    6.725    7.407    7.805    8.270    9.164  1142 1.000
iMixWeights[1]    0.439   0.001 0.046    0.350    0.406    0.438    0.472    0.530  1000 1.000
iMixWeights[2]    0.561   0.001 0.046    0.470    0.528    0.562    0.594    0.650  1000 1.000
betasMix1[1]      5.163   0.003 0.146    4.870    5.067    5.160    5.262    5.452  2000 1.001
betasMix2[1]      0.176   0.013 0.347   -0.501   -0.056    0.171    0.391    0.861   751 1.001
mu[1]            -0.686   0.016 0.713   -2.072   -1.164   -0.705   -0.228    0.750  2000 1.000
mu[2]            31.846   0.067 1.971   27.898   30.590   31.890   33.136   35.791   853 1.000
lp__           -708.099   0.063 1.885 -712.367 -709.142 -707.715 -706.712 -705.449   899 1.001

Samples were drawn using NUTS(diag_e) at Tue Jul  4 16:10:02 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Comparing the results from the two procedures, i.e. flexmix and the Stan model, suggest a nice convergence between the two methods.

Fitted or Predicted Values

Once the coefficients of the regression equation have been determined, we can calculate the predicted (on new data) or fitted (current data) values. There are two ways to get this data: either get predicted data for each component; Or get an aggregate predicted value using each component.

> intercepts
[1] -0.6598333 31.5694812
> betas
[1] 5.1614796 0.2241089
> iMixWeights
[1] 0.44 0.56
> iPred1 = lStanData$X %*% c(intercepts[1], betas[1])
> iPred2 = lStanData$X %*% c(intercepts[2], betas[2])
> ## compare with fit.flex
> head(f)
       Comp.1    Comp.2
[1,] 32.54457 20.883670
[2,] 31.98889  5.460877
[3,] 32.19311 11.129077
[4,] 32.87877 30.159296
[5,] 32.20489 11.456077
[6,] 33.19092 38.822976
> head(cbind(iPred1, iPred2))
       [,1]     [,2]
1 20.897771 32.50550
2  5.542358 31.83878
3 11.185795 32.08381
4 30.132872 32.90649
5 11.511366 32.09795
6 38.758702 33.28101

However getting one aggregate predicted value for the response variable may be more appropriate in certain settings, and will require us to take a weighted average of each component (the weight determined by the mixing probability).

> iAggregate = cbind(iPred1, iPred2)
> iAggregate = sweep(iAggregate, 2, iMixWeights, '*')
> iAggregate = rowSums(iAggregate)
> dfPredicted = data.frame(stan=iAggregate, flexmix=p.agg)
> head(dfPredicted)
      stan  flexmix
1 27.39810 27.42164
2 20.26835 20.33447
3 22.88868 22.93915
4 31.68610 31.68404
5 23.03985 23.08942
6 35.69120 35.66522
> ## calculate Mean Squared Error MSE
> mean((dfPredicted$flexmix - NPreg$yn)^2)
[1] 104.4622
> mean((dfPredicted$stan - NPreg$yn)^2)
[1] 104.3325
> ## both pretty close

Comparing the aggregate values from the two models show that both have produced similar results, and the mean squared errors in each case are also similar.

Parameter Estimation

In statistical or mathematical models our aim is to look at the data and estimate the parameters and uncertainty of those estimations. Generally speaking, looking at a data set, we wish to choose a likelihood/noise/sampling distribution, that fits the data. A distribution requires some parameters, Θ, e.g. a normal distribution (which is a very common error/noise distribution used in practice) has 2 parameters, the mean (µ) and the standard deviation (σ). In a Bayesian setting we will perhaps put some constraints on these parameters, which will be called priors, and eventually the model will (hopefully) converge to a point in this parameter space (driven by the constraints from priors and the data). For single and two parameter models the, under the constraints of the data and the priors, the model parameters and the uncertainty can be estimated using a variety of methods. I try to show a few methods along with R code.

The examples and methodological details, including equations can be seen in

[1] Albert, J., Gentleman, R., Parmigiani, G., & Hornik, K. (2009). Bayesian computation with R. Bayesian Computation with R.

[2] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2008). Bayesian Data Analysis. Chapman Texts in Statistical Science Series.

The R code for these examples can be found here in the github repository.

The data we use here is marathon running times, and is from Ref [1], and can be found in the LearnBayes package.

Analytical Estimation:

We first use an analytical approach to estimate and parameters and add some simulation to estimate the uncertainty. Details and equations can be seen around page 62-68 of Ref [2].

### first we will use a conjugate analysis and calculate the parameters
### for the data analytically as shown in Gelman 2008 - see page 62 
### also see page 64 of bayesian computation with R
## We are not using any prior information, but to make this function more general 
## I am adding some prior information with 0s
## look at page 68 of Bayesian Data Analysis (Gelman) for formula = function(dat.grp){
  # prior variance
  sigma.0 = 0
  # prior observations
  k.0 = 0
  # prior degrees of freedom
  v.0 = k.0 - 1
  # prior mean
  mu.0 = 0
  # calculate conjugate posterior
  # number of observations in the data
  n = length(dat.grp)
  # prior observations + data points = posterior number of observations
  k.n = k.0 + n
  # posterior degrees of freedom
  v.n = v.0 + n
  # mean and sd for the data = mean(dat.grp)
  s = sd(dat.grp)
  # posterior mean
  mu.n = (k.0/k.n * mu.0) + (n/k.n *
  # posterior var
  sigma.n = (( v.0*sigma.0 ) + ( (n-1)*(s^2) ) + ( (k.0*n/k.n)*((^2) )) / v.n
  #post.scale = ((prior.dof * prior.scale) + (var(dat.grp) * (length(dat.grp) - 1))) / post.dof
  ## simulate posterior variance and conditional posterior mean
  sigma = (sigma.n * v.n)/rchisq(1000, v.n)
  mu = rnorm(1000, mu.n, sqrt(sigma)/sqrt(k.n))
  return(list(mu=mu, var=sigma))

We can call this function to simulate the posterior mean and variance, that fit within the parameter space under the constraints of the data and priors.

Grid Approximation

Another useful way to estimate the parameters, when analytical formulas are not available or possible is using grid based approximations. For a grid approximation the trick is first to define a function that returns one value and accepts certain parameters and data that need tracking. The simplest function is the likelihood function, however we follow the strategy from [1] and [2] and use the log posterior to define the function.

## define a log posterior function
lp = function(theta, data){
  s = theta[2]
  m = theta[1]
  d = data$time # data from marathon times
  log.lik = sum(dnorm(d, m, sqrt(s), log=T))
  log.prior = 1 = log.lik + log.prior

The function accepts two arguments here, one is theta which is a vector of parameters that need tracking/calculating, and the second is the data. Together the data and theta define a parameter space within which we can ‘move’ and estimate the constrained/fitted theta. The other thing you may notice is we are doing everything on the log scale, so this function should return the log posterior, or log likelihood if you do not use any priors.

## define a discrete grid of parameters
Mu = seq(250, 302, by=0.5)
Sigma2 = seq(500, 9000, by=10)

We define the grid with a certain spacing, and this spacing can be adjusted depending on how much accuracy we want in our estimations. Once the grid is defined, one can use a double loop or any other fancy shortcut like outer command, to create a matrix where each cell contains the value of the function at that combination of the grid points.

## calculate the log posterior at each grid point 
lm = matrix(NA, nrow=length(Mu), ncol = length(Sigma2))
for (r in seq_along(Mu)){
  for (c in seq_along(Sigma2)){
    s = c('mu'=Mu[r], 'sigma2'=Sigma2[c])
    lm[r, c] = lp(s, lData)

# convert from log posterior 
# subtract max before exponentiating
lm = lm-max(lm)
lm = exp(lm)
# set total prob of grid to 1
lm = lm/sum(colSums(lm))

The lm is the grid matrix which is filled inside this double loop. Once the matrix is filled, we follow the strategies from [1] and [2] to subtract off the maximum before exponentiating (converting back from log scale), and then set the total grid probability to 1.

The sample can be generated from this matrix as shown in the code below

## take samples from this distribution
pSigma2 = colSums(lm)
pMu = rowSums(lm)

## add a uniform jitter centered at zero and half of grid spacing to make distribution continuous
sig2Sample = runif(1000, -1*10/2, 10/2) +  sample(size = 1000, Sigma2, replace = T, prob = pSigma2)
# for each sampled value of sig2Sample, get the posterior mu, conditioned on sig2Sample
muSample = rnorm(1000, mean(time), sqrt(sig2Sample)/sqrt(length(time)))

Optimization based approach

This time we will explore the surface of the function using an optimizer, and calculate the variation using the hessian. As usual the technical details can be found in the reference [1].

Start by defining the log posterior function like before, the main difference this time is that we rescale the standard deviation parameter on the log scale.

lp2 = function(theta, data){
  # we define the sigma on a log scale as optimizers work better
  # if scale parameters are well behaved
  s = exp(theta[2])
  m = theta[1]
  d = data$time # data from marathon times
  log.lik = sum(dnorm(d, m, s, log=T))
  log.prior = 1 = log.lik + log.prior

# sanity check for function
# choose a starting value
start = c('mu'=mean(time), 'sigma'=log(var(time)))
lData = list('time'=time)
lp2(start, lData)

## try the laplace function from LearnBayes
fit = laplace(lp2, start, lData)
se = sqrt(diag(fit$var))

## lets take a sample from this
sigSample.op = rnorm(1000, fit$mode['sigma'], se['sigma'])
muSample.op = rnorm(1000, mean(time), exp(sigSample.op)/sqrt(length(time)))

I prefer the optimization based approach for multiparameter problems, and it is more is simpler to implement as long as you can provide sensible starting values. There are a couple of other methods I will cover at some point like Rejection Sampling, Importance Sampling.

MCMC using RStan and Stan

The final method I will cover briefly is doing this in Stan and using a MCMC based approach. The model is very similar to the one defined in the function earlier, but is defined in the Stan modelling language syntax.

data {
    int<lower=1> Ntotal; // number of observations
    real y[Ntotal]; // observed data
parameters {
    real mu; // posterior mean
    real<lower=0> sig; // posterior sd
model {
  y ~ normal(mu, sig);

I call this model from R using rstan library, and extract the simulated data points from this Stan object.

Comparing All the Estimates

I plot a contour plot for the log posterior function and smatter the simulated data points using each method, and they all agree pretty well. I also show two density plots for these to show how the estimates of the posterior means, variances and their uncertainty.