Plausible Reasoning for Scientific Problems: Belief Driven by Priors and Data.

Plausible reasoning requires constructing rational arguments by use of syllogisms, and their analysis by deductive and inductive logic. Using this method of reasoning and expressing our beliefs, in a scientific hypothesis, in a numerical manner using probability theory is one of my interest. I try to condense the material from the first 4 chapters of [1] in this post – starting with a very brief description of syllogisms, logic, probabilities, sampling distributions and finishing up with a problem of scientific hypothesis testing. I have used the material from various sources and some of it links with the previous posts on model comparisons and mixture models. The two books that can be looked at for more details are:

[1] Jaynes, E. T. (2003). Probability theory the logic of science. Cambridge University Press (2003)

[2] Lavine, M. (2013). Introduction to Statistical Thought. Retrieved from


Syllogisms are systematic ways of presenting logical arguments, and appeal to our intuitive reasoning abilities to draw conclusions. Usually we have two premises Major, Minor and that leads to a conclusion.

Strong Syllogism

A\; is\; true,\; then\; B\; is\; true

\frac{A\;is\; true }{\therefore B\; is\; true}

In the inverse form:

A\; is\; true,\; then\; B\; is\; true

\frac{B\;is\; false }{\therefore A\; is\; false}

Weak Syllogism

In common scientific problems, we have to deal with weaker syllogisms. Following the example from [1]:

A ≡ It will rain at 10 am at the latest

B ≡ It will be cloudy before 10 am

A\; is\; true,\; then\; B\; is\; true

\frac{B\;is\; true }{\therefore A\; becomes\; more\; plausible}

The major premise if A then B, creates a logical environment, where B is a logical consequence of A (not the physical causal consequence). Verification of the consequences of A (in this logical environment) increases our confidence in A. Another form can be expressed as:

\frac{A\;is\; false }{\therefore B\; becomes\; less\; plausible}

one of the reasons for B being true has been eliminated, hence we feel less confident about B.

Boolean Algebra

The ideas of plausible reasoning can be represented as symbolic logic using Boolean algebra. The symbols and operations we will use are

Symbol Name Other Names
AB Logical Product Conjunction
A+B Logical Sum Disjunction

Given two propositions A and B, one is true if an only if the other is true – i.e. they have same truth value. So if A is the necessary and sufficient condition for B, they are logically equivalent propositions.

Some propositions and their meanings include:

\bar A \equiv A\; is\; false \newline Relation \;between \; A\;, \bar A \; is \; reciprocal \newline A = \bar A \; is\; false

The relationship between two propositions can be expressed as an implication

A \Rightarrow B \newline A\bar B \; is \; false \newline (\bar A + B) \; is \; false \newline A = AB \newline if \;A \; is \; true \; then \; B \; is \; true\newline if \;B \; is \; false \; then \; A \; is \; false\newline if \;A \; is \; false \; says \; nothing \; about \; B\newline if \;B \; is \; true \; says \; nothing \; about \; A\newline

Quantitative Rules

These rules have been formally derived in Chapter 2 Ref [1] and it is a recommended read. I will just mention the rules briefly.

Product Rule

AB | C : A and B are true given C; broken down into parts it can be written as:

  • B | C is true
  • accepting B is true, decide that A is true A | BC


  • A | C is true
  • accepting A is true, decide that B is true B | AC

In functional form this can be written as: (AB | C) = F[(B|C), (A|BC)]

With 3 propositions, like (ABC | D), we can break the problem down into parts and use e.g. BC as a single proposition:

  1. (ABC | D)
  2. (BC | D) (A | BCD)
  3. (C | D) (B|CD) (A | BCD)

Sum Rule

The logical product of A \bar A is always false, while logical sum is A + \bar A is always true.

Primitive sum rule: (A|B)+(\bar A| B) = 1

Extended or generalised sum rule: (A+B|C) = (A|C)+(B|C)-(AB |C)

Qualitative Properties

The statements of Logic can be expressed in the form of product and sum rules.

A \Rightarrow B \newline C \;is \;the \;major \;premise \newline C \equiv A \Rightarrow B \newline if \;A \;is \;true \;then \;B \;is \;true \newline p(B|AC) = \frac{p(AB|C)}{p(A|C)}\newline \newline if \;B \;is \;false \;then \;A \;is \;false \newline p(A\mid \bar B C) = \frac{p(A\bar B \mid C)}{p(\bar B \mid C)}\newline \newline if \;B \;is \;true \;then \;A \;becomes \;more \;plausible \newline p(A\mid B C) = p(A \mid C)\frac{p(B \mid AC)}{p(B \mid C)}\newline \newline if \;A \;is \;false \;then \;B \;becomes \;less \;plausible \newline p(B\mid \bar A C) = p(B \mid C)\frac{p(\bar A \mid BC)}{p(\bar A \mid C)}\newline

Elementary Sampling Theory

The rules we have available are the product and sum rules described earlier, and using the principle of indifference (just one of the many principles), if B is the background information while (H_{1}, H_{2}, .. H_{N}) are equally likely mutually exclusive and exhaustive hypotheses then

P(H_{i}\mid B) = \frac{1}{N} \; \; 1 \leq i \leq N

As an example:

  • B ≡ Urn with N balls, with M red balls  and (N-M) white balls.
  • Ri ≡ Red ball on ith draw.
  • Wi ≡ White ball on ith draw.

For the first draw,

P(R_{1}\mid B) = \frac{M}{N} \newline P(W_{1}\mid B) = 1-\frac{M}{N}

These probability values should not be confused with physical properties of the urn, but are representative of state of knowledge or information – before any actual ball is drawn. This state of knowledge changes changes when a new question is asked – what is the probability of red on first two draws? (use the product rule)

P(R_{1} R_{2} \mid B)=P(R{1} \mid B)P(R{2} \mid R{1}B) \newline P(R_{1} R_{2} \mid B)=\frac{M}{N}\frac{M-1}{N-1}

Working this way (you can see the full derivation in [1]) we can look at any sequence of drawing – and if the question becomes: how many ways two red balls can be drawn in three draws, then the we are looking at the multiplicity of this event R_{1}R_{2}W_{3}, R_{1}W_{2}R_{3}, W_{1}R_{2}R_{3}

which can be calculated using the binomial coefficient


So the question can be posed as:

  • B ≡ Urn with N balls, with M red balls  and (N-M) white balls.
  • A ≡ ‘r red balls in n draws, in any order’
  • P(A|B) ≡ P (r | N, M, n) ≡ Hypergeometric distribution (if we sample without replacement)
  •  P(A|B) ≡ P (r | N, M, n) ≡ Binomial distribution. (sample with replacement)

These probability distributions are called sampling distributions or direct probabilities: Given some hypothesis (e.g. contents M, N of the urn), what is the probability that we shall obtain some specified data D (e.g. some sequence of red and white balls). These sampling distributions make predictions about potential observations – and if the correct hypothesis is known then the predictions and observed data will agree closely.

As another example for calculating probabilities of events [2] let:

  • X ≡ set of outcomes e.g. rolling a six-sided die {1, 2, 3, 4, 5, 6}
  • Y ≡ subsets of X
  • μ : Y → [0, 1] where the function μ has the domain Y and has an image from 0 to 1.

We can determine the function μ using simulation, logic or experience. Following the example from [2] (the section on Basic Probability) – we want to calculate the probability of winning on the come-out roll in the game of craps (we need to roll a 7 or 11 using two six-sided die.

P(win on come-out roll) = P(7) + P(11)

The example in the book [2] shows how to calculate this mathematically or via simulation. = function() {
  r = sample(1:6, 2, replace=T)
  if (sum(r) == 7 || sum(r) == 11) return(1) else return(0)
win = replicate(1000,
[1] 0.224

Elementary Hypothesis Testing:

In a general scientific problem, we already have the data and a set of hypotheses and want to decide which hypothesis is more likely when looking at the information. The problem can be written down as propositions:

X = prior \; information \newline H = hypothesis \newline D = data \newline \newline The \; joint\; parameter\; space\; is\; expanded\; using\; product\; rule\newline P(HD\mid X) = P(D \mid X) P(H \mid DX) = P(H \mid X) P(D \mid HX)\newline Performing\; some\; acrobatics,\; the\; equation\; becomes\newline P(H \mid DX) = P(H \mid X) \frac{P(D \mid HX)}{P(D \mid X)} .. Eq(1)

This is of course the famous Bayes theorem, and the left side of equation 1 is the posterior probability – I tend to think of this (as most times your hypothesis will consist of parameters) as the point in the vector space where your parameters converge when they are being restricted by the priors and the data. The factor in equation 1, P(D|HX) is a function with 2 names depending on the context: 1) sampling distribution when H is fixed and D varies; 2) likelihood function L(H) when the data is a fixed parameter while H changes. Ref [2] has a nice section on likelihood functions and should be read with the associated R code in the book.

I show a slightly modified example from Ref [1] (where this problem is solved in multiple ways)

  • X ≡ There are 15 machines, making widgets, and 10 of those machines make good quality widgets, 4 make average quality widgets and 1 is out of tune and makes mostly bad widgets. Each machine has a tuning parameter, which is the proportion of bad widgets, and we model that using a beta distribution.
  • M1 ≡ On average the machine produces bad widgets with a proportion of 0.16
  • M2 ≡ On average the machine produces bad widgets with a proportion of 0.49
  • M3 ≡ On average the machine produces bad widgets with a proportion of 0.83

We use equation 1, to estimate the posterior parameters for each hypothesis or model.


We can treat this problem in various ways, using a parameter estimation approach, calculating 95% high density intervals to assign a p-value , a mixture distribution approach as shown in previous posts e.g. here and here. However the statement of the problem also shows us that we can assign a prior probability or weight to each hypothesis. The code snippet below shows the priors for the 3 models and the prior weights for each model – the full code can be found in the github repository.

### define three machines
## model m1 - machine makes less defective widgets
m1 = function(th) dbeta(th, 1, 5, log = T)

## model m2 - machine makes between average defective widgets
m2 = function(th) dbeta(th, 5, 5, log = T)

## model m3 - machine makes almost everything bad
m3 = function(th) dbeta(th, 5, 1, log = T)

## define an array that represents number of models in our parameter space
## each index has a prior weight/probability of being selected
## this can be thought of coming from a categorical distribution 
mix.prior = c(m1=10/15 ,m2= 4/15 ,m3= 1/15)

As the data is generated, the posterior weight for each model changes with respect to each other, depending on if we see a bad widget (1) or a good widget (0). The results below show the posterior weights as the total number of good or bad widgets change along with the number of observations.

        m1   m2   m3 Data - 1=Bad
 [1,] 0.78 0.20 0.02            1
 [2,] 0.53 0.42 0.05            1
 [3,] 0.64 0.34 0.02            0
 [4,] 0.46 0.49 0.05            1
 [5,] 0.33 0.59 0.08            1
 [6,] 0.25 0.64 0.11            1
 [7,] 0.19 0.66 0.16            1
 [8,] 0.15 0.65 0.20            1
 [9,] 0.12 0.63 0.25            1
[10,] 0.09 0.60 0.30            1
[11,] 0.08 0.57 0.36            1
[12,] 0.06 0.53 0.41            1
[13,] 0.05 0.49 0.46            1
[14,] 0.04 0.44 0.51            1
[15,] 0.07 0.58 0.35            0
[16,] 0.09 0.67 0.24            0
[17,] 0.12 0.72 0.17            0
[18,] 0.14 0.74 0.12            0
[19,] 0.17 0.74 0.09            0
[20,] 0.15 0.75 0.11            1
[21,] 0.13 0.75 0.12            1
[22,] 0.12 0.74 0.14            1

The figure below shows the same numbers and how one hypothesis becomes more plausible while others become less as evidence accumulates – the points represent the prior weights for each hypothesis.


Interestingly hypothesis M3 is a very rare event, or almost a dead hypothesis, that is resurrected [1] to a level where we start to consider it very plausible by the time we have 16 data points, as we had a lot of  bad widgets. This example shows an elegant way to compare multiple hypotheses with respect to each other, and our belief is driven by the prior information and the data in different directions. In a more complex model with many parameters, the denominator of the equation one i.e P(D) can be difficult to calculate, and I would perhaps try a MCMC based approach in a single finite mixture model, rather than optimisation based approach.

Regression & Finite Mixture Models

I wrote a post a while back about Mixture Distributions and Model Comparisons. This post continues on that theme and tries to model multiple data generating processes into a single model. The code for this post is available at the github repository. There were many useful resources that helped me understand this model, and some are mentioned below and others are mentioned in the previous post in this topic:

I use the R library flexmix, for the sample data and fitting the model, in addition to Stan and MCMC approach. The data distribution is shown in the figure below, where looking at the figure suggests that there may be two data generation processes at work here. To start with, we try and find a distribution that will capture certain aspects of this data, and use posterior predictive checks to see if our distribution choice is appropriate. A previous post on posterior predictive checks covers this topic in more detail, and tests different distributions. To keep it short, I just try a normal mixture distribution with 2 mixture components and then perform some tests.

The figure in the left panel shows the density plot for the raw data, and the panel on the right shows the same data, with smattering of 200 predictive data sets from the fitted distribution – visually it appears to be fine, with the left component slightly higher in some cases. However the aspects of the data that I want to our distribution to capture are the mean, variance, maximum and minimum values, and the test result PValues suggest that any variation can be explained by sampling variability.

> ivTestQuantities
variance  minimum  maximum     mean 
   0.380    0.155    0.300    0.420 

The first steps are to fit a mixture regression model with 2 components and one predictor using the flexmix package and then to repeat this using Stan.

> fit.flex = flexmix(yn ~ x, data=NPreg, k=2)
> summary(fit.flex)

flexmix(formula = yn ~ x, data = NPreg, k = 2)

       prior size post>0 ratio
Comp.1 0.561  104    198 0.525
Comp.2 0.439   96    146 0.658

'log Lik.' -704.0635 (df=7)
AIC: 1422.127   BIC: 1445.215 

> ## fitted coefficients
> parameters(fit.flex)
                     Comp.1     Comp.2
coef.(Intercept) 31.7644433 -0.7685309
coef.x            0.1867837  5.1841287
sigma             7.7718438  3.3828795

The same model is now broken down into parts and fit using Stan – you can write a log posterior function in R as well to do this, but I find that the optimiser tends to not explore the parameter space if we track the mixture components as well. However if we keep the mixture components fixed, then it tends to converge to the same point as Stan.

> print(fit.stan, digi=3)
Inference for Stan model: fitNormalMixtureRegression.
1 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=2000.

                   mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
sigma[1]          3.405   0.010 0.395    2.676    3.126    3.395    3.667    4.223  1448 1.002
sigma[2]          7.848   0.019 0.632    6.725    7.407    7.805    8.270    9.164  1142 1.000
iMixWeights[1]    0.439   0.001 0.046    0.350    0.406    0.438    0.472    0.530  1000 1.000
iMixWeights[2]    0.561   0.001 0.046    0.470    0.528    0.562    0.594    0.650  1000 1.000
betasMix1[1]      5.163   0.003 0.146    4.870    5.067    5.160    5.262    5.452  2000 1.001
betasMix2[1]      0.176   0.013 0.347   -0.501   -0.056    0.171    0.391    0.861   751 1.001
mu[1]            -0.686   0.016 0.713   -2.072   -1.164   -0.705   -0.228    0.750  2000 1.000
mu[2]            31.846   0.067 1.971   27.898   30.590   31.890   33.136   35.791   853 1.000
lp__           -708.099   0.063 1.885 -712.367 -709.142 -707.715 -706.712 -705.449   899 1.001

Samples were drawn using NUTS(diag_e) at Tue Jul  4 16:10:02 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Comparing the results from the two procedures, i.e. flexmix and the Stan model, suggest a nice convergence between the two methods.

Fitted or Predicted Values

Once the coefficients of the regression equation have been determined, we can calculate the predicted (on new data) or fitted (current data) values. There are two ways to get this data: either get predicted data for each component; Or get an aggregate predicted value using each component.

> intercepts
[1] -0.6598333 31.5694812
> betas
[1] 5.1614796 0.2241089
> iMixWeights
[1] 0.44 0.56
> iPred1 = lStanData$X %*% c(intercepts[1], betas[1])
> iPred2 = lStanData$X %*% c(intercepts[2], betas[2])
> ## compare with fit.flex
> head(f)
       Comp.1    Comp.2
[1,] 32.54457 20.883670
[2,] 31.98889  5.460877
[3,] 32.19311 11.129077
[4,] 32.87877 30.159296
[5,] 32.20489 11.456077
[6,] 33.19092 38.822976
> head(cbind(iPred1, iPred2))
       [,1]     [,2]
1 20.897771 32.50550
2  5.542358 31.83878
3 11.185795 32.08381
4 30.132872 32.90649
5 11.511366 32.09795
6 38.758702 33.28101

However getting one aggregate predicted value for the response variable may be more appropriate in certain settings, and will require us to take a weighted average of each component (the weight determined by the mixing probability).

> iAggregate = cbind(iPred1, iPred2)
> iAggregate = sweep(iAggregate, 2, iMixWeights, '*')
> iAggregate = rowSums(iAggregate)
> dfPredicted = data.frame(stan=iAggregate, flexmix=p.agg)
> head(dfPredicted)
      stan  flexmix
1 27.39810 27.42164
2 20.26835 20.33447
3 22.88868 22.93915
4 31.68610 31.68404
5 23.03985 23.08942
6 35.69120 35.66522
> ## calculate Mean Squared Error MSE
> mean((dfPredicted$flexmix - NPreg$yn)^2)
[1] 104.4622
> mean((dfPredicted$stan - NPreg$yn)^2)
[1] 104.3325
> ## both pretty close

Comparing the aggregate values from the two models show that both have produced similar results, and the mean squared errors in each case are also similar.

Hierarchical Linear Regression – 2 Level Random Effects Model

Regression is a popular approach to modelling where a response variable is modelled as a function of certain predictors – to understand the relations between variables. I used a linear model in a previous post, using the bread and peace model – and various ways to solve the equation.

In this post, I want to fit a 2 level linear model, using the concept of Random Effects modelling. There is a lot of literature available on this subject, and some references are shown below:

[1] lme4: Mixed-effects modelling with R. Douglas M. Bates (2010). link

[2] West, B. T., Welch, K. B., Ga, A. T., & Crc, H. (2007). LINEAR MIXED MODELS A Practical Guide Using Statistical Software. Statistics in Medicine (Vol. 27).

[3] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

The code in this post can be seen on the github page here.

The simulated data in this example consists of 30 observations of a Normal distributed variable (so in the generalised linear modelling speak, the link function is Identity). These 30 observations can be sub-grouped into 5 individual groups. We assume that there is a common relationship between these 5 groups (e.g. 5 Machines of same type; 5 Persons from same cohort;). This grouping of course requires some domain knowledge and common sense. In ref [3] there is a nice introductory chapter on hierarchical modelling which is a recommended read – in particular the concept of exchangeability. The data also has one predictor variable.

The figure shows the data in 3 panels: Top Left shows the data at the population level (or a higher level – where all observations are part of one big group); Top Right shows the same data but the 5 sub-groups are coloured differently; Bottom panel shows the 5 groups plotted separately on each panel (Ref [1] has some nice examples of how to plot structured data in various ways). Looking at this figure, we can see that the slope in each group is about equal, however the intercept for each group is different. We would like to consider this in our modelling and model the group intercepts, randomly distributed around the central intercept.

################### first use normal regression and lme4 package
fit.lm = lm(resp ~ pred, data=dfData)
lm(formula = resp ~ pred, data = dfData)

    Min      1Q  Median      3Q     Max 
-5.3043 -1.6194 -0.4117  1.5824  6.7582 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.28521    0.92127    2.48   0.0194 *  
pred         1.50345    0.08684   17.31   <2e-16 ***
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.616 on 28 degrees of freedom
Multiple R-squared:  0.9146,    Adjusted R-squared:  0.9115 
F-statistic: 299.7 on 1 and 28 DF,  p-value: < 2.2e-16
# residual se = 2.616

#### fit a 2 level model with random intercept

fit.lme = lmer(resp ~ pred + (1 | group), data=dfData, REML=F)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: resp ~ pred + (1 | group)
   Data: dfData

     AIC      BIC   logLik deviance df.resid 
   141.3    146.9    -66.6    133.3       26 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.5264 -0.7096 -0.2699  0.4328  2.6193 

Random effects:
 Groups   Name        Variance Std.Dev.
 group    (Intercept) 2.664    1.632   
 Residual             3.777    1.943   
Number of obs: 30, groups:  group, 5

Fixed effects:
            Estimate Std. Error t value
(Intercept)  1.89979    1.00879   1.883
pred         1.54593    0.06604  23.408

Correlation of Fixed Effects:
pred -0.594
## random effect or group level sd = 1.632
## population level sd = 1.9434
## total = 1.6322+1.9434 = 3.5756

The interesting numbers I would like to look at are coefficients for the population (higher level group) in a simple linear model and a model with random effects – the slope coefficients are about the same but the intercepts are different in both models. Furthermore, the second (which is perhaps more important part) set of numbers are the amount of residual noise unexplained. The simpler model has a residual standard deviation of 2.6, while the 2 level model has 2 noise terms: Random effect level standard deviation 1.6 and remaining population level standard deviation 1.9 (which is smaller than the simpler model).  So this suggests that you can explain more of your sampling noise in your data by using a Random effects model and accounting for the structure of the data.

Now lets break this model down into parts and write our own models, the first one is a MCMC based approach using Stan, and the second one is using an optimisation with resampling based approach. In both cases you are trying to estimate multiple parameters in a parameter space, and you hope to converge to a point.

The code for the Stan model is in the same directory of the repository at github – and is structurally very similar to the R code shown below, which defines the log posterior function (see the post on parameter estimation for details).

mylogpost = function(theta, data){
  ## data
  resp = data$resp # resp
  mModMatrix = data$mModMatrix
##### this mapping variable maps each group level intercept to each data point
#### at the likelihood level
  groupIndex = data$groupIndex  ## mapping variable to map each random effect to its respective response variable
  ## parameters to track/estimate
  sigmaRan = exp(theta['sigmaRan']) # random effect scale/sd
  sigmaPop = exp(theta['sigmaPop']) # population level sd
  betas = theta[grep('betas', names(theta))] # vector of betas i.e. regression coefficients for population
  iGroupsJitter = theta[grep('ran', names(theta))]# random effects jitters for the group deflections
  ## random effect jitter for the population intercept
  # each group contributes a jitter centered on 0
  # population slope + random jitter
  ivBetaRand = betas[1] + iGroupsJitter
  # create a matrix of betas with the new interceptr/unique intercept for each random effect
  ivIntercept = ivBetaRand[groupIndex] # expand this intercept using the mapping variable
  iFitted = as.matrix(mModMatrix[,2:ncol(mModMatrix)]) %*% betas[2:ncol(mModMatrix)]
  iFitted = ivIntercept + iFitted # using identity link so no transformation required
  ## priors
  # sigma priors
  lhp = dunif(sigmaRan, 0, 2, log=T) + dunif(sigmaPop, 0, 5, log=T)
  # random effects prior
  lran = sum(dnorm(iGroupsJitter, 0, sigmaRan, log=T))
  # priors for the betas
  lp = sum(dcauchy(betas, 0, 10, log=T))
  # write the likelihood function
  lik = sum(dnorm(resp, iFitted, sigmaPop, log=T))
  val = lhp + lran + lp + lik

Some points about this code snippet:

  • sigmaRan is the standard deviation for the group level. All groups are assumed to have the same noise centred on zero. This is a general assumption and may not be true if you have different types of groups. If you feel brave, you can assign a different distribution to each type of group.
  • sigmaPop is the population distribution at the likelihood level.
  • I have set the priors for the sigmaRan and sigmaPop as uniform 0 to 2 and 0 to 5 respectively. This is not necessary but helps the convergence process, if you are having problems with convergence.
  • The population level regression coefficients are given a 0 centred Cauchy distribution.
  • We are tracking 9 parameters in this optimisation problem: 2 standard deviations, 2 regression coefficients and 5 group level deflections.
fit.lap = mylaplace(mylogpost, start, lData)

### lets use the results from laplace and SIR sampling with a t proposal density
### to take a sample
tpar = list(m=fit.lap$mode, var=fit.lap$var*2, df=4)
## get a sample directly and using sir (sampling importance resampling with a t proposal density)
s = sir(mylogpost, tpar, 5000, lData)
apply(s, 2, mean)[-c(1:2)]
    betas1     betas2       ran1       ran2       ran3       ran4       ran5 
 1.8598225  1.5462439  0.3754429  0.7316874 -0.5849880  1.7242024 -2.1096244 
exp(apply(s, 2, mean)[1:2])
sigmaPop sigmaRan 
2.082105 1.351142 

We use Sampling importance resampling (SIR) algorithm using a t proposal density to generate a random sample for our parameters using the mode and covariance matrix calculated earlier with the function mylaplace.

I show the results from all 4 methods below, and you can see they are pretty similar.

############### you can compare the results from stan and lmer and the logposterior function
# Inference for Stan model: linearRegressionRandomEffects.
# 4 chains, each with iter=5000; warmup=2500; thin=1; 
# post-warmup draws per chain=2500, total post-warmup draws=10000.
#                    mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff  Rhat
# betas[1]           1.775   0.038 1.620  -1.736   0.935   1.841   2.738   4.786  1853 1.000
# betas[2]           1.549   0.001 0.070   1.411   1.503   1.549   1.597   1.685  5649 1.001
# sigmaRan           2.961   0.051 2.471   0.875   1.728   2.414   3.487   8.093  2377 1.001
# sigmaPop           2.094   0.005 0.327   1.563   1.863   2.056   2.280   2.843  4928 1.001
# rGroupsJitter[1]   0.503   0.036 1.623  -2.559  -0.419   0.434   1.326   3.996  2083 1.000
# rGroupsJitter[2]   0.901   0.037 1.635  -2.068  -0.055   0.808   1.733   4.493  1997 1.000
# rGroupsJitter[3]  -0.636   0.036 1.618  -3.838  -1.516  -0.646   0.225   2.813  2036 1.000
# rGroupsJitter[4]   2.140   0.036 1.660  -0.808   1.131   2.013   3.008   5.892  2077 1.000
# rGroupsJitter[5]  -2.448   0.035 1.652  -5.842  -3.378  -2.406  -1.502   0.792  2179 1.000

# > summary(fit.lme)
# Linear mixed model fit by maximum likelihood  ['lmerMod']
# Formula: resp ~ pred + (1 | group)
# Data: dfData
# Random effects:
#   Groups   Name        Variance Std.Dev.
# group    (Intercept) 2.664    1.632   
# Residual             3.777    1.943   
# Number of obs: 30, groups:  group, 5
# Fixed effects:
#   Estimate Std. Error t value
# (Intercept)  1.89979    1.00879   1.883
# pred         1.54593    0.06604  23.408
# Random effect jitters from lme
# > t(ranef(fit.lme)$group)
#              1         2          3       4         5
# (Intercept) 0.3888984 0.7627657 -0.6863616 1.94236 -2.407663

# > fit.lap$mode[-c(1:2)]
# betas1     betas2       ran1       ran2       ran3       ran4       ran5 
# 1.9037216  1.5456933  0.3897490  0.7787912 -0.7075750  1.8900023 -2.3491993 
# > exp(fit.lap$mode)[1:2]
# sigmaPop sigmaRan 
# 1.805638 1.463412

## SIR sampling - these will slighly change each time you run it
# > apply(s, 2, mean)[-c(1:2)]
# betas1     betas2       ran1       ran2       ran3       ran4       ran5 
# 1.8598225  1.5462439  0.3754429  0.7316874 -0.5849880  1.7242024 -2.1096244 
# > exp(apply(s, 2, mean)[c(1:2)])
# sigmaPop sigmaRan 
# 2.082105 1.351142 

However in problems with many parameters and groups, the optimisation based approach may not converge, so optimisation+SIR, Stan or lmer based approach may be required – or a different optimization library required.

Model Quality Scores

In a previous post we calculated some model quality scores. However I had been struggling in calculating the Log Likelihood and AIC for a Random effects model – due to a mistake I was making in the calculations. Eventually I figured it out with help from another post I found here.

## first write the log predictive density function
lpd = function(theta, data){
  betas = theta # vector of betas i.e. regression coefficients for population
  ## data
  resp = data$resp # resp
  mModMatrix = data$mModMatrix
  group = data$group
  sigmaRan = exp(data$sigmaRan) # random effect scale/sd
  sigmaPop = exp(data$sigmaPop) # population level sd
  # calculate fitted value
  iFitted = mModMatrix %*% betas
  ######## perform notational acrobatics
  # construct the variance covariance matrix
  z = model.matrix(resp ~ 0 + group)
  zt = t(z)
  mRan = diag(sigmaRan^2, nrow=nlevels(group), ncol=nlevels(group))
  mRan = z %*% mRan %*% zt
  mPop = diag(sigmaPop^2, nrow=length(resp), ncol=length(resp))
  mCov = mRan + mPop
  ## likelihood function with posterior theta
  return(dmnorm(resp, iFitted, mCov, log=T))

lData$group = dfData$group
lData$sigmaRan = log(1.632)
lData$sigmaPop = log(1.943)
theta = fixef(fit.lme)
lpd(theta, lData)
[1] -66.63871
## compare with lme
'log Lik.' -66.63871 (df=4)

The code above shows that to calculate these scores, we need the log predictive density function and to calculate the log likelihood correctly we need to use a multivariate normal density with a covariance matrix that accounts for population level variance and group level variance.

I have been looking into calculating WAIC for structured models or doing cross-validation (which is on my todo list) – however this can get tedious when you have to take into account the structure of the groups. I will try and cover that in a post at sometime in the future.

Compare Transformations & Batch Effects in Omics Data

While analysing high dimensional data, e.g. from Omics (Genomics, Transcriptomics, Proteomics etc.) – we are essentially measuring multiple response variables (i.e. genes, proteins, metabolites etc.) in multiple samples, resulting in a rXn matrix X with r variables and n samples. The data capture can lead to multiple batches or groups in the data – a batch can be anything from a simple grouping like Male vs Female to Lanes on a sequencer, or Wells in a capture plate. Statistically this means that: Is the covariate or batch ignorable or should it be account for in our model.

The X matrix (data matrix) usually has to be normalised, which means the data has to undergo some sort of transformation to account for certain technical differences due to these batches e.g. like different samples being sequenced to different depths in an Next Generation Sequencer. This kind of systematic effect which is due to technical variation can be ‘adjusted’ and the samples are scaled/transformed to make them identically distributed. There are different methods of normalisation based on the data capture technology; but this post has two aims:

  1. A simple workflow for diagnostic checks for your normalisation: we normalise the same data using two different methods, and check which method may be more appropriate.
  2. Introducing the concepts of object oriented programming in R, with S4 objects.

The data and code used in this post can be found here.

1. Compare two Normalisation Methods:

In genomics, we are capturing many variables, and many of these variables are inter-dependent (e.g. co-expressed genes) and on average most of these variables are assumed to be identically distributed between samples (this however depends on how different each sample is from the other – e.g. comparing types of apples or apples vs oranges). In the example below we are using some gene expression data that has been normalised using two different methods:

Method 1: We used house keeping genes to normalise the samples, even though these samples are different cell types, hence the house keeping genes may act differently in each sub-type.

Method 2: We use the common highly expressed genes across all samples, and look for a common factor among these cells, and normalise using that criteria.

Note: We are not arguing which method is correct, both are essentially mathematically and biologically sound – but which one is suitable for our sample types.

The data in the plots below has been normalised and log transformed (in positive space) and zeros represent missing data. The samples have been coloured on our batch of interest i.e. the machine ID.

CDiagnostics_medianThe figure above is a box plot of the data with the medians highlighted as a bar. We can see that in both the methods, the samples from the second machine (071216) has more missing data. Could this be because good quality samples were run first and bad quality later? Otherwise comparing the two methods, I don’t think there is any strong reason to select one over the other; if I had to choose one I would lean towards method 2 as the medians overall are slightly more equal?


We assume the each sample is normally distributed on a log scale; this is a very big assumption, but a convenient one for diagnostic checks. Besides even if the data is not normally distributed, the mean is normally distributed, due to the number of data points (we have about 530 or so variables in each sample). We are plotting the posterior mean for each sample (see here for some details), along with the 95% high density interval (HDI) for each mean. The mean for S_4 and its HDI does not overlap with e.g. S_14 in method 1, while in method 2 all the HDIs can overlap. There is not a clear choice here, but I would prefer the normalisation in method 2.


As we have used a normal model for the average of each sample, and the normal distribution has a second parameter (the scale – sigma, standard deviation), we compare the posterior sigma for these samples. In this case the standard deviation and the HDI in each sample are more comparable in method 2, I would consider this as strong evidence for the second method.


The figures show the posterior proportion of the data that is missing after a normalisation done using each method (The reason this can be different, as some normalisation methods may include background signal removal as well). However there does not appear to be a clear winner in this case. However the batch id 071216 has more missing data.


This figure shows the first 2 components from the Principal Component Analysis of the data matrix X. Looking at the distribution of the points along the two axes – visually I can see that the data is more clumped on the left (method 1) while is looks slightly more random on the right (comparatively of course – method 2). This plot also adds another piece of evidence towards using method 2 rather than method 1.

All these pieces of evidence for or against one method should be taken together, along with considering the types of samples being compared and the nuances of the data capture technology. In this case most of the evidence (although not very strong individually) collectively favours method 2. Furthermore, I would consider including the machine ID as a covariate in the model, or use a hierarchical model with the machine ID as a random effect in the model.

2. Object Oriented Programming in R – S4 Objects:

Before I talk about some of my experience with S4 objects, these two sites are good starting points for learning more about this, site1, site2. I liked the structure of C++ and working on larger projects in R can get messy if the programs/scripts are not structured. I will put some code in here for the class CDiagnosticPlots.

Class Declaration:

setClass('CDiagnosticPlots', slots=list(mData='matrix', csTitle='character', lData='list', lParam='list'))

I define the class name, and the slots i.e. variables that this object will hold. Generally if I want the class to grow in the future, I will keep a slot like lData of type list, that I can append to for future use.


## constructor
## ARGS: mData = data matrix with samples in columns (subject space) and variables in rows (variable space)
CDiagnosticPlots = function(mData, csTitle){
  # check some diagnostics here and create object
  if (class(mData) != 'matrix') stop('CDiagnosticPlots: mData is not object of class matrix')
  ## create the object
  ob = new('CDiagnosticPlots', mData=mData, csTitle=csTitle, lData=list(), lParam=list())
  ob@lParam = CDiagnosticPlotsGetParameters(ob)

The constructor can perform various checks, before creating the object using the new function. It will return the object of the type CDiagnosticPlots which can be used in your program.

Generic Functions:

Generic functions are very useful and the function with the same name can be called for different object types. This is an example from one of my other repositories here. I define two different classes but both use the plotting function with the same name.

## snippet from first function
setGeneric('plot.var.selection', def = function(ob, ...) standardGeneric('plot.var.selection'))
setMethod('plot.var.selection', signature='CVariableSelection.RandomForest', definition = function(ob, ...){
  # plot the variable importance as bar plots with 
  # confidence intervals

setMethod('plot.var.selection', signature='CVariableSelection.ReduceModel', definition = function(ob, ...){
  # plot the test and training error agasint number of variables
  tr = colMeans(ob@mTrain)
  te = colMeans(ob@mTest)  
  m = cbind(tr, te)
  matplot(1:nrow(m), m, type='b', pch=20, lty = 1, lwd=2, col=1:2, xaxt='n', xlab='No. of Variables', ylab='Error Rate')
  legend('topright', legend = c('Train', 'Test'), fill=1:2)
  axis(1, at = 1:nrow(m), las=2)

Inheritance can also be implemented and an example can be seen in the link above.

Model Checking: Scoring and Comparing Models

This is another post in the series of model checking posts. Previously we looked at which aspects of the data and model are compatible, using posterior predictive checks. Once we have selected a model or a set of models for the data, we would like to score and compare them. One aspect of comparison using Mixture Distributions and Bayes Factors has been show in a previous post. Here we use posterior predictive checks to calculate information criteria and leave one out cross validation (and adjusting for number of parameters in the model). The examples and algorithms used in the post can be seen in more detail in:

[1] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

[2] Albert, J., Gentleman, R., Parmigiani, G., & Hornik, K. (2009). Bayesian computation with R. Bayesian Computation with R.

[3] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. Springer texts in statistics (Vol. XIV).

The predictive accuracy of a model can be used to evaluate and compare a model against other candidate models. The accuracy can be defined using some form of a function, e.g. mean squared error, log-likelihood, log predictive density etc. This ties into the concept of deviance or information criteria, which puts different models (with different parameters) on a common scale (along with some adjustments made for the number of parameters being estimated).

The data to calculate this predictive accuracy can be recycled (within sample) or external (out of sample i.e. new data) or using cross-validation (Ref 3 has a good chapter on cross-validation).

Example Data and Model

The data used in this example is from the ‘bread and peace’ model, and the paper can be seen here, while the example is from Ref [1], and the R code can be found in the github repository.

We first fit a linear model using the lm function in R.

lm(formula = VoteShare ~ IncomeGrowth, data = dfData)

    Min      1Q  Median      3Q     Max 
-8.8370 -0.3748  0.1379  1.7745  5.6291 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   45.8027     1.7297  26.480 1.07e-12 ***
IncomeGrowth   3.1809     0.7226   4.402 0.000715 ***
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.816 on 13 degrees of freedom
Multiple R-squared:  0.5985,    Adjusted R-squared:  0.5676 
F-statistic: 19.38 on 1 and 13 DF,  p-value: 0.0007149

There are three parameters being estimated here, the two coefficients and the standard deviation for the likelihood function.

## write the model with the log posterior function
# use the likelihood with non-informative priors
mylogpost = function(theta, data){
  sigma = exp(theta['sigma'])
  betas = theta[-1]
  x = data$pred
  y = data$resp
  # get the predicted or fitted value 
  mModMatrix = model.matrix(y ~ x)
  mCoef = matrix(betas, nrow = length(betas))
  iFitted = mModMatrix %*% mCoef # using identity link
  ## likelihood function
  llik = sum(dnorm(y, iFitted, sigma, log=T))
  ## prior, non informative
  lprior = 1
  lpost = llik + lprior

We can define this regression model using our custom function mylogpost and we are tracking 3 parameters (sigma, and 2 betas) – we are also using non-informative priors so our results should be similar to the lm model.

> fit = laplace(mylogpost, start, lData)
> fit
    sigma     beta0     beta1 
 1.268109 45.806382  3.180065 

              sigma         beta0         beta1
sigma  3.335900e-02  0.0002461944 -5.857976e-05
beta0  2.461944e-04  2.5949267841 -8.909553e-01
beta1 -5.857976e-05 -0.8909553156  4.528745e-01

[1] -38.72532

[1] TRUE

> se = sqrt(diag(fit$var))
> se
    sigma     beta0     beta1 
0.1826445 1.6108776 0.6729595 
## coef and sd
> fit$mode[-1]
    beta0     beta1 
45.806382  3.180065 
> exp(fit$mode[1])

We take a sample from the posterior distribution of these three parameters using a multivariate t proposal density, while the target density is mylogpost using a Sampling Importance Resampling (SIR) algorithm (I will write a post about these sampling methods at some point in the future). You can try using the approach suggested in the post in parameter estimation, but if these are sampled independently then we can not account for the covariance between the parameters – which can be avoided by using a multivariate normal sampling.

## parameters for the multivariate t density
tpar = list(m=fit$mode, var=fit$var*2, df=4)
## get a sample directly and using sir (sampling importance resampling with a t proposal density)
s = sir(mylogpost, tpar, 1000, lData)
#s = rmt(10000, fit$mode, S = fit$var)
sigSample = s[,'sigma']
beta0Sample = s[,'beta0']
beta1Sample = s[,'beta1']

We use the SIR algorithm from [2], but you can write your own, its straight forward once you know how. The tpar is a list of parameters for the multivariate t proposal density. Just to show how our sampler works, we also use STAN to generate samples for these parameters using MCMC. The figure below, shows the samples generated by SIR as histograms, and the lines are the samples generated by STAN and MCMC (they agree pretty well). The fourth plot at the bottom right panel shows the data and the regression line.


## first write the log predictive density function
lpd = function(beta0, beta1, sig){
  sigma = exp(sig)
  x = lData$pred
  y = lData$resp
  # get the predicted or fitted value 
  mModMatrix = model.matrix(y ~ x)
  mCoef = matrix(c(beta0, beta1), nrow = 2)
  iFitted = mModMatrix %*% mCoef # using identity link
  ## likelihood function with posterior theta
  return(sum(dnorm(y, iFitted, sigma, log=T)))

The scoring function for the model check is the log predictive density, which is basically the log-likelihood function, but using the posterior/fitted parameters.

“In the d-dimensional normal distribution, the logarithm of the density function is a constant plus a \mathit{X_{d}^{2}} distribution divided by -2″ [1],  – and the posterior distribution of the log predictive density has a maximum of  ~ -40.3 and mean of ~ -41.9; with a difference of 1.66, which is close to 3/2 (1.5), predicted from theory, the value of d here is 3 (the number of parameters being estimated). [See Page 171 of Ref 1 for details].

Predictive Scores

We show three predictive scores along with adjustments for number of parameters being estimated, plus leave one out cross validation. The details for these scores can be found in References [1] and [3], and else where in literature. Lower values of these scores imply a higher predictive accuracy.

Akaike Information Criterion (AIC)

This is converted to a deviance scale after adjusting for the number of parameters which is 3 in our case.

> iAIC = (lpd(fit$mode['beta0'], fit$mode['beta1'], fit$mode['sigma']) - 3) * -2
> AIC(fit.lm)
[1] 86.59985
> iAIC
[1] 86.59986

The AIC calculated using lpd function and using the AIC function in R on the linear model fit object are the same.

Deviance Information Criterion (DIC)

This is a somewhat Bayesian version of the AIC, and uses the posterior mean (instead of the maximum likelihood estimate) for θ, and a data-based bias correction.

> ## pDIC are the effective number of parameters
> ## 2 * [lpd(Expectation(theta)) - Expectation(lpd(Sample of thetas from posterior))]
> # calculate E(lpd(theta))
> eLPD = mean(sapply(1:1000, function(x) lpd(beta0Sample[x], beta1Sample[x], sigSample[x])))
> # calculate lpd(E(theta)) and pDIC
> pDIC = 2 *(lpd(fit$mode['beta0'], fit$mode['beta1'], fit$mode['sigma']) - eLPD)
> iDIC = (lpd(fit$mode['beta0'], fit$mode['beta1'], fit$mode['sigma']) - pDIC) * -2
> pDIC
[1] 3.424087
> iDIC
[1] 87.44804

The effective number of parameters will change slightly depending on the simulation size, but both numbers are pretty close (i.e. AIC and DIC).

Watanabe-Akaike or widely available information criterion (WAIC)

This is a more Bayesian approach and is an approximation to cross-validation. We show one version of this approach here. First calculated log pointwise predictive density, which is slightly different to log predictive density calculated earlier.

## log pointwise predictive density
lppd = function(beta0, beta1, sig, index){
  sigma = exp(sig)
  x = lData$pred[index]
  y = lData$resp[index]
  # get the predicted or fitted value 
  mModMatrix = model.matrix(y ~ x)
  mCoef = matrix(c(beta0, beta1), nrow = 2, byrow = T)
  iFitted = mModMatrix %*% mCoef # using identity link
  ## likelihood function with posterior theta
  return(mean(dnorm(y, iFitted, sigma, log=F)))
# log pointwise predictive probability of the observed data under the fitted
# model is
ilppd = sum(log(sapply(1:15, function(x) lppd(beta0Sample, beta1Sample, sigSample, x))))

The above function is slightly different than the lpd function, and works basically on one data point at a time (see in index argument).

> ## effective numbers of parameters pWAIC1
> pWAIC1 = 2 * (ilppd - eLPD)
> iWAIC = -2 * (ilppd - pWAIC1)
> pWAIC1
[1] 2.245018
[1] 86.26897

The effective number of parameters are about 2.2, instead of 3. This is because the effective number of parameters estimated in a model is also a function of the data, and can be considered a random variable – hence in more complex models it is not straight-forward to just count the number of parameters.

Leave-one-out cross-validation

Cross-validation can be computationally expensive and awkward in structured models, but simply put, the data are partitioned into training and test sets. The parameters are estimated on the training set while the fit is evaluated on the test set.

## leave one out cross validation
# we need to fit the model 15 times each time removing one data point
lFits = lapply(1:15, function(x){
  start = c('sigma' = log(sd(dfData$VoteShare[-x])), 'beta0'=0, 'beta1'=0)
  lData = list(pred=dfData$IncomeGrowth[-x], resp=dfData$VoteShare[-x])
  laplace(mylogpost, start, lData)

# lets take samples for posterior theta
lSamples = lapply(1:15, function(x){
  fit = lFits[[x]]
  tpar = list(m=fit$mode, var=fit$var*2, df=4)
  ## get a sample directly and using sir (sampling importance resampling with a t proposal density)
  lData = list(pred=dfData$IncomeGrowth[-x], resp=dfData$VoteShare[-x])
  s = sir(mylogpost, tpar, 1000, lData)

## calculate lppd on the hold out set
iLppd = sapply(1:15, function(x){
  s = lSamples[[x]]
  log(lppd(s[,'beta0'], s[,'beta1'], s[,'sigma'], x))
iLppd = sum(iLppd)
# calculate lppd on all data
# calculated earlier ilppd
# effective number of parameters
iParamCV = ilppd - iLppd
# Given that this model includes two linear coefficients and a variance parameter, these
# all look like reasonable estimates of effective number of parameters. [Gelman 2013]
# on deviance scale iCV is
iCV = -2 * iLppd
# after correction for number of parameters
iCV = -2 * (iLppd - iParamCV)

Generally predictive accuracy measures should be used in parallel with posterior predictive checks – starting with simpler models and expanding. This can be applied to comparing nested models: where the full model implements perhaps all the meaningful parameters, and restricted models (where some parameters are restricted or set to 0 or forced to be equal). In this case the complexity of the model and improvement in fit should justify interpretation and additional difficulty in fitting.

Model Checking: Posterior Predictive Checks

Once a model is fit and parameters estimated, we would look at how well the model explains the data and what aspects of the data generation process in nature are not captured by the model. Most of the material covered in this post follows the examples from:

[1] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

The example data I use here is Simon Newcomb’s experiment to measure the speed of light. The R source code and the data are present in the github repository. Import the data and define a model function i.e. log posterior function to model this data.

## define a log posterior function
lp = function(theta, data){
  # we define the sigma on a log scale as optimizers work better
  # if scale parameters are well behaved
  s = exp(theta[2])
  m = theta[1]
  d = data$vector # observed data vector
  log.lik = sum(dnorm(d, m, s, log=T))
  log.prior = 1 = log.lik + log.prior

Here we assume the data can be modelled using a normal distribution and estimate the parameters.

## try the laplace function from LearnBayes
fit = laplace(lp, start, lData)

## lets take a sample from this
sigSample.op = rnorm(1000, fit$mode['sigma'], se['sigma'])
muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

#### the posterior interval reported on P 67 Gelman [2013] is
## y ± 1.997s/ 66 = [23.6, 28.8]
## compare with our intervals
fit$mode['mu']-1.96*se['mu']; fit$mode['mu']+1.96*se['mu']
##      mu 
## 23.63911 
##      mu 
## 28.78365 
quantile(muSample.op, c(0.025, 0.975))
##     2.5%    97.5% 
## 23.61960 28.83061 

The figure below suggests that the normal model may not be appropriate for this data, as there is an extreme outlier measurement around -44.


There are various approaches to model checking e.g.:

  • Using external validation: use data to fit the model (i.e. estimate parameters), make predictions about the new or future data using the model and parameters, collect new real data and compare with the predictions.
  • Recycle existing data: if new data is not available. This also involves other modifications to model fitting cross-validation techniques. I will talk about this further in a future post.

How do we compare the predictions with the real data, i.e. we define some metrics to compare against, which are called Test Quantities.

Prediction of New/Future Data (Posterior Predictive Distribution)

New data or future data can be generated using simulation, from the model and estimated parameters.

## observed data should look plausible under the posterior predictive distribution
## Draw simulated values from the joint posterior of Yrep and compare to Yobs and look for systematic differences

## Gelman [2013] P 144 -
## sample 66 values, 20 times, each time drawing a fresh draw of sd and mean from the joint posterior
mDraws = matrix(NA, nrow = 66, ncol=20)

for (i in 1:20){
  p = sample(1:1000, size = 1)
  s = exp(sigSample.op[p])
  m = muSample.op[p]
  mDraws[,i] = rnorm(66, m, s)

In the example shown above we sample the standard deviation parameter, and then sample the mean parameter (conditioned on the standard deviation, notice we use the same index p). We draw 66 samples and repeat the process 20 times, this is akin to repeating the experiment 20 times and each time taking a 66 measurements. The figure below shows some of the histograms of the 20 simulations and the original data.


It appears that the replications do not include the outlier measurement at -44, and suggests that our current model does not capture that part of the data generation process. Using graphical checks can be a bit tedious in high-throughput settings, and we can use some quantitative checks by defining Test Quantities.

Test Quantities

A test quantity is a function of the original data and replicated data, with some optional additional parameters. We can evaluate the discrepancy in the between the test quantities using the original data and replicated data by calculating a PValue and watch for extreme tail area PValues. We define 5 test quantities in the current case representing: Variance, Mean, Symmetry, Minimum and Maximum.

# The procedure for carrying out a posterior predictive model check requires specifying a test
# quantity, T (y) or T (y, θ), and an appropriate predictive distribution for the replications
# y rep [Gelman 2008]
## variance
T1_var = function(Y) return(var(Y))

## is the model adequate except for the extreme tails
T1_symmetry = function(Y, th){
  Yq = quantile(Y, c(0.90, 0.10))
  return(abs(Yq[1]-th) - abs(Yq[2]-th))

## min quantity
T1_min = function(Y){

## max quantity
T1_max = function(Y){

## mean quantity
T1_mean = function(Y){

## calculate bayesian p-value for this test statistic
getPValue = function(Trep, Tobs){
  left = sum(Trep <= Tobs)/length(Trep)
  right = sum(Trep >= Tobs)/length(Trep)
  return(min(left, right))

Extreme PValues (typically less than 0.01) for the test quantities suggest areas of failure for the model which can be addressed by expanding the model, or ignored if appropriate for the applied question at hand. “The relevant goal is not to answer the question, ‘Do the data come from the assumed model?’ (to which the answer is almost always no), but to quantify the discrepancies between data and model, and assess whether they could have arisen by chance, under the model’s own assumptions.” [1].

Under the normal distribution model, the 5 test quantities show the following PValues.

> mChecks[,'Normal']
Variance Symmetry      Max      Min     Mean 
    0.47     0.16     0.01     0.00     0.47 

The tests for Min and Max quantities fail, suggesting that this model does not perform well in the tail regions. Perhaps using a heavy tailed distribution, like a T distribution with low degrees of freedom or a Contaminated normal distribution will be more useful.

## define a second log posterior function for mixture with contaminated normal distribution
lp2 = function(theta, data){
  # we define the sigma on a log scale as optimizers work better
  # if scale parameters are well behaved
  s = exp(theta[2])
  m = theta[1]
  mix = 0.9
  cont = theta[3]
  d = data$vector # observed data vector
  log.lik = sum(log(dnorm(d, m, s) * mix + dnorm(d, m, s*cont) * (1-mix)))
  log.prior = 1 = log.lik + log.prior

# sanity check for function
# choose a starting value
start = c('mu'=mean(ivTime), 'sigma'=log(sd(ivTime)), 'cont'=1)
lp2(start, lData)

## try the laplace function from LearnBayes
fit2 = laplace(lp2, start, lData)
se2 = sqrt(diag(fit2$var))

sigSample.op = rnorm(1000, fit2$mode['sigma'], se2['sigma'])
muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

A contaminated normal distribution is a mixture distribution where we add 2 more parameters to the model, the mixing probability and contamination parameter. We fix the mixing probability to 0.9 and track the contamination parameter.

Here are a couple of web links if you are more interested in contaminated normal distributions (Link 1, Link 2).

The third distribution we try is a T distribution with small degrees of freedom. We use a slightly different parameterization of the t-distribution which is nicely explained in another blog here.

lp3 = function(theta, data){
  # function to use to use scale parameter
  ## see here
  dt_ls = function(x, df, mu, a) 1/a * dt((x - mu)/a, df)
  ## likelihood function
  lf = function(dat, pred){
    return(log(dt_ls(dat, nu, pred, sigma)))
  nu = exp(theta['nu']) ## normality parameter for t distribution
  sigma = exp(theta['sigma']) # scale parameter for t distribution
  m = theta[1]
  d = data$vector # observed data vector
  if (exp(nu) < 1) return(-Inf)
  log.lik = sum(lf(d, m))
  log.prior = 1 = log.lik + log.prior

# sanity check for function
# choose a starting value
start = c('mu'=mean(ivTime), 'sigma'=log(sd(ivTime)), 'nu'=log(2))
lp3(start, lData)

op = optim(start, lp3, control = list(fnscale = -1), data=lData)

## try the laplace function from LearnBayes
fit3 = laplace(lp3, start, lData)
se3 = sqrt(diag(fit3$var))

sigSample.op = rnorm(1000, fit3$mode['sigma'], se3['sigma'])
nuSample = exp(rnorm(1000, fit3$mode['nu'], se3['nu']))
# threshold the sample values to above or equal to 1
nuSample[nuSample < 1] = 1

## generate random samples from alternative t-distribution parameterization
## see
rt_ls <- function(n, df, mu, a) rt(n,df)*a + mu
muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

Using the 3 approaches we calculated the test quantities and the PValues for these quantities.

> round(mChecks, 2)
         Normal NormalCont    T
Variance   0.47       0.34 0.34
Symmetry   0.16       0.10 0.14
Max        0.01       0.06 0.18
Min        0.00       0.21 0.14
Mean       0.47       0.48 0.47

We can see that the normal model does not seem to fit the data well in terms of outliers, while the Contaminated Normal and T distributions, both fit the data better.

The figure below shows the histogram of the data and the smattering of the density lines from the 200 posterior predictive replications of the data. We can see that the normal model does not fit well in the tails, while the other two show  a reasonable fit, at least it can be explained by sampling variation according to PValues.


While researching the material for this blog I also came across the post about this data set and extreme outliers in data sets at Andrew Gelman’s blog.

Parameter Estimation

In statistical or mathematical models our aim is to look at the data and estimate the parameters and uncertainty of those estimations. Generally speaking, looking at a data set, we wish to choose a likelihood/noise/sampling distribution, that fits the data. A distribution requires some parameters, Θ, e.g. a normal distribution (which is a very common error/noise distribution used in practice) has 2 parameters, the mean (µ) and the standard deviation (σ). In a Bayesian setting we will perhaps put some constraints on these parameters, which will be called priors, and eventually the model will (hopefully) converge to a point in this parameter space (driven by the constraints from priors and the data). For single and two parameter models the, under the constraints of the data and the priors, the model parameters and the uncertainty can be estimated using a variety of methods. I try to show a few methods along with R code.

The examples and methodological details, including equations can be seen in

[1] Albert, J., Gentleman, R., Parmigiani, G., & Hornik, K. (2009). Bayesian computation with R. Bayesian Computation with R.

[2] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2008). Bayesian Data Analysis. Chapman Texts in Statistical Science Series.

The R code for these examples can be found here in the github repository.

The data we use here is marathon running times, and is from Ref [1], and can be found in the LearnBayes package.

Analytical Estimation:

We first use an analytical approach to estimate and parameters and add some simulation to estimate the uncertainty. Details and equations can be seen around page 62-68 of Ref [2].

### first we will use a conjugate analysis and calculate the parameters
### for the data analytically as shown in Gelman 2008 - see page 62 
### also see page 64 of bayesian computation with R
## We are not using any prior information, but to make this function more general 
## I am adding some prior information with 0s
## look at page 68 of Bayesian Data Analysis (Gelman) for formula = function(dat.grp){
  # prior variance
  sigma.0 = 0
  # prior observations
  k.0 = 0
  # prior degrees of freedom
  v.0 = k.0 - 1
  # prior mean
  mu.0 = 0
  # calculate conjugate posterior
  # number of observations in the data
  n = length(dat.grp)
  # prior observations + data points = posterior number of observations
  k.n = k.0 + n
  # posterior degrees of freedom
  v.n = v.0 + n
  # mean and sd for the data = mean(dat.grp)
  s = sd(dat.grp)
  # posterior mean
  mu.n = (k.0/k.n * mu.0) + (n/k.n *
  # posterior var
  sigma.n = (( v.0*sigma.0 ) + ( (n-1)*(s^2) ) + ( (k.0*n/k.n)*((^2) )) / v.n
  #post.scale = ((prior.dof * prior.scale) + (var(dat.grp) * (length(dat.grp) - 1))) / post.dof
  ## simulate posterior variance and conditional posterior mean
  sigma = (sigma.n * v.n)/rchisq(1000, v.n)
  mu = rnorm(1000, mu.n, sqrt(sigma)/sqrt(k.n))
  return(list(mu=mu, var=sigma))

We can call this function to simulate the posterior mean and variance, that fit within the parameter space under the constraints of the data and priors.

Grid Approximation

Another useful way to estimate the parameters, when analytical formulas are not available or possible is using grid based approximations. For a grid approximation the trick is first to define a function that returns one value and accepts certain parameters and data that need tracking. The simplest function is the likelihood function, however we follow the strategy from [1] and [2] and use the log posterior to define the function.

## define a log posterior function
lp = function(theta, data){
  s = theta[2]
  m = theta[1]
  d = data$time # data from marathon times
  log.lik = sum(dnorm(d, m, sqrt(s), log=T))
  log.prior = 1 = log.lik + log.prior

The function accepts two arguments here, one is theta which is a vector of parameters that need tracking/calculating, and the second is the data. Together the data and theta define a parameter space within which we can ‘move’ and estimate the constrained/fitted theta. The other thing you may notice is we are doing everything on the log scale, so this function should return the log posterior, or log likelihood if you do not use any priors.

## define a discrete grid of parameters
Mu = seq(250, 302, by=0.5)
Sigma2 = seq(500, 9000, by=10)

We define the grid with a certain spacing, and this spacing can be adjusted depending on how much accuracy we want in our estimations. Once the grid is defined, one can use a double loop or any other fancy shortcut like outer command, to create a matrix where each cell contains the value of the function at that combination of the grid points.

## calculate the log posterior at each grid point 
lm = matrix(NA, nrow=length(Mu), ncol = length(Sigma2))
for (r in seq_along(Mu)){
  for (c in seq_along(Sigma2)){
    s = c('mu'=Mu[r], 'sigma2'=Sigma2[c])
    lm[r, c] = lp(s, lData)

# convert from log posterior 
# subtract max before exponentiating
lm = lm-max(lm)
lm = exp(lm)
# set total prob of grid to 1
lm = lm/sum(colSums(lm))

The lm is the grid matrix which is filled inside this double loop. Once the matrix is filled, we follow the strategies from [1] and [2] to subtract off the maximum before exponentiating (converting back from log scale), and then set the total grid probability to 1.

The sample can be generated from this matrix as shown in the code below

## take samples from this distribution
pSigma2 = colSums(lm)
pMu = rowSums(lm)

## add a uniform jitter centered at zero and half of grid spacing to make distribution continuous
sig2Sample = runif(1000, -1*10/2, 10/2) +  sample(size = 1000, Sigma2, replace = T, prob = pSigma2)
# for each sampled value of sig2Sample, get the posterior mu, conditioned on sig2Sample
muSample = rnorm(1000, mean(time), sqrt(sig2Sample)/sqrt(length(time)))

Optimization based approach

This time we will explore the surface of the function using an optimizer, and calculate the variation using the hessian. As usual the technical details can be found in the reference [1].

Start by defining the log posterior function like before, the main difference this time is that we rescale the standard deviation parameter on the log scale.

lp2 = function(theta, data){
  # we define the sigma on a log scale as optimizers work better
  # if scale parameters are well behaved
  s = exp(theta[2])
  m = theta[1]
  d = data$time # data from marathon times
  log.lik = sum(dnorm(d, m, s, log=T))
  log.prior = 1 = log.lik + log.prior

# sanity check for function
# choose a starting value
start = c('mu'=mean(time), 'sigma'=log(var(time)))
lData = list('time'=time)
lp2(start, lData)

## try the laplace function from LearnBayes
fit = laplace(lp2, start, lData)
se = sqrt(diag(fit$var))

## lets take a sample from this
sigSample.op = rnorm(1000, fit$mode['sigma'], se['sigma'])
muSample.op = rnorm(1000, mean(time), exp(sigSample.op)/sqrt(length(time)))

I prefer the optimization based approach for multiparameter problems, and it is more is simpler to implement as long as you can provide sensible starting values. There are a couple of other methods I will cover at some point like Rejection Sampling, Importance Sampling.

MCMC using RStan and Stan

The final method I will cover briefly is doing this in Stan and using a MCMC based approach. The model is very similar to the one defined in the function earlier, but is defined in the Stan modelling language syntax.

data {
    int<lower=1> Ntotal; // number of observations
    real y[Ntotal]; // observed data
parameters {
    real mu; // posterior mean
    real<lower=0> sig; // posterior sd
model {
  y ~ normal(mu, sig);

I call this model from R using rstan library, and extract the simulated data points from this Stan object.

Comparing All the Estimates

I plot a contour plot for the log posterior function and smatter the simulated data points using each method, and they all agree pretty well. I also show two density plots for these to show how the estimates of the posterior means, variances and their uncertainty.