Plausible Reasoning for Scientific Problems: Belief Driven by Priors and Data.

Plausible reasoning requires constructing rational arguments by use of syllogisms, and their analysis by deductive and inductive logic. Using this method of reasoning and expressing our beliefs, in a scientific hypothesis, in a numerical manner using probability theory is one of my interest. I try to condense the material from the first 4 chapters of [1] in this post – starting with a very brief description of syllogisms, logic, probabilities, sampling distributions and finishing up with a problem of scientific hypothesis testing. I have used the material from various sources and some of it links with the previous posts on model comparisons and mixture models. The two books that can be looked at for more details are:

[1] Jaynes, E. T. (2003). Probability theory the logic of science. Cambridge University Press (2003)

[2] Lavine, M. (2013). Introduction to Statistical Thought. Retrieved from http://www.stat.duke.edu/~michael/book.html

Syllogisms

Syllogisms are systematic ways of presenting logical arguments, and appeal to our intuitive reasoning abilities to draw conclusions. Usually we have two premises Major, Minor and that leads to a conclusion.

Strong Syllogism

A\; is\; true,\; then\; B\; is\; true

\frac{A\;is\; true }{\therefore B\; is\; true}

In the inverse form:

A\; is\; true,\; then\; B\; is\; true

\frac{B\;is\; false }{\therefore A\; is\; false}

Weak Syllogism

In common scientific problems, we have to deal with weaker syllogisms. Following the example from [1]:

A ≡ It will rain at 10 am at the latest

B ≡ It will be cloudy before 10 am

A\; is\; true,\; then\; B\; is\; true

\frac{B\;is\; true }{\therefore A\; becomes\; more\; plausible}

The major premise if A then B, creates a logical environment, where B is a logical consequence of A (not the physical causal consequence). Verification of the consequences of A (in this logical environment) increases our confidence in A. Another form can be expressed as:

\frac{A\;is\; false }{\therefore B\; becomes\; less\; plausible}

one of the reasons for B being true has been eliminated, hence we feel less confident about B.

Boolean Algebra

The ideas of plausible reasoning can be represented as symbolic logic using Boolean algebra. The symbols and operations we will use are

Symbol Name Other Names
AB Logical Product Conjunction
And
Intersection
A+B Logical Sum Disjunction
Or
Union

Given two propositions A and B, one is true if an only if the other is true – i.e. they have same truth value. So if A is the necessary and sufficient condition for B, they are logically equivalent propositions.

Some propositions and their meanings include:

\bar A \equiv A\; is\; false \newline Relation \;between \; A\;, \bar A \; is \; reciprocal \newline A = \bar A \; is\; false

The relationship between two propositions can be expressed as an implication

A \Rightarrow B \newline A\bar B \; is \; false \newline (\bar A + B) \; is \; false \newline A = AB \newline if \;A \; is \; true \; then \; B \; is \; true\newline if \;B \; is \; false \; then \; A \; is \; false\newline if \;A \; is \; false \; says \; nothing \; about \; B\newline if \;B \; is \; true \; says \; nothing \; about \; A\newline

Quantitative Rules

These rules have been formally derived in Chapter 2 Ref [1] and it is a recommended read. I will just mention the rules briefly.

Product Rule

AB | C : A and B are true given C; broken down into parts it can be written as:

  • B | C is true
  • accepting B is true, decide that A is true A | BC

OR

  • A | C is true
  • accepting A is true, decide that B is true B | AC

In functional form this can be written as: (AB | C) = F[(B|C), (A|BC)]

With 3 propositions, like (ABC | D), we can break the problem down into parts and use e.g. BC as a single proposition:

  1. (ABC | D)
  2. (BC | D) (A | BCD)
  3. (C | D) (B|CD) (A | BCD)

Sum Rule

The logical product of A \bar A is always false, while logical sum is A + \bar A is always true.

Primitive sum rule: (A|B)+(\bar A| B) = 1

Extended or generalised sum rule: (A+B|C) = (A|C)+(B|C)-(AB |C)

Qualitative Properties

The statements of Logic can be expressed in the form of product and sum rules.

A \Rightarrow B \newline C \;is \;the \;major \;premise \newline C \equiv A \Rightarrow B \newline if \;A \;is \;true \;then \;B \;is \;true \newline p(B|AC) = \frac{p(AB|C)}{p(A|C)}\newline \newline if \;B \;is \;false \;then \;A \;is \;false \newline p(A\mid \bar B C) = \frac{p(A\bar B \mid C)}{p(\bar B \mid C)}\newline \newline if \;B \;is \;true \;then \;A \;becomes \;more \;plausible \newline p(A\mid B C) = p(A \mid C)\frac{p(B \mid AC)}{p(B \mid C)}\newline \newline if \;A \;is \;false \;then \;B \;becomes \;less \;plausible \newline p(B\mid \bar A C) = p(B \mid C)\frac{p(\bar A \mid BC)}{p(\bar A \mid C)}\newline

Elementary Sampling Theory

The rules we have available are the product and sum rules described earlier, and using the principle of indifference (just one of the many principles), if B is the background information while (H_{1}, H_{2}, .. H_{N}) are equally likely mutually exclusive and exhaustive hypotheses then

P(H_{i}\mid B) = \frac{1}{N} \; \; 1 \leq i \leq N

As an example:

  • B ≡ Urn with N balls, with M red balls  and (N-M) white balls.
  • Ri ≡ Red ball on ith draw.
  • Wi ≡ White ball on ith draw.

For the first draw,

P(R_{1}\mid B) = \frac{M}{N} \newline P(W_{1}\mid B) = 1-\frac{M}{N}

These probability values should not be confused with physical properties of the urn, but are representative of state of knowledge or information – before any actual ball is drawn. This state of knowledge changes changes when a new question is asked – what is the probability of red on first two draws? (use the product rule)

P(R_{1} R_{2} \mid B)=P(R{1} \mid B)P(R{2} \mid R{1}B) \newline P(R_{1} R_{2} \mid B)=\frac{M}{N}\frac{M-1}{N-1}

Working this way (you can see the full derivation in [1]) we can look at any sequence of drawing – and if the question becomes: how many ways two red balls can be drawn in three draws, then the we are looking at the multiplicity of this event R_{1}R_{2}W_{3}, R_{1}W_{2}R_{3}, W_{1}R_{2}R_{3}

which can be calculated using the binomial coefficient

binomialCoef

So the question can be posed as:

  • B ≡ Urn with N balls, with M red balls  and (N-M) white balls.
  • A ≡ ‘r red balls in n draws, in any order’
  • P(A|B) ≡ P (r | N, M, n) ≡ Hypergeometric distribution (if we sample without replacement)
  •  P(A|B) ≡ P (r | N, M, n) ≡ Binomial distribution. (sample with replacement)

These probability distributions are called sampling distributions or direct probabilities: Given some hypothesis (e.g. contents M, N of the urn), what is the probability that we shall obtain some specified data D (e.g. some sequence of red and white balls). These sampling distributions make predictions about potential observations – and if the correct hypothesis is known then the predictions and observed data will agree closely.

As another example for calculating probabilities of events [2] let:

  • X ≡ set of outcomes e.g. rolling a six-sided die {1, 2, 3, 4, 5, 6}
  • Y ≡ subsets of X
  • μ : Y → [0, 1] where the function μ has the domain Y and has an image from 0 to 1.

We can determine the function μ using simulation, logic or experience. Following the example from [2] (the section on Basic Probability) – we want to calculate the probability of winning on the come-out roll in the game of craps (we need to roll a 7 or 11 using two six-sided die.

P(win on come-out roll) = P(7) + P(11)

The example in the book [2] shows how to calculate this mathematically or via simulation.

sim.game = function() {
  r = sample(1:6, 2, replace=T)
  if (sum(r) == 7 || sum(r) == 11) return(1) else return(0)
}
win = replicate(1000, sim.game())
sum(win)/1000
[1] 0.224

Elementary Hypothesis Testing:

In a general scientific problem, we already have the data and a set of hypotheses and want to decide which hypothesis is more likely when looking at the information. The problem can be written down as propositions:

X = prior \; information \newline H = hypothesis \newline D = data \newline \newline The \; joint\; parameter\; space\; is\; expanded\; using\; product\; rule\newline P(HD\mid X) = P(D \mid X) P(H \mid DX) = P(H \mid X) P(D \mid HX)\newline Performing\; some\; acrobatics,\; the\; equation\; becomes\newline P(H \mid DX) = P(H \mid X) \frac{P(D \mid HX)}{P(D \mid X)} .. Eq(1)

This is of course the famous Bayes theorem, and the left side of equation 1 is the posterior probability – I tend to think of this (as most times your hypothesis will consist of parameters) as the point in the vector space where your parameters converge when they are being restricted by the priors and the data. The factor in equation 1, P(D|HX) is a function with 2 names depending on the context: 1) sampling distribution when H is fixed and D varies; 2) likelihood function L(H) when the data is a fixed parameter while H changes. Ref [2] has a nice section on likelihood functions and should be read with the associated R code in the book.

I show a slightly modified example from Ref [1] (where this problem is solved in multiple ways)

  • X ≡ There are 15 machines, making widgets, and 10 of those machines make good quality widgets, 4 make average quality widgets and 1 is out of tune and makes mostly bad widgets. Each machine has a tuning parameter, which is the proportion of bad widgets, and we model that using a beta distribution.
  • M1 ≡ On average the machine produces bad widgets with a proportion of 0.16
  • M2 ≡ On average the machine produces bad widgets with a proportion of 0.49
  • M3 ≡ On average the machine produces bad widgets with a proportion of 0.83

We use equation 1, to estimate the posterior parameters for each hypothesis or model.

bayesEquation

We can treat this problem in various ways, using a parameter estimation approach, calculating 95% high density intervals to assign a p-value , a mixture distribution approach as shown in previous posts e.g. here and here. However the statement of the problem also shows us that we can assign a prior probability or weight to each hypothesis. The code snippet below shows the priors for the 3 models and the prior weights for each model – the full code can be found in the github repository.

### define three machines
## model m1 - machine makes less defective widgets
m1 = function(th) dbeta(th, 1, 5, log = T)

## model m2 - machine makes between average defective widgets
m2 = function(th) dbeta(th, 5, 5, log = T)

## model m3 - machine makes almost everything bad
m3 = function(th) dbeta(th, 5, 1, log = T)

## define an array that represents number of models in our parameter space
## each index has a prior weight/probability of being selected
## this can be thought of coming from a categorical distribution 
mix.prior = c(m1=10/15 ,m2= 4/15 ,m3= 1/15)

As the data is generated, the posterior weight for each model changes with respect to each other, depending on if we see a bad widget (1) or a good widget (0). The results below show the posterior weights as the total number of good or bad widgets change along with the number of observations.

        m1   m2   m3 Data - 1=Bad
 [1,] 0.78 0.20 0.02            1
 [2,] 0.53 0.42 0.05            1
 [3,] 0.64 0.34 0.02            0
 [4,] 0.46 0.49 0.05            1
 [5,] 0.33 0.59 0.08            1
 [6,] 0.25 0.64 0.11            1
 [7,] 0.19 0.66 0.16            1
 [8,] 0.15 0.65 0.20            1
 [9,] 0.12 0.63 0.25            1
[10,] 0.09 0.60 0.30            1
[11,] 0.08 0.57 0.36            1
[12,] 0.06 0.53 0.41            1
[13,] 0.05 0.49 0.46            1
[14,] 0.04 0.44 0.51            1
[15,] 0.07 0.58 0.35            0
[16,] 0.09 0.67 0.24            0
[17,] 0.12 0.72 0.17            0
[18,] 0.14 0.74 0.12            0
[19,] 0.17 0.74 0.09            0
[20,] 0.15 0.75 0.11            1
[21,] 0.13 0.75 0.12            1
[22,] 0.12 0.74 0.14            1

The figure below shows the same numbers and how one hypothesis becomes more plausible while others become less as evidence accumulates – the points represent the prior weights for each hypothesis.

hypothesisPlausible

Interestingly hypothesis M3 is a very rare event, or almost a dead hypothesis, that is resurrected [1] to a level where we start to consider it very plausible by the time we have 16 data points, as we had a lot of  bad widgets. This example shows an elegant way to compare multiple hypotheses with respect to each other, and our belief is driven by the prior information and the data in different directions. In a more complex model with many parameters, the denominator of the equation one i.e P(D) can be difficult to calculate, and I would perhaps try a MCMC based approach in a single finite mixture model, rather than optimisation based approach.

Regression & Finite Mixture Models

I wrote a post a while back about Mixture Distributions and Model Comparisons. This post continues on that theme and tries to model multiple data generating processes into a single model. The code for this post is available at the github repository. There were many useful resources that helped me understand this model, and some are mentioned below and others are mentioned in the previous post in this topic:

I use the R library flexmix, for the sample data and fitting the model, in addition to Stan and MCMC approach. The data distribution is shown in the figure below, where looking at the figure suggests that there may be two data generation processes at work here. To start with, we try and find a distribution that will capture certain aspects of this data, and use posterior predictive checks to see if our distribution choice is appropriate. A previous post on posterior predictive checks covers this topic in more detail, and tests different distributions. To keep it short, I just try a normal mixture distribution with 2 mixture components and then perform some tests.

The figure in the left panel shows the density plot for the raw data, and the panel on the right shows the same data, with smattering of 200 predictive data sets from the fitted distribution – visually it appears to be fine, with the left component slightly higher in some cases. However the aspects of the data that I want to our distribution to capture are the mean, variance, maximum and minimum values, and the test result PValues suggest that any variation can be explained by sampling variability.

> ivTestQuantities
variance  minimum  maximum     mean 
   0.380    0.155    0.300    0.420 

The first steps are to fit a mixture regression model with 2 components and one predictor using the flexmix package and then to repeat this using Stan.

> fit.flex = flexmix(yn ~ x, data=NPreg, k=2)
> summary(fit.flex)

Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2)

       prior size post>0 ratio
Comp.1 0.561  104    198 0.525
Comp.2 0.439   96    146 0.658

'log Lik.' -704.0635 (df=7)
AIC: 1422.127   BIC: 1445.215 

> ## fitted coefficients
> parameters(fit.flex)
                     Comp.1     Comp.2
coef.(Intercept) 31.7644433 -0.7685309
coef.x            0.1867837  5.1841287
sigma             7.7718438  3.3828795

The same model is now broken down into parts and fit using Stan – you can write a log posterior function in R as well to do this, but I find that the optimiser tends to not explore the parameter space if we track the mixture components as well. However if we keep the mixture components fixed, then it tends to converge to the same point as Stan.

> print(fit.stan, digi=3)
Inference for Stan model: fitNormalMixtureRegression.
1 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=2000.

                   mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
sigma[1]          3.405   0.010 0.395    2.676    3.126    3.395    3.667    4.223  1448 1.002
sigma[2]          7.848   0.019 0.632    6.725    7.407    7.805    8.270    9.164  1142 1.000
iMixWeights[1]    0.439   0.001 0.046    0.350    0.406    0.438    0.472    0.530  1000 1.000
iMixWeights[2]    0.561   0.001 0.046    0.470    0.528    0.562    0.594    0.650  1000 1.000
betasMix1[1]      5.163   0.003 0.146    4.870    5.067    5.160    5.262    5.452  2000 1.001
betasMix2[1]      0.176   0.013 0.347   -0.501   -0.056    0.171    0.391    0.861   751 1.001
mu[1]            -0.686   0.016 0.713   -2.072   -1.164   -0.705   -0.228    0.750  2000 1.000
mu[2]            31.846   0.067 1.971   27.898   30.590   31.890   33.136   35.791   853 1.000
lp__           -708.099   0.063 1.885 -712.367 -709.142 -707.715 -706.712 -705.449   899 1.001

Samples were drawn using NUTS(diag_e) at Tue Jul  4 16:10:02 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Comparing the results from the two procedures, i.e. flexmix and the Stan model, suggest a nice convergence between the two methods.

Fitted or Predicted Values

Once the coefficients of the regression equation have been determined, we can calculate the predicted (on new data) or fitted (current data) values. There are two ways to get this data: either get predicted data for each component; Or get an aggregate predicted value using each component.

> intercepts
[1] -0.6598333 31.5694812
> betas
[1] 5.1614796 0.2241089
> iMixWeights
[1] 0.44 0.56
> iPred1 = lStanData$X %*% c(intercepts[1], betas[1])
> iPred2 = lStanData$X %*% c(intercepts[2], betas[2])
> ## compare with fit.flex
> head(f)
       Comp.1    Comp.2
[1,] 32.54457 20.883670
[2,] 31.98889  5.460877
[3,] 32.19311 11.129077
[4,] 32.87877 30.159296
[5,] 32.20489 11.456077
[6,] 33.19092 38.822976
> head(cbind(iPred1, iPred2))
       [,1]     [,2]
1 20.897771 32.50550
2  5.542358 31.83878
3 11.185795 32.08381
4 30.132872 32.90649
5 11.511366 32.09795
6 38.758702 33.28101

However getting one aggregate predicted value for the response variable may be more appropriate in certain settings, and will require us to take a weighted average of each component (the weight determined by the mixing probability).

> iAggregate = cbind(iPred1, iPred2)
> iAggregate = sweep(iAggregate, 2, iMixWeights, '*')
> iAggregate = rowSums(iAggregate)
> dfPredicted = data.frame(stan=iAggregate, flexmix=p.agg)
> head(dfPredicted)
      stan  flexmix
1 27.39810 27.42164
2 20.26835 20.33447
3 22.88868 22.93915
4 31.68610 31.68404
5 23.03985 23.08942
6 35.69120 35.66522
> ## calculate Mean Squared Error MSE
> mean((dfPredicted$flexmix - NPreg$yn)^2)
[1] 104.4622
> mean((dfPredicted$stan - NPreg$yn)^2)
[1] 104.3325
> ## both pretty close

Comparing the aggregate values from the two models show that both have produced similar results, and the mean squared errors in each case are also similar.

Model Checking: Posterior Predictive Checks

Once a model is fit and parameters estimated, we would look at how well the model explains the data and what aspects of the data generation process in nature are not captured by the model. Most of the material covered in this post follows the examples from:

[1] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

The example data I use here is Simon Newcomb’s experiment to measure the speed of light. The R source code and the data are present in the github repository. Import the data and define a model function i.e. log posterior function to model this data.

## define a log posterior function
lp = function(theta, data){
  # we define the sigma on a log scale as optimizers work better
  # if scale parameters are well behaved
  s = exp(theta[2])
  m = theta[1]
  d = data$vector # observed data vector
  log.lik = sum(dnorm(d, m, s, log=T))
  log.prior = 1
  log.post = log.lik + log.prior
  return(log.post)
}

Here we assume the data can be modelled using a normal distribution and estimate the parameters.

## try the laplace function from LearnBayes
fit = laplace(lp, start, lData)

## lets take a sample from this
sigSample.op = rnorm(1000, fit$mode['sigma'], se['sigma'])
muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

#### the posterior interval reported on P 67 Gelman [2013] is
## y ± 1.997s/ 66 = [23.6, 28.8]
## compare with our intervals
fit$mode['mu']-1.96*se['mu']; fit$mode['mu']+1.96*se['mu']
##      mu 
## 23.63911 
##      mu 
## 28.78365 
quantile(muSample.op, c(0.025, 0.975))
##     2.5%    97.5% 
## 23.61960 28.83061 

The figure below suggests that the normal model may not be appropriate for this data, as there is an extreme outlier measurement around -44.

speedOfLightHistogram

There are various approaches to model checking e.g.:

  • Using external validation: use data to fit the model (i.e. estimate parameters), make predictions about the new or future data using the model and parameters, collect new real data and compare with the predictions.
  • Recycle existing data: if new data is not available. This also involves other modifications to model fitting cross-validation techniques. I will talk about this further in a future post.

How do we compare the predictions with the real data, i.e. we define some metrics to compare against, which are called Test Quantities.

Prediction of New/Future Data (Posterior Predictive Distribution)

New data or future data can be generated using simulation, from the model and estimated parameters.

## POSTERIOR PREDICTIVE CHECKING
## observed data should look plausible under the posterior predictive distribution
## Draw simulated values from the joint posterior of Yrep and compare to Yobs and look for systematic differences

## Gelman [2013] P 144 -
## sample 66 values, 20 times, each time drawing a fresh draw of sd and mean from the joint posterior
mDraws = matrix(NA, nrow = 66, ncol=20)

for (i in 1:20){
  p = sample(1:1000, size = 1)
  s = exp(sigSample.op[p])
  m = muSample.op[p]
  mDraws[,i] = rnorm(66, m, s)
}

In the example shown above we sample the standard deviation parameter, and then sample the mean parameter (conditioned on the standard deviation, notice we use the same index p). We draw 66 samples and repeat the process 20 times, this is akin to repeating the experiment 20 times and each time taking a 66 measurements. The figure below shows some of the histograms of the 20 simulations and the original data.

speedOfLight20RepsspeedOfLight20Hist_2

It appears that the replications do not include the outlier measurement at -44, and suggests that our current model does not capture that part of the data generation process. Using graphical checks can be a bit tedious in high-throughput settings, and we can use some quantitative checks by defining Test Quantities.

Test Quantities

A test quantity is a function of the original data and replicated data, with some optional additional parameters. We can evaluate the discrepancy in the between the test quantities using the original data and replicated data by calculating a PValue and watch for extreme tail area PValues. We define 5 test quantities in the current case representing: Variance, Mean, Symmetry, Minimum and Maximum.

# The procedure for carrying out a posterior predictive model check requires specifying a test
# quantity, T (y) or T (y, θ), and an appropriate predictive distribution for the replications
# y rep [Gelman 2008]
## variance
T1_var = function(Y) return(var(Y))

## is the model adequate except for the extreme tails
T1_symmetry = function(Y, th){
  Yq = quantile(Y, c(0.90, 0.10))
  return(abs(Yq[1]-th) - abs(Yq[2]-th))
} 

## min quantity
T1_min = function(Y){
  return(min(Y))
} 

## max quantity
T1_max = function(Y){
  return(max(Y))
} 

## mean quantity
T1_mean = function(Y){
  return(mean(Y))
} 

## calculate bayesian p-value for this test statistic
getPValue = function(Trep, Tobs){
  left = sum(Trep <= Tobs)/length(Trep)
  right = sum(Trep >= Tobs)/length(Trep)
  return(min(left, right))
}

Extreme PValues (typically less than 0.01) for the test quantities suggest areas of failure for the model which can be addressed by expanding the model, or ignored if appropriate for the applied question at hand. “The relevant goal is not to answer the question, ‘Do the data come from the assumed model?’ (to which the answer is almost always no), but to quantify the discrepancies between data and model, and assess whether they could have arisen by chance, under the model’s own assumptions.” [1].

Under the normal distribution model, the 5 test quantities show the following PValues.

> mChecks[,'Normal']
Variance Symmetry      Max      Min     Mean 
    0.47     0.16     0.01     0.00     0.47 

The tests for Min and Max quantities fail, suggesting that this model does not perform well in the tail regions. Perhaps using a heavy tailed distribution, like a T distribution with low degrees of freedom or a Contaminated normal distribution will be more useful.

## define a second log posterior function for mixture with contaminated normal distribution
lp2 = function(theta, data){
  # we define the sigma on a log scale as optimizers work better
  # if scale parameters are well behaved
  s = exp(theta[2])
  m = theta[1]
  mix = 0.9
  cont = theta[3]
  d = data$vector # observed data vector
  log.lik = sum(log(dnorm(d, m, s) * mix + dnorm(d, m, s*cont) * (1-mix)))
  log.prior = 1
  log.post = log.lik + log.prior
  return(log.post)
}

# sanity check for function
# choose a starting value
start = c('mu'=mean(ivTime), 'sigma'=log(sd(ivTime)), 'cont'=1)
lp2(start, lData)

## try the laplace function from LearnBayes
fit2 = laplace(lp2, start, lData)
fit2
se2 = sqrt(diag(fit2$var))

sigSample.op = rnorm(1000, fit2$mode['sigma'], se2['sigma'])
muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

A contaminated normal distribution is a mixture distribution where we add 2 more parameters to the model, the mixing probability and contamination parameter. We fix the mixing probability to 0.9 and track the contamination parameter.

Here are a couple of web links if you are more interested in contaminated normal distributions (Link 1, Link 2).

The third distribution we try is a T distribution with small degrees of freedom. We use a slightly different parameterization of the t-distribution which is nicely explained in another blog here.

lp3 = function(theta, data){
  # function to use to use scale parameter
  ## see here https://grollchristian.wordpress.com/2013/04/30/students-t-location-scale/
  dt_ls = function(x, df, mu, a) 1/a * dt((x - mu)/a, df)
  ## likelihood function
  lf = function(dat, pred){
    return(log(dt_ls(dat, nu, pred, sigma)))
  }
  nu = exp(theta['nu']) ## normality parameter for t distribution
  sigma = exp(theta['sigma']) # scale parameter for t distribution
  m = theta[1]
  d = data$vector # observed data vector
  if (exp(nu) < 1) return(-Inf)
  log.lik = sum(lf(d, m))
  log.prior = 1
  log.post = log.lik + log.prior
  return(log.post)
}

# sanity check for function
# choose a starting value
start = c('mu'=mean(ivTime), 'sigma'=log(sd(ivTime)), 'nu'=log(2))
lp3(start, lData)

op = optim(start, lp3, control = list(fnscale = -1), data=lData)
op$par
exp(op$par[2:3])

## try the laplace function from LearnBayes
fit3 = laplace(lp3, start, lData)
fit3
se3 = sqrt(diag(fit3$var))

sigSample.op = rnorm(1000, fit3$mode['sigma'], se3['sigma'])
nuSample = exp(rnorm(1000, fit3$mode['nu'], se3['nu']))
# threshold the sample values to above or equal to 1
nuSample[nuSample < 1] = 1

## generate random samples from alternative t-distribution parameterization
## see https://grollchristian.wordpress.com/2013/04/30/students-t-location-scale/
rt_ls <- function(n, df, mu, a) rt(n,df)*a + mu
muSample.op = rnorm(1000, mean(lData$vector), exp(sigSample.op)/sqrt(length(lData$vector)))

Using the 3 approaches we calculated the test quantities and the PValues for these quantities.

> round(mChecks, 2)
         Normal NormalCont    T
Variance   0.47       0.34 0.34
Symmetry   0.16       0.10 0.14
Max        0.01       0.06 0.18
Min        0.00       0.21 0.14
Mean       0.47       0.48 0.47

We can see that the normal model does not seem to fit the data well in terms of outliers, while the Contaminated Normal and T distributions, both fit the data better.

The figure below shows the histogram of the data and the smattering of the density lines from the 200 posterior predictive replications of the data. We can see that the normal model does not fit well in the tails, while the other two show  a reasonable fit, at least it can be explained by sampling variation according to PValues.

speedOfLightSmattering

While researching the material for this blog I also came across the post about this data set and extreme outliers in data sets at Andrew Gelman’s blog.

Mixture Distributions and Model Comparison

The following text and code snippets show examples from two books on Bayesian Data Analysis:

[1] Kruschke, J. K. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan, second edition. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan, Second Edition. http://doi.org/10.1016/B978-0-12-405888-0.09999-2

[2] Albert, J., Gentleman, R., Parmigiani, G., & Hornik, K. (2009). Bayesian computation with R. Bayesian Computation with R. http://doi.org/10.1007/978-0-387-92298-0

The examples I refer to here can be seen on Page 50 of Ref 2 (Mixtures of conjugate priors) and Chapter 10 of Ref 1 (Model Comparison .. ). I try and connect the two ideas presented there, where two independent models are either modelled together as a mixture distribution or separately and then compared.

Problem:

We can imagine this problem as two machines or two coins, that have a parameter theta θ, in case of coins the proportion of heads. Coin A is biased to produce more tails, Coin B is biased to produce more heads. We are presented with some data i.e. number of heads and number of tails.

  1. Which coin generated this data?
  2. What is the distribution of the parameter theta before and after observing the data?

The R code for this analysis and the figures can be found here.

Following the example from ref [2]. Lets define the two distributions:

### define two models that explain the data
## model m1
## it is a beta distribution with a weight on the left side
g1 = function(theta) dbeta(theta, 6, 14)

## model m2
## it is a beta distribution with a weight on the right side
g2 = function(theta) dbeta(theta, 14, 6)

How much weight we assign to each model i.e. the prior probability for each model, and this can be used to generate the mixture distribution.

## define an array that represents number of models in our parameter space
## each index has a prior weight/probability of being selected
## this can be thought of coming from a categorical distribution 
mix.prior = c(50, 50)
mix.prior = mix.prior/sum(mix.prior)

## define a joint prior space for these 2 models
## (mix.prior[1] AND g1) OR (mix.prior[2] AND g2)
g.mix = function(theta) mix.prior[1]*g1(theta) + mix.prior[2]*g2(theta)

This function represents the prior distribution or prior beliefs about how the parameter theta for each coin is distributed.

m1m2prior

m1m2mixprior

The observed data shows 7 heads and 3 tails. We now define the sampling distribution/likelihood functions and posterior distributions for each model. The posterior distribution represents how much parameter theta changes for each model after observing the data. I have used the word conjugate here, which I will explain at a later time – for the time being think of it as a simple analytical way to calculate the posterior.

## we flip a coin, perform a binary experiment 
## our prior beliefs are that the experiment tends to follow two extreme models
## we will either see more 1s or more 0s, depending on which model (m1 or m2) is 
## more representative of the data
data = c(success=7, fail=3)

## the posterior is proportional to Likelihood * Prior
## P(Data | Theta, Model) X P(Theta, Model)

## define the Likelihood function, both models share the same likelihood functional form
lik = function(data, theta) dbinom(data['success'], sum(data), theta)

## since this is a conjugate analysis, as prior is beta distributed and likelihood is binomial
## the posterior can be derived analytically 
## using a similar analogy as before when describing priors for each model
## we can define a posterior for each of the 2 models
g1.post = function(theta, data) dbeta(theta, 6+data['success'], 14+data['fail'])
g2.post = function(theta, data) dbeta(theta, 14+data['success'], 6+data['fail'])

We have assigned a prior mixture probability in the variable mix.prior, however we need to calculate the posterior mixing probability, which requires some algebra, and I have tried to show that in the code below. For details you have to read the first few pages of Chapter 10 in ref 1.

In order to calculate a mixture probability, i.e. what is the probability each model m1 or m2 after we see the data:

P(Data, Model[1]) = P(Data, Model[1])
P(Model[1] | Data) X P(Data) = P(Data | Model[1]) X P(Model[1])
P(Model[1] | Data) = P(Data | Model[1]) X P(Model[1]) / P(Data) … Equation (1)

where P(Data) can be expanded using summation across all models
P(Data) = Sum for each Model P(Data, Model[i])

We need to calculate a few things:
P(Data | Model[i]) – this is the prior predictive distribution for the data given the selected model. So for each model we solve this equation:

P(Data, Theta) = P(Data | Theta) X P(Theta)

P(Data) = P(Data | Theta) X P(Theta) / P(Theta | Data)
In words this means: Prior predictive distribution for Data = Likelihood X Prior / Posterior

## Prior predictive probability for Data = Likelihood X Prior / Posterior
## for model 1
data.prior.g1 = function(data, theta){
  ret = lik(data, theta) * g1(theta) / g1.post(theta, data)
  return(ret)
}
## for model 2
data.prior.g2 = function(data, theta){
  ret = lik(data, theta) * g2(theta) / g2.post(theta, data)
  return(ret)
}

## P(Data | Model) for each model should be the same for any value of theta
## you can use that as a sanity check
th = seq(0.01, 0.99, by=0.01)
data.g1 = mean(data.prior.g1(data, th))
data.g2 = mean(data.prior.g2(data, th))

We have the necessary requirements to solve the equation 1.

## P(Data) = (P(Data | Model[1]) X P(Model[1])) + (P(Data | Model[2]) X P(Model[2])
## we can calculate the posterior for Model 1
## P(Model[1] | Data) = P(Data | Model[1]) X P(Model[1]) / P(Data)
mix.post = data.g1 * mix.prior[1] / (data.g1 * mix.prior[1] + data.g2 * mix.prior[2])
mix.post = c(mix.post, 1-mix.post)

## (mix.post[1] AND g1.post) OR (mix.post[2] AND g2.post)
g.mix.post = function(theta, data) mix.post[1]*g1.post(theta, data) + mix.post[2]*g2.post(theta, data)

The figures below show the posterior distribution of theta for each of the models of the data i.e. head biased and tail biased coins.

m1m2postm1m2postmix

You can see that the posterior distribution of theta for model one has shifted more to the right (towards higher values of theta), as the data has influenced this shift. The mixture distribution has also shifted most of its weight towards the right with a very slight bump around 0.4.

We can now approximate this distribution of theta on a grid, and take a random sample from this distribution.

## approximate the posterior theta on the grid
p.th = g.mix.post(th, data)
p.th = p.th/sum(p.th)
th.sam = sample(th, 10000, replace = T, prob=p.th)
th.sam = th.sam + runif(10000, 0.01/2 * -1, 0.01/2)
summary(th.sam);
##   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.2041  0.6224  0.6922  0.6748  0.7512  0.9342 

I will add another post on grid approximations at a later point. For the time being, this is a useful method to generate random samples from a distribution and this sample can be used to calculate various statistics or parameters of the distribution.

We will also approximate this using another technique, where we first define a function that takes one or more parameters as input (defining the parameter space) and the data; and returns one value which is the log posterior at that position in the parameter space. The code below defines two functions, one for each model and uses an optimiser to explore the surface of this function to find the maximum.

library(LearnBayes)
library(car)
logit.inv = function(p) {exp(p)/(exp(p)+1) }

mylogpost_m1 = function(theta, data){
  ## theta contains parameters we wish to track
  th = logit.inv(theta['theta'])
  suc = data['success']
  fail = data['fail']
  
  # define likelihood function
  lf = function(s, f, t) return(dbinom(s, s+f, t, log=T))
  
  # calculate log posterior
  val = lf(suc, fail, th) + dbeta(th, 6, 14, log = T)
  return(val)
}

mylogpost_m2 = function(theta, data){
  ## theta contains parameters we wish to track
  th = logit.inv(theta['theta'])
  suc = data['success']
  fail = data['fail']
  
  # define likelihood function
  lf = function(s, f, t) return(dbinom(s, s+f, t, log=T))
  
  # calculate log posterior
  val = lf(suc, fail, th) + dbeta(th, 14, 6, log = T) 
  return(val)
}

# choose sensible starting values for the optimiser
start = c(theta=logit(0.5))
data = c(success=7, fail=3)

fit_m1 = laplace(mylogpost_m1, start, data)
fit_m2 = laplace(mylogpost_m2, start, data)

logit.inv(fit_m1$mode)
##   theta 
##0.428616 
logit.inv(fit_m2$mode)
##   theta 
##0.7142694 

The values for theta representing the maximum of each function are approximately similar to those calculated using grid approximation.

all

The figure above shows the histogram of the random sample generated using grid approximation, the line represents the value of the mixture posterior g.mix.post, and the red  and green lines represent the maximum points for the 2 models calculated using an optimiser.

Bayes Factor:

Simply put, the Bayes Factor (BF) indicates how much the prior odds on the two models change after seeing the data. I.e. how much evidence there is for model 1 vs model 2. As usual the technical details can be found in the references 1 and 2.

## Bayes factor for the ratios of posterior predictive distribution
## of the 2 models
## P(Data | Model[1]) / P(Data | Model[2])
BF = data.g1 / data.g2
## OR posterior odds for the models / prior odds for the 2 models
mix.post[1]/mix.post[2]/(mix.prior[1]/mix.prior[2])

The BF calculated analytically is 0.102 in support of model 1. The general convention for a discrete decision about about the models that there is significant evidence for model 1 when BF > 3, and for model 2 when BF < 1/3 (i.e. 0.33). In our case BF for model 2 is 0.1 which shows significant support for model 2, suggesting that the data was generated from model 2 rather than model 1.

The BF can also be approximated using the optimisation approach shown earlier, where the function laplace returns  the P(Data | Model) in the slot $int.

> round(exp(fit_m1$int - fit_m2$int), 2)
## BF  
##   0.09 

This is very close to the BF of 0.1 calculated analytically. Both analyses suggest that the data was generated by Coin or Machine B.