## Plausible Reasoning for Scientific Problems: Belief Driven by Priors and Data.

Plausible reasoning requires constructing rational arguments by use of syllogisms, and their analysis by deductive and inductive logic. Using this method of reasoning and expressing our beliefs, in a scientific hypothesis, in a numerical manner using probability theory is one of my interest. I try to condense the material from the first 4 chapters of [1] in this post – starting with a very brief description of syllogisms, logic, probabilities, sampling distributions and finishing up with a problem of scientific hypothesis testing. I have used the material from various sources and some of it links with the previous posts on model comparisons and mixture models. The two books that can be looked at for more details are:

[1] Jaynes, E. T. (2003). Probability theory the logic of science. Cambridge University Press (2003)

## Syllogisms

Syllogisms are systematic ways of presenting logical arguments, and appeal to our intuitive reasoning abilities to draw conclusions. Usually we have two premises Major, Minor and that leads to a conclusion.

#### Strong Syllogism

$A\; is\; true,\; then\; B\; is\; true$

$\frac{A\;is\; true }{\therefore B\; is\; true}$

In the inverse form:

$A\; is\; true,\; then\; B\; is\; true$

$\frac{B\;is\; false }{\therefore A\; is\; false}$

#### Weak Syllogism

In common scientific problems, we have to deal with weaker syllogisms. Following the example from [1]:

A ≡ It will rain at 10 am at the latest

B ≡ It will be cloudy before 10 am

$A\; is\; true,\; then\; B\; is\; true$

$\frac{B\;is\; true }{\therefore A\; becomes\; more\; plausible}$

The major premise if A then B, creates a logical environment, where B is a logical consequence of A (not the physical causal consequence). Verification of the consequences of A (in this logical environment) increases our confidence in A. Another form can be expressed as:

$\frac{A\;is\; false }{\therefore B\; becomes\; less\; plausible}$

one of the reasons for B being true has been eliminated, hence we feel less confident about B.

## Boolean Algebra

The ideas of plausible reasoning can be represented as symbolic logic using Boolean algebra. The symbols and operations we will use are

Symbol Name Other Names
AB Logical Product Conjunction
And
Intersection
A+B Logical Sum Disjunction
Or
Union

Given two propositions A and B, one is true if an only if the other is true – i.e. they have same truth value. So if A is the necessary and sufficient condition for B, they are logically equivalent propositions.

Some propositions and their meanings include:

$\bar A \equiv A\; is\; false \newline Relation \;between \; A\;, \bar A \; is \; reciprocal \newline A = \bar A \; is\; false$

The relationship between two propositions can be expressed as an implication

$A \Rightarrow B \newline A\bar B \; is \; false \newline (\bar A + B) \; is \; false \newline A = AB \newline if \;A \; is \; true \; then \; B \; is \; true\newline if \;B \; is \; false \; then \; A \; is \; false\newline if \;A \; is \; false \; says \; nothing \; about \; B\newline if \;B \; is \; true \; says \; nothing \; about \; A\newline$

### Quantitative Rules

These rules have been formally derived in Chapter 2 Ref [1] and it is a recommended read. I will just mention the rules briefly.

#### Product Rule

AB | C : A and B are true given C; broken down into parts it can be written as:

• B | C is true
• accepting B is true, decide that A is true A | BC

OR

• A | C is true
• accepting A is true, decide that B is true B | AC

In functional form this can be written as: (AB | C) = F[(B|C), (A|BC)]

With 3 propositions, like (ABC | D), we can break the problem down into parts and use e.g. BC as a single proposition:

1. (ABC | D)
2. (BC | D) (A | BCD)
3. (C | D) (B|CD) (A | BCD)

#### Sum Rule

The logical product of $A \bar A$ is always false, while logical sum is $A + \bar A$ is always true.

Primitive sum rule: $(A|B)+(\bar A| B) = 1$

Extended or generalised sum rule: $(A+B|C) = (A|C)+(B|C)-(AB |C)$

### Qualitative Properties

The statements of Logic can be expressed in the form of product and sum rules.

$A \Rightarrow B \newline C \;is \;the \;major \;premise \newline C \equiv A \Rightarrow B \newline if \;A \;is \;true \;then \;B \;is \;true \newline p(B|AC) = \frac{p(AB|C)}{p(A|C)}\newline \newline if \;B \;is \;false \;then \;A \;is \;false \newline p(A\mid \bar B C) = \frac{p(A\bar B \mid C)}{p(\bar B \mid C)}\newline \newline if \;B \;is \;true \;then \;A \;becomes \;more \;plausible \newline p(A\mid B C) = p(A \mid C)\frac{p(B \mid AC)}{p(B \mid C)}\newline \newline if \;A \;is \;false \;then \;B \;becomes \;less \;plausible \newline p(B\mid \bar A C) = p(B \mid C)\frac{p(\bar A \mid BC)}{p(\bar A \mid C)}\newline$

## Elementary Sampling Theory

The rules we have available are the product and sum rules described earlier, and using the principle of indifference (just one of the many principles), if B is the background information while $(H_{1}, H_{2}, .. H_{N})$ are equally likely mutually exclusive and exhaustive hypotheses then

$P(H_{i}\mid B) = \frac{1}{N} \; \; 1 \leq i \leq N$

As an example:

• B ≡ Urn with N balls, with M red balls  and (N-M) white balls.
• Ri ≡ Red ball on ith draw.
• Wi ≡ White ball on ith draw.

For the first draw,

$P(R_{1}\mid B) = \frac{M}{N} \newline P(W_{1}\mid B) = 1-\frac{M}{N}$

These probability values should not be confused with physical properties of the urn, but are representative of state of knowledge or information – before any actual ball is drawn. This state of knowledge changes changes when a new question is asked – what is the probability of red on first two draws? (use the product rule)

$P(R_{1} R_{2} \mid B)=P(R{1} \mid B)P(R{2} \mid R{1}B) \newline P(R_{1} R_{2} \mid B)=\frac{M}{N}\frac{M-1}{N-1}$

Working this way (you can see the full derivation in [1]) we can look at any sequence of drawing – and if the question becomes: how many ways two red balls can be drawn in three draws, then the we are looking at the multiplicity of this event $R_{1}R_{2}W_{3}, R_{1}W_{2}R_{3}, W_{1}R_{2}R_{3}$

which can be calculated using the binomial coefficient

So the question can be posed as:

• B ≡ Urn with N balls, with M red balls  and (N-M) white balls.
• A ≡ ‘r red balls in n draws, in any order’
• P(A|B) ≡ P (r | N, M, n) ≡ Hypergeometric distribution (if we sample without replacement)
•  P(A|B) ≡ P (r | N, M, n) ≡ Binomial distribution. (sample with replacement)

These probability distributions are called sampling distributions or direct probabilities: Given some hypothesis (e.g. contents M, N of the urn), what is the probability that we shall obtain some specified data D (e.g. some sequence of red and white balls). These sampling distributions make predictions about potential observations – and if the correct hypothesis is known then the predictions and observed data will agree closely.

As another example for calculating probabilities of events [2] let:

• X ≡ set of outcomes e.g. rolling a six-sided die {1, 2, 3, 4, 5, 6}
• Y ≡ subsets of X
• μ : Y → [0, 1] where the function μ has the domain Y and has an image from 0 to 1.

We can determine the function μ using simulation, logic or experience. Following the example from [2] (the section on Basic Probability) – we want to calculate the probability of winning on the come-out roll in the game of craps (we need to roll a 7 or 11 using two six-sided die.

P(win on come-out roll) = P(7) + P(11)

The example in the book [2] shows how to calculate this mathematically or via simulation.

sim.game = function() {
r = sample(1:6, 2, replace=T)
if (sum(r) == 7 || sum(r) == 11) return(1) else return(0)
}
win = replicate(1000, sim.game())
sum(win)/1000
[1] 0.224

## Elementary Hypothesis Testing:

In a general scientific problem, we already have the data and a set of hypotheses and want to decide which hypothesis is more likely when looking at the information. The problem can be written down as propositions:

$X = prior \; information \newline H = hypothesis \newline D = data \newline \newline The \; joint\; parameter\; space\; is\; expanded\; using\; product\; rule\newline P(HD\mid X) = P(D \mid X) P(H \mid DX) = P(H \mid X) P(D \mid HX)\newline Performing\; some\; acrobatics,\; the\; equation\; becomes\newline P(H \mid DX) = P(H \mid X) \frac{P(D \mid HX)}{P(D \mid X)} .. Eq(1)$

This is of course the famous Bayes theorem, and the left side of equation 1 is the posterior probability – I tend to think of this (as most times your hypothesis will consist of parameters) as the point in the vector space where your parameters converge when they are being restricted by the priors and the data. The factor in equation 1, P(D|HX) is a function with 2 names depending on the context: 1) sampling distribution when H is fixed and D varies; 2) likelihood function L(H) when the data is a fixed parameter while H changes. Ref [2] has a nice section on likelihood functions and should be read with the associated R code in the book.

I show a slightly modified example from Ref [1] (where this problem is solved in multiple ways)

• X ≡ There are 15 machines, making widgets, and 10 of those machines make good quality widgets, 4 make average quality widgets and 1 is out of tune and makes mostly bad widgets. Each machine has a tuning parameter, which is the proportion of bad widgets, and we model that using a beta distribution.
• M1 ≡ On average the machine produces bad widgets with a proportion of 0.16
• M2 ≡ On average the machine produces bad widgets with a proportion of 0.49
• M3 ≡ On average the machine produces bad widgets with a proportion of 0.83

We use equation 1, to estimate the posterior parameters for each hypothesis or model.

We can treat this problem in various ways, using a parameter estimation approach, calculating 95% high density intervals to assign a p-value , a mixture distribution approach as shown in previous posts e.g. here and here. However the statement of the problem also shows us that we can assign a prior probability or weight to each hypothesis. The code snippet below shows the priors for the 3 models and the prior weights for each model – the full code can be found in the github repository.

### define three machines
## model m1 - machine makes less defective widgets
m1 = function(th) dbeta(th, 1, 5, log = T)

## model m2 - machine makes between average defective widgets
m2 = function(th) dbeta(th, 5, 5, log = T)

## model m3 - machine makes almost everything bad
m3 = function(th) dbeta(th, 5, 1, log = T)

## define an array that represents number of models in our parameter space
## each index has a prior weight/probability of being selected
## this can be thought of coming from a categorical distribution
mix.prior = c(m1=10/15 ,m2= 4/15 ,m3= 1/15)

As the data is generated, the posterior weight for each model changes with respect to each other, depending on if we see a bad widget (1) or a good widget (0). The results below show the posterior weights as the total number of good or bad widgets change along with the number of observations.

m1   m2   m3 Data - 1=Bad
[1,] 0.78 0.20 0.02            1
[2,] 0.53 0.42 0.05            1
[3,] 0.64 0.34 0.02            0
[4,] 0.46 0.49 0.05            1
[5,] 0.33 0.59 0.08            1
[6,] 0.25 0.64 0.11            1
[7,] 0.19 0.66 0.16            1
[8,] 0.15 0.65 0.20            1
[9,] 0.12 0.63 0.25            1
[10,] 0.09 0.60 0.30            1
[11,] 0.08 0.57 0.36            1
[12,] 0.06 0.53 0.41            1
[13,] 0.05 0.49 0.46            1
[14,] 0.04 0.44 0.51            1
[15,] 0.07 0.58 0.35            0
[16,] 0.09 0.67 0.24            0
[17,] 0.12 0.72 0.17            0
[18,] 0.14 0.74 0.12            0
[19,] 0.17 0.74 0.09            0
[20,] 0.15 0.75 0.11            1
[21,] 0.13 0.75 0.12            1
[22,] 0.12 0.74 0.14            1

The figure below shows the same numbers and how one hypothesis becomes more plausible while others become less as evidence accumulates – the points represent the prior weights for each hypothesis.

Interestingly hypothesis M3 is a very rare event, or almost a dead hypothesis, that is resurrected [1] to a level where we start to consider it very plausible by the time we have 16 data points, as we had a lot of  bad widgets. This example shows an elegant way to compare multiple hypotheses with respect to each other, and our belief is driven by the prior information and the data in different directions. In a more complex model with many parameters, the denominator of the equation one i.e P(D) can be difficult to calculate, and I would perhaps try a MCMC based approach in a single finite mixture model, rather than optimisation based approach.

## Methods of handling and working with missing data (part 1)

#### Description

In biology, the presence of missing values is a common occurrence for example in proteomics and metabolomics study. This represents a real challenge if one intends to perform an objective statistical analysis avoiding misleading conclusions. The leading causes of incompletely observed data are truncation and censoring which are often wrongly used interchangeably. You can refer to the post here that explains the difference between the two.

This blog aims to describe methods of handling missing data including data cleaning and quality checking (part 1) and another blog will follow soon (part 2) to discuss the potential follow-up analysis.

#### Data cleaning

If the missing data is nominal, the most common approach used is to replace it with different categorical value. Example: Let’s say the feature of interest is hair colour with values brown, blonde and red but for some subjects this information is missing. In that case, one can replace the missing records with ‘not available’ or something similar. By doing this we create a 4th category which will be treated as another dimension in any statistical analysis one does downstream.

On the other hand, if the missing data is on an ordinal or numeric scale, a more sophisticated approach for handling the missing data is required. Dropping or ignoring the records containing missing values does not always works as it assumes that the number of missing values is relatively small and totally at random. However, often this is not the case and can introduce a substantial bias because the information is simply lost. For instance, in gel-based proteomics, the amount of missing data is considerable (~10-50%) and not at random but could be related to the staining procedure used or abundances that are close enough to the limit of detection of the instrument (Pedreschi et al. 2008).

An ad hoc approach often used is to replace the missing data by a fixed value such as mean (in case of normally distributed data) or median (when the data is skewed) of observed values. When a missing value is the result of value being below the detection limit, a threshold or fixed value can be justifiable. This form of ‘data cleaning’ is useful but not always encouraged as it artificially reduces the variance which potentially could affect the strength of relationships with other variables as a single value is used to replace all the missing data.

Approaches like NIPALS directly deals with the missing values during the multivariate analysis and are more sensible way of handling randomly missing data. NIPALS is the acronym for Nonlinear Iterative Partial Least Squares it was introduced for Principal Component Analysis (PCA) and not what we now know as Partial least square regression. This method is generally used in chemometrics and proteomics and is tolerant to small amounts of missing data (upto 5-20%). It performs PCA using an iterative procedure. Uses weighted regressions with null weights for the missing entries thus missing data has no influence on the model.

Another sophisticated approach often used when large amount of data is incomplete is to impute the missing values iteratively during the estimation of the model parameters. For this several methods have been proposed such as k-nearest neighbour, singular value decomposition or maximum likelihood estimation etc. There are several R packages like ‘mice’, ‘Hmisc’, ‘VIM’ that implements some of these imputation algorithms. In the end one must consider the structure of the data and a compromise should be found between a sound statistical and biological interpretation of the data.

#### Quality Check and exploratory analysis

Suppose you have a dataset with some missing data.

# Load the required R packages
library(plsdepot)
library(cluster)

# Here we create a random dataframe
df <- data.frame(A = 1:10, B = 11:20, C = 1:10)

##   A B C
## 1 1 11 1
## 2 2 12 2
## 3 3 13 3
## 4 4 14 4
## 5 5 15 5
## 6 6 16 6

# Then we add few missing values (NA) in the dataframe
df_miss<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))

##    A B C
## 1 NA 11 1
## 2  2 12 2
## 3  3 13 NA
## 4 NA 14 NA
## 5  5 15 5
## 6  6 16 6

##### A) Principal Component Analysis (PCA)

Normal PCA to check the quality of the samples won’t work in this case as ‘prcomp’ function in R does not work with missing values. Instead, one can use NIPALS algorithm to compute PCA scores and loadings. In R,  library ‘plsdepot‘ implements NIPALS.

pc_nipal<- nipals (df_miss, comps = 2, scaled = TRUE)

#----Plot PCA on rows/observations/samples
plot (pc_nipal, what = "observation", comps = c(1, 2), cex = 0.6,show.names = TRUE, col.labels = "red")

##### B) Circle of correlation plot

Variables are displayed using the correlations of each block of variables with the components of the other block.

#----Plot PCA on columns/variables
plot (pc_nipal, what = "variables", comps = c(1, 2), cex = 0.5,xlim=c(-50,150),show.names = TRUE, offset = 0.1,col.arrows = "#FE992955",col.points = "#5592e3",col.axis = "black")

In general, the plot represents the correlation between the variables/features:

• The closer a variable appears on the perimeter of the circle, the better it is represented.
• In addition, if two variables are highly correlated, they will appear near each other.
• If two variables are negatively correlated, they will tend to appear in opposite extremes.
• If two variables are uncorrelated, they will be orthogonal to each other.
##### C)Clustering

Similarly, to perform clustering without removing rows where NAs are present, the gower distance metric can be used. It is a dissimilarity/distance coefficient that handles missing data well and implemented in function ‘daisy‘ in the R package ‘cluster‘.

#----Compute all the pairwise dissimilarities (distances) between observations in the data set
diss<-daisy(t(df_miss),metric="gower")

#---Computes agglomerative hierarchical clustering of the dataset.
distance_agnes<-agnes(diss,metric = "euclidean",method="complete")

hcd<-as.dendrogram(distance_agnes)

plot(distance_agnes,which.plots = 2,main="Dendrogram with Daisy function(grower metric)")

Therefore, this is a good alternative to quality check data with missingness instead of discarding data or introducing any form of bias in your analysis.

The information about the R packages used can be found below.

# print the package versions used ---#
sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12.5 (Sierra)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cluster_2.0.5 plsdepot_0.1.17
##
## loaded via a namespace (and not attached):
## [1] backports_1.0.4 magrittr_1.5 rprojroot_1.2 tools_3.3.1
## [5] htmltools_0.3.5 yaml_2.1.14 Rcpp_0.12.9 stringi_1.1.2
## [9] rmarkdown_1.5 knitr_1.16 stringr_1.1.0 digest_0.6.12
## [13] evaluate_0.10

## Regression & Finite Mixture Models

I wrote a post a while back about Mixture Distributions and Model Comparisons. This post continues on that theme and tries to model multiple data generating processes into a single model. The code for this post is available at the github repository. There were many useful resources that helped me understand this model, and some are mentioned below and others are mentioned in the previous post in this topic:

I use the R library flexmix, for the sample data and fitting the model, in addition to Stan and MCMC approach. The data distribution is shown in the figure below, where looking at the figure suggests that there may be two data generation processes at work here. To start with, we try and find a distribution that will capture certain aspects of this data, and use posterior predictive checks to see if our distribution choice is appropriate. A previous post on posterior predictive checks covers this topic in more detail, and tests different distributions. To keep it short, I just try a normal mixture distribution with 2 mixture components and then perform some tests.

The figure in the left panel shows the density plot for the raw data, and the panel on the right shows the same data, with smattering of 200 predictive data sets from the fitted distribution – visually it appears to be fine, with the left component slightly higher in some cases. However the aspects of the data that I want to our distribution to capture are the mean, variance, maximum and minimum values, and the test result PValues suggest that any variation can be explained by sampling variability.

> ivTestQuantities
variance  minimum  maximum     mean
0.380    0.155    0.300    0.420

The first steps are to fit a mixture regression model with 2 components and one predictor using the flexmix package and then to repeat this using Stan.

> fit.flex = flexmix(yn ~ x, data=NPreg, k=2)
> summary(fit.flex)

Call:
flexmix(formula = yn ~ x, data = NPreg, k = 2)

prior size post>0 ratio
Comp.1 0.561  104    198 0.525
Comp.2 0.439   96    146 0.658

'log Lik.' -704.0635 (df=7)
AIC: 1422.127   BIC: 1445.215

> ## fitted coefficients
> parameters(fit.flex)
Comp.1     Comp.2
coef.(Intercept) 31.7644433 -0.7685309
coef.x            0.1867837  5.1841287
sigma             7.7718438  3.3828795

The same model is now broken down into parts and fit using Stan – you can write a log posterior function in R as well to do this, but I find that the optimiser tends to not explore the parameter space if we track the mixture components as well. However if we keep the mixture components fixed, then it tends to converge to the same point as Stan.

> print(fit.stan, digi=3)
Inference for Stan model: fitNormalMixtureRegression.
1 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=2000.

mean se_mean    sd     2.5%      25%      50%      75%    97.5% n_eff  Rhat
sigma[1]          3.405   0.010 0.395    2.676    3.126    3.395    3.667    4.223  1448 1.002
sigma[2]          7.848   0.019 0.632    6.725    7.407    7.805    8.270    9.164  1142 1.000
iMixWeights[1]    0.439   0.001 0.046    0.350    0.406    0.438    0.472    0.530  1000 1.000
iMixWeights[2]    0.561   0.001 0.046    0.470    0.528    0.562    0.594    0.650  1000 1.000
betasMix1[1]      5.163   0.003 0.146    4.870    5.067    5.160    5.262    5.452  2000 1.001
betasMix2[1]      0.176   0.013 0.347   -0.501   -0.056    0.171    0.391    0.861   751 1.001
mu[1]            -0.686   0.016 0.713   -2.072   -1.164   -0.705   -0.228    0.750  2000 1.000
mu[2]            31.846   0.067 1.971   27.898   30.590   31.890   33.136   35.791   853 1.000
lp__           -708.099   0.063 1.885 -712.367 -709.142 -707.715 -706.712 -705.449   899 1.001

Samples were drawn using NUTS(diag_e) at Tue Jul  4 16:10:02 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Comparing the results from the two procedures, i.e. flexmix and the Stan model, suggest a nice convergence between the two methods.

## Fitted or Predicted Values

Once the coefficients of the regression equation have been determined, we can calculate the predicted (on new data) or fitted (current data) values. There are two ways to get this data: either get predicted data for each component; Or get an aggregate predicted value using each component.

> intercepts
[1] -0.6598333 31.5694812
> betas
[1] 5.1614796 0.2241089
> iMixWeights
[1] 0.44 0.56
> iPred1 = lStanData$X %*% c(intercepts[1], betas[1]) > iPred2 = lStanData$X %*% c(intercepts[2], betas[2])
> ## compare with fit.flex
Comp.1    Comp.2
[1,] 32.54457 20.883670
[2,] 31.98889  5.460877
[3,] 32.19311 11.129077
[4,] 32.87877 30.159296
[5,] 32.20489 11.456077
[6,] 33.19092 38.822976
[,1]     [,2]
1 20.897771 32.50550
2  5.542358 31.83878
3 11.185795 32.08381
4 30.132872 32.90649
5 11.511366 32.09795
6 38.758702 33.28101

However getting one aggregate predicted value for the response variable may be more appropriate in certain settings, and will require us to take a weighted average of each component (the weight determined by the mixing probability).

> iAggregate = cbind(iPred1, iPred2)
> iAggregate = sweep(iAggregate, 2, iMixWeights, '*')
> iAggregate = rowSums(iAggregate)
> dfPredicted = data.frame(stan=iAggregate, flexmix=p.agg)
stan  flexmix
1 27.39810 27.42164
2 20.26835 20.33447
3 22.88868 22.93915
4 31.68610 31.68404
5 23.03985 23.08942
6 35.69120 35.66522
> ## calculate Mean Squared Error MSE
> mean((dfPredicted$flexmix - NPreg$yn)^2)
[1] 104.4622
> mean((dfPredicted$stan - NPreg$yn)^2)
[1] 104.3325
> ## both pretty close

Comparing the aggregate values from the two models show that both have produced similar results, and the mean squared errors in each case are also similar.

## Hierarchical Linear Regression – 2 Level Random Effects Model

Regression is a popular approach to modelling where a response variable is modelled as a function of certain predictors – to understand the relations between variables. I used a linear model in a previous post, using the bread and peace model – and various ways to solve the equation.

In this post, I want to fit a 2 level linear model, using the concept of Random Effects modelling. There is a lot of literature available on this subject, and some references are shown below:

[1] lme4: Mixed-effects modelling with R. Douglas M. Bates (2010). link

[2] West, B. T., Welch, K. B., Ga, A. T., & Crc, H. (2007). LINEAR MIXED MODELS A Practical Guide Using Statistical Software. Statistics in Medicine (Vol. 27). http://doi.org/10.1002/sim.3167

[3] Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian Data Analysis, Third Edition (Texts in Statistical Science). Book.

The code in this post can be seen on the github page here.

The simulated data in this example consists of 30 observations of a Normal distributed variable (so in the generalised linear modelling speak, the link function is Identity). These 30 observations can be sub-grouped into 5 individual groups. We assume that there is a common relationship between these 5 groups (e.g. 5 Machines of same type; 5 Persons from same cohort;). This grouping of course requires some domain knowledge and common sense. In ref [3] there is a nice introductory chapter on hierarchical modelling which is a recommended read – in particular the concept of exchangeability. The data also has one predictor variable.

The figure shows the data in 3 panels: Top Left shows the data at the population level (or a higher level – where all observations are part of one big group); Top Right shows the same data but the 5 sub-groups are coloured differently; Bottom panel shows the 5 groups plotted separately on each panel (Ref [1] has some nice examples of how to plot structured data in various ways). Looking at this figure, we can see that the slope in each group is about equal, however the intercept for each group is different. We would like to consider this in our modelling and model the group intercepts, randomly distributed around the central intercept.

################### first use normal regression and lme4 package
fit.lm = lm(resp ~ pred, data=dfData)
summary(fit.lm)
Call:
lm(formula = resp ~ pred, data = dfData)

Residuals:
Min      1Q  Median      3Q     Max
-5.3043 -1.6194 -0.4117  1.5824  6.7582

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.28521    0.92127    2.48   0.0194 *
pred         1.50345    0.08684   17.31   <2e-16 ***
---
Signif. codes:  0***0.001**0.01*0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.616 on 28 degrees of freedom
Multiple R-squared:  0.9146,    Adjusted R-squared:  0.9115
F-statistic: 299.7 on 1 and 28 DF,  p-value: < 2.2e-16
# residual se = 2.616

####################################################
#### fit a 2 level model with random intercept

library(lme4)
fit.lme = lmer(resp ~ pred + (1 | group), data=dfData, REML=F)
summary(fit.lme)
Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: resp ~ pred + (1 | group)
Data: dfData

AIC      BIC   logLik deviance df.resid
141.3    146.9    -66.6    133.3       26

Scaled residuals:
Min      1Q  Median      3Q     Max
-1.5264 -0.7096 -0.2699  0.4328  2.6193

Random effects:
Groups   Name        Variance Std.Dev.
group    (Intercept) 2.664    1.632
Residual             3.777    1.943
Number of obs: 30, groups:  group, 5

Fixed effects:
Estimate Std. Error t value
(Intercept)  1.89979    1.00879   1.883
pred         1.54593    0.06604  23.408

Correlation of Fixed Effects:
(Intr)
pred -0.594
## random effect or group level sd = 1.632
## population level sd = 1.9434
## total = 1.6322+1.9434 = 3.5756

The interesting numbers I would like to look at are coefficients for the population (higher level group) in a simple linear model and a model with random effects – the slope coefficients are about the same but the intercepts are different in both models. Furthermore, the second (which is perhaps more important part) set of numbers are the amount of residual noise unexplained. The simpler model has a residual standard deviation of 2.6, while the 2 level model has 2 noise terms: Random effect level standard deviation 1.6 and remaining population level standard deviation 1.9 (which is smaller than the simpler model).  So this suggests that you can explain more of your sampling noise in your data by using a Random effects model and accounting for the structure of the data.

Now lets break this model down into parts and write our own models, the first one is a MCMC based approach using Stan, and the second one is using an optimisation with resampling based approach. In both cases you are trying to estimate multiple parameters in a parameter space, and you hope to converge to a point.

The code for the Stan model is in the same directory of the repository at github – and is structurally very similar to the R code shown below, which defines the log posterior function (see the post on parameter estimation for details).

mylogpost = function(theta, data){
## data
resp = data$resp # resp mModMatrix = data$mModMatrix
##### this mapping variable maps each group level intercept to each data point
#### at the likelihood level
groupIndex = data$groupIndex ## mapping variable to map each random effect to its respective response variable ## parameters to track/estimate sigmaRan = exp(theta['sigmaRan']) # random effect scale/sd sigmaPop = exp(theta['sigmaPop']) # population level sd betas = theta[grep('betas', names(theta))] # vector of betas i.e. regression coefficients for population iGroupsJitter = theta[grep('ran', names(theta))]# random effects jitters for the group deflections ## random effect jitter for the population intercept # each group contributes a jitter centered on 0 # population slope + random jitter ivBetaRand = betas[1] + iGroupsJitter # create a matrix of betas with the new interceptr/unique intercept for each random effect ivIntercept = ivBetaRand[groupIndex] # expand this intercept using the mapping variable iFitted = as.matrix(mModMatrix[,2:ncol(mModMatrix)]) %*% betas[2:ncol(mModMatrix)] iFitted = ivIntercept + iFitted # using identity link so no transformation required ## priors # sigma priors lhp = dunif(sigmaRan, 0, 2, log=T) + dunif(sigmaPop, 0, 5, log=T) # random effects prior lran = sum(dnorm(iGroupsJitter, 0, sigmaRan, log=T)) # priors for the betas lp = sum(dcauchy(betas, 0, 10, log=T)) # write the likelihood function lik = sum(dnorm(resp, iFitted, sigmaPop, log=T)) val = lhp + lran + lp + lik return(val) } Some points about this code snippet: • sigmaRan is the standard deviation for the group level. All groups are assumed to have the same noise centred on zero. This is a general assumption and may not be true if you have different types of groups. If you feel brave, you can assign a different distribution to each type of group. • sigmaPop is the population distribution at the likelihood level. • I have set the priors for the sigmaRan and sigmaPop as uniform 0 to 2 and 0 to 5 respectively. This is not necessary but helps the convergence process, if you are having problems with convergence. • The population level regression coefficients are given a 0 centred Cauchy distribution. • We are tracking 9 parameters in this optimisation problem: 2 standard deviations, 2 regression coefficients and 5 group level deflections. fit.lap = mylaplace(mylogpost, start, lData) ### lets use the results from laplace and SIR sampling with a t proposal density ### to take a sample tpar = list(m=fit.lap$mode, var=fit.lap$var*2, df=4) ## get a sample directly and using sir (sampling importance resampling with a t proposal density) s = sir(mylogpost, tpar, 5000, lData) apply(s, 2, mean)[-c(1:2)] betas1 betas2 ran1 ran2 ran3 ran4 ran5 1.8598225 1.5462439 0.3754429 0.7316874 -0.5849880 1.7242024 -2.1096244 exp(apply(s, 2, mean)[1:2]) sigmaPop sigmaRan 2.082105 1.351142 We use Sampling importance resampling (SIR) algorithm using a t proposal density to generate a random sample for our parameters using the mode and covariance matrix calculated earlier with the function mylaplace. I show the results from all 4 methods below, and you can see they are pretty similar. ############### you can compare the results from stan and lmer and the logposterior function # Inference for Stan model: linearRegressionRandomEffects. # 4 chains, each with iter=5000; warmup=2500; thin=1; # post-warmup draws per chain=2500, total post-warmup draws=10000. # # mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat # betas[1] 1.775 0.038 1.620 -1.736 0.935 1.841 2.738 4.786 1853 1.000 # betas[2] 1.549 0.001 0.070 1.411 1.503 1.549 1.597 1.685 5649 1.001 # sigmaRan 2.961 0.051 2.471 0.875 1.728 2.414 3.487 8.093 2377 1.001 # sigmaPop 2.094 0.005 0.327 1.563 1.863 2.056 2.280 2.843 4928 1.001 # rGroupsJitter[1] 0.503 0.036 1.623 -2.559 -0.419 0.434 1.326 3.996 2083 1.000 # rGroupsJitter[2] 0.901 0.037 1.635 -2.068 -0.055 0.808 1.733 4.493 1997 1.000 # rGroupsJitter[3] -0.636 0.036 1.618 -3.838 -1.516 -0.646 0.225 2.813 2036 1.000 # rGroupsJitter[4] 2.140 0.036 1.660 -0.808 1.131 2.013 3.008 5.892 2077 1.000 # rGroupsJitter[5] -2.448 0.035 1.652 -5.842 -3.378 -2.406 -1.502 0.792 2179 1.000 # > summary(fit.lme) # Linear mixed model fit by maximum likelihood ['lmerMod'] # Formula: resp ~ pred + (1 | group) # Data: dfData # # Random effects: # Groups Name Variance Std.Dev. # group (Intercept) 2.664 1.632 # Residual 3.777 1.943 # Number of obs: 30, groups: group, 5 # # Fixed effects: # Estimate Std. Error t value # (Intercept) 1.89979 1.00879 1.883 # pred 1.54593 0.06604 23.408 # # Random effect jitters from lme # > t(ranef(fit.lme)$group)
#              1         2          3       4         5
# (Intercept) 0.3888984 0.7627657 -0.6863616 1.94236 -2.407663

# > fit.lap$mode[-c(1:2)] # betas1 betas2 ran1 ran2 ran3 ran4 ran5 # 1.9037216 1.5456933 0.3897490 0.7787912 -0.7075750 1.8900023 -2.3491993 # > exp(fit.lap$mode)[1:2]
# sigmaPop sigmaRan
# 1.805638 1.463412

## SIR sampling - these will slighly change each time you run it
# > apply(s, 2, mean)[-c(1:2)]
# betas1     betas2       ran1       ran2       ran3       ran4       ran5
# 1.8598225  1.5462439  0.3754429  0.7316874 -0.5849880  1.7242024 -2.1096244
# > exp(apply(s, 2, mean)[c(1:2)])
# sigmaPop sigmaRan
# 2.082105 1.351142

However in problems with many parameters and groups, the optimisation based approach may not converge, so optimisation+SIR, Stan or lmer based approach may be required – or a different optimization library required.

## Model Quality Scores

In a previous post we calculated some model quality scores. However I had been struggling in calculating the Log Likelihood and AIC for a Random effects model – due to a mistake I was making in the calculations. Eventually I figured it out with help from another post I found here.

## first write the log predictive density function
lpd = function(theta, data){
betas = theta # vector of betas i.e. regression coefficients for population
## data
resp = data$resp # resp mModMatrix = data$mModMatrix
group = data$group sigmaRan = exp(data$sigmaRan) # random effect scale/sd
sigmaPop = exp(data$sigmaPop) # population level sd # calculate fitted value iFitted = mModMatrix %*% betas ######## perform notational acrobatics # construct the variance covariance matrix z = model.matrix(resp ~ 0 + group) zt = t(z) mRan = diag(sigmaRan^2, nrow=nlevels(group), ncol=nlevels(group)) mRan = z %*% mRan %*% zt mPop = diag(sigmaPop^2, nrow=length(resp), ncol=length(resp)) mCov = mRan + mPop ## likelihood function with posterior theta return(dmnorm(resp, iFitted, mCov, log=T)) } lData$group = dfData$group lData$sigmaRan = log(1.632)
lData$sigmaPop = log(1.943) theta = fixef(fit.lme) lpd(theta, lData) [1] -66.63871 ## compare with lme logLik(fit.lme) 'log Lik.' -66.63871 (df=4) The code above shows that to calculate these scores, we need the log predictive density function and to calculate the log likelihood correctly we need to use a multivariate normal density with a covariance matrix that accounts for population level variance and group level variance. I have been looking into calculating WAIC for structured models or doing cross-validation (which is on my todo list) – however this can get tedious when you have to take into account the structure of the groups. I will try and cover that in a post at sometime in the future. ## Normalising Nanostring data This is a quick R guide to learn about Nanostring technology (nCounter) and how to pre-process the data profiled on this platform. ### Description The nCounter system from Nanostring Technologies is a direct, reliable and highly sensitive multiplexed measurement of nucleic acids (DNA and RNA) based on a novel digital barcode technology. It involves Custom Codeset of genes or off-the-shelf preassembled panels and on single cell (more details on NanoString website). Each mRNA Expression CodeSet contains probes designed against fourteen ERCC transcript sequences. – Six of these sequences are used as positive hybridization controls and eight are designed as negative controls. – These positive controls and negative controls are present in each CodeSet independent of the sample. These help in normalising for any technical/systemic variability. – In addition, the codesets can contain some housekeeping genes which can be used for normalising sample variability (biological normalisation) i.e. to correct for differences in sample input between assays. It is based on the assumption that the target sequences of the house keeping genes are consistent in their expression levels. Note: Read the nCounter guide available in the the link for more details: (https://www.nanostring.com/application/files/1214/8942/4642/MAN-C0011-03_nCounter_Gene_Expression_Data_Analysis_Guidelines.pdf) ### Load the dataset The data produced by the nCounter Digital Analyzer (nanostring) are exported as a Reporter Code Count (RCC) file which is a comma-separated text (.csv) file that contains the counts for each gene in a sample. Each cartridge has 12 lanes i.e. 12 samples can be profiled on one nanostring cartridge. For processing the data one can apply the normalization steps recommended by the company (using NanoStringNorm R package). Alternatively, the data can be treated as regular digital counts (RNA-seq) and can be analysed using edgeR TMM normalisation approach. However, in our experience former works better then the latter as it accounts for cross-hybridization related biases by allowing user to do background correction. # Load the required R packages library(NanoStringNorm) You can read the RCC files in two different ways i.e.use the excel import function read.xls.RCC to read directly from nCounter output files if provided in .xls format by the facility. However, do ensure that you are using the worksheet with the raw counts and not something that has been processed. An example dataset can be downloaded from GEO (GSE51488). # read the raw counts from the RCC excel spreadsheet output by the nCounter platform df <-read.xls.RCC("GSE51488_GAMA_Nanostring_RAW_Spleen_1.xls", sheet = 1) or, you can use the following to process single sample markup RCC files (example:GSE95100) and merge the individual .RCC files together in one variable. # read the raw counts from individual RCC files from the directory (path of .RCC files ) df <-read.markup.RCC(rcc.path = ".",rcc.pattern = "*.RCC|*.rcc",exclude = NULL,include = NULL,nprobes = -1) ### Pre-processing Firstly, remove systemic biases by using geometric mean. # use geometric mean for technical normalisation all_samples_gm <- NanoStringNorm(x = df,anno = NA,CodeCount = 'geo.mean',Background = 'none',SampleContent = 'none', round.values = FALSE, take.log =FALSE,return.matrix.of.endogenous.probes =FALSE) Then, correct for cross-hybridization and normalise for sample variability by using background correction and house keeping genes respectively. # use housekeeping genes along with background correction(mean+2SD) for biological normalisation---# normalised_df <- NanoStringNorm(x = all_samples_gm,anno = NA,CodeCount = 'none',Background = 'mean.2sd',SampleContent = 'housekeeping.geo.mean', round.values = FALSE,is.log = FALSE, take.log = TRUE, return.matrix.of.endogenous.probes = TRUE ) This returns the normalised values in log2 scale. If you want the data to be on linear scale then change take.log = FALSE # save the normalised data in a file---# write.table(normalised_df,"Normalised_data_nanostring.csv",sep=",",quote=F,row.names = T,col.names = T) The information about the R packages can be found below. # print the package versions used ---# sessionInfo() ## R version 3.3.1 (2016-06-21) ## Platform: x86_64-apple-darwin13.4.0 (64-bit) ## Running under: OS X 10.12.5 (Sierra) ## locale: ## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 ## attached base packages: ## [1] parallel stats graphics grDevices utils datasets [7] methods base ## other attached packages: ## [1] NanoStringNorm_1.1.21 vsn_3.40.0 Biobase_2.32.0 ## [4] BiocGenerics_0.18.0 gdata_2.17.0 ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.9 knitr_1.16 magrittr_1.5 ## [4] zlibbioc_1.18.0 munsell_0.4.3 lattice_0.20-34 ## [7] colorspace_1.3-2 stringr_1.1.0 plyr_1.8.4 ## [10] tools_3.3.1 grid_3.3.1 gtable_0.2.0 ## [13] affy_1.50.0 htmltools_0.3.5 gtools_3.5.0 ## [16] assertthat_0.1 yaml_2.1.14 lazyeval_0.2.0 ## [19] rprojroot_1.2 digest_0.6.12 preprocessCore_1.34.0 ## [22] tibble_1.2 affyio_1.42.0 ggplot2_2.2.1 ## [25] evaluate_0.10 rmarkdown_1.5 limma_3.28.21 ## [28] stringi_1.1.2 BiocInstaller_1.22.3 scales_0.4.1 ## [31] backports_1.0.4 ## Using R to export results into Excel ## Applying conditional formatting on a sheet based on the values from a different sheet This is the first post in the series “Tips and Tricks for Data Science”. In this post I will show how to create Excel files with conditional formatting in R. As an example I will focus on colouring cells in a sheet based on values stored in a different sheet but the technique is easily extensible to other needs. Let’s start by importing a toy dataset and by familiarising ourselves with its structure. This dataset was obtained by analyzing a publicly available gene expression dataset (http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-22886/). The original dataset contains gene expression values for multiple different blood cell subsets. List of differentially expressed genes for five specific comparisons (Neutrophils vs Monocytes, CD4+ T cells vs CD8+ T cells, IgM memory B cells vs IgG/IgA memory B cells, CD8+ T cells vs NK cells and B cells vs Plasmacells) were obtained using limma and results were stored in a single data frame. The dataset can be downloaded from (https://github.com/rmelchiotti/DataCATz/blob/master/Datasets/blog_RM2_data.RData). library(openxlsx) library(reshape2) load("/media/development/DataCATz/Datasets/blog_RM2_data.RData") head(limma_results_multiple_comparisons) ## logFC AveExpr t P.Value adj.P.Val ## 229967_at 5.796050 6.605532 25.98790 8.554785e-48 1.937231e-43 ## 234644_x_at 3.185765 6.756721 24.28746 4.018272e-45 4.549688e-41 ## 226154_at -3.686182 8.345932 -21.74632 7.064384e-41 5.332433e-37 ## 224374_s_at -3.180816 8.541149 -21.02976 1.271144e-39 7.196262e-36 ## 229789_at 1.629650 6.436400 19.92510 1.234818e-37 5.592492e-34 ## 230585_at 1.832602 6.559401 18.63027 3.194017e-35 1.205475e-31 ## B comparison probe ## 229967_at 97.62712 NeutrovsMono 229967_at ## 234644_x_at 91.69283 NeutrovsMono 234644_x_at ## 226154_at 82.21080 NeutrovsMono 226154_at ## 224374_s_at 79.39618 NeutrovsMono 224374_s_at ## 229789_at 74.93025 NeutrovsMono 229789_at ## 230585_at 69.49491 NeutrovsMono 230585_at The data frame contains, for each probe (think gene) and each comparison, log-fold changes, adjusted p-values and additional variable which will not be used in this post. The goal is to export an Excel table where rows correspond to probes, columns correspond to comparisons and each cell contains the p-value obtained for that particular gene and that particular comparison using limma. In addition we would like to colour each cell by its corresponding fold change. In order to do this we make use of an R package, openxlsx (https://cran.r-project.org/web/packages/openxlsx/index.html), specifically developed for the manipulation of Excel files. Let’s start by creating two data frames, one containing p-values for each probe and comparison and one containing fold changes for each probe and comparison. For simplicity we will focus only on those probes which are significant in at least one of the comparisons. # Identifing probes which are significant in at least one comparison (adjusted p-value<0.001) significant<-dcast(subset(limma_results_multiple_comparisons,adj.P.Val<0.001),probe~.,length,value.var="comparison") colnames(significant)<-c("probe","num_significant_comparisons") head(significant) ## probe num_significant_comparisons ## 1 200000_s_at 1 ## 2 200001_at 1 ## 3 200002_at 1 ## 4 200003_s_at 1 ## 5 200004_at 1 ## 6 200005_at 1 significant_limma_results_multiple_comparisons<-subset(limma_results_multiple_comparisons,probe %in% significant$probe)

# Generating data frame containing p-values (rows correspond to probes, columns correspond to comparisons)
rownames(pvals)<-pvals$probe pvals$probe<-NULL
##               BvsPlasma  CD4vsCD8     CD8vsNK  IgMvsIgA NeutrovsMono
## 200000_s_at 0.032460135 0.9242417 0.824582060 0.9695951 2.949946e-12
## 200001_at   0.005092352 0.9575582 0.965266195 0.9992567 4.887914e-11
## 200002_at   0.522338942 0.3181947 0.001017482 0.9705612 1.255813e-21
## 200003_s_at 0.342090910 0.8663762 0.250467121 0.9992567 3.991931e-09
## 200004_at   0.279605958 0.7619152 0.982076553 0.9992567 7.507483e-06
## 200005_at   0.579021244 0.6999011 0.001318081 0.9650455 5.786168e-16
# Generating data frame containing fold changes (rows correspond to probes, columns correspond to comparisons in the same order as above)
fold_changes<-dcast(data=significant_limma_results_multiple_comparisons,formula=probe~comparison,value.var="logFC")
rownames(fold_changes)<-fold_changes$probe fold_changes$probe<-NULL
##              BvsPlasma    CD4vsCD8     CD8vsNK     IgMvsIgA NeutrovsMono
## 200000_s_at -0.6563666  0.09343405  0.12728432  0.273197551    -2.013805
## 200001_at   -0.8709291 -0.05733401 -0.02998118 -0.007288651    -1.916991
## 200002_at   -0.1731417 -0.39430381  0.93701092  0.215177795    -2.470631
## 200003_s_at  0.2244004 -0.10897526  0.35226591  0.087253733    -1.185663
## 200004_at   -0.3472753  0.22445483  0.01535964  0.122980752    -1.203665
## 200005_at   -0.1283781 -0.17952769  0.77665262  0.188921812    -1.647487

Now that the two data frames are ready we can start the exporting process. We use the openxlsx package to perform this function. We start by creating a workbook and adding two worksheet, one called p-values and one called fold_changes. In the first worksheet we store the p-values, in the second worksheet we store the fold changes.

xlsx_filename<-"/media/development/DataCATz/Tables/results.xlsx"

# Creating workbook
wb<-openxlsx::createWorkbook("Results")

# Creating and filling the p-values worksheet
writeData(wb,sheet=1,pvals,rowNames=TRUE)

# Creating and filling the fold changes worksheet
writeData(wb,sheet=2,fold_changes,rowNames=TRUE)
openxlsx::saveWorkbook(wb,xlsx_filename,overwrite=TRUE)

We then move on to formatting. In this particular case we want to colour each p-value in the p-values worksheet by its fold change. We start by creating a gradient of colours, blue for negative fold changes and red for positive fold_changes (we switch colour for every 1 log-fold change increase).

logfc_intervals<-c(-Inf,seq(-8,8,1),Inf)
blues<-colorRampPalette(c("blue", "white"))
reds<-colorRampPalette(c("white","red"))
list_of_colours<-blues(floor((length(logfc_intervals)-1)/2))
list_of_colours<-c(list_of_colours,reds(floor((length(logfc_intervals)-1)/2)))

We then apply conditional formatting using these colours. Each fold change is assigned to a colour according to its value. This colour is then applied to the corresponding cell using the function addStyle.

# Applying colours
for(i in seq(1,length(logfc_intervals)-1)){
cells_to_colour<-which(pvals<0.05 & fold_changes>logfc_intervals[i] & fold_changes<logfc_intervals[i+1],arr.ind=TRUE)
style<-createStyle(fgFill=list_of_colours[i])
}

The last step consists in saving the workbook into an Excel file.

# Saving the workbook in an Excel file
openxlsx::saveWorkbook(wb,xlsx_filename,overwrite=TRUE)

There are many other ways to perform conditional enrichment using openxlsx. In particular if the condition used to colour cells in a sheet is based on values stored in that sheet (for example if we wanted to colour p-values based on their values), the function openxlsx::conditionalFormatting might be a better option.

The R file used to generate the Excel file described in this post is available for download from git (https://github.com/rmelchiotti/DataCATz/tree/master/Scripts). The Excel file generated by the script can be found here. Below is a list of the packages used.

# Printing package version
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] reshape2_1.4.2  openxlsx_4.0.17
##
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.0  backports_1.1.0 plyr_1.8.4      magrittr_1.5
##  [5] rprojroot_1.2   tools_3.4.0     htmltools_0.3.6 yaml_2.1.14
##  [9] Rcpp_0.12.11    stringi_1.1.5   rmarkdown_1.5   knitr_1.16
## [13] stringr_1.2.0   digest_0.6.12   evaluate_0.10

## Compare Transformations & Batch Effects in Omics Data

While analysing high dimensional data, e.g. from Omics (Genomics, Transcriptomics, Proteomics etc.) – we are essentially measuring multiple response variables (i.e. genes, proteins, metabolites etc.) in multiple samples, resulting in a $rXn$ matrix X with r variables and n samples. The data capture can lead to multiple batches or groups in the data – a batch can be anything from a simple grouping like Male vs Female to Lanes on a sequencer, or Wells in a capture plate. Statistically this means that: Is the covariate or batch ignorable or should it be account for in our model.

The X matrix (data matrix) usually has to be normalised, which means the data has to undergo some sort of transformation to account for certain technical differences due to these batches e.g. like different samples being sequenced to different depths in an Next Generation Sequencer. This kind of systematic effect which is due to technical variation can be ‘adjusted’ and the samples are scaled/transformed to make them identically distributed. There are different methods of normalisation based on the data capture technology; but this post has two aims:

1. A simple workflow for diagnostic checks for your normalisation: we normalise the same data using two different methods, and check which method may be more appropriate.
2. Introducing the concepts of object oriented programming in R, with S4 objects.

The data and code used in this post can be found here.

## 1. Compare two Normalisation Methods:

In genomics, we are capturing many variables, and many of these variables are inter-dependent (e.g. co-expressed genes) and on average most of these variables are assumed to be identically distributed between samples (this however depends on how different each sample is from the other – e.g. comparing types of apples or apples vs oranges). In the example below we are using some gene expression data that has been normalised using two different methods:

Method 1: We used house keeping genes to normalise the samples, even though these samples are different cell types, hence the house keeping genes may act differently in each sub-type.

Method 2: We use the common highly expressed genes across all samples, and look for a common factor among these cells, and normalise using that criteria.

Note: We are not arguing which method is correct, both are essentially mathematically and biologically sound – but which one is suitable for our sample types.

The data in the plots below has been normalised and log transformed (in positive space) and zeros represent missing data. The samples have been coloured on our batch of interest i.e. the machine ID.

The figure above is a box plot of the data with the medians highlighted as a bar. We can see that in both the methods, the samples from the second machine (071216) has more missing data. Could this be because good quality samples were run first and bad quality later? Otherwise comparing the two methods, I don’t think there is any strong reason to select one over the other; if I had to choose one I would lean towards method 2 as the medians overall are slightly more equal?

We assume the each sample is normally distributed on a log scale; this is a very big assumption, but a convenient one for diagnostic checks. Besides even if the data is not normally distributed, the mean is normally distributed, due to the number of data points (we have about 530 or so variables in each sample). We are plotting the posterior mean for each sample (see here for some details), along with the 95% high density interval (HDI) for each mean. The mean for S_4 and its HDI does not overlap with e.g. S_14 in method 1, while in method 2 all the HDIs can overlap. There is not a clear choice here, but I would prefer the normalisation in method 2.

As we have used a normal model for the average of each sample, and the normal distribution has a second parameter (the scale – sigma, standard deviation), we compare the posterior sigma for these samples. In this case the standard deviation and the HDI in each sample are more comparable in method 2, I would consider this as strong evidence for the second method.

The figures show the posterior proportion of the data that is missing after a normalisation done using each method (The reason this can be different, as some normalisation methods may include background signal removal as well). However there does not appear to be a clear winner in this case. However the batch id 071216 has more missing data.

This figure shows the first 2 components from the Principal Component Analysis of the data matrix X. Looking at the distribution of the points along the two axes – visually I can see that the data is more clumped on the left (method 1) while is looks slightly more random on the right (comparatively of course – method 2). This plot also adds another piece of evidence towards using method 2 rather than method 1.

All these pieces of evidence for or against one method should be taken together, along with considering the types of samples being compared and the nuances of the data capture technology. In this case most of the evidence (although not very strong individually) collectively favours method 2. Furthermore, I would consider including the machine ID as a covariate in the model, or use a hierarchical model with the machine ID as a random effect in the model.

## 2. Object Oriented Programming in R – S4 Objects:

Before I talk about some of my experience with S4 objects, these two sites are good starting points for learning more about this, site1, site2. I liked the structure of C++ and working on larger projects in R can get messy if the programs/scripts are not structured. I will put some code in here for the class CDiagnosticPlots.

## Class Declaration:

setClass('CDiagnosticPlots', slots=list(mData='matrix', csTitle='character', lData='list', lParam='list'))

I define the class name, and the slots i.e. variables that this object will hold. Generally if I want the class to grow in the future, I will keep a slot like lData of type list, that I can append to for future use.

## Constructor:

## constructor
## ARGS: mData = data matrix with samples in columns (subject space) and variables in rows (variable space)
CDiagnosticPlots = function(mData, csTitle){
# check some diagnostics here and create object
if (class(mData) != 'matrix') stop('CDiagnosticPlots: mData is not object of class matrix')

## create the object
ob = new('CDiagnosticPlots', mData=mData, csTitle=csTitle, lData=list(), lParam=list())
ob@lParam = CDiagnosticPlotsGetParameters(ob)
return(CDiagnosticPlotsBuild(ob))
}

The constructor can perform various checks, before creating the object using the new function. It will return the object of the type CDiagnosticPlots which can be used in your program.

## Generic Functions:

Generic functions are very useful and the function with the same name can be called for different object types. This is an example from one of my other repositories here. I define two different classes but both use the plotting function with the same name.

## snippet from first function
setGeneric('plot.var.selection', def = function(ob, ...) standardGeneric('plot.var.selection'))
setMethod('plot.var.selection', signature='CVariableSelection.RandomForest', definition = function(ob, ...){
# plot the variable importance as bar plots with
# confidence intervals
})

setMethod('plot.var.selection', signature='CVariableSelection.ReduceModel', definition = function(ob, ...){
# plot the test and training error agasint number of variables
tr = colMeans(ob@mTrain)
te = colMeans(ob@mTest)
m = cbind(tr, te)
matplot(1:nrow(m), m, type='b', pch=20, lty = 1, lwd=2, col=1:2, xaxt='n', xlab='No. of Variables', ylab='Error Rate')
legend('topright', legend = c('Train', 'Test'), fill=1:2)
axis(1, at = 1:nrow(m), las=2)
})

Inheritance can also be implemented and an example can be seen in the link above.