Methods of handling and working with missing data (part 1)

Description

In biology, the presence of missing values is a common occurrence for example in proteomics and metabolomics study. This represents a real challenge if one intends to perform an objective statistical analysis avoiding misleading conclusions. The leading causes of incompletely observed data are truncation and censoring which are often wrongly used interchangeably. You can refer to the post here that explains the difference between the two.

This blog aims to describe methods of handling missing data including data cleaning and quality checking (part 1) and another blog will follow soon (part 2) to discuss the potential follow-up analysis.

Data cleaning

If the missing data is nominal, the most common approach used is to replace it with different categorical value. Example: Let’s say the feature of interest is hair colour with values brown, blonde and red but for some subjects this information is missing. In that case, one can replace the missing records with ‘not available’ or something similar. By doing this we create a 4th category which will be treated as another dimension in any statistical analysis one does downstream.

On the other hand, if the missing data is on an ordinal or numeric scale, a more sophisticated approach for handling the missing data is required. Dropping or ignoring the records containing missing values does not always works as it assumes that the number of missing values is relatively small and totally at random. However, often this is not the case and can introduce a substantial bias because the information is simply lost. For instance, in gel-based proteomics, the amount of missing data is considerable (~10-50%) and not at random but could be related to the staining procedure used or abundances that are close enough to the limit of detection of the instrument (Pedreschi et al. 2008).

An ad hoc approach often used is to replace the missing data by a fixed value such as mean (in case of normally distributed data) or median (when the data is skewed) of observed values. When a missing value is the result of value being below the detection limit, a threshold or fixed value can be justifiable. This form of ‘data cleaning’ is useful but not always encouraged as it artificially reduces the variance which potentially could affect the strength of relationships with other variables as a single value is used to replace all the missing data.

Approaches like NIPALS directly deals with the missing values during the multivariate analysis and are more sensible way of handling randomly missing data. NIPALS is the acronym for Nonlinear Iterative Partial Least Squares it was introduced for Principal Component Analysis (PCA) and not what we now know as Partial least square regression. This method is generally used in chemometrics and proteomics and is tolerant to small amounts of missing data (upto 5-20%). It performs PCA using an iterative procedure. Uses weighted regressions with null weights for the missing entries thus missing data has no influence on the model.

Another sophisticated approach often used when large amount of data is incomplete is to impute the missing values iteratively during the estimation of the model parameters. For this several methods have been proposed such as k-nearest neighbour, singular value decomposition or maximum likelihood estimation etc. There are several R packages like ‘mice’, ‘Hmisc’, ‘VIM’ that implements some of these imputation algorithms. In the end one must consider the structure of the data and a compromise should be found between a sound statistical and biological interpretation of the data.

Quality Check and exploratory analysis

Suppose you have a dataset with some missing data.

# Load the required R packages
library(plsdepot)
library(cluster)

# Here we create a random dataframe
df <- data.frame(A = 1:10, B = 11:20, C = 1:10)
head(df)

##   A B C
## 1 1 11 1
## 2 2 12 2
## 3 3 13 3
## 4 4 14 4
## 5 5 15 5
## 6 6 16 6

# Then we add few missing values (NA) in the dataframe
df_miss<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
head(df_miss)

##    A B C
## 1 NA 11 1
## 2  2 12 2
## 3  3 13 NA
## 4 NA 14 NA
## 5  5 15 5
## 6  6 16 6

A) Principal Component Analysis (PCA)

Normal PCA to check the quality of the samples won’t work in this case as ‘prcomp’ function in R does not work with missing values. Instead, one can use NIPALS algorithm to compute PCA scores and loadings. In R,  library ‘plsdepot‘ implements NIPALS.


#----Using nipals to perform calculate PCA loadings and scores
pc_nipal<- nipals (df_miss, comps = 2, scaled = TRUE)

#----Plot PCA on rows/observations/samples
plot (pc_nipal, what = "observation", comps = c(1, 2), cex = 0.6,show.names = TRUE, col.labels = "red")

B) Circle of correlation plot

Variables are displayed using the correlations of each block of variables with the components of the other block.

#----Plot PCA on columns/variables
plot (pc_nipal, what = "variables", comps = c(1, 2), cex = 0.5,xlim=c(-50,150),show.names = TRUE, offset = 0.1,col.arrows = "#FE992955",col.points = "#5592e3",col.axis = "black") 

plot2

In general, the plot represents the correlation between the variables/features:

  • The closer a variable appears on the perimeter of the circle, the better it is represented.
  • In addition, if two variables are highly correlated, they will appear near each other.
  • If two variables are negatively correlated, they will tend to appear in opposite extremes.
  • If two variables are uncorrelated, they will be orthogonal to each other.
C)Clustering

Similarly, to perform clustering without removing rows where NAs are present, the gower distance metric can be used. It is a dissimilarity/distance coefficient that handles missing data well and implemented in function ‘daisy‘ in the R package ‘cluster‘.

#----Compute all the pairwise dissimilarities (distances) between observations in the data set
diss<-daisy(t(df_miss),metric="gower")

#---Computes agglomerative hierarchical clustering of the dataset.
distance_agnes<-agnes(diss,metric = "euclidean",method="complete")

hcd<-as.dendrogram(distance_agnes)

plot(distance_agnes,which.plots = 2,main="Dendrogram with Daisy function(grower metric)")

plot3

Therefore, this is a good alternative to quality check data with missingness instead of discarding data or introducing any form of bias in your analysis.

The information about the R packages used can be found below.

# print the package versions used ---#
sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12.5 (Sierra)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cluster_2.0.5 plsdepot_0.1.17
##
## loaded via a namespace (and not attached):
## [1] backports_1.0.4 magrittr_1.5 rprojroot_1.2 tools_3.3.1
## [5] htmltools_0.3.5 yaml_2.1.14 Rcpp_0.12.9 stringi_1.1.2
## [9] rmarkdown_1.5 knitr_1.16 stringr_1.1.0 digest_0.6.12
## [13] evaluate_0.10

Normalising Nanostring data

This is a quick R guide to learn about Nanostring technology (nCounter) and how to pre-process the data profiled on this platform.

Description

The nCounter system from Nanostring Technologies is a direct, reliable and highly sensitive multiplexed measurement of nucleic acids (DNA and RNA) based on a novel digital barcode technology. It involves Custom Codeset of genes or off-the-shelf preassembled panels and on single cell (more details on NanoString website).

Each mRNA Expression CodeSet contains probes designed against fourteen ERCC transcript sequences.

– Six of these sequences are used as positive hybridization controls and eight are designed as negative controls.

– These positive controls and negative controls are present in each CodeSet independent of the sample. These help in normalising for any technical/systemic variability.

– In addition, the codesets can contain some housekeeping genes which can be used for normalising sample variability (biological normalisation) i.e. to correct for differences in sample input between assays. It is based on the assumption that the target sequences of the house keeping genes are consistent in their expression levels.

Note:  Read the nCounter guide available in the the link for more
details: (https://www.nanostring.com/application/files/1214/8942/4642/MAN-C0011-03_nCounter_Gene_Expression_Data_Analysis_Guidelines.pdf)

Load the dataset

The data produced by the nCounter Digital Analyzer (nanostring) are exported as a Reporter Code Count (RCC) file which is a comma-separated text (.csv) file that contains the counts for each gene in a sample. Each cartridge has 12 lanes  i.e. 12 samples can be profiled on one nanostring cartridge.

For processing the data one can apply the normalization steps recommended by the company (using NanoStringNorm R package). Alternatively, the data can be treated as regular digital counts (RNA-seq) and can be analysed using edgeR TMM normalisation approach. However, in our experience former works better then the latter as it accounts for cross-hybridization related biases by allowing user to do background correction.

# Load the required R packages
library(NanoStringNorm)

You can read the RCC files in two different ways i.e.use the excel import function read.xls.RCC to read directly from nCounter output files if provided in .xls format by the facility. However, do ensure that you are using the worksheet with the raw counts and not something that has been processed. An example dataset can be downloaded from GEO (GSE51488).

# read the raw counts from the RCC excel spreadsheet output by the nCounter platform
df <-read.xls.RCC("GSE51488_GAMA_Nanostring_RAW_Spleen_1.xls", sheet = 1)

or,

you can use the following to process single sample markup RCC files (example:GSE95100) and merge the individual .RCC files together in one variable.

# read the raw counts from individual RCC files from the directory (path of .RCC files )
df <-read.markup.RCC(rcc.path = ".",rcc.pattern = "*.RCC|*.rcc",exclude = NULL,include = NULL,nprobes = -1)

Pre-processing

Firstly, remove systemic biases by using geometric mean.

# use geometric mean for technical normalisation
all_samples_gm <- NanoStringNorm(x = df,anno = NA,CodeCount = 'geo.mean',Background = 'none',SampleContent = 'none', round.values = FALSE, take.log =FALSE,return.matrix.of.endogenous.probes =FALSE)

Then, correct for cross-hybridization and normalise for sample variability by using background correction and house keeping genes respectively.

# use housekeeping genes along with background correction(mean+2SD) for biological normalisation---#
normalised_df <- NanoStringNorm(x = all_samples_gm,anno = NA,CodeCount = 'none',Background = 'mean.2sd',SampleContent = 'housekeeping.geo.mean', round.values = FALSE,is.log = FALSE, take.log = TRUE, return.matrix.of.endogenous.probes = TRUE )

This returns the normalised values in log2 scale. If you want the data to be on linear scale then change take.log = FALSE

# save the normalised data in a file---#
write.table(normalised_df,"Normalised_data_nanostring.csv",sep=",",quote=F,row.names = T,col.names = T)

The information about the R packages can be found below.

# print the package versions used ---#
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12.5 (Sierra)

## locale:

## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

## attached base packages:

## [1] parallel  stats     graphics  grDevices utils     datasets  
   [7] methods  base     

## other attached packages:
## [1] NanoStringNorm_1.1.21 vsn_3.40.0            Biobase_2.32.0    ## [4] BiocGenerics_0.18.0   gdata_2.17.0         

## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.9           knitr_1.16       magrittr_1.5     
##  [4] zlibbioc_1.18.0       munsell_0.4.3    lattice_0.20-34  
##  [7] colorspace_1.3-2      stringr_1.1.0     plyr_1.8.4       
## [10] tools_3.3.1           grid_3.3.1        gtable_0.2.0     
## [13] affy_1.50.0           htmltools_0.3.5   gtools_3.5.0   
## [16] assertthat_0.1        yaml_2.1.14       lazyeval_0.2.0   
## [19] rprojroot_1.2         digest_0.6.12     preprocessCore_1.34.0
## [22] tibble_1.2            affyio_1.42.0     ggplot2_2.2.1    
## [25] evaluate_0.10         rmarkdown_1.5     limma_3.28.21    
## [28] stringi_1.1.2         BiocInstaller_1.22.3  scales_0.4.1     ## [31] backports_1.0.4

 

Compare Transformations & Batch Effects in Omics Data

While analysing high dimensional data, e.g. from Omics (Genomics, Transcriptomics, Proteomics etc.) – we are essentially measuring multiple response variables (i.e. genes, proteins, metabolites etc.) in multiple samples, resulting in a rXn matrix X with r variables and n samples. The data capture can lead to multiple batches or groups in the data – a batch can be anything from a simple grouping like Male vs Female to Lanes on a sequencer, or Wells in a capture plate. Statistically this means that: Is the covariate or batch ignorable or should it be account for in our model.

The X matrix (data matrix) usually has to be normalised, which means the data has to undergo some sort of transformation to account for certain technical differences due to these batches e.g. like different samples being sequenced to different depths in an Next Generation Sequencer. This kind of systematic effect which is due to technical variation can be ‘adjusted’ and the samples are scaled/transformed to make them identically distributed. There are different methods of normalisation based on the data capture technology; but this post has two aims:

  1. A simple workflow for diagnostic checks for your normalisation: we normalise the same data using two different methods, and check which method may be more appropriate.
  2. Introducing the concepts of object oriented programming in R, with S4 objects.

The data and code used in this post can be found here.

1. Compare two Normalisation Methods:

In genomics, we are capturing many variables, and many of these variables are inter-dependent (e.g. co-expressed genes) and on average most of these variables are assumed to be identically distributed between samples (this however depends on how different each sample is from the other – e.g. comparing types of apples or apples vs oranges). In the example below we are using some gene expression data that has been normalised using two different methods:

Method 1: We used house keeping genes to normalise the samples, even though these samples are different cell types, hence the house keeping genes may act differently in each sub-type.

Method 2: We use the common highly expressed genes across all samples, and look for a common factor among these cells, and normalise using that criteria.

Note: We are not arguing which method is correct, both are essentially mathematically and biologically sound – but which one is suitable for our sample types.

The data in the plots below has been normalised and log transformed (in positive space) and zeros represent missing data. The samples have been coloured on our batch of interest i.e. the machine ID.

CDiagnostics_medianThe figure above is a box plot of the data with the medians highlighted as a bar. We can see that in both the methods, the samples from the second machine (071216) has more missing data. Could this be because good quality samples were run first and bad quality later? Otherwise comparing the two methods, I don’t think there is any strong reason to select one over the other; if I had to choose one I would lean towards method 2 as the medians overall are slightly more equal?

CDiagnostics_mean

We assume the each sample is normally distributed on a log scale; this is a very big assumption, but a convenient one for diagnostic checks. Besides even if the data is not normally distributed, the mean is normally distributed, due to the number of data points (we have about 530 or so variables in each sample). We are plotting the posterior mean for each sample (see here for some details), along with the 95% high density interval (HDI) for each mean. The mean for S_4 and its HDI does not overlap with e.g. S_14 in method 1, while in method 2 all the HDIs can overlap. There is not a clear choice here, but I would prefer the normalisation in method 2.

CDiagnostics_sigma

As we have used a normal model for the average of each sample, and the normal distribution has a second parameter (the scale – sigma, standard deviation), we compare the posterior sigma for these samples. In this case the standard deviation and the HDI in each sample are more comparable in method 2, I would consider this as strong evidence for the second method.

CDiagnostics_missing

The figures show the posterior proportion of the data that is missing after a normalisation done using each method (The reason this can be different, as some normalisation methods may include background signal removal as well). However there does not appear to be a clear winner in this case. However the batch id 071216 has more missing data.

CDiagnostics_PCA

This figure shows the first 2 components from the Principal Component Analysis of the data matrix X. Looking at the distribution of the points along the two axes – visually I can see that the data is more clumped on the left (method 1) while is looks slightly more random on the right (comparatively of course – method 2). This plot also adds another piece of evidence towards using method 2 rather than method 1.

All these pieces of evidence for or against one method should be taken together, along with considering the types of samples being compared and the nuances of the data capture technology. In this case most of the evidence (although not very strong individually) collectively favours method 2. Furthermore, I would consider including the machine ID as a covariate in the model, or use a hierarchical model with the machine ID as a random effect in the model.

2. Object Oriented Programming in R – S4 Objects:

Before I talk about some of my experience with S4 objects, these two sites are good starting points for learning more about this, site1, site2. I liked the structure of C++ and working on larger projects in R can get messy if the programs/scripts are not structured. I will put some code in here for the class CDiagnosticPlots.

Class Declaration:

setClass('CDiagnosticPlots', slots=list(mData='matrix', csTitle='character', lData='list', lParam='list'))

I define the class name, and the slots i.e. variables that this object will hold. Generally if I want the class to grow in the future, I will keep a slot like lData of type list, that I can append to for future use.

Constructor:

## constructor
## ARGS: mData = data matrix with samples in columns (subject space) and variables in rows (variable space)
CDiagnosticPlots = function(mData, csTitle){
  # check some diagnostics here and create object
  if (class(mData) != 'matrix') stop('CDiagnosticPlots: mData is not object of class matrix')
  
  ## create the object
  ob = new('CDiagnosticPlots', mData=mData, csTitle=csTitle, lData=list(), lParam=list())
  ob@lParam = CDiagnosticPlotsGetParameters(ob)
  return(CDiagnosticPlotsBuild(ob))
}

The constructor can perform various checks, before creating the object using the new function. It will return the object of the type CDiagnosticPlots which can be used in your program.

Generic Functions:

Generic functions are very useful and the function with the same name can be called for different object types. This is an example from one of my other repositories here. I define two different classes but both use the plotting function with the same name.

## snippet from first function
setGeneric('plot.var.selection', def = function(ob, ...) standardGeneric('plot.var.selection'))
setMethod('plot.var.selection', signature='CVariableSelection.RandomForest', definition = function(ob, ...){
  # plot the variable importance as bar plots with 
  # confidence intervals
})

setMethod('plot.var.selection', signature='CVariableSelection.ReduceModel', definition = function(ob, ...){
  # plot the test and training error agasint number of variables
  tr = colMeans(ob@mTrain)
  te = colMeans(ob@mTest)  
  m = cbind(tr, te)
  matplot(1:nrow(m), m, type='b', pch=20, lty = 1, lwd=2, col=1:2, xaxt='n', xlab='No. of Variables', ylab='Error Rate')
  legend('topright', legend = c('Train', 'Test'), fill=1:2)
  axis(1, at = 1:nrow(m), las=2)
})


Inheritance can also be implemented and an example can be seen in the link above.