Methods of handling and working with missing data (part 1)

Description

In biology, the presence of missing values is a common occurrence for example in proteomics and metabolomics study. This represents a real challenge if one intends to perform an objective statistical analysis avoiding misleading conclusions. The leading causes of incompletely observed data are truncation and censoring which are often wrongly used interchangeably. You can refer to the post here that explains the difference between the two.

This blog aims to describe methods of handling missing data including data cleaning and quality checking (part 1) and another blog will follow soon (part 2) to discuss the potential follow-up analysis.

Data cleaning

If the missing data is nominal, the most common approach used is to replace it with different categorical value. Example: Let’s say the feature of interest is hair colour with values brown, blonde and red but for some subjects this information is missing. In that case, one can replace the missing records with ‘not available’ or something similar. By doing this we create a 4th category which will be treated as another dimension in any statistical analysis one does downstream.

On the other hand, if the missing data is on an ordinal or numeric scale, a more sophisticated approach for handling the missing data is required. Dropping or ignoring the records containing missing values does not always works as it assumes that the number of missing values is relatively small and totally at random. However, often this is not the case and can introduce a substantial bias because the information is simply lost. For instance, in gel-based proteomics, the amount of missing data is considerable (~10-50%) and not at random but could be related to the staining procedure used or abundances that are close enough to the limit of detection of the instrument (Pedreschi et al. 2008).

An ad hoc approach often used is to replace the missing data by a fixed value such as mean (in case of normally distributed data) or median (when the data is skewed) of observed values. When a missing value is the result of value being below the detection limit, a threshold or fixed value can be justifiable. This form of ‘data cleaning’ is useful but not always encouraged as it artificially reduces the variance which potentially could affect the strength of relationships with other variables as a single value is used to replace all the missing data.

Approaches like NIPALS directly deals with the missing values during the multivariate analysis and are more sensible way of handling randomly missing data. NIPALS is the acronym for Nonlinear Iterative Partial Least Squares it was introduced for Principal Component Analysis (PCA) and not what we now know as Partial least square regression. This method is generally used in chemometrics and proteomics and is tolerant to small amounts of missing data (upto 5-20%). It performs PCA using an iterative procedure. Uses weighted regressions with null weights for the missing entries thus missing data has no influence on the model.

Another sophisticated approach often used when large amount of data is incomplete is to impute the missing values iteratively during the estimation of the model parameters. For this several methods have been proposed such as k-nearest neighbour, singular value decomposition or maximum likelihood estimation etc. There are several R packages like ‘mice’, ‘Hmisc’, ‘VIM’ that implements some of these imputation algorithms. In the end one must consider the structure of the data and a compromise should be found between a sound statistical and biological interpretation of the data.

Quality Check and exploratory analysis

Suppose you have a dataset with some missing data.

# Load the required R packages
library(plsdepot)
library(cluster)

# Here we create a random dataframe
df <- data.frame(A = 1:10, B = 11:20, C = 1:10)
head(df)

##   A B C
## 1 1 11 1
## 2 2 12 2
## 3 3 13 3
## 4 4 14 4
## 5 5 15 5
## 6 6 16 6

# Then we add few missing values (NA) in the dataframe
df_miss<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
head(df_miss)

##    A B C
## 1 NA 11 1
## 2  2 12 2
## 3  3 13 NA
## 4 NA 14 NA
## 5  5 15 5
## 6  6 16 6

A) Principal Component Analysis (PCA)

Normal PCA to check the quality of the samples won’t work in this case as ‘prcomp’ function in R does not work with missing values. Instead, one can use NIPALS algorithm to compute PCA scores and loadings. In R,  library ‘plsdepot‘ implements NIPALS.


#----Using nipals to perform calculate PCA loadings and scores
pc_nipal<- nipals (df_miss, comps = 2, scaled = TRUE)

#----Plot PCA on rows/observations/samples
plot (pc_nipal, what = "observation", comps = c(1, 2), cex = 0.6,show.names = TRUE, col.labels = "red")

B) Circle of correlation plot

Variables are displayed using the correlations of each block of variables with the components of the other block.

#----Plot PCA on columns/variables
plot (pc_nipal, what = "variables", comps = c(1, 2), cex = 0.5,xlim=c(-50,150),show.names = TRUE, offset = 0.1,col.arrows = "#FE992955",col.points = "#5592e3",col.axis = "black") 

plot2

In general, the plot represents the correlation between the variables/features:

  • The closer a variable appears on the perimeter of the circle, the better it is represented.
  • In addition, if two variables are highly correlated, they will appear near each other.
  • If two variables are negatively correlated, they will tend to appear in opposite extremes.
  • If two variables are uncorrelated, they will be orthogonal to each other.
C)Clustering

Similarly, to perform clustering without removing rows where NAs are present, the gower distance metric can be used. It is a dissimilarity/distance coefficient that handles missing data well and implemented in function ‘daisy‘ in the R package ‘cluster‘.

#----Compute all the pairwise dissimilarities (distances) between observations in the data set
diss<-daisy(t(df_miss),metric="gower")

#---Computes agglomerative hierarchical clustering of the dataset.
distance_agnes<-agnes(diss,metric = "euclidean",method="complete")

hcd<-as.dendrogram(distance_agnes)

plot(distance_agnes,which.plots = 2,main="Dendrogram with Daisy function(grower metric)")

plot3

Therefore, this is a good alternative to quality check data with missingness instead of discarding data or introducing any form of bias in your analysis.

The information about the R packages used can be found below.

# print the package versions used ---#
sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12.5 (Sierra)
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cluster_2.0.5 plsdepot_0.1.17
##
## loaded via a namespace (and not attached):
## [1] backports_1.0.4 magrittr_1.5 rprojroot_1.2 tools_3.3.1
## [5] htmltools_0.3.5 yaml_2.1.14 Rcpp_0.12.9 stringi_1.1.2
## [9] rmarkdown_1.5 knitr_1.16 stringr_1.1.0 digest_0.6.12
## [13] evaluate_0.10

Using R to export results into Excel

Applying conditional formatting on a sheet based on the values from a different sheet

This is the first post in the series “Tips and Tricks for Data Science”. In this post I will show how to create Excel files with conditional formatting in R. As an example I will focus on colouring cells in a sheet based on values stored in a different sheet but the technique is easily extensible to other needs.

Let’s start by importing a toy dataset and by familiarising ourselves with its structure. This dataset was obtained by analyzing a publicly available gene expression dataset (http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-22886/). The original dataset contains gene expression values for multiple different blood cell subsets. List of differentially expressed genes for five specific comparisons (Neutrophils vs Monocytes, CD4+ T cells vs CD8+ T cells, IgM memory B cells vs IgG/IgA memory B cells, CD8+ T cells vs NK cells and B cells vs Plasmacells) were obtained using limma and results were stored in a single data frame. The dataset can be downloaded from (https://github.com/rmelchiotti/DataCATz/blob/master/Datasets/blog_RM2_data.RData).

library(openxlsx)
library(reshape2)
load("/media/development/DataCATz/Datasets/blog_RM2_data.RData")
head(limma_results_multiple_comparisons)
##                 logFC  AveExpr         t      P.Value    adj.P.Val
## 229967_at    5.796050 6.605532  25.98790 8.554785e-48 1.937231e-43
## 234644_x_at  3.185765 6.756721  24.28746 4.018272e-45 4.549688e-41
## 226154_at   -3.686182 8.345932 -21.74632 7.064384e-41 5.332433e-37
## 224374_s_at -3.180816 8.541149 -21.02976 1.271144e-39 7.196262e-36
## 229789_at    1.629650 6.436400  19.92510 1.234818e-37 5.592492e-34
## 230585_at    1.832602 6.559401  18.63027 3.194017e-35 1.205475e-31
##                    B   comparison       probe
## 229967_at   97.62712 NeutrovsMono   229967_at
## 234644_x_at 91.69283 NeutrovsMono 234644_x_at
## 226154_at   82.21080 NeutrovsMono   226154_at
## 224374_s_at 79.39618 NeutrovsMono 224374_s_at
## 229789_at   74.93025 NeutrovsMono   229789_at
## 230585_at   69.49491 NeutrovsMono   230585_at

The data frame contains, for each probe (think gene) and each comparison, log-fold changes, adjusted p-values and additional variable which will not be used in this post.

The goal is to export an Excel table where rows correspond to probes, columns correspond to comparisons and each cell contains the p-value obtained for that particular gene and that particular comparison using limma. In addition we would like to colour each cell by its corresponding fold change. In order to do this we make use of an R package, openxlsx (https://cran.r-project.org/web/packages/openxlsx/index.html), specifically developed for the manipulation of Excel files.

Let’s start by creating two data frames, one containing p-values for each probe and comparison and one containing fold changes for each probe and comparison. For simplicity we will focus only on those probes which are significant in at least one of the comparisons.

# Identifing probes which are significant in at least one comparison (adjusted p-value<0.001)
significant<-dcast(subset(limma_results_multiple_comparisons,adj.P.Val<0.001),probe~.,length,value.var="comparison")
colnames(significant)<-c("probe","num_significant_comparisons")
head(significant)
##         probe num_significant_comparisons
## 1 200000_s_at                           1
## 2   200001_at                           1
## 3   200002_at                           1
## 4 200003_s_at                           1
## 5   200004_at                           1
## 6   200005_at                           1
significant_limma_results_multiple_comparisons<-subset(limma_results_multiple_comparisons,probe %in% significant$probe)

# Generating data frame containing p-values (rows correspond to probes, columns correspond to comparisons)
pvals<-dcast(data=significant_limma_results_multiple_comparisons,formula=probe~comparison,value.var="adj.P.Val")
rownames(pvals)<-pvals$probe
pvals$probe<-NULL
head(pvals)
##               BvsPlasma  CD4vsCD8     CD8vsNK  IgMvsIgA NeutrovsMono
## 200000_s_at 0.032460135 0.9242417 0.824582060 0.9695951 2.949946e-12
## 200001_at   0.005092352 0.9575582 0.965266195 0.9992567 4.887914e-11
## 200002_at   0.522338942 0.3181947 0.001017482 0.9705612 1.255813e-21
## 200003_s_at 0.342090910 0.8663762 0.250467121 0.9992567 3.991931e-09
## 200004_at   0.279605958 0.7619152 0.982076553 0.9992567 7.507483e-06
## 200005_at   0.579021244 0.6999011 0.001318081 0.9650455 5.786168e-16
# Generating data frame containing fold changes (rows correspond to probes, columns correspond to comparisons in the same order as above)
fold_changes<-dcast(data=significant_limma_results_multiple_comparisons,formula=probe~comparison,value.var="logFC")
rownames(fold_changes)<-fold_changes$probe
fold_changes$probe<-NULL
head(fold_changes)
##              BvsPlasma    CD4vsCD8     CD8vsNK     IgMvsIgA NeutrovsMono
## 200000_s_at -0.6563666  0.09343405  0.12728432  0.273197551    -2.013805
## 200001_at   -0.8709291 -0.05733401 -0.02998118 -0.007288651    -1.916991
## 200002_at   -0.1731417 -0.39430381  0.93701092  0.215177795    -2.470631
## 200003_s_at  0.2244004 -0.10897526  0.35226591  0.087253733    -1.185663
## 200004_at   -0.3472753  0.22445483  0.01535964  0.122980752    -1.203665
## 200005_at   -0.1283781 -0.17952769  0.77665262  0.188921812    -1.647487

Now that the two data frames are ready we can start the exporting process. We use the openxlsx package to perform this function. We start by creating a workbook and adding two worksheet, one called p-values and one called fold_changes. In the first worksheet we store the p-values, in the second worksheet we store the fold changes.

xlsx_filename<-"/media/development/DataCATz/Tables/results.xlsx"

# Creating workbook
wb<-openxlsx::createWorkbook("Results")

# Creating and filling the p-values worksheet
openxlsx::addWorksheet(wb,"pvalues",gridLines=TRUE)
writeData(wb,sheet=1,pvals,rowNames=TRUE)

# Creating and filling the fold changes worksheet
openxlsx::addWorksheet(wb,"fold_changes",gridLines=TRUE)
writeData(wb,sheet=2,fold_changes,rowNames=TRUE)
openxlsx::saveWorkbook(wb,xlsx_filename,overwrite=TRUE)

We then move on to formatting. In this particular case we want to colour each p-value in the p-values worksheet by its fold change. We start by creating a gradient of colours, blue for negative fold changes and red for positive fold_changes (we switch colour for every 1 log-fold change increase).

# Generating gradient of colours
logfc_intervals<-c(-Inf,seq(-8,8,1),Inf)
blues<-colorRampPalette(c("blue", "white"))
reds<-colorRampPalette(c("white","red"))
list_of_colours<-blues(floor((length(logfc_intervals)-1)/2))
list_of_colours<-c(list_of_colours,reds(floor((length(logfc_intervals)-1)/2)))

We then apply conditional formatting using these colours. Each fold change is assigned to a colour according to its value. This colour is then applied to the corresponding cell using the function addStyle.

# Applying colours
for(i in seq(1,length(logfc_intervals)-1)){
  cells_to_colour<-which(pvals<0.05 & fold_changes>logfc_intervals[i] & fold_changes<logfc_intervals[i+1],arr.ind=TRUE)
  style<-createStyle(fgFill=list_of_colours[i])
  openxlsx::addStyle(wb,sheet=1,rows=cells_to_colour[,1]+1,col=cells_to_colour[,2]+1,style=style)
}

The last step consists in saving the workbook into an Excel file.

# Saving the workbook in an Excel file
openxlsx::saveWorkbook(wb,xlsx_filename,overwrite=TRUE)

There are many other ways to perform conditional enrichment using openxlsx. In particular if the condition used to colour cells in a sheet is based on values stored in that sheet (for example if we wanted to colour p-values based on their values), the function openxlsx::conditionalFormatting might be a better option.

The R file used to generate the Excel file described in this post is available for download from git (https://github.com/rmelchiotti/DataCATz/tree/master/Scripts). The Excel file generated by the script can be found here. Below is a list of the packages used.

# Printing package version
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.2  openxlsx_4.0.17
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.0  backports_1.1.0 plyr_1.8.4      magrittr_1.5   
##  [5] rprojroot_1.2   tools_3.4.0     htmltools_0.3.6 yaml_2.1.14    
##  [9] Rcpp_0.12.11    stringi_1.1.5   rmarkdown_1.5   knitr_1.16     
## [13] stringr_1.2.0   digest_0.6.12   evaluate_0.10