Principal component analysis is a commonly used technique in multi-variate statistics and pattern recognition literature. In this post I try to merge ideas of Geometric and Algebraic interpretation of data as vectors in a vector space and its relationship with PCA. The 3 major sources used in this blog are: [1] Thomas D. Wickens (1995). The... Continue Reading →

# Methods of handling and working with missing/censored data (part-2)

Description As discussed in my last blog here, missing data in big data analysis cannot always be ignored and requires a good understanding of the data and user decisions on how to handle this scenario. In biology, this generally occurs when the data is subjected to limits of detection or quantification (censoring or truncation mechanism). These... Continue Reading →

# Plausible Reasoning for Scientific Problems: Belief Driven by Priors and Data.

Plausible reasoning requires constructing rational arguments by use of syllogisms, and their analysis by deductive and inductive logic. Using this method of reasoning and expressing our beliefs, in a scientific hypothesis, in a numerical manner using probability theory is one of my interest. I try to condense the material from the first 4 chapters of... Continue Reading →

# Methods of handling and working with missing data (part 1)

Description In biology, the presence of missing values is a common occurrence for example in proteomics and metabolomics study. This represents a real challenge if one intends to perform an objective statistical analysis avoiding misleading conclusions. The leading causes of incompletely observed data are truncation and censoring which are often wrongly used interchangeably. You can... Continue Reading →

# Regression & Finite Mixture Models

I wrote a post a while back about Mixture Distributions and Model Comparisons. This post continues on that theme and tries to model multiple data generating processes into a single model. The code for this post is available at the github repository. There were many useful resources that helped me understand this model, and some... Continue Reading →

# Hierarchical Linear Regression – 2 Level Random Effects Model

Regression is a popular approach to modelling where a response variable is modelled as a function of certain predictors - to understand the relations between variables. I used a linear model in a previous post, using the bread and peace model - and various ways to solve the equation. In this post, I want to fit... Continue Reading →

# Normalising Nanostring data

This is a quick R guide to learn about Nanostring technology (nCounter) and how to pre-process the data profiled on this platform. Description The nCounter system from Nanostring Technologies is a direct, reliable and highly sensitive multiplexed measurement of nucleic acids (DNA and RNA) based on a novel digital barcode technology. It involves Custom Codeset... Continue Reading →

# Using R to export results into Excel

Applying conditional formatting on a sheet based on the values from a different sheet This is the first post in the series "Tips and Tricks for Data Science". In this post I will show how to create Excel files with conditional formatting in R. As an example I will focus on colouring cells in a... Continue Reading →

# Compare Transformations & Batch Effects in Omics Data

While analysing high dimensional data, e.g. from Omics (Genomics, Transcriptomics, Proteomics etc.) - we are essentially measuring multiple response variables (i.e. genes, proteins, metabolites etc.) in multiple samples, resulting in a $latex rXn$ matrix X with r variables and n samples. The data capture can lead to multiple batches or groups in the data -... Continue Reading →