## Description

As discussed in my last blog here, missing data in big data analysis cannot always be ignored and requires a good understanding of the data and user decisions on how to handle this scenario. In biology, this generally occurs when the data is subjected to limits of detection or quantification (censoring or truncation mechanism). These values are non-ignorable and need to be considered in any statistical analysis done on the study. However, in the presence of censoring/truncation the dependent variable ‘*y*’ has number of observations that are constrained which could create problem for data modeling. This blog approaches the problem of modeling censored data using R.

## Data modeling

Let us consider a simple scenario (non-biological) of censored data. Suppose you have some data on consumer behaviour where proportion of observations on the dependent variable are missing or censored. This problem occurs frequently in cases where different extents of purchase are observed among buyers (users) but are essentially zero for non-buyers. Thus, one will be tempted to set extents of purchase to zero among non-buyers and simply regress extents of purchase against predictors (demographics, lifestyle measures etc.) The inherent weakness of such a procedure lies in the concentration of zero values in the non-buyers’ sub-sample. Thus, making the fitted regression line to be too flat and estimated coefficients to be biased.

A better and widely-used method to estimate linear relationships between variables when there is either left- or right-censoring in the dependent variable is ‘tobit’ regression. This censored regression model uses a truncated normal distribution with inflation at the censoring point and works by estimating the probability of being censored and then incorporates that into the equation. For more detailed explanation of the model you can refer to the resources here [1,2]. Thus, ‘*tobit*’ is a censored gaussian regression.

In R, tobit and other censored regression models have been implemented in packages like *‘censReg’, ‘VGAM’, ‘MCMC’, ‘NADA’* . Here, we will focus on the ** survreg ( )** function in the

*‘survival’*library in R which implements maximum likelihood estimation for the censored data. The distribution included in the survreg function are Normal, exponential, Weibull, logistic, log-normal and log-logistic.

### An example on how to approach this problem

Let us assume we have a dataset in which we are measuring 8 different analytes concentrations across 6 subjects from 2 groups disease vs control. Each analyte has a limit of quantification (marked by asterisk * in the example) and values below that limit are not measured. In our example, analyte1, analyte3, analyte4 and analyte8 have censored values. Thus, we are dealing with a left-censored dataset.

#### Step1: Preparing the data in the required format

We need the following set of information before we fit a model to each individual analyte:

**A. Observed/measured Values for the 8 different analytes**

#----read the data file data<-read.table("missingdataexample.csv",sep=",",header=TRUE,row.names = 1) print(data)

**B. Predictors to be used in the model (age, gender etc)**

#----read the phenodata for the samples: will be used as predictors in the model phenoData<-read.table("sampleInformation.csv",sep=",",header=TRUE,,row.names =1) print(phenoData)

**C. Censoring Information i.e which of the observed values are censored**

Finally, we need the information indicating which of these values are censored and the type of censoring.

#----create a boolean matrix of censored data(matrix not a vector as we are working with 8 different analyte and thus will be eventually creating seperate models for each of the analytes/rows) df <- data.frame(lapply(data[,], function(x) as.numeric(as.character(x)))) censored<-which(is.na(t(df))) notcensored<-which(!is.na(t(df))) censor_data<-vector() censor_data[censored] <- as.logical(TRUE) censor_data[notcensored] <- as.logical(FALSE) censor_data<-matrix(t(censor_data),nrow=nrow(data),ncol=ncol(data),byrow = T) colnames(censor_data)<-colnames(data) rownames(censor_data)<-rownames(data) print(censor_data)

#### Step 2: Creating the Surv object

The censored response information is captured by creating a *Surv* object for the *survreg* function in the **survival** R package.

my.surv <- Surv(values, event=!censored, type='left’)

* Note:* In survival analysis, we define event/death = TRUE and non-event/alive(censored) = FALSE. Since we have set censored as TRUE and not-censored as FALSE in our example, therefore in the above, we define event as the

**logical not**of the value of the censoring indicator.

Now, the surv object for analyte1 in our data will be :

## [1] 0.03000 0.26000 <0.00726 0.10000 0.09000 0.08000

The censored data is shown by the sign ‘*<*‘ implying that for one of our samples, analyte1 has values below limit of quantification and is therefore left-censored. Once, we have created the *Surv* object, we can fit any linear model to this response object.

#### Step 3: Fitting the model

We have our data in the correct format required by survreg(). Next, let us try fitting a parametric model for each of these analytes respectively.

#----Load the required R packages library(survival) #----remove the asterisk from the values in the data variable before fitting the model data<-as.matrix(data[,]) data[,]<-gsub("[*]","",data[,]) data_mod<-data #----create a function for the survreg model modelFunction = function(dat){ df <- data.frame(exp=as.numeric(as.character(t(data_mod[dat,]))),censored=(censor_data[dat,]),group=as.factor(phenoData$group),sex=phenoData$sex,age=phenoData$age) my.surv <- Surv(df$exp, event=!df$censored, type='left') return(survreg(my.surv~group+sex+age,data=df,dist='gaussian',score=TRUE,control = list(maxiter=90))) } #----fit a survreg model for each of the 8 analytes in our dataset index <- 1:nrow(data_mod) lSurv <- lapply(index, modelFunction) names(lSurv) <- rownames(data_mod) #----Lets look at the model for analyte8 (has one censored observation for control_1) summary(lSurv[[8]]) ## Call: ## survreg(formula = my.surv ~ group + sex + age, data = df, dist = "gaussian", ## control = list(maxiter = 90), score = TRUE) ## Value Std. Error z p ## (Intercept) 2.907 2.6564 1.094 2.74e-01 ## groupDisease 51.296 1.6751 30.622 6.15e-206 ## sexMale 2.248 1.1322 1.985 4.71e-02 ## age 0.242 0.0514 4.707 2.51e-06 ## Log(scale) 0.148 0.3162 0.468 6.40e-01 ## ## Scale= 1.16 ## ## Gaussian distribution ## Loglik(model)= -7.8 Loglik(intercept only)= -24.6 ## Chisq= 33.5 on 3 degrees of freedom, p= 2.5e-07 ## Number of Newton-Raphson Iterations: 16 ## n= 6

#### Interpreting the output:

- Firstly, R reminds us what the model we ran, what options we specified, etc.
- The likelihood ratio chi-square of 33.5 (df=3) with a p-value of 0.00000025 tells us that our model as a whole fits significantly better than an empty model (i.e. a model with no predictors).
- Then, in the table we see the coefficients, their standard errors, the z-statistic and associated p-values. The coefficients forgroup=disease, sex and age are statistically significant.
- For a one unit increase in age, there is a 0.242 point increase in the predicted value of analyte8 concentration
- The term for group and sex (categorical) have slightly different interpretation then the continuous predictor age. The predicted value of analyte is 52.526 points higher for
*diseased*subjects than for*control*samples. Similarly, in*Males*the predicted value of analyte is 2.248 points higher then the*females*.

#### References

- Tobin, J. 1958. Estimation of relationships for limited dependent variables. Econometrica 26: 24-36
- https://menghublog.wordpress.com/2014/12/28/the-use-of-tobit-and-truncated-regressions-for-limited-dependent-variables/

## Leave a Reply