### Article

##
Statistical methods for the analysis of left-censored variables

Statistische Analysemethoden für linkszensierte Variablen und Beobachtungen mit Werten unterhalb einer Bestimmungs- oder Nachweisgrenze

### Search Medline for

### Authors

Published: | March 7, 2013 |
---|

### Outline

### Abstract

In some applications statisticians are confronted with values which are reported to be below a limit of detection or quantitation. These left-censored variables are a challenge in the statistical analysis. In a simulation study, we compare different methods to deal with this type of data in statistical applications. These include measures of location, dispersion, association, and statistical modeling. Our simulation study showed that the multiple imputation approach and the Tobit regression lead to unbiased estimates, whereas the naïve methods including simple substitution of non-detects lead to unreliable estimates. We illustrate the application of the multiple imputation approach and the Tobit regression with an example from occupational epidemiology.

### Zusammenfassung

In der statistischen Praxis treten immer wieder Variablen mit Werten unterhalb einer Bestimmungs- oder Nachweisgrenze auf. Diese sind linkszensiert und stellen daher eine Herausforderung für die statistische Analyse dar. Im Rahmen einer Simulationsstudie vergleichen wir Schätzmethoden zur Berechnung von Lage- und Streuungmaßen, Korrelationen und Regressionsparametern bei diesen Variablen. Unsere Ergebnisse zeigen, dass die multiple Imputationsmethode und die Tobit Regression zu unverzerrten Schätzungen führen. Naive Methoden, einschließlich der einfachen Substitution von zensierten Beobachtungen, ergeben hingegen unzuverlässige Schätzungen. Wir illustrieren die Anwendung der multiplen Imputationsmethode und der Tobit Regression anhand eines Beispiels aus der Epidemiologie der Arbeitswelt.

### Introduction

In some applications statisticians are confronted with left-censored variables. For example in medical research, biomarkers can only be measured with a certain precision and sometimes measurements are observed below a limit of detection or quantitation. Similarly in epidemiology an exposure measurement might also be non-detectable. As noted by Helsel (2010) [1] and Lubin et al. (2004) [2], these non-detects in the data have to be analyzed with great care, because ignoring left-censored variables in the statistical analysis might lead to bias in estimates, improper models and, hence, to errors in the conclusions. We will use the abbreviation LOD (limit of determination) to refer to the cut-off value of censored variables. The aim of this paper is to compare and validate different methods dealing with the statistical analysis of left-censored variables. The methods discussed in this paper are to restrict the datasets to measurable values, to substitute censored values by one-half, two-third, or one over the square root of two times LOD, to use multiple imputation of left-censored variables, and to apply Tobit regression. In a simulation study, we calculate the median and the interquartile range as measures of location and dispersion, the Pearson correlation coefficient as measure of association, and linear regression models as an example of statistical modeling. Finally we use data from the WELDOX study [3], [4], a cross-sectional study on welders, to show how the presented methods can be applied in a study from occupational epidemiology. Here we are interested in describing the concentrations of respirable particles (RP) in welding fume and to quantify the predictors of this exposure.

### Methods of the simulation study

To compare different methods for the statistical analysis of left-censored variables, a simulation study was applied constructing complete datasets of 250 observations without any censored value 1000 times each. The proportion of values below LOD was set to 10%, 25% and 50%. This was achieved by calculating the 10^{th}, 25^{th} and 50^{th} percentile in each of the simulated datasets and using this number as LOD. We compared the estimates of the different methods based on the censored datasets with the estimates based on the complete datasets. The statistical analysis comprised estimates of quartiles, Pearson correlation coefficients and simple linear regression parameters. The quartiles were calculated according to the definition number two in the work of Hyndman and Fan (1996) [5]. The simulated variables for the estimation of quartiles were drawn from a rectangular distribution U(0,20), a normal distribution N(50,50) and a log-normal distribution lnN(3,2). The simulation datasets for the analysis of the Pearson correlation were taken from bivariate normal distribution , where *ρ* = 0.25, 0.50, or 0.75. The simulation datasets for the regression models followed the equation y_{i} = 4 + 5 x_{i} + e_{i} where i = 1, … , 250 and e_{i} ~ N(0,σ²). σ² was set to 100 or 1500. X was drawn from a rectangular distribution with U(10,50). For the statistical analysis of left-censored variables the following methods were applied and compared: naïve methods, multiple imputation, calculation of an upper and lower bound of the quartiles, and Tobit regression. All calculations were performed with the SAS software, version 9.2 (SAS Institute Inc., Cary, NC, USA).

#### Naïve methods

Computationally easy to apply are the naïve methods for dealing with left-censored variables. The discard method restricts the analysis dataset to values above LOD. Simple substitution methods replace values below LOD with a constant. In this simulation study, we applied three different simple substitution methods: substitution by one-half times LOD (½*LOD), substitution by two-thirds times LOD (2/3*LOD) and substitution by one over the square root of two times LOD (1/√2*LOD).

#### Multiple imputation

The multiple imputation method is described in detail in the work of Lubin et al. (2004) [2]. In brief, a lower and an upper bound for each observation are defined. It is assumed that the probability distribution of measurements within the lower and upper bound of the censoring interval depends only on observed data and follows the same distribution. In a first step, 100 bootstrap datasets [6] are sampled with replacement from the original dataset. These bootstrap datasets are of the same size as the original dataset. Applying the Tobit regression approach, the cumulative distribution function *G* of the left-censored variable is estimated for each bootstrap sample. In each dataset, single imputations of each value below LOD are done as follows: Let *z* be a uniform random value between G(lower bound) and G(upper bound). Then, the single imputed value is equal to G^{–1}(z). In the next step, for each of the 100 imputed datasets the point estimates of interest are calculated, for example the parameters of a linear regression model. Finally, these estimates are combined according to the method described by Little and Rubin (1987) [7] and the mean of the point estimates is the final point estimate.

#### Calculation of an upper and lower bound of the quartiles

By left-censored variables the range of the true quartiles is calculable. The upper bound of a percentile is obtained by substituting values below LOD with the LOD and then calculating the quartile according to the definition number two in the work of Hyndman and Fan (1996) [5]. Similarly, the lower bound of a percentile is obtained by substituting values below LOD with zero or the smallest possible value of the variable of interest and then calculating the quartile as usual. If the variable has no smallest possible value, the lower bound is not determinable.

#### Tobit regression

The basic idea of the Tobit regression is to treat the left-censored variable y as the outcome of a normally distributed latent variable y* [8]. This leads to the following model equations

with u|x = N(0,σ^{2}). For y = LOD, the density of y is equal to the probability of observing y < LOD and for y > LOD the density of y is the same as the density of y*.The parameters of the model are estimated using a maximum likelihood approach. Detailed information about the Tobit regression can be found in the work of Amamiya (1984) [9].

### Results of the simulation study

The multiple imputation method showed the best results in the simulation study in estimating the median and the first and third quartile of a left-censored variable (Table 1 [Tab. 1]). The discard method overestimated all quartiles. The three different simple substitution methods estimated the quartiles accurately in all cases where the percentage of values below LOD was smaller than the percentile of interest. Otherwise the percentiles were estimated with a systematic error. Similarly, the calculation of quartiles gave the true percentile when the percentage of values below LOD was smaller than the quartile of interest. Else, a wide range was presented which included the true quartile.

The correlation coefficient was very accurately estimated by the multiple imputation method (Table 2 [Tab. 2]). The other methods underestimated the correlation coefficient in all cases. An increasing percentage of values below LOD resulted in less accurate estimates of the correlation coefficient.

Table 3 [Tab. 3] shows that the discard method overestimated the intercept and underestimated the slope of the regression line in all cases. The simple substitution methods also gave very imprecise estimates of the regression parameters. When a higher percentage of values below LOD was present or σ² increased, the error of the estimates increased. The regression based on multiple imputation and the Tobit regression showed very accurate results for the regression parameters in all simulated settings. In an additional simulation run, we changed the simulation by using a fixed and identical set of x- and cut-off values for all simulated datasets. The results were very similar to the results presented in Table 3 [Tab. 3]. Again only multiple imputation and the Tobit regression showed very accurate results for the regression parameters (data not shown).

### Methods of the application example

The measurement of welding fume in the WELDOX study served as example for the application of the methods to analysis a left-censored variable. We used data from 215 welders recruited between May 2007 and October 2009 in a cross-sectional study. The WELDOX study was approved by the Ethics Committee of the Ruhr University Bochum and was conducted in accordance with the Helsinki Declaration. Details about the WELDOX study and technical information about the exposure measurements are described by Lehnert et al. (2012) [3] and Pesch et al. (2012) [4]. In brief, the welders were equipped with a sampling system to determine the exposure to RP during a shift and the content of manganese (Mn) in the respirable welding fume was determined. Each workplace was documented by technicians and the descriptions included the welding technique applied (gas metal arc welding, flux-cored arc welding, tungsten inert gas welding, shielded metal arc welding, and miscellaneous techniques), the materials used (mild steel, stainless steel, and miscellaneous materials) and information about the space (confined or non-confined). Confined work spaces were locations with a strongly restricted air exchange. The efficiency of the local exhaust ventilation was rated by a team of experts. Regression models were applied to determine potential predictors of the concentrations of RP. Due to the skewed distribution of the exposure variable, it was log-transformed prior to the regression analysis. Additionally to the described statistical methods in the simulation study, a third approach based on data from respirable Mn was applied. All measurement values of respirable Mn were above LOD and were closely associated with the measurement values of RP. Therefore we used the Mn data for a multiple imputation of the concentrations of RP below LOD as described in Lehnert et al. (2012) [3]. Similar to the multiple imputation approach described above, values below LOD were imputed 100 times using a Tobit regression approach with the log-transformed respirable Mn concentrations as independent variable and the log-transformed RP concentration as dependent variable.

### Results of the application example

About 30% (65 out of 215) of the measurements of the concentration of RP were below LOD. The median of the concentrations of RP was 1.29 mg/m^{3}. None of the 215 measurements of respirable Mn was below LOD. The log-transformed concentration of welding fume was strongly positive correlated with the log-transformed concentrations of respirable Mn (multiple imputation: r = 0.948). Table 4 [Tab. 4] presents the results of the regression analysis calculated by Tobit regression, multiple imputation and multiple imputation based on Mn data. The effect estimates were in good accordance. In the models, flux-cored arc welding had about two-fold higher concentrations of RP than gas metal arc welding (Tobit regression, multiple imputation, multiple imputation based on data from respirable Mn: 2.26, 2.30, 2.25). Tungsten inert gas welding had the lowest concentrations of all welding techniques. Welding of stainless steel resulted in about 0.6 fold (0.57, 0.52, 0.55) lower concentrations than welding of mild steel. The exposure in a confined work space was about two-fold (1.79, 2.01, 1.87) higher than in a non-confined space. Efficiently used local exhaust ventilation decreased the concentrations of RP about 0.4 fold (0.43, 0.40, 0.43).

### Discussion

Our simulation study confirmed that the application of naïve methods can lead to a systematic error in the estimates in the statistical analysis. In presence of a left-censored variable, multiple imputation or Tobit regression should be preferred instead. However, Tobit regression can only be applied to estimate regression models whereas multiple imputation is more flexible. We demonstrated with an example from occupational epidemiology that Tobit regression and multiple imputation both worked well to determine the main predictors of the exposure to respirable welding fume and that these methods yielded similar estimates.

Median and interquartile range are robust measures of the central tendency and dispersion. The discard method always overestimated the quartiles and should not be applied. The simple substitution methods work very well as long as the percentage of left censored values is below the percentile of interest, but in all other cases multiple imputation gives the best results. Further information about methods to measure the central tendency and dispersion of a left-censored variable is summarized by Helsel (2005) [10].

Multiple imputation is also the best method in our simulation study to estimate the correlation coefficient in the presence of a left-censored variable. The other approaches lead to biased estimates. This finding is in line with a study of Lyles et al. (2001) [11]. Chu et al. (2008) [12] explored nonparametric methods to estimate the correlation coefficient and concluded that further research is needed in this field.

In the regression analysis both, multiple imputation and Tobit regression, lead to precise estimates of the regression parameters whereas the estimates based on the discard method and the simple substitution methods were again unreliable. Even in the case of only 10% of the measurements below LOD, the intercept estimate of the naïve methods had a strong bias. Our simulation study confirmed the results of Lubin et al. (2004) [2], who stated that “… any single value to impute missing measurement data is not advisable” and that “multiple imputation of missing data is the best approach of ensuring unbiased estimates of effects and nominal CIs”. Lubin et al. (2004) [4] compared Tobit regression, multiple imputation, single imputation, inserting ½*LOD, and inserting the conditional expected value E[Y|Y < LOD] in a simulation study with a regression model with zero intercept and no covariates. We could show that their result holds also in regression models including a covariate. Furthermore, our simulation study demonstrated that the discard method and other substitution methods including the insertion of 2/3*LOD and 1/√2*LOD lead to biased effect estimates.

Our example from occupational medicine showed, that multiple imputation and Tobit regression can be readily applied on real datasets. The WELDOX study shows that the major determinants of the exposure to respirable welding fume are the welding technique applied, the material used, the possibilities of air exchange at the work space and the use of efficient local exhaust ventilation. A detailed discussion can be found in Lehnert et al. (2012) [3].

Multiple imputation and Tobit regression are recommendable for the analysis of left-censored variables. On the one hand, Tobit regression needs less computational resources as multiple imputation. But on the other hand, Tobit regression is a special method to analyze regression models and multiple imputation can be applied in a variety of statistical approaches. Tobit regression is the best approach in applications where it is solely of interest to estimate regression parameters; in all other cases the multiple imputation approach should be preferred.

### Conclusion

In the presence of a left-censored variable naïve methods should not be applied. We showed that these methods are inferior to the multiple imputation method in the analysis of central tendency and dispersion, correlation and in statistical modeling. Tobit regression is equally suitable as multiple imputation to estimate regression models with a left-censored dependent variable.

### Notes

#### Acknowledgement

The WELDOX study was financially supported by the German Social Accident Insurance (DGUV). We thank the WELDOX study group and all welders having participated. We are thankful to Ying Cheng for addressing the problems of left-censored variables in her diploma thesis which was supervised by Jörg Rahnenführer, professor for “Statistical methods in genetics and chemometrics” at the Department of Statistics of TU Dortmund University, Germany.

#### Competing interests

The authors declare that they have no competing interests.

### References

- 1.
- Helsel D. Much ado about next to nothing: incorporating nondetects in science. Ann Occup Hyg. 2010;54:257-62. DOI: 10.1093/annhyg/mep092
- 2.
- Lubin JH, Colt JS, Camann D, Davis S, Cerhan JR, Severson RK, Bernstein L, Hartge P. Epidemiologic evaluation of measurement data in the presence of detection limits. Environ Health Perspect. 2004 Dec;112(17):1691-6. DOI: 10.1289/ehp.7199
- 3.
- Lehnert M, Pesch B, Lotz A, Pelzer J, Kendzia B, Gawrych K, Heinze E, Van Gelder R, Punkenburg E, Weiss T, Mattenklott M, Hahn JU, Möhlmann C, Berges M, Hartwig A, Brüning T; Weldox Study Group. Exposure to inhalable, respirable, and ultrafine particles in welding fume. Ann Occup Hyg. 2012 Jul;56(5):557-67. DOI: 10.1093/annhyg/mes025
- 4.
- Pesch B, Weiss T, Kendzia B, Henry J, Lehnert M, Lotz A, Heinze E, Käfferlein HU, Van Gelder R, Berges M, Hahn JU, Mattenklott M, Punkenburg E, Hartwig A, Brüning T. Levels and predictors of airborne and internal exposure to manganese and iron among welders. J Expo Sci Environ Epidemiol. 2012 May-Jun;22(3):291-8. DOI: 10.1038/jes.2012.9
- 5.
- Hyndman RJ, Fan Y. Sample Quantiles in Statistical Packages. Am Stat. 1996;50:361-5.
- 6.
- Efron B. Bootstrap methods: Another look at the jackknife. Ann Stat. 1979;7:1-26. DOI: 10.1214/aos/1176344552
- 7.
- Little RJA, Rubin DB. Statistical analysis with missing data. New York: Wiley; 1987.
- 8.
- Tobin J. Estimation of Relationships for Limited Dependent Variables. Econometrica. 1958;26:24-36. DOI: 10.2307/1907382
- 9.
- Amemiya T. Tobit models: A survey. J Econom. 1984;24(1-2):3-61. DOI: 10.1016/0304-4076(84)90074-5
- 10.
- Helsel DR. Nondetects and data analysis: Statistics for censored environmental data. Hoboken, NJ: Wiley-Interscience; 2005. (Statistics in practice).
- 11.
- Lyles RH, Fan D, Chuachoowong R. Correlation coefficient estimation involving a left censored laboratory assay variable. Stat Med. 2001 Oct;20(19):2921-33. DOI: 10.1002/sim.901
- 12.
- Chu H, Nie L, Zhu M. On estimation of bivariate biomarkers with known detection limits. Environmetrics. 2008;19:301-317. DOI: 10.1002/env.868