gms | German Medical Science

50. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds)
12. Jahrestagung der Deutschen Arbeitsgemeinschaft für Epidemiologie (dae)

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie
Deutsche Arbeitsgemeinschaft für Epidemiologie

12. bis 15.09.2005, Freiburg im Breisgau

Statistical methods for the validation of questionnaires: Discrepancy between theory and practice

Meeting Abstract

Suche in Medline nach

  • Martina Schmidt - DKFZ, Heidelberg
  • Karen Steindorf - DKFZ, Heidelberg

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. Deutsche Arbeitsgemeinschaft für Epidemiologie. 50. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie (gmds), 12. Jahrestagung der Deutschen Arbeitsgemeinschaft für Epidemiologie. Freiburg im Breisgau, 12.-15.09.2005. Düsseldorf, Köln: German Medical Science; 2005. Doc05gmds383

Die elektronische Version dieses Artikels ist vollständig und ist verfügbar unter:

Veröffentlicht: 8. September 2005

© 2005 Schmidt et al.
Dieser Artikel ist ein Open Access-Artikel und steht unter den Creative Commons Lizenzbedingungen ( Er darf vervielfältigt, verbreitet und öffentlich zugänglich gemacht werden, vorausgesetzt dass Autor und Quelle genannt werden.




In principle there is wide agreement that questionnaires that are used in epidemiological studies should be validated. However, there is less clarity about the study designs, appropriate statistical methods, and the interpretation of validation studies. While in the theoretically oriented literature clear recommendations for appropriate methods exist, these methods are infrequently found in published validation studies. Therefore, we reviewed validation methods used in practice, using the example of physical activity questionnaires, and discussed those methods with respect to theoretical statistical insights. Two platforms, a systematic literature review and a validation study, which we had completed recently, were used.

Material and Methods

With PubMed we identified the validation studies for physical activity questionnaires that were published in the last 5 years. An overview of the statistical methods used in these publications was created. These methods were critically discussed on the basis of theoretical findings. Using a relative validation study, which we had performed in 211 women, alternative methods were described and additionally compared by simulations with the commonly used methods.


We identified 46 validation studies for physical activity questionnaires published in the last 5 years in PubMed. In the majority of studies physical activity was assessed as a continuous variable. Of these 46 publications, 40 (87.0%) based their conclusions largely on Pearson’s or Spearman’s correlation coefficients. Bland und Altman [1] had criticized this common validation method. Correlation coefficients are highly influenced by the range of the measured variable and hence dependent on the selection of the study population. Is the selected population heterogeneous, this will result in higher correlation coefficient than in a homogeneous population with same absolute measurement errors. Further, correlation coefficients cannot detect systematic errors. They are measures of association but not of agreement between the questionnaire and the reference. Bellach [2] showed by simulations that highly biased questionnaires can result in as high correlation coefficients as an unbiased good questionnaire.

Paired tests for the comparison of the means were found in 5 (10.9%) publications. These tests also need to be interpreted with caution. The conclusion that methods are in sufficient agreement when the means are not statistically significant different is inappropriate: Significant systematic bias will be less likely to be detected if it is accompanied by large amounts of random error. Likewise, a statistically significant difference between the means does not necessarily implicate disagreement between the questionnaire and the reference, because it is not significance but magnitude of the mean difference that matters.

Our simulations and the analyses of our validation study supported these limitations. As alternative appropriate method the “limits of agreement method” of Bland and Altman [3] is suggested in the literature. This mainly descriptive method is aimed at describing and quantifying the agreement between the questionnaire and the reference. Bland-Altman plots or limits of agreement were found in 10 (21.7%) publications. However, the Bland-Altman methods were not always used or interpreted quite correctly. Furthermore, the conclusions of most of these 10 validation studies were based partly or fully on other methods, mainly on correlations.


It was found that the majority of reviewed validation studies were focused on unsatisfactory statistical methods, mainly correlation coefficients. Some authors argued that for many epidemiological studies the correct ranking of participants is relevant but not the correct absolute values. Hence, calculating correlations would be sufficient and evaluating absolute agreement between questionnaire and reference would not be necessary. However, researchers should be extremely cautious in (i) concluding acceptable validity even if there is a high correlation; (ii) comparing the correlation coefficients between different studies or between subgroups (e.g. men vs. women); and (iii) extrapolating the results to another possibly more homogeneous population [4].

The evaluation of a questionnaire is not an issue for statistical hypothesis testing. The relevant question is in fact whether the questionnaire is good enough for a special purpose. Therefore, the validation study should provide the data as a basis to decide over the usefulness of the questionnaire for a certain aim, or for the calculation of sample sizes or minimal detectable effects. Furthermore, systematic and random errors should be presented separately and possible sources of error should be investigated. Additionally, in case-control studies it is of special interest, whether measurement errors differ between cases and controls. Bland-Altman plots and limits of agreement including reporting of the systematic bias are simple, descriptive methods to present the agreement between a questionnaire and the reference (validity). The decision, which amount of disagreement is still acceptable for the own purpose, is not always easy. However, for correlation coefficients there are also no clear criteria, which value is still acceptable. In our review we found a wide range of correlation coefficients, often below 0.5, that were interpreted as valid or “reasonably valid”. The frequently used significance tests for the null hypothesis H0: r=0 are inappropriate.

Our literature review showed a discrepancy between theory and practice regarding questionnaire validation. Correlation coefficients may be misleading in validation studies and should only be used with extreme caution and correct interpretation. Future studies should present systematic and random errors separately and investigate possible sources or factors influencing the errors. The decision whether the detected amount of measurement error is acceptable should be related to the intended use of the questionnaire. Bland-Altman plots and limits of agreement slowly found their way into the validation practice within the recent years, unfortunately not always correctly applied and interpreted. A frequent and correct use of the Bland-Altman methods would be desirable.


Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307-10
Bellach B. Remarks on the use of Pearson's correlation coefficient and other association measures in assessing validity and reliability of dietary assessment methods. Eur.J.Clin.Nutr. 1993;47 Suppl 2:S42-5
Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat.Methods Med.Res. 1999;8:135-60
Atkinson G, Nevill AM. Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Med. 1998;26:217-38