gms | German Medical Science

GMS Journal for Medical Education

Gesellschaft für Medizinische Ausbildung (GMA)

ISSN 2366-5017

Reliability of a science admission test (HAM-Nat) at Hamburg medical school

research article medicine

Search Medline for

  • author Johanna Hissbach - Universitätsklinikum Hamburg-Eppendorf, Institut für Biochemie und molekulare Zellbiologie, Hamburg, Deutschland
  • author Dietrich Klusmann - Universitätsklinikum Hamburg-Eppendorf, Institut und Poliklinik für Medizinische Psychologie, Hamburg, Deutschland
  • corresponding author Wolfgang Hampe - Universitätsklinikum Hamburg-Eppendorf, Institut für Biochemie und molekulare Zellbiologie, Hamburg, Deutschland

GMS Z Med Ausbild 2011;28(3):Doc44

doi: 10.3205/zma000756, urn:nbn:de:0183-zma0007562

This is the English version of the article.
The German version can be found at:

Received: October 8, 2010
Revised: March 29, 2011
Accepted: June 1, 2011
Published: August 8, 2011

© 2011 Hissbach et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.


Objective: The University Hospital in Hamburg (UKE) started to develop a test of knowledge in natural sciences for admission to medical school in 2005 (Hamburger Auswahlverfahren für Medizinische Studiengänge, Naturwissenschaftsteil, HAM-Nat). This study is a step towards establishing the HAM-Nat. We are investigating

parallel forms reliability,
the effect of a crash course in chemistry on test results, and
correlations of HAM-Nat test results with a test of scientific reasoning (similar to a subtest of the "Test for Medical Studies", TMS).

Methods: 316 first-year students participated in the study in 2007. They completed different versions of the HAM-Nat test which consisted of items that had already been used (HN2006) and new items (HN2007). Four weeks later half of the participants were tested on the HN2007 version of the HAM-Nat again, while the other half completed the test of scientific reasoning. Within this four week interval students were offered a five day chemistry course.

Results: Parallel forms reliability for four different test versions ranged from rtt=.53 to rtt=.67. The retest reliabilities of the HN2007 halves were rtt=.54 and rtt =.61. Correlations of the two HAM-Nat versions with the test of scientific reasoning were r=.34 und r=.21. The crash course in chemistry had no effect on HAM-Nat scores.

Conclusions: The results suggest that further versions of the test of natural sciences will not easily conform to the standards of internal consistency, parallel-forms reliability and retest reliability. Much care has to be taken in order to assemble items which could be used interchangeably for the construction of new test versions. The test of scientific reasoning and the HAM-Nat are tapping different constructs. Participation in a chemistry course did not improve students’ achievement, probably because the content of the course was not coordinated with the test and many students lacked of motivation to do well in the second test.

Keywords: Student selection medical school, External validity, Reliability, Admission test


In 2005 Hamburg Medical School started to develop a test of natural sciences (HAM-Nat) as a tool for student admission after a change in federal law allowed German medical schools to select 60% of their student body by admission procedures such as written tests [1], [2].

Until 2008 the Medical Faculty of Hamburg selected candidates solely on the basis of school grade point average (GPA). This is a straightforward approach and GPA is predictive of study success. For the 1986 and 1987 cohorts of medical students Trost et al. [3] reported a correlation of r=.48 between GPA and grades in the written part of the first clinical examination, the correlation for the oral part was r=.34 [3]. In a meta-analysis Trapmann et al [4] report a corrected predictive power of r=.58 for grades in the first section of study. High predictive validity of GPA is also reported in international studies [5] and degree programs other than medicine [4]. In a prospective British study, A-levels showed predictive power for professionalism in the medical field [6].

Nevertheless, GPA as a selective tool is criticized on many accounts [4]:

different standards between schools and federal states make GPA scores incomparable;
reliability and validity of GPA are insufficient;
there are different standards between teachers and classrooms;
predictive power for study success later in the curriculum is weak.

Since a large number of candidates apply for medical school, a high level of GPA is necessary for admission. In the years of 2005-2007 applicants for Hamburg Medical School needed GPA scores of at least 1.6 to 1.7 (a low score means high achievement). As GPA is influenced by the type of school, combinations of subjects, and evaluative standards, using GPA as the only admission criterion raises issues of fairness [7]. Additional selection criteria may compensate for some of the shortcomings of GPA.

Some German medical schools use the "Test for Medical Studies” (TMS) to complement GPA in selection. This test was mandatory for all applicants to medical school in the years between 1986 and 1996. It includes questions from the field of natural sciences; however, the targeted construct is not knowledge but the ability to study successfully [8]. Correlations of TMS scores and GPA range from r=.37 to r=.48 and the authors conclude that GPA and TMS measure sufficiently separable facets of academic achievement [3]. Predictive power of the TMS is mainly based on four subtests for abilities required in the medical curriculum (medical and scientific comprehension, quantitative and formal problems, text comprehension, diagrams and tables) [3].

Various countries employ tests of subject specific knowledge relevant for the respective courses [9]. Knowledge tests are used for medical school selection in Belgium [10] and Austria [11]. Reibnegger et al. [12] reported an increase of successful students from 23% to 84% after a demanding admission procedure had been introduced at the university of Graz, Austria (mean percentages of three years before and after the admission procedure had been established). Simultaneously the drop-out rate in the first year of medical school decreased from 10% to 1%. The majority of test items were natural science problems similar to HAM-Nat items.

Since 2003 some British medical schools have introduced the Biomedical Admissions Test (BMAT) for student selection. The subtest “scientific knowledge and application” was a useful predictor for examination marks in the first and second year of study [13]. Predictive power of the second part of the BMAT which entails multiple choice items on problem solving, text comprehension, and the interpretation of tables and figures (“aptitude and skill”) is considerably lower [14]. In Germany the HAM-Nat is the only testing program for medical school selection focusing specifically on natural sciences.

With the HAM-Nat test a second selection criterion in addition to GPA will be introduced, a criterion that is uniform for all applicants and might be evaluated consecutively. The HAM-Nat is expected to measure knowledge of natural sciences that is relevant for success in the first two years of the curriculum and thereby help to select applicants with good chance to complete the course successfully. Moreover, an excellent HAM-Nat score may compensate for a low GPA score. The internet page of Hamburg Medical School ( not only offers information about the curriculum but additionally exhibits HAM-Nat test items for a self-test of knowledge in natural sciences. Potential applicants may examine their motivation to study medicine and assess their chances to succeed. Hamburg Medical School deliberately aims at pre-selection by self-evaluation. Preparation for the HAM-Nat is tantamount to a preparation for the first two years of study because the HAM-Nat examines basic knowledge required for the science classes during the first two years of study.

A preliminary version of the HAM-Nat was presented 2006 to a sample of high school students. From this pilot study a first 2006 version was derived and tested with the 2006 cohort of already admitted students [15]. Subsequently, new items were generated for a 2007 version of the test. The existence of two test versions raises the question of test equivalence.

This study attempts to answer this question and will additionally examine retest reliability. We also analyze the effect of a five-day crash course (training in basic chemistry) on HAM-Nat scores and the relation of the HAM-Nat to a test of “scientific reasoning”. This test resembles the TMS subtest “medical and scientific comprehension”.


Hampe et al. [15] describe the test development of the 2006 HAM-Nat test form (HN2006). After exclusion of 8 items with low item-total correlations, the HN2006 consists of 52 multiple choice items from mathematics, chemistry, physics, and biology. To create a parallel test form for 2007 (HN2007), high school teachers and university lecturers from clinical and basic science departments generated 60 new items similar in content and structure to the HN2006 items.

Example for a HAM-Nat-question:

Oxidation of an aldehyde yields...
A) an ester.
B) a ketone.
C) a carboxylic acid.
D) an alcohol.
E) an alkene.

Each item presents one correct answer and four distractors, testees have 1.5 minutes to answer each question. The topics covered in the test and some sample items from HN2006 and HN2007 are published on the internet page of Hamburg medical school for self-testing (

Test of “scientific reasoning”

The “scientific reasoning” test is similar to the TMS subtest “medical and scientific comprehension” with regard to form and content. Both tests comprise 24 multiple choice items and both were developed by ITB-Consulting. Each question starts with the description of a scientific problem. Subsequently, the testee has to decide which statement out of five options following the text is true. The duration of the test is limited to 55 minutes. Prior scientific knowledge is not needed to answer these questions since the test is designed to measure intellectual abilities relevant to the medical curriculum: comprehension of complex problems and deductive reasoning. Rights to use the test were purchased from ITB-Consulting.

Chemistry course

The chemistry department of Hamburg University regularly offers this five-day course to first year medical students before the beginning of the term. The intention of this optional course is to level previous knowledge of students. Several parallel courses of 30-40 students are run by tutors. The tutors present topics from senior years of secondary school, e.g. the concept of matter, chemical reaction, and the structure of organic compounds, and afterwards students work on problems. The course’s contents are similar to HAM-Nat topics. However, tutors did not teach to the test as they were not familiar with the HAM-Nat.

Study design

First testing: Parallel forms

The HN2006 was divided into two halves A and B of 26 items each. As the test had been accessible on the internet, participants might have known these items. The HN2007 comprises two halves C and D of 30 new items. Each participant worked on two halves from each test version (AC, AD, BC, or BD), namely 26 old items from HN2006 and 30 new items from HN2007 (see Figure 1 [Fig. 1]). Before the second testing, each student had the opportunity to attend the five day training course. Participants stated how many days of the course they attended.

Second testing: Retest and test of “scientific reasoning”

Participants were randomly assigned to two groups. Four weeks after the first testing 96 participants took the complete HN2007, meaning test halves C and D. Therefore, they had already answered one half of the items while the other half was unknown. The following week, the second group of participants (N=91) took the subtest “scientific reasoning”. The test was conducted by our study group and invigilated by members of faculty. The “scientific reasoning” test was specifically conducted for this study, irrespective of the official, nationwide testing of the TMS.


All students of the 2007 cohort were offered study participation in the first semester orientation week. Study participation was voluntary, and all students gave written informed consent. 316 students (77% of the cohort) agreed to participate (see Figure 1[Fig. 1]). One third of the sample was male, two thirds female which corresponds to the distribution in the total cohort. The mean secondary school GPA was 1.8. The second test was conducted in a compulsory class during the first term. As opposed to the participation in the orientation week, many attendees were not willing to retake the test. Results of the first and second testing could be matched for 170 students (54% of the original sample). No significant differences regarding GPA and gender distribution were found between the group of test repeaters and those who denied the retest. 91 participants worked on the “scientific reasoning” test and 79 wrote the HN2007 again. The effect of the chemistry course can be evaluated with data from 52 students, who took the HAM-Nat twice and stated the number of days that they attended the course. 15 students stated they had attended 3 or fewer days, while 37 attended more than 3 days.

Statistical analysis

Parallel forms reliability requires true values and error variances to be equal. Equal means and distributions of data, as well as a high correlation between test versions, indicate high parallel forms reliability. Retest reliability assumes that between two assessments the participants’ true scores are constant as is measurement error. It reflects the degree in which repeated measurement with a certain measure on the same population reveals according results.

Pearson correlations are calculated to quantify parallel forms reliability and the correspondence of HAM-Nat and “scientific reasoning”. Retest reliability of HN2007 was assessed by means of Spearman’s rank correlation coefficient.

Cronbach’s α is the expected value of a correlation of two randomly selected item sets (of k items) from the universe of all possible items for the measured construct. If HN2006 and HN2007 are parallel forms, the correlations of test halves must be as high as their internal consistencies. If the correlations are different in size, the item set is either drawn from different item universes or items are not randomly selected.

A general linear model (GLM) of HN2006 and HN2007 total scores was employed, with “test version” as a repeated measurement factor (within subjects factor) and “group” (AC, AD, BC, BD) as a between subjects factor. A significant repeated measurement factor means that test versions differ in difficulty, while significant interaction effects with group point to differences in test halves within one test version.

To estimate the effect of the chemistry course on chemistry test item performance, the variable “attendance at the course” was dichotomized (0-3 vs. >3 days) and included in a new model as a between subjects factor as well as the within subject factors “items separated by subject” (chemistry vs. other questions) and “time” (first testing vs. retest). PASW 18 for Windows [16] was used for these analyses.


Internal consistency and parallel forms reliability

Inter item correlations for all scales ranged from r=-.22 to r=.53 (mean: r=.06), internal consistencies from α=.56 to α=.69 (see Table 1 [Tab. 1]), and parallel forms correlations from r=.53 to r=.67 (see Table 2 [Tab. 2]).

Retest reliability

Retest reliability was calculated for HN2007. Pearson’s rank correlation for test half C was rtt=.52 (n=46), for test half D rtt=.61 (n=34) (see Figure 2 [Fig. 2]). The corresponding Pearson correlations were rtt=.54 and rtt=.56. Some participants scored considerably worse in the retest as compared to the first testing. Exclusion of 9 participants with retest scores below 6 did not raise the correlation coefficient (test half C rtt=.45, n=39; test half D rtt=.61, n=32), even though Figure 2 [Fig. 2] might suggest such an effect.

Differences between test versions HN2006 and HN2007

The general linear model (GLM) gives a more detailed look at differences between test versions. A GLM with the factors “test version” (HN2006 vs. HN2007) as a repeated measurement factor and test half (A or B vs. C or D) as a between subjects factor showed that significantly fewer HN2007 items were solved correctly as compared to the HN2006 version (38.5% vs. 45.2%, F1,312=101.5; p<.001). While all participants scored equally high in both test halves of version HN2006 (F1,312=2.3; p=.128), test half D was more difficult than test half C in the HN2007 version (35.1% vs. 40.6% correct answers, F1,312=11.4; p=.001). Gender, included as a between subjects factor, had no significant effect on performance in the different test versions (F=.468, p=.495), even though males showed higher total scores as compared to females (44% vs. 40% correct answers, T=-2.64; p=.009).

Effect of the chemistry course

Scores of the first and second testing were analyzed separately for chemistry and other items (biology, mathematics, and physics) to check the effect of the chemistry course on performance. Fewer chemistry items were answered correctly as compared to items from the other subjects (35.8% vs. 43.4% correct answers, F78,1=25.6, p<.001). There was neither an improvement nor a decline of HN2007 test results after the course, not even chemistry items were answered correctly more often (interaction effect: F1,78=0.26; p=.610). The dichotomized variable “days of participation in the course” (0-3 vs. 4-5 days of attendance) showed no significant effect on HN2007 total scores (F1,50=2.4; p=.124) or chemistry scores (F1,50=0.1; p=.759). The sample for this analysis is reduced to n=52. Including gender in the model yielded no significant interaction effects (all p>.289).

Publicity of test items

In the retest condition, participants had already seen half of the HN2007 items, while the other half was new. Known items were not significantly more often answered correctly as compared to the first test (41.5 % 40.1 %; F1,50=0.4; p=.543). Exclusion of participants with very low retest scores did not alter results.

Correlation of HAM-Nat and GPA

Correlation coefficients for HAM-Nat and GPA scores ranged between r=-.34 and r=-.13 (see Table 1 [Tab. 1]) for the different versions (mean correlation r=-.24). GPA and the test “scientific reasoning” showed a correlation of r=-.11 (n=90).

Correlation of the HAM-Nat and “scientific reasoning”

The correlation of the subtest “scientific reasoning” and HN2006 test halves A and B were r=.34. Correlation coefficients for HN2007 test halves C and D were r=.19 and r=.23, respectively (see Figure 3 [Fig. 3]). For the combined test halves HN2006 (A+B) and HN2007 (C+D) the correlation coefficients were r=.34 and r=.21. The two coefficients did not differ significantly (p=.350; test with Fisher’s z [17]).


Results can be summarized as follows:

Significantly more old HN2006 items were solved correctly as compared to new HN2007 items, and taking HN2007 twice did not improve test performance.
HN2006 and HN2007 did neither differ with regard to their internal consistencies nor with regard to their correlations with a third test “scientific reasoning”.
Internal consistencies of the different test versions were not significantly different from the correlation between test halves (parallel forms reliability).

Why is the HN2006 easier than the HN2007? Maybe some participants were familiar with the old test items due to their publication on the internet. However, we assume that participants did not prepare for the test because they had already been admitted to medical school and nothing was at stake. We do not know how many students took the internet self-test. However, taking the HN2007 twice within a four week period did not lead to better results. Why should the supposedly infrequent visit of this internet page have an effect? It is more likely that test developers produced more difficult items.

On the one hand, varying difficulties of HAM-Nat test forms are not problematic since the purpose of the test is to rank applicants in a combined score of HAM-Nat and further admission criteria (GPA, further tests). As long as tests produce the same rank ordering, they are exchangeable. However, a test used for student selection should exhibit a profile which is constant over different cohorts.

Rank correlation coefficients are a measure of reproducibility. For test halves C and D of the HN2007 they were r=.52 and r=.61. These are not very high values given that participants had seen the same items four weeks prior to the second testing. This low level of reproducibility might be due to an important source of error which applies to the whole study design: since stakes were low, test score variation is not only due to differences in knowledge but also to differences in test motivation. This is especially true for the retest condition with just above half of the sample taking part. At this time point, participants were busy with the first weeks of the term. Therefore, the low retest correlation is probably an underestimation.

The especially low performance in chemistry items could be explained by the fact that most German schools introduce chemistry classes later into the curriculum than other natural sciences. Moreover, more students drop this subject in sixth form as compared to other science classes. If, for example, biology is dropped in sixth form, pupils still have had more years studying biology as compared to the scenario where chemistry classes are dropped. Offering a training course in chemistry seems worthwhile. But why did we not see better results in the chemistry items of the HAM-Nat retest? For this part motivation to do well is very important and probably participants were not motivated enough. Another explanation could be that HAM-Nat items covered knowledge which was not taught in the course. This finding draws attention to the process of writing items. New items should correspond to the typical teaching material that applicants use for test preparation. Only if this is the case, test preparation can improve chances to be admitted – one of the intended effects of the HAM-Nat. To improve further versions of the HAM-Nat test, a list of topics was published in 2008 to help applicants with their preparation. All subsequent HAM-Nat items can be reliably assigned to one or more topics of the list of subjects.

HN2006 items had been preselected by item total correlation in a first test run which was not the case for HN2007. This might explain the slightly smaller – yet insignificant – internal consistency of HN2007. To check this, we excluded items with corrected item total correlations <.10 from both scales and recalculated internal consistencies. HN2006 contained merely 5 items below .10 while HN2007 contained 15 items that had to be excluded. After exclusion of these items, internal consistencies for all test halves amounted up to values between .60 and .70. Therefore, internal consistencies are only slightly higher than correlations of test halves, and we cannot reject the null hypothesis that both tests are drawn from the same item universe and that they are randomly selected from this universe.

We expected correlations of the HAM-Nat and the external criterion “scientific reasoning” to be low as the “scientific reasoning” is targeted on ability to reason and intelligence while the HAM-Nat test is targeted on knowledge and application of knowledge.

Even though the test versions HN2006 and HN2007 only differ with regard to the number of correctly solved items, results indicate that it is difficult to develop parallel test forms for knowledge of sciences. Erroneously assuming that test forms are parallel (beta error) at this stage of test development is more harmful than the contrary error.

Despite many actions to prevent that items are made public, new items have to be written every year. However, a certain proportion of old items with good psychometric properties should be reused to raise test quality and to estimate equivalence of new test versions. The larger the item pool, the more items can be reused. Methods that are able to estimate sample independent test characteristics should be used for subsequent HAM-Nat test versions. Models within the item response theory (IRT) framework [18] allow comparisons across different test versions and cohorts of students. Therefore, the aim of our project is to assemble a pool of validated items.


We are grateful to Prof. U. Koch-Gromus and Dr. B Andresen for open discussions and their collaboration, and we would like to thank D. Münch-Harrach and C. Kothe for their support in data management. This research was funded by the “Foerderfonds Lehre”, a grant of the Universitaetsklinikum Hamburg Eppendorf.

Competing interests

The authors declare that they have no competing interests.


Bundesministerium für Bildung und Forschung. Hochschulrahmengesetz. BGBI. 2005;I:3835. Zugänglich unter/available from: External link
Hansestadt Hamburg. Hochschulzulassungsgesetz Hamburg, HmbGVBI. 2004:515-517. Zugänglich unter/available from: External link
Trost G, Flum F, Fay E, Klieme E, Maichle U, Meyer M, Nauels HU. Evaluation des Tests für Medizinische Studiengänge (TMS): Synopse der Ergebnisse. Bonn: ITB; 1998.
Trapmann S, Hell B, Weigand S, Schuler H. Die Validität von Schulnoten zur Vorhersage des Studienerfolgs - eine Metaanalyse. Z Padagog Psychol. 2007;21(1):11-27. DOI: 10.1024/1010-0652.21.1.11 External link
Ferguson E, James D, Madeley L. Factors associated with success in medical school: systematic review of the literature. BMJ. 2002;324(7343):952-957. DOI: 10.1136/bmj.324.7343.952 External link
McManus IC, Smithers E, Partridge P, Keeling A, Fleming PR. A levels and intelligence as predictors of medical careers in UK doctors: 20 year prospective study. BMJ. 2003;327(7407):139-142. DOI: 10.1136/bmj.327.7407.139 External link
Wissenschaftsrat. Empfehlungen zur Reform des Hochschulzugangs. Berlin: Wissenschaftsrat; 2004. Zugänglich unter/available from: External link
Trost G. Test für Medizinische Studiengänge (TMS): Studien zur Evaluation, 20. Arbeitsbericht. Bonn: Institut für Test- und Begabungsforschung; 1996.
Koeller O, Baumert J. Das Abitur - immer noch ein gültiger Indikator für die Studierfähigkeit? Politik Zeitgeschichte. 2002;B26. Zugänglich unter/available from:,0,Das_Abitur_immer_noch_eing%FCltiger_Indikator_f%FCr_die_Studierf%E4higkeit.html External link
Janssen PJ. Vlaanderens toelatingsexamen arts-tandarts: resultaten na 9 jaar werking. Ned Tijdschr Geneeskd. 2006;62:1569-81. DOI: 10.2143/TVG.62.22.5002592 External link
Smolle J, Neges H, Macher S, Reibnegger G. Aufnahmeverfahren für das Medizinstudium: Erfahrungen der Medizinischen Universität Graz. GMS Z Med Ausbild. 2007;24(3):Doc141. Zugänglich unter/available from: External link
Reibnegger, G; Caluba, HC; Ithaler, D; Manhal, S; Neges, HM; Smolle, J. Progress of medical students after open admission or admission based on knowledge tests. Med Educ. 2010; 44(2): 205-214. DOI: 10.1111/j.1365-2923.2009.03576.x External link
Emery JL, Bell JF. The predictive validity of the BioMedical Admissions Test for pre-clinical examination performance. Med Educ. 2009;43(6):557-564. DOI: 10.1111/j.1365-2923.2009.03367.x External link
McManus IC, Ferguson E, Wakeford R, Powis D, James D. Predictive validity of the Biomedidcal Admission Test: An evaluation and case study. Med Teach. 2011;33:53-57. DOI: 10.3109/0142159X.2010.525267 External link
Hampe W, Klusmann D, Buhk H, Muench-Harrach D, Harendza S. Reduzierbarkeit der Abbrecherquote im Humanmedizinstudium durch das Hamburger Auswahlverfahren für Medizinische Studiengaenge - Naturwissenschaftsteil (HAM-Nat). GMS Z Med Ausbild. 2008;25(2):Doc82. Zugänglich unter/available from: External link
PASW. Predictive Analysis SoftWare. Rel. 18.0.0 ed. Chicago: SPSS Inc.; 2009.
Müller KH. Beitrag zum Prüfen der Differenz zwischen 2 Korrelationskoeffizienten. Biometr Z. 1971;13(5):342–361. DOI: 10.1002/bimj.19710130507 External link
Embretson SE, Reise SP. Item response theory for psychologists. Mahwah, N.J.: L. Erlbaum Associates; 2000.