gms | German Medical Science

GMS Journal for Medical Education

Gesellschaft für Medizinische Ausbildung (GMA)

ISSN 2366-5017

Cost analysis for computer supported multiple-choice paper examinations

research article medicine

  • author Alexander Mandel - Wuerzburg University, Medical Faculty, Wuerzburg, Germany
  • author Alexander Hörnlein - Wuerzburg University, Faculty of Mathematics and Computer Science, Chair of Artificial Intelligence and Applied Informatics, Wuerzburg, Germany
  • author Marianus Ifland - Wuerzburg University, Faculty of Mathematics and Computer Science, Chair of Artificial Intelligence and Applied Informatics, Wuerzburg, Germany
  • author Edeltraud Lüneburg - Wuerzburg University, Medical Faculty, Wuerzburg, Germany
  • author Jürgen Deckert - Wuerzburg University, Medical Faculty, Wuerzburg, Germany
  • corresponding author Frank Puppe - Wuerzburg University, Faculty of Mathematics and Computer Science, Chair of Artificial Intelligence and Applied Informatics, Wuerzburg, Germany

GMS Z Med Ausbild 2011;28(4):Doc55

doi: 10.3205/zma000767, urn:nbn:de:0183-zma0007672

This is the English version of the article.
The German version can be found at: http://www.egms.de/de/journals/zma/2011-28/zma000767.shtml

Received: September 10, 2010
Revised: June 16, 2011
Accepted: June 16, 2011
Published: November 15, 2011

© 2011 Mandel et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en). You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.


Abstract

Introduction: Multiple-choice-examinations are still fundamental for assessment in medical degree programs. In addition to content related research, the optimization of the technical procedure is an important question. Medical examiners face three options: paper-based examinations with or without computer support or completely electronic examinations. Critical aspects are the effort for formatting, the logistic effort during the actual examination, quality, promptness and effort of the correction, the time for making the documents available for inspection by the students, and the statistical analysis of the examination results.

Methods: Since three semesters a computer program for input and formatting of MC-questions in medical and other paper-based examinations is used and continuously improved at Wuerzburg University. In the winter semester (WS) 2009/10 eleven, in the summer semester (SS) 2010 twelve and in WS 2010/11 thirteen medical examinations were accomplished with the program and automatically evaluated. For the last two semesters the remaining manual workload was recorded.

Results: The cost of the formatting and the subsequent analysis including adjustments of the analysis of an average examination with about 140 participants and about 35 questions was 5-7 hours for exams without complications in the winter semester 2009/2010, about 2 hours in SS 2010 and about 1.5 hours in the winter semester 2010/11. Including exams with complications, the average time was about 3 hours per exam in SS 2010 and 2.67 hours for the WS 10/11.

Discussion: For conventional multiple-choice exams the computer-based formatting and evaluation of paper-based exams offers a significant time reduction for lecturers in comparison with the manual correction of paper-based exams and compared to purely electronically conducted exams it needs a much simpler technological infrastructure and fewer staff during the exam.

Keywords: Educational Measurement (I2.399), Self-Evaluation Programs (I2.399.780), Multiple-Choice Examination, Cost Analysis


Introduction

Multiple Choice (MC) exams still play a prominent role for medical tests [8]. In addition to substantive work [9], [2] the question of how the technical aspects can be optimized appears. There are three basic options for the implementation of MC-exams: exam papers with or without computer support or electronic examinations:

  • A. Traditionally, the instructor creates an exam paper with a word processing system that prints out the exam sheets, corrects the answers by hand, and transmits the results to a spreadsheet program that calculates the scores.
  • B. A better option, for which there is already commercial software to buy, uses computers to scan the responses and to automatically insert the results in a spreadsheet program.
  • C. A further automation is possible if the students directly write their exams on the computer with the results being transmitted to a server and automatically evaluated afterwards.

The decision to use the most economical alternative depends on both the technical equipment as well as the selected process model whereby the risk of technical failure must be considered. In this paper we examine the efficiency of the implementation of paper-based exams with computer support (B) and compare these with the other two alternatives A and C. In contrast to B there are numerous publications about C (e.g. [6], [3]), including cost analysis, whereas some different hardware versions of electronic exam designs are compared (using students’ own laptops vs. using university’s computers in a special testing center or in CIP pools vs. complete outsourcing).

In some publications reference values for expenses and costs are provided for comparison with conventional tests, which we will discuss in the following. In [7], the total costs for a single exam which consist of investment costs, personnel costs and printing costs, are € 1423 with option A, with option B € 1072 and with option C € 1746, under the assumption of 96 examinations per year over a period of 3 years. Concerning the time exposure a comparison of estimates from two studies is shown in Table 1 [Tab. 1] ([7], Table 1 [Tab. 1] and Table 2 [Tab. 2] and [1] Table 27).

A comparison of the two estimates shows significant discrepancies, which are probably partly due to the fact that in [7] less exam participants with fewer questions per exam are considered. Overall it is apparent that in [1] generally much higher time expenses are calculated, whereas the estimation of 200 hours for manual evaluations of A (½ minutes per question with 60 questions in 400 exams) probably bases on a mix of free-text questions and closed questions, while in C only closed questions are used. Furthermore, it is striking that for exam preparation and test execution in [1], considerable efforts for the functionality testing of the computers and technical supervisors as well as technical support occur, while these factors are neglected in the estimates of [7]. Significant potential for option B can be derived from both studies if it is possible to combine the advantages of A (little technical effort in preparing and carrying out the tests) and the advantages of C (minor correction time). In the following, we analyze the time required for computer-based pure multiple-choice paper exams with automatic correction of the scanned answer sheets. Other types of questions that require a number and text input can be co-managed, but they would have to be corrected manually. For implementation, a component for computer-based paper exams had been developed at the University of Wuerzburg after experiences with the paid service of IMPP spidMED [https://www.impp.de/spidMED/ (Link checked 11.7.2011; Service terminated: 1.7.2011)] and a commercial program for optical mark recognition of multiple-choice exams. Since this component is based on a university-wide framework for the development of case-based training systems (see [5], [4]), which is funded by tuition fees, the additional investment costs were relatively low. In Section 2, the process model and the critical aspects of computer-based paper tests are described while in Section 3 the technical effort for the various phases of the 12 resp. 13 exams in SS 2010 and WS 2010/11 is presented (without considering the content of the work) and in Section 4, Option B is compared on a qualitative level with options A and C.


Methods and process model

Critical factors in the implementation of examinations are in addition to the substantial content related work, on which we don’t focus here, the logistics involved in the examination procedure, the quality, speed and cost of exam correction, the provision of documents for inspection and the statistical analysis of exam results. Below, we describe a general process model with several variations:

Creating and formatting of the exam

Questions in a written examination may come from one or more lecturers (e.g. lecture series), old question files on paper or from a database can be reused or the questions can be completely or partially created anew. Often the questions are checked by different persons so that there are several iterations. The questions may relate to images or descriptions of cases and there are often several related questions ("key feature questions"). The answer alternatives can be of type A (single selection), type X (true / false) or PickN (multiple choice) (cf. [http://www.let.ethz.ch/exam_eval/onlinetests/faq/nomenklatur_fragetypen.pdf). While in manual exam correction (A) the lecturers usually directly format the questions in a word processing program, indirect formats are common in B and C. Either the questions are selected from a database or the lecturers enter them in a specific format from which the computer generates the exam. There are two variants for this process: either input via a form or input into a word processing system with layout specifications, which is converted by a parse operation in the internal format. To discourage copying among exam participants, there are often two to four variants under option A created by swapping exam questions and answer alternatives manually. Under option B and C, the swapping is mostly automated so that each participant gets a different exam version.

In our study no question database existed yet. The lecturers defined the questions and were largely relieved of the formatting by having exam texts sent as a Word file to a coordinator who made the necessary formatting. Single and multiple-choice questions (type A and PickN) were both used. The latter is marked in Table 2 [Tab. 2] by a "yes" in the column "multiple answers per question." While in the winter semester 2009/2010, a relatively complicated input format with many options was used; from the following summer semester 2010 the input format was aligned with the most common templates of the lecturers and simplified. This simplified format (see Figure 1 [Fig. 1]) was communicated to the lecturers in order to relieve the coordinator. However, the lecturers didn’t have to follow them, as the coordinator was still responsible for the final editing. In our effort measurements in section 3, we therefore begin with an arbitrarily formatted exam text and measure the cost of subsequent formatting by the coordinator as a first step.

Exam preparation and exam management

This includes the expenses and costs for the printing of the exam as well as the efforts for laying out the sheets in the auditorium and the exam supervisor. The printing can be done on own printers or in a copy shop, in the latter case a PDF file is sent and then the printed exams are collected. The exams are usually put on the tables in the auditorium. While an alphabetical seating plan must be created in case of personalized exams so that participants are able to find their exams, in case of non-personalized exams the students write their name and matriculation number on the answer sheets, which has to be transferred into the analysis file afterwards. The exam requires the supervision of one or more persons depending on the number of participants.

At the University of Wuerzburg, the costs for printing typical medical exams can be estimated as follows: At about 140 participants and about 35 questions about 140 * 20 = 2800 pages are printed, which at a cost of 2 cents per copy makes around 56 Euro per exam (which must still be paid by the lecturers; in case of the use of color copies it is accordingly more expensive). In Table 2 [Tab. 2] the column "personalization" indicates, whether personalized exams were used and the "randomization" column marks, if questions and answers were exchanged automatically in order to impede copying. To simplify the correction, a separate answer sheet (see Figure 2 [Fig. 2] left as used in summer semester 2010 and right as in winter semester 2010/11) was created on which the numbers of the answer to all questions are marked. After initial experience with scanning in WS 09/10, in SS 2010 much more value was placed on good print quality and the use of pencil and eraser for the prevention of scribbling, which markedly improved the level of automation during the correction (see section exam evaluation). The step exam preparation and implementation summarizes efforts that lie in our model with the lecturer, i.e. printing and stapling with ½ - 1 hour (either on own printers with stapling, or in a copy shop with pick-up time) and the preparation and supervision during the actual exam with typically two people for about an hour exam time. Since these expenses of approximately 3 hours occur for each written exam and are regardless of the coordinator, they are not separately identified in Table 2 [Tab. 2] but included in the discussion.

Exam evaluation

While in option A the lecturer corrects the exams and manually transfers the data into a spreadsheet program and in option C the computer instantly gives the raw results, the efficiency of option B depends on the scanning speed and quality. Since it’s not uncommon for questions to be subsequently removed from the rating or the rating scheme being adapted, in all the options A, B, C the simplicity of the adaptation of the evaluation is important. Also, at options B and C statistics, e.g. discriminative power (“Trennschärfe”), of the questions are automatically generated.

The main focus of this study is the detailed analysis of the time required for the evaluation of an exam with option B. Therefore this step is broken down into smaller steps:

  • Scanning, in the simplest case, consists of the insertion of the answer sheets into a scanner. In some exams the answer sheets were tacked to the information sheets or answer sheets of different examinations were mixed, so that they previously had to be separated and sorted. These efforts were counted. While in WS 09/10 a high performance scanner was used in the university library, which was impractical because of transport times and the need for an appointment, in SS 10 an inexpensive scanner (about 1000 €) was purchased for exam evaluation, which, however, could only hold about 50 sheets at the same time and didn’t possess very good quality. However, the lower scan quality could be compensated by better evaluation software (see next item).
  • The analysis includes the automatic recognition of crosses on scanned papers with manual inspection and rework if necessary. The program for optical mark recognition was revised and replaced by an improved version in each of three semesters. All versions offered a clear view for manual checking, in which confidently recognized crosses were marked green, probably recognized crosses were marked with red and yellow or pink markings were used if the number of detected crosses was greater or less than the number of optical marks expected. The current version that is used since WS 2010/11 combines three different methods for optical mark recognition, which, while extending the duration of the optical mark recognition program, clearly reduces the effort of manual rework. In all versions the optical mark recognition results were put in an Excel spreadsheet with the scores for each participant and each question, including various statistics such as discriminative power, as well as documents for the exam inspection.
  • oIf some questions were unclear or needed to be adjusted or taken out of the valuation for other reasons, effort for the adaptation of the assessment scheme occurred. Although this expense is conditional to its content, we have identified it in Table 2 [Tab. 2].
  • oThe overall communication effort is listed in Table 2 [Tab. 2] under the column "Other/Support". It will naturally decrease over the semesters, when the lecturers are familiar with the process model of exam implementation, but it is higher if complications arise.

For all exams in Table 2 [Tab. 2] with one exception in SS 2010, participants were given different exam sheets with the same questions but the order of questions and response alternatives had been interchanged ("randomization = yes" in Table 2 [Tab. 2]). Choosing this option requires trust in the technology, since the manual correction of randomized exams would be very difficult. On the other hand, it is an important argument for the use of computer-based exams, since it obviously impedes copying and simplifies the distribution of the exam in the exam room. An overview of the process model in exam preparation and processing is shown in Figure 3 [Fig. 3].


Results

In WS 09/10, eleven, in SS 2010 twelve and in WS 10/11 thirteen multiple-choice papers in medicine were created and evaluated with computer assistance. All examinations but one were randomized. While from SS 2010 the coordinator recorded the expenditures, there were only rough estimates for a typical exam without major complications from the same coordinator in WS 09/10. The results are shown in table 2.

In SS 2010 and WS 10/11 all exams but four were personalized, i.e. each participant's name and student number was printed on the exam (with reserve exams for undeclared stragglers). In SS 10 resp. WS 10/11 there was an average of 143 or 137 participants per exam, which included 37 questions at average. Almost half of the exams allowed multiple answers per question, the other ones only one answer. Measured was the time required for the coordinator who helps the lecturers with exam preparation and evaluation. The average time is divided in five areas according to the information in section 2:

  • Postprocessing of the exam template: while it took 2-3 hours in WS 09/10, the time decreased in SS 10 and WS 10/11 to only 49 minutes, in uncomplicated exams even to 32 or 23 minutes. Here, a further drop is expected because it is just a matter of time for the lecturers to get used to the format they send to the coordinator. The more similar it is to the one shown in figure 1 (WORD input format), the less rework is due for the coordinator.
  • Scanning: The scanning effort mainly depends on the size of the sheet feeder and scanning speed. With the currently used, relatively simple scanner, scanning an exam with about 140 answer sheets without complications takes 20-25 minutes at best. The average of all exams was actually measured at 28 minutes in WS 10/11 and 42 minutes in SS 2010, which was mainly due to the fact that the scanner settings had to be adapted for each exam in order to achieve an optimal result. The necessary steps are now either carried out by the analysis software or are unnecessary because the answer sheet contains no gray values.
  • Evaluation: The most critical step is the evaluation of the optical mark recognition on the answer sheet because this determines the practicality of the whole process. To ensure the quality of the optical mark recognition, a manual verification step with representation of the detected crosses in the traffic-light colors (see section 2) is part of the evaluation. In SS 10 and WS 10/11 the average expenditure was about 50 minutes per exam with approximately 140 participants and 37 questions each. Since different optical mark recognition programs were used, it is more informative to consider the evaluation effort of all examinations which were corrected with the new optical mark recognition, i.e. all exams in WS 10/11 excluding the two pathology exams. Here, the average evaluation time has almost halved with only 26 minutes per exam.
  • Adaptation of the assessment scheme: The costs depend on factors that can’t be influenced by the method of evaluation and are addressed only indirectly, because the software used should make the correction of the assessment scheme or the removal of individual questions from the valuation relatively simple. In SS 10 and WS 10/11 the average expenditure was approximately 20 minutes and in most cases at 0. Only in the exam Infectious Diseases in WS 2010/11 it was unusually high with 180 minutes due to special circumstances.
  • Miscellaneous/Support: The general communication in addition to the indicated times was 12 minutes in WS 10/11 and 20 minutes in SS 10.

In sum, the effort of the exam process without printing and exam supervision for an exam with about 140 participants and about 35 questions has decreased for the coordinator from 5-7 hours for "good" exams without complication in WS 2009/2010 to about 2 hours in SS 2010 and finally was at 1.5 hours in WS 2010/11. For the most efficiently corrected exam "general medicine" in WS 10/11, the expenses even were just 65 minutes at 121 participants and 30 questions. With complications the average time increased to 160 or 179 minutes per exam in WS 10/11 or SS 2010; in WS 2009/2010, the number was much higher and not reported. The figures clearly show that for the overall average efficiency, the existence and treatment of complications is almost as important as the basic model.


Discussion

Overall, it can be said that in SS 2010 and even more in WS 10/11 the time effort for instructors and coordinators is pretty low. Although there is always room for improvement, the average values of the 9 uncomplicated exams in WS 2010/11 should be pretty close to the optimum of about 1 to 1.5 hours of time effort per exam (excluding the content-related effort). Additionally to that, the time needed for printing the exam, which is 0.5 to 1 hour, has to be added. The total time is comparable to the minimum time required for exam supervision, which is approximately 2 hours and cannot be optimized. However, these numbers were not achieved immediately, as in the introduction phase in WS 2009/2010, the time effort for exams without complications was with 5 to 7 hours for the lecturers and also the coordinator had significantly higher expenses, being very high for exams with complications.

It seems that the number of exam participants and the number of questions per exam have a relatively small influence on the overhead costs because although more questions require increased formatting effort, and more questions and more participants increase the scanning and evaluation effort, the additional effort in comparison to the basic effort is limited. However, the empirical data don’t allow clear statements because the exams are relatively homogeneous concerning the number of questions and participants and the exams with deviations had complications and therefore were not comparable.

Based on this data, we pick up the cost models for the options A, B and C from [7] and [1] and compare them on a qualitative level with our outlined expenses of option B. In the comparison of options A and B the efforts for the creation of the exam and for the exam procedure remain about the same. The differences result from the exam evaluation: The correction time for option A, which was estimated 13.5 hours per exam in [7], and in [1] even longer, drops to 1 to 1.5 hours in our analysis of option B when there are no complications. Additional costs for the scanner (about 1000 €) and the development or acquisition of software and its maintenance (which were very low at the University of Wuerzburg, because the software used for correction was only one additional component of a large blended learning project, see above) have to be added. In comparison of options B and C, the correction times are roughly comparable and in both options the development or acquisition of software and its maintenance is necessary. While in B time expenses for the creation and printing of the document examination incur (about 0.5 hours for the formatting of the exam and 0.5 to 1 hour for printing), at C function tests of the computers must be included, which aren’t reported in [7] but include 8 to 32 hours in [1]. The same applies to the exam procedure, as with C additionally to the exam supervision technically skilled personnel should be present, which does not apply to B. There are other differences regarding the printing costs (a minimum of 56 € per exam) and scanner investment in option B in comparison with investments in infrastructure, which is necessary for the realization of purely electronic exams in Option C. The latter are difficult to quantify, since there are many variations, ranging from a fully equipped test center with own computers to the use of student’s laptops. In this context the study [1] suggests that lower investment costs lead to (much) more time effort. After all, the risk and severity of complications, which carry much more weight in C than in B, also have to be considered.

Therefore, the aim to relieve the lecturers as inexpensively as possible in the correction of multiple-choice exams can be achieved in the best way with option B, which means paper-based exams with computer support. According to our investigations, electronic exams currently only pay off when the opportunities of new types of tasks beyond conventional multiple-choice will be used, such as long-menu questions or other types of questions, the showing of videos, working with virtual microscopes or solving interactive training cases.


Competing interests

The authors declare that they have no competing interests.


References

1.
Bücking J, Schwedes K, Laue H. Computergestützte Klausuren an der Universität Bremen, ZMML (Zentrum für Multimedia in der Lehre), Arbeitsbericht. Bremen: Universität Bremen; 2007. Zugänglich unter/available from: http://www.eassessment.uni-bremen.de/documents/eKlausurenBerichtZMML.pdf External link
2.
Fischer M, Kopp V. Computer-based pre-clinical assessment: Does the embedding of multiple-choice questions in a clinical context change performance? GMS Z Med Ausbild. 2006;23(3):Doc52. Zugänglich unter/available from: http://www.egms.de/static/de/journals/zma/2006-23/zma000271.shtml External link
3.
Frey P. Computerbasiert prüfen: Möglichkeiten und Grenzen. GMS Z Med Ausbild. 2006;23(3):Doc49. Zugänglich unter/available from: http://www.egms.de/static/de/journals/zma/2006-23/zma000268.shtml External link
4.
Hörnlein A, Ifland M, Klügl P, und Puppe F. Konzeption und Evaluation eines fallbasierten Trainingssystems im universitätsweiten Einsatz (CaseTrain). GMS Med Inform Biom Epidemiol. 2009;5(1):Doc07. DOI: 10.3205/mibe000086 External link
5.
Hörnlein A, Mandel A, Ifland M, Lüneberg, E, Deckert J, Puppe F. Akzeptanz medizinischer Trainingsfälle als Ergänzung zu Vorlesungen. GMS Z Med Ausbild. 2011;28(3):Doc42. DOI: 10.3205/zma000754 External link
6.
Kopp V, Herrmann S, Müller T, Vogel P, Liebhardt H, Fischer MR. Einsatz eines fallbasierten Computerprüfungsinstruments in der klinischen Lehre: Akzeptanz der Studierenden. GMS Z Med Ausbild. 2005;22(1):Doc11. Zugänglich unter/available from: http://www.egms.de/static/de/journals/zma/2005-22/zma000011.shtml External link
7.
Krückeberg J, Paulmann V, Fischer V, Haller H, Matthies H. Elektronische Testverfahren als Bestandteil von Qualitätsmanagement und Dynamisierungsprozessen in der medizinischen Ausbildung. GMS Med Inform Biom Epidemiol. 2008;4(2):Doc08. Zugänglich unter/available from: http://www.egms.de/static/de/journals/mibe/2008-4/mibe000067.shtml External link
8.
Möltner A, Duelli R, Resch F, Schultz JH, Jünger J. Fakultätsinterne Prüfungen an den deutschen medizinischen Fakultäten. GMS Z Med Ausbild. 2010;27(3):Doc44. DOI: 10.3205/zma000681 External link
9.
Smolle J. Klinische MC-Fragen rasch und einfach erstellen – ein Praxisleitfaden für Lehrende. Berlin/New York: Walter de Gruyter; 2008.