gms | German Medical Science

GMS Journal for Medical Education

Gesellschaft für Medizinische Ausbildung (GMA)

ISSN 2366-5017

Effects of a rater training on rating accuracy in a physical examination skills assessment

research article medicine

  • corresponding author Gunther Weitz - Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Medizinische Klinik I, Lübeck, Deutschland
  • author Christian Vinzentius - Institut für Qualitätsentwicklung an Schulen Schleswig-Holstein, Kronshagen, Deutschland
  • author Christoph Twesten - Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Medizinische Klinik I, Lübeck, Deutschland
  • author Hendrik Lehnert - Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Medizinische Klinik I, Lübeck, Deutschland
  • author Hendrik Bonnemeier - Universitätsklinikum Schlesweig-Holstein, Campus Kiel, Medizinische Klinik III, Kiel, Deutschland
  • author Inke R. König - Universität zu Lübeck, Institut für Medizinische Biometrie und Statistik, Lübeck, Deutschland

GMS Z Med Ausbild 2014;31(4):Doc41

doi: 10.3205/zma000933, urn:nbn:de:0183-zma0009338

This is the English version of the article.
The German version can be found at: http://www.egms.de/de/journals/zma/2014-31/zma000933.shtml

Received: January 8, 2014
Revised: March 24, 2014
Accepted: August 20, 2014
Published: November 17, 2014

© 2014 Weitz et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc-nd/3.0/deed.en). You are free: to Share – to copy, distribute and transmit the work, provided the original author and source are credited.


Abstract

Background: The accuracy and reproducibility of medical skills assessment is generally low. Rater training has little or no effect. Our knowledge in this field, however, relies on studies involving video ratings of overall clinical performances. We hypothesised that a rater training focussing on the frame of reference could improve accuracy in grading the curricular assessment of a highly standardised physical head-to-toe examination.

Methods: Twenty-one raters assessed the performance of 242 third-year medical students. Eleven raters had been randomly assigned to undergo a brief frame-of-reference training a few days before the assessment. 218 encounters were successfully recorded on video and re-assessed independently by three additional observers. Accuracy was defined as the concordance between the raters' grade and the median of the observers' grade. After the assessment, both students and raters filled in a questionnaire about their views on the assessment.

Results: Rater training did not have a measurable influence on accuracy. However, trained raters rated significantly more stringently than untrained raters, and their overall stringency was closer to the stringency of the observers. The questionnaire indicated a higher awareness of the halo effect in the trained raters group. Although the self-assessment of the students mirrored the assessment of the raters in both groups, the students assessed by trained raters felt more discontent with their grade.

Conclusions: While training had some marginal effects, it failed to have an impact on the individual accuracy. These results in real-life encounters are consistent with previous studies on rater training using video assessments of clinical performances. The high degree of standardisation in this study was not suitable to harmonize the trained raters’ grading. The data support the notion that the process of appraising medical performance is highly individual. A frame-of-reference training as applied does not effectively adjust the physicians' judgement on medical students in real-live assessments.

Keywords: rater training, rating accuracy, skills assessment, physical examination skills, randomised controlled trial


Introduction

The physical examination is a core clinical competence for every physician. A major task in medical education is to impart profound physical examination skills. However, recent literature raises concerns over declining abilities of graduates to perform a thorough physical examination [1], [2]. Factors contributing to this development include a scarcity of good teaching patients, skilled faculty, and time for bedside teaching [3], [4]. Also, increasing specialisation has led to an over-reliance on technology and a loss of the big picture [4], [5]. Hence, teaching and accurately assessing basic examination skills may more and more become a challenge in medical education.

Over the last decades several strategies have been established to secure the quality of physical examination skills training. These include the introduction of standardised patients (SPs) and patient instructors [6], [7], the application of checklists and rating forms [8], the implementation of Objective Structured Clinical Examinations (OSCEs) [9], and systematic direct observations of patient encounters [10]. In practice, however, these tools do not always take effect as intended. E.g., in a study from Taiwan, 22% of final year students reported to have never been observed in a physical examination (36% never by faculty) and 10% felt not yet confident with the procedure [11].

In our faculty, we evaluate a standardized head-to-toe examination of every third year student immediately after a tutorial over the first five weeks of the semester. However, we frequently receive complaints concerning the fairness of these assessments. Reliability and accuracy of faculty evaluation is indeed known to be low [12]. Structuring the evaluation by a rating form markedly increases the accuracy of the observations, but does not improve the agreement in the overall assessment [13]. This may be due to the fact that the raters' strategies to integrate information are rather individual and that the frame of reference differs between the raters [14], [15]. Studies from personnel psychology indicate that frame-of-reference training in groups can improve the accuracy in performance appraisal [16], [17]. The goal of such training is to teach raters to share a common conceptualisation of performance. It thereby imposes more accurate schemes [18]. We therefore planned to implement a rater training for the assessment of the physical examination skills.

Surprisingly, studies on rater training in medical education are scarce and the results are somewhat disappointing. In a small study, Newble and co-workers investigated the impact of training on the ratings of five videotaped physical examinations [19]. They gave either no training, performance feedback to the raters in one group, or feedback with additional training, including a discussion of another videotaped encounter, in a third group. There were no notable differences in the re-ratings of the videotapes after two months in either group. Holmboe and co-workers studied the effects of an intensive multi-dimensional rater training on the ratings of videotaped patient encounters eight months after training [20]. The trained faculty was more stringent and had a smaller range in some of the ratings. More recently, Cook and co-workers investigated the effects of similar but shorter training on interrater reliability and accuracy of mini-CEX ratings in a resident program [21]. The training did not improve these parameters.

To test whether a rater training would improve accuracy in our setting we undertook this study. Our setting differs from previous studies in several aspects:

Firstly, our rating focussed on a defined skill rather than assessing overall performance. Secondly, we standardized the physical examination task and all raters were familiar with the faculty standard. And thirdly, while previous studies relied on videotaped and scripted situations to determine the quality of rating, we had the chance to study real-live encounters between examinees and standardized patients (SPs). This implies that the grading was relevant and that the raters had to announce their decisions face to face to the examinees. For evaluation we videotaped all the exams and let three observers independently grade the students' performance retrospectively. We hypothesised that trained raters would rate more in line with the post-hoc observers, hence, being more accurate. We also sought to assess the effects of the rater training on stringency and the range of the grades.


Methods

Curricular embedment

The physical examination skills assessment was part of a course in physical examination to medical students at the beginning of their third year. The goal of this part of the course is to teach the students the basics of the physical examination in general internal medicine. After training the course continues with bedside teaching. The procedure is standardised to a head-to-toe screening physical examination and includes the inspection of head and mouth, the inspection and palpation of the neck, the complete examination of thorax and abdomen, an orientating examination of the vascular system (including measurement of one blood pressure), and the inspection of the limbs. A video explaining the standardised procedure is accessible to all students on the web. Other elements of the physical examination such as the pelvic, the musculoskeletal, and the neurological examination are taught in other parts of the course.

Training takes place in the first five weeks of the winter semester. It consists of five ninety-minute lectures and the same amount of training with peer examinations in groups of six students instructed by one experienced internist each. The assessment of the students’ skills is scheduled in the sixth week. The students’ task is to present the standardised examination with a standardized patient (SP) in a time limit of ten minutes. The raters are physicians selected from the medical departments of the University Hospital. They watch the students’ performance, give feedback, and rate the performance by assigning a grade (German school grading, see table 1 [Tab. 1]). They do not interfere with the students’ examination nor do they ask theoretical questions. Each rater assesses six students in a time frame of fifteen minutes per student on two days each. The SPs are healthy students. They are instructed to behave passively and only to comply with coherent commands.

For this study, twenty-one physicians were chosen to rate the performance of 242 students. All twenty-one raters were familiar with the learning objectives of the course, the free accessible video of the standard procedure on the web, and the feedback code. Eleven out of these twenty-one individuals were randomly chosen to undergo the rater training. For the randomisation, the raters were numbered and then assigned to the groups by numbers derived from a website creating random numbers in a given range. To determine the accuracy of the grading all the examinations were videotaped for further evaluation. Both raters and students gave written informed consent before the study. The study was approved by means of the local ethics committee. The work was carried out in accordance with the Declaration of Helsinki, and the anonymity of all participants was guaranteed.

Intervention

The eleven raters chosen for the training were split into two groups (six and five persons per group, respectively) to achieve a smaller group size. The training was scheduled in the end of the week before the skills assessment (Thursday and Friday afternoon, respectively, skills assessment on Monday and Tuesday afternoon). Training was limited to ninety minutes. In a short introduction, the moderator (author GW) stated the goals and the standards of the assessment as well as the rating dimensions (see table 2 [Tab. 2]. The raters were then shown four videos showing different fourth-year students performing the standardised examination with a standardised SP at different levels of competence. The videos were presented in the same order in both training groups. After each presentation the raters were asked to assess the performance using a checklist with seven dimensions (see table 2 [Tab. 2]) and to write down their grades for each item (see table 1 [Tab. 1]). The raters then read out all their grades and for each item the raters with the most different grades were asked to justify their judgement. The ensuing discussions of all participants were chaired by the moderator. After all dimensions had been discussed, the moderator gave feedback featuring the embedded faults of each video.

Examination skills assessment

Because the untrained raters were not familiar with checklist forms both the trained and the untrained raters were asked to assign an overall grade for the whole performance of the students (see table 1 [Tab. 1]). Hence, the scoring method of the training was abandoned for the actual assessment. After the assessment each of the tested students was asked to fill in a questionnaire about his or her views on the assessment and to grade his or her own performance. Additionally, the raters were asked to give information on their experience in assessing students, their views on the idea of rater training and on their own performance (see figure 1 [Fig. 1]), and (in case of the trained raters) their satisfaction with the training on a five-point scale. The videos of the examinations were collected from the examination rooms, cut, and the allocation to trained and untrained raters was made anonymous.

Video-based re-assessment

All the videotaped examinations were re-evaluated by three observers, one faculty member and two fifth-year students who as a group underwent the same training described above (moderated by author CV). All videotapes were evaluated by global rating at first. Subsequently the observers performed the dimension-evaluation that had been applied in the training concluding with a second overall rating. The observers rated the videos independently of each other and were unaware of the randomisation. The reference rating for the analysis was defined as the median of the three observers' ratings.

Statistics

The grades are given as medians with 1st and 3rd quartiles. For reasons of graphical presentability the mean ±standard error of means (SEM) and ±standard deviation (SD), respectively, are used in the figures. The range of the ratings is given as the mean standard deviation per rater ±SD. Kendall's coefficient of concordance was calculated for every pair of observers and for all three observers together. The primary outcome measure was the difference between the raters' and the observers' ratings given as absolute value. The determining factor was the training and the studied entity were the students considering that every rater evaluated several students. This model was analysed by generalised estimating equations with exchangeable correlation structures. Parameter estimates β with standard errors are presented. Likewise, the effect of experience on accuracy and the effect of the training on grading were investigated using generalized estimating equations.

For the self-assessment of students, the effect of training on the self-assessed grades as well as on the agreement with the raters’ grade was analyzed using Mann-Whitney U tests. Moreover, concordance between the raters' grading and the self-assessment was estimated by Kendall's coefficient. To control for multiple tests, we adhered to the following test hierarchy: Firstly we tested the concordance between the three observers using a significance level of 5%. Only if this was significant, we tested whether the training had an effect on the accuracy, again at a significance level of 5%. All the other tests are reported for descriptive purposes only. All analyses were performed using SPSS and R, version 2.15.0 [http://www.R-project.org].


Results

Global results

All twenty-one raters completed the study. The characteristics of trained and untrained raters are given in table 3 [Tab. 3]. The randomly chosen training group was older and there were more males, and more senior and experienced physicians in this group. Of the 247 students scheduled for the assessment, 242 (98%) completed the assessment, and 218 assessments (90%) were successfully taped on video. 208 students of the latter group (95%) completed the questionnaire. The median of the number of rated students per rater was 11 in each group (4-12 in the untrained and 5-12 in the trained group, respectively).

Observers’ ratings and their concordance

To assess the accuracy of the ratings, the median of the global ratings of the three observers was used as comparison. The difference between this median and the grade of the rater was used to estimate the (lack of) accuracy. To evaluate the adequacy of this, we estimated the coefficient of concordance between the observers, which was 0.70 (P=5.84x10-19). The concordance was higher between the two student observers (0.90, P=6.58x10-12) than between the faculty member and the students (0.70, P=1.26x10-4 and 0.73, P=1.01x10-5, respectively). Sixty-one and 75% of the students' ratings equalled the median of all three raters, respectively, and 30% of the faculty's ratings. The median overall grading [1st;3rd quartile] of the observers was 2 [1-;2-] (German school grading, see table 1 [Tab. 1]). Comparing the grades among observers, the faculty's median grade [1st;3rd quartile] was more stringent than the students' grades (2- [2+;3] versus 2 [1-;2-] both). The overall ratings of the observers after assessing the seven dimensions (see table 2 [Tab. 2]) were virtually the same as these ratings and did not enter further analysis.

Effect of training on grading and accuracy, effect of experience on accuracy

The median overall grading [1st;3rd quartile] of the trained raters was 2 [1-;2-] and of the untrained raters 2+ [1;2], respectively. The pairs of means (±SEM) of the raters’ and median observers’ gradings are given in figure 2 [Fig. 2]. In the generalised estimating equations model, the trained raters were more stringent than those without the training (β=-0.94 ±0.36, P=0.01). No effect of the training on rating accuracy was detectable (β=-0.09 ±0.20, P=0.64). The factor experience of the raters did not have any influence on the accuracy of the ratings (β=-0.12 ±0.17, P=0.48).

Self-assessment by the students

Similar to the grades of the raters, the students in the group with trained raters assessed themselves more stringently than the students in the group with untrained raters (2 [2+;2] and 2+ [1-;2], respectively; P=0.01 from the Mann-Whitney U test). The concordance between the raters' grading and the self-assessment of the students was high in both groups (Kendall's coefficient 0.83 and 0.80 in the group with trained and untrained raters, respectively, P=1.29x10-5 and P=1.25x10-4). However, students in the group with trained raters disagreed more strongly with their assessment, finding their grade more often inadequate (P=5.74x10-3 from the Mann-Whitney U test).

The range of grades applied by each rater did not differ between the groups. The mean standard deviations of the grades were 0.56 ±0.18 in the group of trained raters and 0.61 ±0.15 in the group of untrained raters. The corresponding standard deviation of the observers' medians were 0.67 ±0.26 and 0.66 ±0.19, and of the students' self-assessment 0.49 ±0.21 and 0.50 ±0.10.

The raters' views on the idea of a rater training and on their own performance are given in figure 1 [Fig. 1]. Of the eleven trained raters, ten agreed with the notion that he or she felt more secure in their judgement after the training; one rater was neutral in this regard.


Discussion

The present study failed to show an effect of a rater training on the raters' accuracy. Trained raters were more stringent than their untrained counterparts but did not apply a wider range of grades. These results largely reflect the outcome of previous studies on rater training in a medical context. In the study by Newble and coworkers [19], raters were asked to fill in a rating form and rating quality was measured by the consistency of the raters in assessing five videotaped encounters. Similar to our study, this study focussed on physical examination skills. Despite the rather specific task, the overall consistency was only moderate to acceptable and did not change after the training. The most inconsistent ratings were given in the items "general approach to the patient" and "general observation", indicating that global rating categories (as applied in our study) were more difficult to agree on than more specific categories.

Holmboe and co-workers studied the effects of a four-day faculty development course on the rating of nine scripted videotaped clinical encounters using a mini-CEX rating form [20]. The trained faculty members felt significantly more comfortable with their evaluations of real-live encounters in a follow-up survey. After eight months the participants were re-assessed. The trained raters were found to rate significantly more stringent partially with smaller ranges of ratings. The accuracy of ratings was represented by the capability to discriminate three different levels of competence displayed in the videos. This discrimination was good in the trained and untrained raters both before and after the training. Although the approach in this study was fundamentally different to ours, the higher stringency of the trained raters and the lack of evidence for an effect on accuracy very much resemble the results of our study.

The effects of a rater training on accuracy was more specifically studied by Cook and coworkers [21]. Eighteen of the thirty-two videos used in the pre- and post-test, respectively, were the same scripted videos used by Holmboe and co-workers. The time span between training and the re-assessment in this study was one month. Accuracy was estimated by discrimination of the mean ratings between the scripted levels of competence, by the frequency with which ratings matched scripted performance, and (because of disagreements with the scripted performance ratings) by a chance-corrected agreement using intraclass correlation coefficients. The rater training had no effect at all on either of these accuracy measures. Notably, the interrater reliability for the ratings in the subcategory "physical examination" was comparably small. This might indicate that it was particularly difficult to achieve an agreement on the performance ratings in physical examination.

Our study differed to the previous studies on rater training in one decisive point. While the other studies used prepared video scenes to assess rating scores, we investigated real-live student-SP encounters and re-assessed them by video recordings. Re-assessing examinations by videotapes may have an impact on the ratings and has been formerly studied. In a study dealing with an OSCE assessing joint examination skills, the investigators found a moderate interrater-reliability between live and video raters [22]. The authors point out that the range was similar to previously published interrater-reliability scores of live raters [23]. A second study with pharmacy students specifically studied the intra-rater reliability after one month [24]. The reliability was high; however, due to a higher stringency in the video rating, more candidates would have failed in the post-hoc assessment. A higher stringency in video ratings had already been observed in the first study on joint examinations and in tendency was also present in our study. This effect is most likely due to the fact that an on-scene rater has to announce his judgement face-to-face to the student, while a video observer does not have to take responsibility for his ratings. Announcing decisions face-to-face indeed influences the ratings towards greater leniency [25]. Since this affected both groups equally in our study, we do not consider it crucial for the interpretation of our data.

To overcome the problem of low interrater-reliability in rating medical encounters [26], we re-assessed the videotaped encounters by three observers each. Two of the observers were senior students; the third observer was a faculty member. Trained students have been shown to be equally reliable in rating the practical skills of their junior peers than faculty staff [27], [28]. Latter studies also show that faculty staff rate more stringently than do student assessors. This was obvious in our study. Hence, by choosing the median of the three observers as the measure for accuracy, the student observers' ratings dominated the re-ratings. This might be a concern in the interpretation of the data. Moreover, the randomisation process in our study skewed the allocation of the raters to the groups: the raters in the training group were older, more likely male, and senior and they were more frequently experienced in testing students. These factors have been shown to have no [29], [30] or marginal [31] influence on ratings. Accordingly, in our study we were also unable to find an influence of the factor "rating experience" on rating accuracy.

Other concerns might be the size of the study and the type of intervention. To reduce the effect of intra-observer variety we tried to achieve a sample size of at least ten examinees per rater. Due to the size of the students' cohort, the number of raters was therefore limited to a little over twenty. This was also the number of physicians we were able to recruit from the medical departments for the time of the exams. The time limit of the training was related to the time spent for the ratings (ninety minutes on either day). A greater number of raters or more training would not have been feasible in our setting. We also believe that the effort of a more intensive intervention with the chance of a measurable effect on accuracy would not match the benefit.

However, some other aspects of the study seem noteworthy. Firstly, the time between the training and the exams was relatively short implying that the effect of the training was still present at the time of the ratings. Secondly, the task to be presented by the students was very clear and uniform. Hence, case specificy and contextual factors as sources of rater errors [14], [32] could largely be eliminated from the experiment. And thirdly, one can also argue that the training indeed had some kind of effect on accuracy. The stringency of overall grading of the trained raters was significantly closer to the observers' gradings and (despite the lack of individual accuracy) can be viewed as more accurate for the group. Consequently, the trained raters were rather less lenient than more stringent. The effect had already been observed in the study by Holmboe and coworkers [20] and suggests that the training in a way helped to standardise the raters' frame of reference by assigning a more appropriate range of ratings. However, the idiosyncrasy of processing the observations and converting the judgements to an ordinal scale [33] within this range obviously remained unaffected.

The untrained raters also denied the possibility of a halo effect in their ratings more consistently than the trained raters. This might well be a training effect and implies that training may be able to raise the awareness of a cognitive bias. Moreover, the students assessed by the trained raters rather felt incorrectly judged, stating more often that their grading was inadequate. This can easily be explained by the more stringent grades in this group. The observation that despite this difference there was a similarly high concordance between the self-assessment of the students and the raters' gradings in both groups could be due to the fact that the students filled in the questionnaire straight after the announcement of the grade. Hence, although the students in the trained raters' group were more likely discontent with the grading, their self-assessment was strongly influenced by the raters' judgement.

In conclusion, our study focussed on the curricular assessment of a very specific task, a brief and highly standardised physical examination. Rater training failed to have an impact on the raters' individual accuracy. However, the stringency of ratings was more in line with the observers' assessment when the raters were trained. Moreover, the trained raters were rather aware of a halo effect and their ratees were more likely discontent with their grade. The data suggest that rater training did have some kind of effect but that the idiosyncrasy of judgement in assessing complex medical skills is too strong to be influenced by a single training. The effort of implementing rater training in order to improve fairness of exams may therefore not be effective. Ratings of medical performance, however, should be interpreted with discretion.


Acknowledgements

The authors are deeply indebted to Prof. Jana Jünger and Dr. Andreas Möltner from the Center of Excellence for Assessment in Medicine Baden-Württemberg (Heidelberg, Germany) for sharing their expertise during the planning phase and in the statistical evaluation, and for providing video equipment. We would also like to thank Sebastian Sosnowki and Christopher Beck for their excellent assistance in videotaping the exams and evaluating the recordings. We also sincerely acknowledge the commitment of Jennifer Miles Davis in proof reading the manuscript.


Competing interests

The authors declare that they have no competing interests.


References

1.
Horwitz RI, Kassirer JP, Holmboe ES, Humphrey HJ, Verghese A, Croft C, Kwok M, Loscalzo J. Internal medicine residency redesign: proposal of the Internal Medicine Working Group. Am J Med. 2011;124(9):806-812. DOI: 10.1016/j.amjmed.2011.03.007 External link
2.
Clark D, III, Ahmed MI, Dell'italia LJ, Fan P, McGiffin DC. An argument for reviving the disappearing skill of cardiac auscultation. Cleve Clin J Med. 2012;79(8):536-537, 544. DOI: 10.3949/ccjm.79a.12001 External link
3.
Smith MA, Burton WB, Mackay M. Development, impact, and measurement of enhanced physical diagnosis skills. Adv Health Sci Educ Theory Pract. 2009;14(4):547-556. DOI: 10.1007/s10459-008-9137-z External link
4.
Ramani S, Ring BN, Lowe R, Hunter D. A pilot study assessing knowledge of clinical signs and physical examination skills in incoming medicine residents. J Grad Med Educ. 2010;2(2):232-235. DOI: 10.4300/JGME-D-09-00107.1 External link
5.
Alexander EK. Perspective: moving students beyond an organ-based approach when teaching medical interviewing and physical examination skills. Acad Med. 2008;83(10):906-909. DOI: 10.1097/ACM.0b013e318184f2e5 External link
6.
Ainsworth MA, Rogers LP, Markus JF, Dorsey NK, Blackwell TA, Petrusa ER. Standardized patient encounters. A method for teaching and evaluation. JAMA. 1991;266(10):1390-1396. DOI: 10.1001/jama.1991.03470100082037 External link
7.
Barley GE, Fisher J, Dwinnell B, White K. Teaching foundational physical examination skills: study results comparing lay teaching associates and physician instructors. Acad Med. 2006;81(10 Suppl):S95-S97. DOI: 10.1097/00001888-200610001-00024 External link
8.
Norcini JJ, Blank LL, Duffy FD, Fortna GS. The mini-CEX: a method for assessing clinical skills. Ann Intern Med. 2003;138(6):476-481. DOI: 10.7326/0003-4819-138-6-200303180-00012 External link
9.
Newble D. Techniques for measuring clinical competence: objective structured clinical examinations. Med Educ. 2004;38(2):199-203. DOI: 10.1111/j.1365-2923.2004.01755.x External link
10.
Pelgrim EA, Kramer AW, Mokkink HG, van den EL, Grol RP, van der Vleuten CP. In-training assessment using direct observation of single-patient encounters: a literature review. Adv Health Sci Educ Theory Pract. 2011;16(1):131-142. DOI: 10.1007/s10459-010-9235-6 External link
11.
Chen W, Liao SC, Tsai CH, Huang CC, Lin CC, Tsai CH. Clinical skills in final-year medical students: the relationship between self-reported confidence and direct observation by faculty or residents. Ann Acad Med Singapore. 2008;37(1):3-8.
12.
Holmboe ES, Hawkins RE. Methods for evaluating the clinical competence of residents in internal medicine: a review. Ann Intern Med. 1998;129(1):42-48. DOI: 10.7326/0003-4819-129-1-199807010-00011 External link
13.
Noel GL, Herbers JE Jr, Caplow MP, Cooper GS, Pangaro LN, Harvey J. How well do internal medicine faculty members evaluate the clinical skills of residents? Ann Intern Med. 1992;117(9):757-765. DOI: 10.7326/0003-4819-117-9-757 External link
14.
Kogan JR, Conforti L, Bernabeo E, Iobst W, Holmboe E. Opening the black box of clinical skills assessment via observation: a conceptual model. Med Educ. 2011;45(10):1048-1060. DOI: 10.1111/j.1365-2923.2011.04025.x External link
15.
Yeates P, O'Neill P, Mann K, Eva K. Seeing the same thing differently : Mechanisms that contribute to assessor differences in directly-observed performance assessments. Adv Health Sci Educ Theory Pract. 2013;18(3):325-341. DOI: 10.1007/s10459-012-9372-1 External link
16.
Woehr DJ. Rater training for performance appraisal: a quantitative review. J Occup Organ Psychol. 1994;67:189-205. DOI: 10.1111/j.2044-8325.1994.tb00562.x External link
17.
Lievens F. Assessor training strategies and their effects on accuracy, interrater reliability, and discriminant validity. J Appl Psychol. 2001;86(2):255-264. DOI: 10.1037/0021-9010.86.2.255 External link
18.
Gorman CA, Rentsch JR. Evaluating frame-of-reference rater training effectiveness using performance schema accuracy. J Appl Psychol. 2009;94(5):1336-1344. DOI: 10.1037/a0016476 External link
19.
Newble DI, Hoare J, Sheldrake PF. The selection and training of examiners for clinical examinations. Med Educ. 1980;14(5):345-349. DOI: 10.1111/j.1365-2923.1980.tb02379.x External link
20.
Holmboe ES, Hawkins RE, Huot SJ. Effects of training in direct observation of medical residents' clinical competence: a randomized trial. Ann Intern Med. 2004;140(11):874-881. DOI: 10.7326/0003-4819-140-11-200406010-00008 External link
21.
Cook DA, Dupras DM, Beckman TJ, Thomas KG, Pankratz VS. Effect of rater training on reliability and accuracy of mini-CEX scores: a randomized, controlled trial. J Gen Intern Med. 2009;24(1):74-79. DOI: 10.1007/s11606-008-0842-3 External link
22.
Vivekananda-Schmidt P, Lewis M, Coady D, Morley C, Kay L, Walker D, Hassell AB. Exploring the use of videotaped objective structured clinical examination in the assessment of joint examination skills of medical students. Arthritis Rheum. 2007;57(5):869-876. DOI: 10.1002/art.22763 External link
23.
Newble DI, Hoare J, Elmslie RG. The validity and reliability of a new examination of the clinical competence of medical students. Med Educ. 1981;15(1):46-52. DOI: 10.1111/j.1365-2923.1981.tb02315.x External link
24.
Sturpe DA, Huynh D, Haines ST. Scoring objective structured clinical examinations using video monitors or video recordings. Am J Pharm Educ. 2010;74(3):44. DOI: 10.5688/aj740344 External link
25.
Klimoski R, Inks L. Accountability forces in performance appraisal. Organ Behav Hum Decis Proc. 1990;45:194-208. DOI: 10.1016/0749-5978(90)90011-W External link
26.
Martin JA, Reznick RK, Rothman A, Tamblyn RM, Regehr G. Who should rate candidates in an objective structured clinical examination? Acad Med. 1996;71(2):170-175. DOI: 10.1097/00001888-199602000-00025 External link
27.
Ogden GR, Green M, Ker JS. The use of interprofessional peer examiners in an objective structured clinical examination: can dental students act as examiners? Br Dent J. 2000;189(3):160-164.
28.
Chenot JF, Simmenroth-Nayda A, Koch A, Fischer T, Scherer M, Emmert B, Stanske B, Kochen MM, Himmel W. Can student tutors act as examiners in an objective structured clinical examination? Med Educ. 2007;41(11):1032-1038. DOI: 10.1111/j.1365-2923.2007.02895.x External link
29.
Carline JD, Paauw DS, Thiede KW, Ramsey PG. Factors affecting the reliability of ratings of students' clinical skills in a medicine clerkship. J Gen Intern Med. 1992;7(5):506-510. DOI: 10.1007/BF02599454 External link
30.
Kogan JR, Hess BJ, Conforti LN, Holmboe ES. What drives faculty ratings of residents' clinical skills? The impact of faculty's own clinical skills. Acad Med. 2010;85(10 Suppl):S25-S28. DOI: 10.1097/ACM.0b013e3181ed1aa3 External link
31.
McManus IC, Thompson M, Mollon J. Assessment of examiner leniency and stringency ('hawk-dove effect') in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling. BMC Med Educ. 2006;6:42. DOI: 10.1186/1472-6920-6-42 External link
32.
Williams RG, Klamen DA, McGaghie WC. Cognitive, social and environmental sources of bias in clinical performance ratings. Teach Learn Med. 2003;15(4):270-292. DOI: 10.1207/S15328015TLM1504_11 External link
33.
Gingerich A, Regehr G, Eva KW. Rater-based assessments as social judgments: rethinking the etiology of rater errors. Acad Med. 2011;86(10 Suppl):S1-S7. DOI: 10.1097/ACM.0b013e31822a6cf8 External link