Evaluator Differences in PCL-R Scores Suggest Need for Training

Evaluator Differences in PCL-R Scores Suggest Need for Training

lhbDifferences in PCL-R scores between evaluators indicate some degree of variability in scoring attributable to evaluators, although these differences were less amongst evaluators who had received formal training on the PCL-R. This is the bottom line of a recently published article in Law and Human Behavior. Below is a summary of the research and findings as well as a translation of this research into practice.

Featured Article | Law and Human Behavior | 2014, Vol. 38, No. 4, 337-345

Evaluator Differences in Psychopathy Checklist-Revised Factor and Facet Scores


Marcus T. Boccaccini, Sam Houston State University
Daniel C. Murrie, University of Virginia
Katrina A. Rufino, Baylor College of Medicine/The Menninger Clinic
Brett O. Gardner, Sam Houston State University


Recent research suggests that the reliability of some measures used in forensic assessments—such as Hare’s (2003) Psychopathy Checklist-Revised (PCL-R)—tends to be weaker when applied in the field, as compared with formal research studies. Specifically, some of the score variability in the field is attributable to evaluators themselves, rather than the offenders they evaluate. We studied evaluator differences in PCL-R scoring among 558 offenders (14 evaluators) and found evidence of large evaluator differences in scoring for each PCL-R factor and facet, even after controlling for offenders’ self-reported antisocial traits. There was less evidence of evaluator differences when we limited analyses to the 11 evaluators who reported having completed a PCL-R training workshop. Findings provide indirect but positive support for the benefits of PCL-R training, but also suggest that evaluator differences may be evident to some extent in many field settings, even among trained evaluators.


evaluator differences, field reliability, rater agreement, Psychopathy Checklist-Revised, workshop training

Summary of the Research

PCL-R scores were analyzed for 558 offenders who had been evaluated for SVP (Sexually Violent Predator) civil commitment in Texas between 1999 and 2011 by 14 different evaluators to determine the degree of variability in scoring attributable to evaluators. Scores on the Antisocial (ANT) scale/subscales of the Personality Assessment Inventory (PAI) were used to provide an indication of the degree of psychopathy for these offenders in an attempt to tease apart evaluator differences in scoring the PCL-R.

“There were 558 offenders scored on the PCL-R by one of the 14 SVP evaluators. The mean age among offenders was 43.14 years (SD = 10.99). Offenders were identified as white (50.0%), black (32.4%), Hispanic (16.4%), or from another ethnic background (1.1%)” (p. 339).

“The mean PCL-R Total score across offenders and evaluators was 20.91 (SD = 8.15), which falls at about the 40th percentile among North American male offenders and somewhat below the mean of 24.2 for sexual offenders reported in the PCL-R manual. Factor 1 (M = 8.89, SD = 4.26), Facet 1 (M = 3.46, SD = 2.50), and Facet 2 (M = 4.85, SD = 2.20) scores were at or slightly above the mean compared to North American offenders, falling between the 50th and 60th percentiles. Factor 2 (M = 9.42, SD = 4.41), Facet 3 (M = 4.46, SD = 2.47), and Facet 4 (M= 5.02, SD = 2.59) were somewhat below the mean among North American offenders, falling between the 33rd to 45th percentiles” (p. 340).

A significant amount of variance was found when mean factor and facet scores for the PCL-R were examined by evaluator: “the mean PCL-R Total score assigned by an evaluator ranged from 13.79 (18th percentile) to 30.71 (88th percentile)…[and] there was also a large amount of variability in Factor 1 and Factor 2 scores, with mean Factor 1 scores ranging from the 24th (M = 4.80) to 90th (M = 13.42) percentiles and mean Factor 2 scores ranging from the 17th (M = 5.83) to 76th (M = 14.85) percentiles…[The data] suggest, however, that a small subset of outlying evaluators may be responsible for much of the variability” (p. 340).

The mean total, factor, and facet scores suggest a substantial amount of variability in PCL-R scores attributable to evaluators, but do not provide a clear quantifiable estimate of this variability; thus, the authors used multilevel linear modeling (MLM) to calculate and test for statistical significance the proportion of variance in PCL-R scores attributable to differences among evaluators.

Results of the MLM analyses indicated, “about 32% of the variance in PCL-R Total scores was attributable to differences among evaluators…[in addition, there was] a similar and large amount of variance attributable to evaluators for both Factor 1 (23%, p = .001) and Factor 2 (25%, p = .001) scores…the proportion of variance attributable to evaluators was smallest for Facet 2 (13%, p = .001) and largest for Facet 3 (24%, p = .001) and Facet 4 (19%, p = .001)” (p. 341).

PAI scores for the ANT scale/subscales were used to consider whether some evaluators had been assigned to score more antisocial and potentially psychopathic offenders than others. Results indicated that the offenders assigned to each evaluator tended to have similar levels of antisocial and psychopathic traits and that “differences in antisocial traits (i.e., ANT scores) cannot explain evaluator differences in PCL-R scoring” (p. 342).

“To examine the potential impact of having completed PCL-R training on evaluator differences, [the authors] reran the MLM models for the PCL-R using only scores from the 11 evaluators who reported having completed PCL-R training. Findings from this third set of PCL-R models yielded effects that were generally consistent with PCL-R field reliability research. Although there was still evidence of statistically significant evaluator differences for each PCL-R factor and facet score, the amount of variance attributable to evaluators was 20% or lower for each PCL-R score. In terms of absolute value, the amount of variance attributable to evaluators was larger for Factor 1 (17%) than Factor 2 (13%), and lowest for Facet 4 (9%)” (p. 342).

Translating Research into Practice

The authors found “evidence of evaluator differences in PCL-R scoring for each factor and facet score…and no evidence that these evaluator differences in PCL-R scoring were attributable to some evaluators being assigned to score a subset of more antisocial or psychopathic offenders than others. There was substantial variability in PCL-R scoring attributable to evaluators even after controlling for differences in PAI ANT scores” (p. 342).

“Overall, [these] findings add to the small but growing body of research suggesting that the reliability of PCL-R scores may be weaker in the field than in controlled research studies, particularly among sexual offenders…[and] they also reinforce concerns about potentially inaccurate scores influencing decisions about offenders” (p. 342).

“In practice, it may be that unreliability in PCL-R scoring is difficult to identify in any specific case and only becomes evident over time. In [this] study, evaluator differences were much smaller after [the authors] removed three outlying evaluators from [the] analyses” (p. 343).

“[The] findings of less variability in scoring among trained evaluators parallel those from a study of the predictive validity of PCL-R scores among Texas sexual offenders in which PCL-R scores assigned by a subset of prolific evaluators were stronger predictors of violent recidivism than scores assigned by less prolific evaluators” (p. 343).

“Together, findings of smaller evaluator differences and stronger predictive validity among subsets of trained and prolific evaluators provide indirect, but potentially promising support for the value of PCL-R training and evaluation experience…[these] findings suggest that there is a clear need for more research on the potential benefits of PCL-R training and experience” (p. 343).

Evaluators who use the PCL-R in practice should obtain the necessary training on the administration, scoring, and interpretation of this instrument and should remain cautious about deviating from the scoring criteria set out in the PCL-R manual.

Other Interesting Tidbits for Researchers and Clinicians

“Future evaluator differences research should consider the offender, evaluator, and evaluation characteristics that may explain some evaluator difference effects. It could be that evaluator differences are more evident among sexual offenders than nonsexual offenders, or among pedophilic sexual offenders than other sexual offenders, because these offenders may evoke stronger emotional reactions from some evaluators. It is also possible that the evaluator differences we have identified are limited to the relatively small pool of state-contracted evaluators conducting SVP evaluations in Texas. Or, that these Texas differences are limited to only a small subset of evaluators within this pool. Indeed, we were able to reduce evaluator difference effects by removing only three atypical evaluators. It is also possible that evaluator differences may only be evident among less experienced evaluators, or those with less formal PCL-R training. Our relatively small sample of evaluators and the limited information available about their training necessarily limits the conclusions we can draw about the potential benefits of completing a PCL-R training workshop. Because many evaluators cite workshop training as evidence of their competence in PCL-R administration and scoring, the field would benefit from studies designed to directly assess the pre- and post-workshop reliability of evaluators completing workshop training” (pp. 343-344).

Join the Discussion

As always, please join the discussion below if you have thoughts or comments to add!

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.