Using analytic rating scales to assess English/Chinese bi-directional interpretation: A longitudinal Rasch analysis of scale utility and rater behavior




Analytic rating scale, scale-based assessment, Rasch measurement, scale utility, rater behaviour, consecutive interpreting


Descriptor-based analytic rating scales have been increasingly used to assess interpretation quality. However, little empirical evidence is available to unequivocally support the effectiveness of rating scales and rater reliability. This longitudinal study thus attempts to shed insight into scale utility and rater behavior in English/Chinese interpretation performance assessment, using multifaceted Rasch measurement. Specifically, the study focuses on criterion/scale difficulty, scale effectiveness, rater severity/leniency and rater self-consistency between English/Chinese interpreting and over three time points. Research results are discussed, highlighting the utility of analytic rating scales and the variability of rater behavior in interpretation assessment. The results also have implications for developing reliable, valid, and practical instruments to assess interpretation quality.

Author Biography

Chao Han, Southwest University

College of International Studies


Angelelli, C. (2009). Using a rubric to assess translation ability: Defining the construct. In C. Angelelli & H.E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies (pp. 13–47). Amsterdam: John Benjamins.

Barik, H. C. (1971). A description of various types of omissions, additions and errors in translation encountered in simultaneous interpretation. Meta, 16(4), 199–210.

Bontempo, K., & Hutchinson, B. (2011). Striving for an ‘A’ grade: A case study in performance management of interpreters. International Journal of Interpreter Education, 3, 56–71.

Carroll, J. B. (1966). An experiment in evaluating the quality of translations. Mechanical Translation and Computational Linguistics, 9(3–4), 55–66.

Certification Commission for Healthcare Interpreters (2011). Technical report on the development and pilot testing of the CCHI examinations. Retrieved from

Cheung, A. K. F. (2014). Anglicized numerical denominations as a coping tactic for simultaneous interpreting from English into Mandarin Chinese: An experimental study. Forum, 12(1), 1–22.

Clifford, A. (2001). Discourse theory and performance-based assessment: Two tools for professional interpreting. Meta, 46(2), 365–378.

Diriker, E. (2015). Conference interpreting. In F. Pöchhacker (Ed.), Routledge encyclopedia of interpreting studies (pp. 78–82). London: Routledge.

Gerver, D. (1969/2002). The effects of source language presentation rater on the performance of simultaneous conference interpreters. In F. Pöchhacker & M. Shlesinger (Eds.), The interpreting studies reader (pp. 53–66). London: Routledge.

Gile, D. (1999). Variability in the perception of fidelity in simultaneous interpretation. Hermes, 22, 51–79.

Han, C. (2015a). Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting, 17(2), 255–283.

Han, C. (2015b). (Para)linguistic correlates of perceived fluency in English-to-Chinese simultaneous interpretation. International Journal of Comparative Literature & Translation Studies, 3(4), 32–37.

Han, C., & Slatyer, H. (2016). Test validation in interpreter certification performance testing: An argument-based approach. Interpreting, 18(2), 231–258.

Han, C. (2016a). Investigating score dependability in English/Chinese interpreter certification performance testing: A generalizability theory approach. Language Assessment Quarterly, 13(3), 186–201.

Han, C. (2016b). Reporting practices of rater reliability in interpreting research: A mixed-methods review of 14 journals (2004–2014). Journal of Research Design and Statistics in Linguistics and Communication Science, 3(1), 49–75.

IoL Educational Trust. (2010). Diploma in public service interpreting: Handbook for candidates. Retrieved from

Jacobson, H. E. (2009). Moving beyond words in assessing mediated interaction. In C. Angelelli & H. E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies (pp. 49–70). Amsterdam: John Benjamins.

Lee, J. (2008). Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer, 2(2), 165–184.

Lee, S.-B. (2015). Developing an analytic scale for assessing undergraduate students’ consecutive interpreting performances. Interpreting, 17(2), 226–254.

Lin, I. I., Chang, F. A., & Kuo, F. (2013). The impact of non-native accented English on rendition accuracy in simultaneous interpreting. Translation & Interpreting, 5(2), 30–44.

Linacre, J. M. (1999). Investigating rating scale category utility. Journal of Outcome Measurement, 3(2), 103–122.

Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16(2), 878.

Linacre, J. M. (2013). A user’s guide to FACETS: Program manual 3.71.2. Retrieved from

Liu, M.-H. (2013). Design and analysis of Taiwan’s interpretation certification examination. In D. Tsagari & R. van Deemter (Eds.), Assessment issues in language translation and interpreting (pp. 163–178). Frankfurt: Peter Lang.

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174.

McDermid, C. (2014). Cohesion in English to ASL simultaneous interpreting. Translation & Interpreting, 6(1), 76–101.

Multon, K. D. (2010) Interrater reliability. In N. J. Salkind (Ed.), Encyclopedia of research design (pp. 627–629). Thousand Oaks, CA: Sage.

Pöchhacker, F. (2001). Quality assessment in conference and community interpreting. Meta, 46(2), 410–425.

Sawyer, D. B. (2004). Fundamental aspects of interpreter education: Curriculum and assessment. Amsterdam: John Benjamins.

Schumacker, R. E. (1999). Many-facet Rasch analysis with crossed, nested and mixed designs. Journal of Outcome Measurement, 3(4), 323–338.

Setton, R., & Motta, M. (2007). Syntacrobatics quality and reformulation in simultaneous-with-text. Interpreting, 9(2), 199–230.

Tiselius, E. (2009). Revisiting Carroll’s scales. In C. Angelelli & H. E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies (pp. 95–121). Amsterdam: John Benjamins.

Turner, B., Lai, M., & Huang, N. (2010). Error deduction and descriptors: A comparison of two methods of translation test assessment. Translation & Interpreting, 2(1), 11–23.

Wang, B. H. (2011). Kou yi neng li de ping gu ji ce shi she ji zai tan – yi quan guo kou yi da sai wei li [Exploration of the assessment model and test design of interpreting competence]. Wai yu jie, 1, 66–71.

Wang, J.-H., Napier, J., Goswell, D., & Carmichael, A. (2015). The design and application of rubrics to assess signed language interpreting performance. The Interpreter and Translator Trainer, 9(1), 83–103.

Wu, J., Liu, M.-H., & Liao, C. (2013). Analytic scoring in interpretation test: Construct validity and the halo effect. In H.-H. Liao, T-E. Kao, & Y. Lin (Eds.), The making of a translator: Multiple perspectives (pp. 277–292). Taipei: Bookman.

Wu, S. C. (2010). Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior (Unpublished doctoral dissertation). Newcastle University, United Kingdom.

Xi, X.-M., & Mollaun, P. (2006). Investigating the utility of analytic scoring for the TOEFL Academic Speaking Test (TAST). Retrieved from

Zhao, N., & Dong, Y. P. (2013). Ji yu duo mian Rasch mo xing de jiao ti chuan yi ce shi xiao du yan zheng [Validation of a consecutive interpreting test based on multi-faceted Rasch model]. Jie fang jun wai guo yu xue yuan xue bao, 36(1), 86–90.




How to Cite

Han, C. (2018). Using analytic rating scales to assess English/Chinese bi-directional interpretation: A longitudinal Rasch analysis of scale utility and rater behavior. Linguistica Antverpiensia, New Series – Themes in Translation Studies, 16.