Speech-to-speech assessment vs text-based assessment of simultaneous interpreting: Tapping into the potential of large language models and the metrics of machine translation quality estimation in enabling automatic assessment
Keywords:
simultaneous interpreting, SI, speech-to-speech assessment, S2S, text-based assessment, automatic assessment, large language model, LLM, machine translation quality estimation, MTQEAbstract
There is growing interest in using machine translation quality estimation (MTQE) metrics and models and large language models (LLMs) to assess information fidelity in interpreting quality automatically. However, studies have been limited to text-based assessments only. This study compared speech-to-speech (S2S) assessment and text-based assessment. The experiment began by segmenting audio recordings of simultaneous interpreting (SI) into one-minute intervals and isolating the source speech and the target interpretations in each segment. We used LLMs, BLASER, and speech embeddings and the last hidden states from HuBERT and Wav2Vec to assess interpreting quality at the speech level. In addition, we explored the use of automatic speech recognition (ASR) for transcribing segments, coupled with human verification and LLM along with MTQE models such as COMET and TransQuest for minute-level text-based assessment. The findings indicate the following: (1) LLMs cannot conduct the speech-based assessment of interpreting quality directly but demonstrate certain capabilities in text-based assessment when evaluating based on transcriptions, displaying a moderately high correlation with human ratings (Pearson r = 0.66); (2) in contrast, BLASER operates directly at the speech level and demonstrates a comparable correlation (r = 0.63) with human judgements, confirming its potential for speech-based quality assessment; (3) a combined metric integrating both S2S and text-based assessments, as proposed in this study, accounts for approximately 47% of the variance in human judgement scores, which highlights the potential of integrated metrics to enhance the development of machine-learning models for assessing interpreting quality. Such metrics offer an automated, cost-effective, and labour-saving method of evaluating SI by human beings and enable direct quality estimation in end-to-end speech-to-text (S2T) and S2S machine interpreting (MI) systems for continuous quality monitoring during training and deployment.
Downloads
References
See PDF or HTML
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Xiaoman Wang, Binhua Wang

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under the CC BY-NC 4.0 Deed that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal. The material cannot be used for commercial purposes.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
