Respeaking certification: Bringing together training, research and practice

Research and training in respeaking are still lagging behind professional practice. One of the consequences of this lack of training opportunities is the UK government’s refusal, in 2016, to use the Disabled Students’ Allowances (DSA) to provide for respoken subtitles, arguing that respeaking was not a qualified profession. In order to tackle this issue, the Galician Observatory for Media Accessibility set up LiRICS, the Live Respeaking International Certification Standard, which aims to set and maintain high international standards in the respeaking profession. In 2019, after assessing the online certification process proposed by LiRICS, the Department of Education in the UK concluded that it meets their requirements and that LiRICS-certified respeakers are eligible for Disabled Students’ Allowances funding. This article outlines, first, the current provision of respeaking training around the world and the assessments of live subtitling quality carried out to date, both of which inform the LiRICS online certification process presented here. The focus is then placed on the actual certification process, including a description of the tests, the platform used and the quality assurance Romero-Fresco, P., Melchor-Couto, S., Dawson, H., Moores, Z. & Pedregosa, I. (2019). Respeaking certification: Bringing together training, research and practice. Linguistica Antverpiensia, New Series: Themes in Translation Studies, 18, 216–236. 217 process. This is followed by an analysis of the respeakers’ performance, which has been shown to be in line with current professional standards.


Introduction
Widely held as one of the most challenging modalities in media accessibility, intralingual live subtitling (or "real-time captioning", as it is known in the United States) is defined by the International Telecommunication Union (ITU) (2015) as "the real-time transcription of spoken words, sound effects, relevant musical cues, and other relevant audio information" (p. 2) to enable deaf or hard-of-hearing persons to follow a live audiovisual programme. Live subtitles may be produced through different methods, including standard keyboards, dual keyboards, Velotype and the two most common approaches, namely, stenography and respeaking (Lambourne, 2006). Stenography uses a system of machine shorthand in which letters or groups of letters phonetically represent syllables, words, phrases and punctuation marks. This is the preferred method to produce live subtitles in the United States and Canada, and it is also used in other countries such as Italy and Spain, especially for transcriptions and subtitles in courtrooms, classrooms, meetings and other settings. Respeaking (known as "real-time voice writing" in the United States) is currently the preferred method for live subtitling around the world, especially for live subtitles on TV (Romero-Fresco & Eugeni, in press). It refers to the production of written text (such as subtitles) by means of speech recognition and may be defined as a technique in which a respeaker listens to the original sound of a (live) programme or event and respeaks it, including punctuation marks and some specific features for the deaf and hardof-hearing audience, to a speech-recognition software, which turns the recognised utterances into subtitles displayed on the screen with the shortest possible delay. (Romero-Fresco, 2011, p. 1) Despite the popularity and widespread use of respeaking in the media accessibility industry and the impact it has on millions of viewers, research in this area is still hard to come by. The translation and interpreting database BITRA shows that only 4% of the academic publications on accessibility and 0.8% of published outputs on AVT, respectively, deal with live subtitling and respeaking. This is in stark contrast with research on audio description, for instance. Although it is much less widespread in the industry than live subtitling (Rossignol-Farjon & Cimino, 2016) and it is received by a smaller number of users, it features twice as many publications in BITRA. One of the reasons for this lack of scholarly activity in respeaking (and live subtitling in general) may lie in the very few respeaking courses available at universities (no more than a handful around the world) and in the very few scholars working in this area (Romero-Fresco, 2018). In addition, respeaking is often taught as a component within larger modules on subtitling for the deaf and hard of hearing (Romero-Fresco, 2012b). The respeaking part is typically small and often taught at the end of the academic year, when students have already chosen the research topic for their dissertations.
The lack of academic research and training means that scholars have had very little influence on the training and working conditions of professional respeakers and many subtitling companies have no choice but to set up their own in-house respeaking programmes (Robert, Schrijver, & Diels, 2020). As a further consequence, the UK government, which was using the Disabled Students' Allowances (DSA) to fund the provision of live subtitles to provide access to school and university lectures for deaf students, decided in 2016 not to use these funds for respeaking. The argument provided for this decision was that there was little recognized training and, most importantly, no professional certification of respeakers (J. Ward, AiMedia, personal communication, September 18, 2019).
In order to tackle this issue, the Galician Observatory for Media Accessibility (GALMA), a research centre at the Universidade de Vigo concerned with the analysis of media accessibility quality in Galicia and at an international level, set up LiRICS, the Live Respeaking International Certification Standard. As part of its international activity, GALMA provides consultancy services to companies (Netflix, Sub-ti, AiMedia, etc.), broadcasters (Sky, TVE, VRT, etc.) and government regulators in Australia, Canada and the United Kingdom. The aim of LiRICS is to help set and maintain high international standards in the respeaking profession in order to create a pool of respeakers who can provide high-quality access through subtitles for live TV programmes and live events. In September 2019, the Canadian Radio-television and Telecommunications approved LiRICS (jointly with the Canadian company Keeble Media Inc.) as the official certification body to assess live subtitling quality on TV (CRTC, 2019). In the same month, the Department of Education in the United Kingdom assessed the online certification process proposed by LiRICS and concluded that it meets their requirements and that LiRICScertified respeakers are now eligible for Disabled Students' Allowances funding: We fully support this venture and can confirm that respeakers who pass the LiRICS certification have the required skill set to provide live remote captioning to students with hearing impairment. The tests, the platform and the quality assurance process used and undertaken have been devised thoroughly by an international team of expert researchers and professionals that have pioneered the development of live respeaking. This certification is the product of years of action research involving testing, user's feedback and continuously evolving modelling. The Department of Education's DSA funding is thus available for LiRICS-certified respeakers. (P. Higgs, Student Finance Directorate, Department for Education, United Kingdom, personal communication, September 20, 2019) Before presenting LiRICS, this article outlined the current provision of respeaking training around the world and the assessments of live subtitling quality carried out so far, both of which inform the certification process presented here. The following section deals with the actual certification process, including a description of the tests and the platform used and the quality assurance process. This section is followed by an analysis of the respeakers' views and performance.

Respeaking training around the world
Live subtitling was first introduced in Europe in the early 1980s, when the British channel ITV began to subtitle headlines of public events using a standard keyboard (Lambourne, 2006). Since the subtitles produced with this method were not fast enough to keep up with the Romero-Fresco, P., Melchor-Couto, S., Dawson, H., Moores, Z. & Pedregosa, I. (2019). Respeaking certification: Bringing together training, research and practice. Linguistica Antverpiensia, New Series: Themes in Translation Studies, 18, 216-236. 219 speech rates of many live programmes, other approaches were tested, such as the Velotype (a syllabic keyboard developed in the Netherlands), tandem methods involving from two to five subtitlers who would share the workload in a given programme and, in the 1990s, stenography (Lambourne, 2006;Orero, 2006;Romero-Fresco, 2011). Steno-made subtitles are fast and reliable, but also expensive, as the training required to become a competent live subtitler using this method takes between three and four years (Romero-Fresco, 2018). Respeaking was first tested as an alternative method for producing live subtitles in Europe in 2001, both by VRT (Vlaamse Radio-en Televisieomroeporganisatie), the national public service broadcaster in Flanders, Belgium, and by the BBC in the United Kingdom. Other European countries such as Spain, France and Italy followed suit in 2004, 2007 and 2008, respectively, helping to consolidate respeaking as the prevailing live subtitling method in Europe (Romero-Fresco, 2011).
The provision of academic training and scholarly research in this area did not start until 2006 (Eugeni & Mack, 2006). In 2007, the University of Antwerp created the first postgraduate course for respeakers as part of its MA in Interpreting and, following the first pedagogical proposals on respeaking (Arumí Ribas & Romero-Fresco, 2008;Russello, 2010), the Universitat Autònoma de Barcelona and the University of Roehampton, London, launched their own courses in 2008 as part of their respective MAs in Audiovisual Translation. By 2011, respeaking was taught in the following higher-education institutions (HEIs): University of Antwerp (in Dutch), Leeds University (in English), Universitat Autònoma de Barcelona (in Spanish), Universidade de Vigo (in Spanish), University of Bologna-Forlì (in Italian) and University of Roehampton (in English, Spanish, French, German and Italian) (Romero-Fresco, 2012b). Since then, courses on AVT have mushroomed all over the continent and respeaking has become not only the preferred method to produce live subtitles (Romero-Fresco & Eugeni, in press), but also a professional area that is in constant need of new professionals (Robert et al., 2020). However, training in respeaking has been incorporated in only a handful of other HEIs (SDI Munich, University of Warsaw and the European University of Valencia) and it has been reduced at the Universitat Autònoma de Barcelona.
Several reasons may account for this, but perhaps the two main ones are the high cost of the resources (software and equipment) required for respeaking and especially the limited number of respeaking trainers available. In fact, some of the abovementioned highereducation (HE) respeaking courses are taught by the same trainers, who are also often asked to create other bespoke vocational courses. Some examples of the latter are the courses delivered by Romero-Fresco and Melchor-Couto at MacQuarie University (Australia) on prerecorded respeaking, the University of Helsinki and the national Finnish broadcaster YLE on live respeaking for TV, the University of Vigo on interlingual English-Spanish respeaking or the Galician TV station TVG for respeaking and post-edition in Galician.
The limited offering of respeaking training in HE is also at the core of two recent EU-funded Erasmus+ projects: LTA 2 (Live Text Access) and ILSA 3 (Interlingual Live Subtitling for Access), which aim to create open source and flexible training materials for intralingual and interlingual respeakers, respectively. The first findings produced by the ILSA project (Robert et al., 2020) are the results of the largest questionnaire on respeaking training and practice disseminated so far among professional respeakers. This questionnaire has proved useful not only in providing a clear picture of the current landscape in respeaking training but also in informing the certification of respeakers described and analysed in this article. The questionnaire was filled in by 126 participants from 27 countries, including European countries but also Australia, Brazil, Canada, China, India, Iran, Korea, Malaysia and South Africa. Only a minority of respondents (13%) were trained at university, normally as part of face-to-face postgraduate courses on Audiovisual Translation or Interpreting. Respeaking courses range from one hour per week during eight weeks to two hours a week during 28 weeks. They are eminently practical, although they are often introduced by one or two theoretical units. They also include sessions on familiarization with the speech-recognition software, the creation of a voice profile, dictation practice and respeaking practice covering different audiovisual genres (from slow speeches to sports broadcasts and more challenging programmes such as news reports, interviews and chat shows). In some cases, trainees are also taught how to correct their mistakes live.
The majority of the respondents (87%) were trained (partially or fully) in-house, which differs from HE training in several respects. First, most in-house trainees are asked to take an aptitude test (normally as part of the selection process prior to employment), which may consist of a language test and a respeaking or a dictation test. Depending on the company, the training is organized either as on-the-job training led by colleagues without a real course or structure or as longer courses (from one intensive week to three months) focused on subtitling, speech recognition and respeaking practice. This is similar to the training offered by HEIs, although arguably less structured and more focused on the particular subtitling software used for the production of respoken subtitles. Most of the participants were assessed through continuous assessment using the NER model (Romero-Fresco & Martínez, 2015), which calculates the accuracy rate of the respoken text based on the number of words (N), edition errors (E) and recognition errors (R).
This brief overview of the respeaking training landscape points to a few issues that are relevant for the purposes of this article. First, there is a considerable contrast between the increasing scope and impact of respeaking in the AVT industry and the limited training offering at HEIs, which has barely increased over the past decade. Training is mostly delivered in-house and varies considerably across companies, which explains the existence of the two abovementioned EU-funded projects aimed at producing streamlined respeaking training material to be used in industry and HEIs, the UK government's decision not to use Disabled Students' Allowances unless professional respeakers are recognized/certified and, consequently, the creation of the LiRICS certification. Secondly, the analysis of the different components included in the HE and in-house training courses as described by the respondents of the ILSA questionnaire reveals some of the key competences that professional respeakers acquire through training and that can be tested in the certification process, such as preparation, dictation, edition, respeaking different genres of television and applying corrections. Finally, the questionnaire also shows that professional respeakers are often familiar with the NER model, which makes it a useful tool to use in the assessment of respeakers' performance for LiRICS.
Assessing the quality of live subtitles is still one of the most often debated topics in this area, as it has proved to be of interest not only to researchers but also to companies and users. Different assessment methods have been proposed for this purpose. Some are based on subtitling theory (Eugeni, 2012), while others have their origin in the professional market (Dumouchel, Boulianne, & Brousseau, 2011) or in scientific efforts to automate quality assessment (Apone, Brooks, & O'Connell, 2010). In Canada, for instance, the Canadian Radiotelevision and Telecommunications Commission (CRTC) in 2012 launched a two-year project to analyse the quality of the live captions provided in 265 programmes using the so-called Verbatim Test (English Broadcasters Group [EBG], 2014). Only 19% of the programmes analysed reached the 95% threshold established by the Verbatim Test, which considers accuracy as the extent to which the captions match the audio of a programme verbatim.
Broadcasters criticized this method for its inability to assess accuracy and for pushing captioners to provide verbatim captions that may be too fast for the viewers to read instead of correctly edited captions that retain the meaning of the audio. The NER model (Romero-Fresco & Martínez, 2015) was developed in 2012 in order to allow for the possibility of correct (and incorrect) editing and for the occurrence of different types of error in live subtitling. Over the past few years, it has been widely used by universities, broadcasters, access service providers and regulators in countries such as the Switzerland, Italy, the United Kingdom, Spain, France, South Africa and Australia. The NER model accounts for editing errors (caused by the respeakers' decisions when they need to edit the original audio content if it is not possible to respeak it verbatim) and recognition errors (caused by the interaction between the respeakers and the speech-recognition software). These errors can in turn be minor (the viewer may notice the error but the main meaning is retained), standard (the main meaning is lost) or serious (the incorrect meaning appears on the screen). In 2013, the UK governmental regulator Ofcom adopted the NER model in order to set up the largest study conducted so far on the quality of live subtitling, which analysed the accuracy, delay, speed and edition rate of 78,000 subtitles from news programmes, entertainment programmes and chat shows broadcast by all terrestrial TV channels in the country (Romero-Fresco, 2016).
The results of this project have proved very useful to informing the design of the LiRICS certification process presented here and also to providing a benchmark against which to compare the performance of the candidates.
Ofcom's reports on the quality of subtitles show an overall accuracy rate of 98.4%, which is above the 98% threshold set by the NER model and may be considered as acceptable quality. Generally speaking, subtitles with a less than 98% accuracy rate are regarded as substandard, subtitles with a 98-98.49% rate are regarded as acceptable, subtitles with a 98.5-98.99% rate as good, subtitles with a 99-99.49% rate as very good and subtitles with a 99.5-100% rate as excellent. Of the 300 programmes analysed across the two-year period, 23% of the programmes did not reach the required accuracy threshold, whereas 77% did. Of the latter, 21% had acceptable subtitles, 29% had good subtitles, 22% had very good subtitles and 5% had excellent subtitles. Interestingly, 60% of the programmes that did not reach the required accuracy threshold are chat shows (as compared to 26% entertainment shows and 14% news shows), which highlights the extent to which genre can have an impact on subtitle quality. News programmes obtained the highest average accuracy rate, at 98.75%, as they normally feature only one speaker at a time and they tend to combine pre-recorded (and thus 100% accurate) and live subtitles. Entertainment programmes, with the added difficulty of featuring several speakers but also with a combination of live and pre-recorded subtitles, followed with an average accuracy rate of 98.54%. Finally, chat shows had an average accuracy rate just below the threshold (97.9%), which can be explained by the high speech rates, the presence of multiple speakers and the absence of a script or pre-recorded subtitles.
In the samples analysed during the two-year project, 69% of the errors observed in the subtitles were editing errors, that is, those caused by incorrect omissions or additions made by the subtitlers, errors of speaker identification, etc. The remaining 31% were recognition errors, that is, those caused by the interaction between the subtitler and the steno machine or the speech-recognition software. Once again, these percentages vary depending on the genres. In chat shows, which feature very high speech rates that are hard for subtitlers to follow, 75% of the errors were caused by incorrect editions (typically omissions) and 25% by misrecognitions. Entertainment programmes, where speech rates are slower, contain 69% edition errors and 31% recognition errors. Finally, news programmes feature 61% edition errors and 39% recognition errors. This relative increase in recognition errors in the news as compared to entertainment programmes and chat shows may be due to both the effort made by subtitlers to type/respeak fast in order to keep up with the audio without editing too much and to the very content of the news, which is likely to include more specialized terms and unexpected proper nouns than chat shows and entertainment programmes. This is one of the reasons why having access to the script before the programme -when one is available -can help to improve accessibility for the viewers.
As far as the seriousness of the errors is concerned, in general 56% of the errors found in the Ofcom sample were minor (i.e., they do not prevent the viewers from following the content of the programme), 39% were standard (i.e., they trigger confusion or cause full factual omissions) and 5% were serious (i.e., they introduce misleading information). These figures vary depending on whether they relate to edition errors (56% of which were minor, 42% were standard and 2% were serious) or recognition errors (62% of which were minor, 31% were standard and 7% were serious). In other words, editing errors tend to be more problematic than recognition errors, since it is more common to have standard edition errors (omissions of full sentences) than standard recognition errors (nonsensical misrecognitions). The different genres also play an important role here. As a result of the effort made by the subtitlers to improve the quality of the subtitles for the news, one-third of the errors in these programmes are standard and two-thirds are minor. In contrast, in chat shows the fast speech rates and the overlapping interventions of the speakers force the subtitlers to rush and to omit more information. As a result, 56% of the edition errors were minor and as many as 43% were standard. In other words, almost one in two errors found in chat shows involves the omission of a full sentence; in the worst cases, this may cause the viewers to lose the thread of the programme.
The large-scale research conducted so far on the quality of live subtitles is essential to informing the creation of a certification for professional respeakers, especially when it comes to grading the audiovisual material according to levels of difficulty (e.g., by genre), setting accuracy thresholds for the different levels and analysing the performance of the candidates compared to current professional standards (accuracy, types and severity of errors, etc.).

LiRICS certification
LiRICS is an online certification process, where testing is carried out remotely. This section describes the different steps involved in the creation of the certification process, including the design of the test, the different aspects required to conduct it online and the development of a quality assurance process.

Designing the test
The LiRICS test assesses the candidates' ability to respeak across two broad contexts: TV, on the one hand, and education and live events, on the other. The genres selected for assessment in the TV subject area are news, sports and entertainment/chat shows, that is, the genres covered in the Ofcom project with the addition of sports, which has been assessed in the abovementioned two-year trial set up in Canada by the CRTC. For education and live events, the genres chosen are a class or a lecture, a conference presentation and an interview, which are the most common genres respoken by AiMedia, the biggest provider of respeaking for live events in Europe and Australia and the company whose subtitlers effectively participated in the pilot certification test.
Within these genres, materials were chosen at three different levels of difficulty in order to assess a candidate's ability. Level 1 is the lowest level of assessment and the entry level for all candidates; once a candidate has passed this assessment, they are able to move on to levels 2 and 3. The pilot study was based on this initial Level 1 certification.
A number of parameters differentiate the difficulty between levels, content being one of them. Level 1 test materials are of a more general nature and not as specialized as those of Romero-Fresco, P., Melchor-Couto, S., Dawson, H., Moores, Z. & Pedregosa, I. (2019). Respeaking certification: Bringing together training, research and practice. Linguistica Antverpiensia, New Series: Themes in Translation Studies, 18, 216-236. 224 levels 2 and 3. Other aspects are also taken into account, such as sound quality, the number and delivery of the speakers (in terms of pronunciation, accent, speech rate and whether they speak spontaneously or read a written text) or whether there are any visual aids that support the verbal information.

Assessment-setting proformas
Assessment-setting proformas were produced as tools for the assessing team to check that the selected video clips were set at the correct level. For each video clip a proforma was filled out, which included the following information: genre, provenance, title, publication date and synopsis of the video with the time frame that had been selected to be respoken. The image below shows an example of the first section of an assessment-setting proforma. In addition to the aforementioned features, the proforma identified a list of challenges which the respeaker could encounter during the test. Types of challenge included change of speaker, proper nouns, specialized terminology and the need to restructure sentences. For Level 1, each video clip was required to contain a minimum of five challenges. For each challenge, the time code, text and type of challenge were noted on the proforma. The image below shows an example of the second section of an assessment-setting proforma.

Video clips
The video clips chosen for the LiRICS Level 1 certification test were these: a classroom setting focusing on the science of sound for primary-school children with a duration of 12 minutes 9 seconds and spoken at 166 words per minute (wpm); a conference presentation discussing the science of simplicity with a duration of 15 minutes 13 seconds and spoken at 163 wpm, and a Q&A interview on the topic of feminism with a duration of 14 minutes 54 seconds and spoken at 175 wpm.

Testing online
Setting up online testing for the LiRICS pilot posed several challenges and involved careful consideration of different aspects such as identity verification and invigilation.

Key considerations in creating the platform
When creating the testing platform, it was therefore essential to incorporate identity verification and to prevent opportunities for cheating during the test within the design of the platform and the procedures surrounding its use (LearningLight, n.d.). The platform also had to be one that enabled the respeakers to replicate their workplace routines to allow them to perform as well as possible. For assessment purposes, the examiners needed to provide a transcript of the respoken text and a recording of the respeakers' voices. After much investigation, it was decided that five different pieces of software would be used to achieve this testing set-up: the respeakers would dictate with their regular respeaking software, DragonNaturallySpeaking; and the Screencast-o-matic screen recorder, Google Drive, Classmarker and Vimeo would be used to create the other elements of the platform. Careful thought was given to the precise process of running the tests so the experience for the candidates would be as smooth and stress-free as possible.

Identity verification and invigilation
Candidates were asked to provide a screen recording of the duration of the test. They were asked to record both audio and video with Screencast-o-matic and to set their screen up so that the certifiers would be able to see the testing platform, the video being respoken and the window recording them at work. This prevented the candidates from pausing or restarting the test clip, which would have invalidated the test. The audio included the voice of each respeaker over the original video, which served to prevent cheating. On completion, the candidates shared the video with the assessors via Google Drive. The official identity of the candidates had already been confirmed by the candidates' company. As testing expands, identity documents will need to be checked before the test begins.

Test procedures
Classmarker was selected as the testing platform, as it allows remote testing which can be scheduled and timed. This set-up formalized the testing process. Each candidate received a specific time slot for the test, during which an invigilator would be online and available to respond to any queries and deal with any technical issues that might arise. Each test expired ten minutes after the due completion time, ensuring that testing conditions were the same for all candidates.
The clips used for the test were open source, so, during the pilot study, a number of measures were taken to prevent the candidates from accessing and identifying the clips before they respoke them. The candidates accessed the assessment clips through a link to Vimeo, which was cued to appear question-by-question after the preparation slot ended. This meant that respeakers could use the preparation time to research the general theme of the clip but not listen to the content that they were about to respeak. Furthermore, the candidates were requested not to share information about the clips with other candidates yet to take the test and the tests were scheduled across a short period of time to support this. When the certification is rolled out on a larger scale, a bank of video clips from different sources will be used, which will remove the possibility of candidates' identifying and preparing for the test clip in advance.
In order to minimize technical problems on the day, then candidates were provided with very thorough instructions a week before their scheduled test date about how to set up each piece of software correctly and to use them in combination. In addition, the first question in the test comprised a trial run which allowed them to select the desktop set-up that was most comfortable for them and ensured that the different pieces of software required worked simultaneously before the testing proper began.

Technical challenges and future considerations
The pre-test instructions meant that, for the most part, testing ran smoothly and minimal technical difficulties were experienced. The main technical issues experienced involved the respoken text not appearing on screen or disappearing entirely. The video recording guaranteed that this text could be retrieved by assessors so that candidates were not . Respeaking certification: Bringing together training, research and practice. Linguistica Antverpiensia,New Series: Themes in Translation Studies,18, disadvantaged. The recording also facilitated quality assurance during the marking process (section 4.3 below).
The recording process posed two key problems: some candidates placed a required window on their second screen, so it was not recorded. This did not compromise invigilation or assessment quality, but it meant that the marking process needed to be adapted. Some homeworkers felt that the recording was an intrusion. However, since recording is integral to the certification process, earlier notification will be given in future tests to allow candidates to make any adjustments required to their working set-up.

Quality assurance process
The tests were marked with the NER model (Romero-Fresco & Martínez, 2015), following the same process used during Ofcom's sampling of live television subtitling in 2014(Romero-Fresco, 2016, and all examiners were experienced in using this model. Procedures for first and second marking were determined to ensure rigorous marking and fair treatment of all the candidates. Since the pilot was the first certification of live respeaking to take place, it also revealed areas for further refinement to the quality assurance process, which will be put in place during subsequent rounds of testing.

Marking process
For effective use of the NER model, a careful comparison of the words spoken in the original source video (the verbatim transcript) and the respoken transcript is required. As mentioned in section 3, differences between the two are classified as either edition (E) errors, where the respeaker has added, omitted or changed something in the original, or recognition (R) errors, where mispronunciations or mishearing means that a word is incorrectly recognized by the speech-recognition software. The severity of these errors is weighted and scored and the value obtained is then deducted from the total number of words spoken (N), allowing the accuracy to be calculated: Correct editions, where the respeaker has edited the text without causing loss of information, are also noted.
Internal marking sheets were designed to ensure homogenous marking among examiners and errors were automatically calculated: . Respeaking certification: Bringing together training, research and practice. Linguistica Antverpiensia,New Series: Themes in Translation Studies,18, 228 Figure 4 Internal marking sheet The marking sheet allowed the full transcript to be divided into smaller chunks of text, usually a sentence long each (or independent idea unit, as per NER terminology). This made the process of marking and reviewing far clearer for all the examiners involved and allowed for easy comparison of responses across candidates, ensuring consistency in marking. Comment boxes allowed each marker to explain the error seen and second markers to add their responses, where necessary, to the first marker's decision. The coloured columns allowed each marker to record the types of error noted. The second marker worked with a copy of the first markers' sheet and adjusted any discrepancies according to their own marking. In accordance with the NER model, recognition and edition errors were divided into three categories: serious, standard and minor errors. Serious errors changed the meaning of the original and carried a penalty of one mark; standard errors resulted in the omission of an information unit or disrupted the flow of meaning (Romero-Fresco, 2016) and had a penalty of 0.5 marks; minor errors still allowed the viewer to follow the meaning of the original and had a penalty of 0.25 marks. Correct editions were totalled but not weighted. As the markers enter the types of error, the Excel spreadsheet automatically calculates the total number of points deducted. At the end, in addition to completing the NER calculation, each marker wrote a detailed comment to capture the overall quality of the respeaking.
Where a candidate did not achieve the expected level of accuracy, a third marker reviewed the assessment for confirmation. . Respeaking certification: Bringing together training, research and practice. Linguistica Antverpiensia,New Series: Themes in Translation Studies,18,

Feedback to candidates
Each candidate received an assessment report with feedback on their performance. In addition to finding out whether they had passed or failed and receiving their overall accuracy score across all three tests, for each clip they received detailed comments on their performance and the level of accuracy attained and a grid indicating the total number of errors in each category.

Interrater disagreement
The interrater disagreement seen across first and second marking for LiRICS is 0.24%. This figure was calculated by averaging the difference between first and second marking for the 27 respoken texts produced by nine respeakers. Although this interrater disagreement is higher than the one obtained in the Ofcom study, which was reported at 0.09% (Romero-Fresco, 2016), it is still negligible. It is equivalent to 0.5 in a 1 to 10 scale and it means that all 27 respoken texts assessed by two markers were placed in the same band (i.e., pass or fail).

Modifications to marking process
During the pilot test, it became clear that, despite the rigorous marking of both first and second examiners, a formalized moderation process or discussion forum which allowed responses to be compared across candidates was also required. To further support the first and second markers, moderation sessions will be held to agree on expected standards for each genre and co-mark respoken passages, in the case of those which might be particularly problematic. In the pilot test, the first and second markers were randomly assigned, but as a way forward a more rationalized method of marking would be to assign the same first marker to the same genre. This, we expect, would reduce the extent of interrater disagreement.

Analysis of the LiRICS pilot
The first phase of the LiRICS project, completed in December 2018, can be regarded as a pilot study. It involved 27 tests by nine candidates, all of them professional respeakers. Six of them were awarded a Level 1 LiRICS certificate and three of them will need to resit all or some of the tests. In total, the LiRICS examiners analysed 6,000 subtitles and 40,000 words, which is far from the 78,000 subtitles and 546,000 words analysed in the Ofcom project (Romero-Fresco, 2016). However, the unexpected similarity between the results obtained in the two projects makes it possible to draw more conclusions from this pilot than may have been expected.
Tables 1-6 compare the results of the two projects and, more specifically, the accuracy rates, types (edition vs recognition) and severity (minor, standard and serious) of errors in general and per genre. The average accuracy rate obtained by all candidates for all videos is 98.5%, practically the same as the rate obtained by the dozens of respeakers analysed over two years for the Ofcom project (98.4%). In the LiRICS pilot, 23% of the videos analysed did not attain the minimum NER threshold of 98% and 77% did -exactly the same percentages found in the Ofcom project. A slight difference may be found in the distribution between acceptable, good, very good and excellent subtitles, with better results in the LiRICS pilot, that is, a higher percentage of good, very good and excellent subtitles than in the Ofcom study. This makes sense if we consider that candidates were being tested for Level 1 certification and that there are two more levels of increased difficulty.
As shown in Table 2, once again, the difference between genres has proved to have a considerable impact on the respeakers' performance. In the Ofcom project, the most challenging genre was the chat shows, 60% of which did not attain the required quality threshold (as compared to 26% for entertainment shows and 14% for news shows). In the LiRICS test, it is the classroom clip that proved to be particularly difficult, with 66% of the programmes not attaining the threshold (as compared to 17% for both the Q&A and the conference clips). Whereas in the chat shows the main problems were the lack of a script, the high speech rates and the overlapping speech, the classroom clip proved particularly challenging because of the interaction between teacher and students. Not all of the pupils were always fully audible and, even when they were, respeakers were not always sure whether or not to include them in the subtitles. the conference clip. On the whole, these are slightly better results per genre than those obtained in the Ofcom project (97.9% for chat shows, 98.54% for entertainment shows and 98.75% for news programmes); this is shown by the fact that 22% of the candidates produced very good or excellent subtitles (over 99% accuracy rate) for the Q&A clip and as many as 55% produced good subtitles (between 98.5% and 99%) for the conference clip. See Table 3 below. The findings relating to the types (edition vs recognition) and severity of errors (minor, standard and serious), presented in Tables 4, 5 and 6, provide further data with which to analyse the respeakers' performance and, more specifically, the low scores obtained in the classroom clip. The LiRICS pilot confirms that incorrect editions, that is, instances in which meaning is lost because respeakers cannot keep up with the original audio and omit information, are the most common type of error in respeaking. Edition errors are approximately twice as recurrent as recognition errors (65% edition errors vs 35% recognition errors in LiRICS, as compared to 69% edition errors vs 31% recognition errors in the Ofcom project). It is also evident that the more challenging the genre, the higher the percentage of edition errors: 76.5% edition errors vs 23.5% recognition errors in the classroom clip and, in the Ofcom project, 75% edition errors vs 25% recognition errors in chat shows. The analysis of error severity reveals further parallels between the LiRICS pilot and the Ofcom project: 56% minor errors, 36% standard errors and 8% serious errors in the former and 56% minor errors, 39% standard errors and 5% serious errors in the latter. Once again, edition errors tend to be more serious than recognition errors and this is particularly true of the most challenging genres, in this case the classroom clip, where one in two edition errors is standard -that is, one in two errors involves the omission of a full sentence -as opposed to one in five in the case of the conference and the Q&A clips. Those omitted sentences were often comments by the students. The breakdown of the type and severity of errors in the LiRICS study shows once again that the key to achieving high-quality subtitles is not only to have fewer errors but also to control their severity and to ensure that most of them are minor rather than standard or serious.

Conclusions
The considerable impact that live subtitling and, more specifically, respeaking has had on both the audiovisual translation market and society as a whole has not been matched by an equally significant body of publications or training courses. Academic research and training in respeaking had a slow start, which meant that many companies had no choice but to develop their own in-house training programmes. Although by 2012 six European HEIs were offering respeaking courses, seven years later the offering has not increased as much as could have been expected. This may be due to the lack of respeaking trainers, which is related to the low number of researchers working in this area. This landscape might help explain why the UK government had decided not to use Disabled Students' Allowances for respeaking until professional respeakers are recognized/certified and, most importantly, it justifies the need for an official certification such as LiRICS.
As shown in this article, the creation of a respeaking certification is a complex and timeconsuming process that involves careful consideration of the choice and grading of material, technical set-up, online testing, peer-reviewed assessment, etc. The procedure chosen for LiRICS and tested in a pilot scheme is by no means perfect, but it has been shown to be effective enough to achieve the goal of certifying professional respeakers, while also revealing aspects that are still in need of improvement.
The analysis of the data obtained from the candidates has also proved extremely useful. Their striking resemblance to the results of the much larger Ofcom study (Romero-Fresco, 2016) reveals a few interesting lessons: first, despite the low number of participants in the pilot, their results can probably be extrapolated and are representative of the current provision of respoken subtitles in the United Kingdom; secondly, we have probably succeeded in aligning the certification with current professional standards; and, thirdly, the decision to draw on the Ofcom project as an inspiration for designing the assessment included in LiRICS has been useful. Some limitations have also been revealed, not least of them the excessive difficulty of the classroom clip. Whereas it made sense for some genres such as chat shows to be more difficult than others in the Ofcom project, this should not apply to LiRICS, where all the clips should be equally difficult at a level commensurate with the level that is being tested. Another limitation is the absence of some key live subtitling dimensions such as delay and subtitling speed, which have not been tested in LiRICS.
Finally, the data obtained in the LiRICS pilot have also helped to provide a clearer picture of respeaking, highlighting the importance of edition errors over recognition errors and the need to control error severity in order to obtain a satisfactory accuracy rate. This should be fed back into training, which does not always focus on these elements. This last point helps to end this article on a positive note. Despite the limited number of trainers and researchers working on respeaking, the vibrant work produced in this area is an example of fruitful cross-fertilization between teaching, research and professional practice. The first training programmes in respeaking at HEIs introduced the NER model as a method with which to rate students' work; this was then used in research projects to assess the quality of the live TV subtitles, such as the Ofcom study, which has in turn informed the design and implementation of LiRICS. The circle is now closed, as the findings of the LiRICS project feed back into respeaking training programmes.
As new challenges appear in the professional field of live subtitling (for instance, interlingual live subtitling and live subtitles produced by automatic speech recognition), it is essential to maintain this tight connection between training, research and practice in order to ensure that the rapid development of live subtitling is accompanied by the required quality standards that can guarantee full access for all viewers.