Effects of thinking aloud on cognitive effort in translation

This study investigates the effects of thinking aloud on cognitive effort in translation as a function of source-text difficulty level. It does so by considering time on task, duration of different translation phases (i.e., orientation, draft, revision), cognitive effort of processing the source and target texts, and translation quality. Twenty participants took part in an English– Chinese translation experiment, which comprised two matched sessions – translating while thinking aloud and translating silently. Their translation processes were recorded by means of an eye tracker and a key logger. An adapted NASA Task Load Index was employed to elicit their subjective assessments of translation difficulty levels. The quality of their translations was evaluated. The results of the study reveal a number of important effects of thinking aloud on cognitive effort during translation: on translation duration, cognitive effort, the perceived level of difficulty of a translation as measured by NASA-TLX, and on translating easier texts.


Introduction
As a data elicitation method, thinking aloud (TA) refers to verbalizing one's thoughts out loud while engaging in a task, and written transcripts of the verbalizations are called think-aloud protocols (TAPs) (Ericsson & Simon, 1993;Jääskeläinen, 2010). In translation process research, it used to be a primary method for investigating translators' cognitive processes from the mid-1980s until the end of the 1990s. In recent years, TA has been adopted by some translation researchers (e.g., Angelone, 2010;Vieira, 2017). And yet, for various reasons, TAP-based translation studies have dwindled noticeably. One reason is the emerging and increasing use of keystroke logging, eye-tracking and other methods (see Jakobsen, 2017). Another reason is concerns and doubts about TA's validity and comprehensiveness (Hansen, 2005;Jääskeläinen, 2011). To date, few studies (e.g., Jakobsen, 2003) have empirically tested TA's validity, despite calls for more research into this method (Jääskeläinen, 2017).
This study is an empirical investigation of the impact of TA on the translation process. It seeks to examine the effects of TA on cognitive effort in translation by considering time on task, duration of different translation phases, cognitive effort in processing the source texts (STs) and target texts (TTs), and translation quality, as functions of the difficulty of STs. It bases its analysis on verbal, keystroke logging and eye-tracking data collected from the experiment, and post-performance rating of the level of difficulty of a translation. The methods of data collection are complementary and they aim to provide an accurate picture of the effects of TA on the translation process.
We first review and discuss the theories, arguments and recent empirical findings concerning TAPs. Then we introduce the concept of cognitive effort, its position in translation process research and its various measures before we describe our experiment set-up and present the findings.

Validity of TAPs
The theory that verbal protocols can be used to elicit data on cognitive processes was proposed by Simon (1980, 1993), who have provided substantial empirical support for it. Ericsson and Simon (1993) hold that automated or unconscious are unavailable for TA. The controversies concerning the validity of TAPs go around these two points: (1) whether TA alters the thought processes being studied (reactivity), and (2) whether TAPs are an accurate reflection of thoughts (veridicality; Bowles, 2018). Recently, Fox et al. (2011), in a meta-analysis of 94 studies (involving almost 3,500 participants) comparing performance while TA to that in a matching silent condition, found that TA results in little or no reliable difference in performance across TA and silent conditions, though it does prolong the time taken to reach solution.
Most of the studies included in the meta-analysis of Fox et al. (2011) involved problem-solving tasks (e.g., playing chess), making the relevance of their findings to linguistic tasks (such as translation and reading) unclear (Bowles, 2018). Partly for this reason, since the early years of translation process research, the validity and completeness of TAPs have been controversial. Toury (1991), for example, suspected that "spoken" translation (i.e., TA) and "written" translation might involve different strategies and therefore TA might interfere with translation. Hansen (2005) contended that TA "must have an impact … on the thought processes, on the translation process and on the translation product" (p. 519). Sun (2011), in his review, analysed these claims and argued that no strong evidence suggests that TA significantly changes or influences the translation process, although several variables (e.g., task or text characteristics) in a specific study might influence the validity and completeness of TAPs.
Two oft-cited empirical studies in this regard are Krings (2001) and Jakobsen (2003). Krings (2001), in his study of the process of post-editing machine-translated texts as well as translation, found that TA slowed down the process by roughly 30%. Jakobsen (2003) reported that TA delayed translation by about 25% and forced translators to process text in smaller segments. Their findings have been interpreted as evidence that TA might change the course or structure of cognitive processes (e.g., Nitzke, 2019, p. 92). This, however, is a misunderstanding of their findings and the expression "course or structure of cognitive processes".
Metaphorically speaking, if a person follows the same path as usual, we would say their path (or course) is unchanged, even when they walk more slowly and take smaller steps than usual. Cognitive structure means "rules for processing information or for connecting experienced events" (Kohlberg, 1969, p. 349) and it is exhibited in four types of statement (Ericsson & Simon, 1993, p. 198): • intentions (goals and future states of the participant); • cognitions (attention to selected aspects of the current situation); • planning (exploring sequences of possibilities mentally); and • evaluations (explicit or implicit comparisons of alternatives).
If one translates a metaphor literally under the silent condition but paraphrases it under the TA condition, then TA has undoubtedly changed the translation process or cognitive structure. Obviously, this is not what the findings by Krings (2001) and Jakobsen (2003) are about.
TA has also been used in reading and writing studies. As translation involves reading and writing, related studies in these two fields merit our attention. Smith et al. (2019) provided a methodological review of 76 original empirical studies published between 2000 and 2015 that used verbal reports (including TAPs) to investigate the reading processes of language learners; six of these studies discussed the reactivity of verbal reports. Leow and Morgan-Short (2004) and Bowles and Leow (2005) found that verbalization had a minimal effect on text comprehension and written production and did not compromise the validity of the study. Godfroid and Spino (2015) reported that neither eye-tracking nor TA affected text comprehension, although TA had a small, positive effect on vocabulary recognition. Bowles (2010) did a meta-analysis of 14 studies and concluded that TAPs "can reliably be used as a data collection tool" (p. 138); meanwhile, she pointed out that reactivity is multidimensional and, depending on many factors, can improve or hinder task performance (p. 137).
Compared with the issue of reactivity of TA, veridicality has been a minor concern for research across all fields, that is, whether TAPs reflect thoughts accurately (Bowles, 2018). In a study among 43 Chinese sophomores, Yang (2019) compared TAPs with retrospective verbal reports in an English-as-a-Foreign-Language (EFL) writing task and reported that TAPs were largely accurate, although there were various insignificant omissions in them. Thought processes mediating revision decisions were thought aloud for only about 8% of the revisions, therefore the monitoring processes were mostly under-represented in the TAPs. Based on the participants' reflections, the omissions appeared to be covert (as intermediary processes), transient (a second or two), nearly automatized or intuitive, making them hard to verbalize. According to Ericsson and Simon's theory (1993, p. 90), only information in focal attention can be verbalized, while such omissions are not in focal attention. In other words, TAPs do not provide an exhaustive view of the cognitive processes that occur during a task (Jääskeläinen, 2017). This attests to the usefulness of retrospective verbal reports in providing complementary information such as elaborations, rationalizations and justifications, although such information is not what Ericsson and Simon (1993, p. xvi) recommend deriving from retrospective verbal reports for accuracy concerns.

Cognitive effort and its measurement
The term "cognitive effort" is used in many fields and has many synonyms, including mental load, mental workload, cognitive workload, cognitive load, cognitive effort, mental effort, task difficulty (Sun, 2019). Mental workload (sometimes referred to as cognitive workload) has been an important concept in human factors and industrial psychology. It is defined in various ways, for example, "portion of the operator's limited capacity actually required to perform a particular task" (O'Donnell & Eggemeier, 1986, p. 42-2). The term "cognitive load" has been used in psychology since the 1960s, and is defined by Block et al. (2010) in a meta-analytic review as "the amount of information-processing (especially attentional or working memory) demands during a specified time period; that is, the amount of mental effort demanded by a primary task" (p. 331).
The term "mental load" first appeared in the 1920s and has been used interchangeably with mental workload in the field of psychology. "Cognitive effort" has been defined as "the engaged proportion of limited-capacity central processing" (Tyler et al., 1979, p. 607). "Difficulty", from a cognitive perspective, refers to the amount of cognitive effort required to solve a problem (see Sun, 2015). From these definitions, we may or may not see the distinctions among them, but we can see the connections.
According to Muñoz Martín (2012), mental load is a pivotal construct in Cognitive Translation Studies (CTS); research in this regard can help us to identify important characteristics of the translation process, unravel the complex relationships between attention, consciousness, problem-solving, automation and expertise, and may be beneficial for both current theoretical and empirical efforts in translation process research. Similarly, Lacruz (2017) believes that studying cognitive effort is "key to gaining insights into the translation process" (p. 387).
A central theme in the study of mental load, cognitive effort and translation difficulty is how to measure them. In their bibliography on mental workload assessment, Wierwille and Williges (1980) identified 28 specific techniques in four major categories: subjective opinion, spare mental capacity, primary task and physiological measures. In translation process research, the applicable techniques can be classified into three major categories: subjective, performance and physiological measures (e.g., measures of brain, eye, cardiac or muscle functions). Subjective measures have been used frequently and the most commonly used subjective measure is the rating scale. A frequently employed rating scale is the Task Load Index (NASA-TLX) developed by Hart and Staveland (1988), which has six workload-related subscales: mental demand, physical demand, temporal demand, effort, performance, and frustration level. It has been adopted and adapted in translation research (see Sun & Shreve, 2014). The performance measures include time on task, typing behaviour (such as pauses, deletions) recorded by key-logging programs, eye movement and gaze direction (e.g., fixation count, fixation duration) revealed by eye-tracking and translation quality (Halverson, 2017;Lacruz, 2017). An important factor to consider in a study related to cognitive effort is the characteristics of the ST/SL or TT/TL, which have been found to correlate to cognitive effort (Halverson, 2017;Liu et al., 2019;Sun & Shreve, 2014) and may have an impact on the effect of TA on cognitive effort (Sun, 2011).

Participants
Twenty graduate students in Translation or Interpreting from a university in Beijing participated in this experiment in early 2019. Three of them took part in an ST difficulty evaluation, three in a pilot test and 14 were the informants in the main experiment. They were native Chinese speakers from 21 to 24 years of age. All the participants -bar one -were female. They all signed an informed consent form and each received about USD25 in compensation.

Test passages
Five passages were used in this study. One of them was in Chinese, about 100 words long, used in the copying task for participants to get used to the same keyboard used in all the experimental tasks. The other four passages were in English, were about 150 words long, and were used as STs in the translation tasks. These four self-contained passages were excerpts taken from two recently published texts entitled, respectively, "The Uses of Nostalgia" retrieved from The Economist and Worldwide Cancer Rates Rising from voanews.com.
Based on the Flesch Reading Ease formula, the Nostalgia piece was more difficult, with the two excerpts (i.e., ST1 and ST2) scoring 46.4 and 45.4 (corresponding to an 11th grade level), respectively; the Cancer piece was easier, with the two excerpts (i.e., ST3 and ST4) scoring 72 and 65.9 (corresponding to a 7th grade level), respectively. The translation difficulty level of the two excerpts taken from the same text in both cases was further confirmed to be approximately the same in terms of task duration and subjective evaluation by three participants. Therefore, two excerpts taken from the same text were homogeneous in terms of word count, text type, topic and ST difficulty.
Task analysis was conducted before the experiment. The English texts did not require specialized knowledge in order to be comprehended. Since the participants were not allowed to access the internet during the tests, English explanations for a few difficult words and Chinese translations for English proper nouns were shown to the participants on paper for a few minutes just before the test on the computer began.

Instruments and experimental setup
A Tobii TX300 eye tracker and a high-performance PC were used in the experiment; installed on the PC were the key logger Translog-II for recording keyboard and mouse events, Tobii Studio, and Audacity for audio recording.
In an eye-tracking experiment, the type and model of the eye tracker, the physical distance between eye tracker and participant's eyes, the type of chair for the participant, the participant's ability to touch type, light intensity, among others, may have an impact on the quality of the data collected (Hvelplund, 2014;O'Brien, 2009). Therefore, pilot studies are indispensable.
The participants were informed beforehand not to wear contact lenses, false eyelashes or eyeliners, which would affect the validity of the eye-tracking data (O'Brien, 2009). The distance between eye tracker and participant was around 64 cm, as recommended by the Tobii Studio Manual (p. 36). A comfortable non-swivelling, medium-height chair was selected.
The pilot study showed that a participant's sitting posture had an influence on the collection of eye-tracking data. Participants who moved their body forward and backwards from time to time when performing the task produced low-quality data. During one pilot session, a participant moved her chair and the quality of the eye-tracking data immediately became poor. Therefore, before an experiment started, we did eye-tracking calibration and adjusted the chair position to ensure the distance between eye-tracker and participant was optimal, and asked the participant to keep their posture steady while translating. The participant's eyemovement habits may result in data-collection failure: if a participant tends to blink very fast or look up and to the left (or right), they will be a poor fit for experiments with a remote eye tracker. This was confirmed in our pilot study.
The project setup in Translog-II was also an important factor to consider. For Tobii Studio to record the whole screen of Translog-II, the Translog-II User's window was adjusted out of fullscreen mode. The text in the Translog-II window was double-spaced so that the participant's gaze path and fixations would be easier to recognize.

Test procedure
Each participant completed two sessions: first, they translated two passages (ST1 and ST3) silently and, second, they translated the other two passages (ST2 and ST4) while TA (see Table  1). Behaviour at the keyboard and mouse was recorded with Translog-II, which does not track translation behaviour outside its interface (e.g., web searches), so the participants were not allowed to access the internet or refer to print materials.
In each session, the translation task was preceded by the previously noted copying task and the two texts were presented in random order. The sessions had no time constraints. For each participant, there was a two-day interval between the two sessions. The participants were told that the quality of their translations would be assessed and therefore they could review their translation and make revisions if necessary when they had finished the draft and before submitting it. This was supposed to push participants to try their best in the experiment. The participants were not informed of the focus of this study, that is, the impact of TA on the measures of cognitive effort. After the participants had finished the translation of a passage, they were asked to complete an adapted NASA-TLX survey for evaluating the translation difficulty level of the passage, which comprised four items with English and Chinese descriptions: mental demand, effort, frustration and performance. Each category was rated on a 0-10 scale, with 0 being extremely low and 10 being extremely high (see Sun & Shreve, 2014).
Before the TAPs session, the participants received some training on TA: based on Ericsson and Simon's (1993, p. 376) recommendations, they were asked to multiply two numbers in their head and say out loud everything that they would say to themselves silently; they were then requested to translate a few sentences from English into Chinese on paper while thinking aloud so that they would grow accustomed to the experiment. In addition, the participants were told to focus on the translation task, especially when facing a choice between TA and performing the translation task (Sun, 2011), and to avoid articulating what they assumed the observer wanted to hear (Jääskeläinen, 2017). After they had finished the translation task, each of them was interviewed about the difficulties they had encountered with their translation and the influence of TA on their translation process and performance. After all the tests were completed, the participants' translations were assessed with regard to quality by two graders. Participants' verbalizations and post-performance interviews were then transcribed and analysed. These verbal data were analysed together with key-logging and eyetracking data.

Data quality and analysis
Two post-experiment measures were adopted to ensure data quality: Tobii Studio's indicators of the recording quality and Translog-II's replay function. Tobii Studio presented several types of information after each recording, including the ID of the recording, the name of the participant, date, duration, Gaze Samples and Weighted Gaze Samples, the last two of which are indicators of the number of valid samples during a recording.
According to the Tobii Studio User's manual (v. 3.4.8), the Gaze Samples show how many of the eye-tracking samples in a recording have usable gaze data and indicate how useful the recording will be for analysis. It is expressed as a percentage that is "calculated by dividing the number of eye-tracking samples with usable gaze data that were correctly identified, by the number of attempts" (p. 41); 100% means that one or two eyes were detected for the full recording and 50% indicates that one or two eyes were detected for half of the recording session. A low value means that the participant's eyes could not be found because they were looking away from the screen or were closed during parts of the recording. As human beings cannot open their eyes uninterruptedly, Gaze Samples cannot be 100% in an authentic experiment. The Weighted Gaze Samples value is "weighted based on if both or just one eye was detected in a sample" (p. 41); 100% shows that two eyes were detected for the full recording whereas 50% indicates that one eye was detected throughout the recording or two eyes for a half of the recording.
Compared to Gaze Samples, Weighted Gaze Samples seem to be a more precise measure. Therefore, Weighted Gaze Samples were adopted and those recording sessions with a Weighted Gaze Samples percentage below 60% were replayed and examined in Translog-II. In the end, data from one of the 14 formal participants were discarded and only 13 participants' data were analysed.
The Translog-II keylogging files were uploaded to an online TPR-DB management tool, which tokenized the STs and TTs, segmented them into sentences, and aligned them at a sentence level. After some adjustments, the sentence-aligned STs and TTs were manually aligned at a word level. The TPR-DB management tool then generated TPR-DB tables, which were downloaded to the local computer for statistical analysis. These TPR-DB tables (see Figure 1) contained more than 200 features that can be used to describe and model translation behaviour, such as timestamps, durations, pauses, deletions, insertions, part of speech tags, number of typed keystrokes per activity unit, number of fixations per activity unit and word translation entropy (see .
Data from eye-tracking and key-logging were analysed together with TAPs.  Skewness and kurtosis, two measures of the symmetry in a distribution, were found to be 0.7 and 0.4, respectively, which indicated the distribution of data was moderately right-skewed. Two outliers with durations of 2,201.55 and 2;267.45 s were found with a box plot; they exerted great influence on the variance duration of ST2 translations, and therefore were excluded from the inferential statistical analysis. A density plot then showed that the data of duration of 50 observations (i.e., 13 participants × 4 passages − 2 outliers) was normally distributed. Linear mixed-regression modelling was employed to estimate the influence of TA on translation.   Ericsson and Simon's theory (1993, p. xxxii) that participants slow down only moderately due to the additional verbalization and also with Jakobsen's (2003) finding that TA delayed translation by about 25%.

Effects of TA on translation duration
The t value of fixed effects of ST difficulty was found to be 4.489 (p <0.01), showing that ST difficulty had a significant influence on translation duration; the duration for translating a difficult ST was estimated to be 304.01 s longer than for translating an easier ST.
The interaction between the TA condition and ST difficulty condition was also found to be significant; the t value of fixed effects of the interaction was −2.149 (p <0.01). The duration difference in the translation of easier texts between the silent condition and the TA condition was estimated to be 341.59 s; for more difficult texts, the difference was only 130.93 s. From this, it can be inferred that when the level of ST difficulty increases, the difference in translation duration between the silent condition and the TA condition will decrease.

Effects of thinking aloud on different translation phases
In Translog data, the translation process was divided into orientation, draft and revision phases (Dragsted & Carl, 2013;Nitzke & Oster, 2016). The orientation phase is that during which a translator reads the ST before any insertion or deletion takes place. It lasts from the beginning of a translation session until the first keystroke. The draft phase starts from the end of the orientation phase until the last token of the ST is translated. The revision phase refers to the period during which a translator deletes and/or inserts words in the TT after having produced a complete draft, and it starts with the first deletion or insertion. It should be mentioned that this division of phases is based on key demarcation points and it is not intended to filter out instances of revision taking place prior to a complete draft being in place.

On the duration of the orientation phase
The mean duration of the orientation phase (n = 50) was 21.32 s, median 8.51 s and standard deviation (SD) 30.7 s. The minimum duration of the orientation phase was 2.53 s, whereas the maximum was 137.98 s. The skewness and kurtosis were found to be 2.31 s and 4.83 s, respectively, indicating the distribution of data was heavily right-skewed and there were outliers. A box plot (see Figure 2) analysis of the data showed ten outliers.  Sun, S., Li, T., & Zhou, X. (2020). Effects of thinking aloud on cognitive effort in translation. Linguistica Antverpiensia,New Series: Themes in Translation Studies,19,[132][133][134][135][136][137][138][139][140][141][142][143][144][145][146][147][148][149][150][151] In a Translation Progression Graph (see Figure 3), the horizontal axis shows the time (in milliseconds (ms)) when the translation of the ST was produced; the left vertical axis enumerates the emerging ST words whereas the right vertical axis gives the target words; blue dots represent fixations on the ST; green dots represent fixations on the TT; black characters represent insertions; red characters represent deletions . The orientation phase in Figure 3 is marked by an orange rectangle. As we can see, the participant read all the sentences of the ST before typing the TT.
Such great discrepancies among the observations pointed to participants' different translation styles. During the orientation phase, several types of behaviour could be distinguished via the eye-tracking data: (1) reading through the ST before translation; (2) skimming the ST quickly before translation; (3) reading the first couple of words or sentences before pressing the first key (Carl et al., 2011).
Since the distribution of data was heavily right-skewed, the data were transformed and normalized by taking the logarithm before statistical analysis. A linear mixed model was built (t = −0.804, p = 0.426 >0.05) and it indicated that the fixed effect of TA was not significant. Then, ST difficulty was added into the model and the t-value was 0.431 (p = 0.669 >0.05); it showed that the fixed effects of ST difficulty were not significant either. The interaction between TA and ST difficulty was not significant (p = 0.693 >0.05). Therefore, the duration of orientation was not influenced significantly by TA, ST difficulty or the interaction of both. This makes sense, as the participants might simply be reading the ST out loud.

On the duration of the draft phase
The mean duration of the draft phase (n = 50) was 901.89 s, median 860.59 s and SD 322.13 s. The data were transformed and normalized by taking the logarithm before statistical analysis. A linear mixed model comprising two fixed effects (TA, ST difficulty) and one random effect (participant) was applied. Both fixed effects (t = 3.825 for TA condition, p = 0.0004 <0.01; t = 2.141 for ST difficulty, p = 0.039 <0.05) were found to be significant on the duration of the draft phase. The interaction between TA and ST difficulty had no significant influence (p = 0.693 >0.05) on the duration of the draft phase. In other words, TA increased the duration of the draft phase and, compared to translating easier texts, translating difficult texts required a longer duration in this phase.

On the duration of the revision phase
The mean duration of the revision phase (n = 50) was 239.49 s, median 223.11 s and SD 168.59 s. The minimum duration was 6.19 s, whereas the maximum was 627.97 s. The distribution of data was slightly right-skewed. A linear mixed model comprising two fixed effects (TA, ST difficulty) and one random effect (participant) was applied. The results showed that the fixed effect of TA was not significant (t = 1.696, p = 0.097 >0.05); the fixed effect of ST difficulty was significant (t = 2.426, p = 0.020 <0.05); the interaction between the two predictor variables had no significant influence (p = 0.693 >0.05) on the duration of the revision phase. In other words, as the ST difficulty level increased, the duration of the revision phase also increased. The best model estimated that the duration of the revision phase for difficult texts would be 89.97 s longer than that for easier texts and the standard error was 39.38 s; the duration of the revision phase under the TA condition would be 65.30 s longer than under the silent condition, and the standard error was 39.38 s.
As to random effects, the residual variance -that is, the variability that was left unexplained by the fixed effects in the model -in this case (the residual variance was 22,861) -was much larger than the participant variance. This means that other factors were affecting the duration of the revision phase, such as a participant's translation style, time pressure, fatigue and even hunger.

Effects of thinking aloud on processing the STs and TTs
The exploration of the effects of TA on translation needs to take cognitive effort into account. In Translog data, cognitive effort can be measured by fixation count and fixation duration.
Because, text wise, translation consists of comprehending the ST and producing the TT, we need to study the effects of TA on processing the STS and TTs according to fixation count and fixation duration. In the TPR-DB tables, "FixS and FixT are the number of fixations on the source token(s) and on the target token(s), while TrtS and TrtT represent the total reading time, i.e. the sum of all fixation durations on the source and target text respectively" (Carl et al., 2016, p. 21).
We replayed the data collected with Translog-II and found that data from two participants contained little fixation and therefore excluded them (i.e., eight translations by the two participants) from our statistical analysis.  In other words, the participants spent more time looking at the TT under the TA condition than under the silent condition; when working on a difficult ST, the participants had more fixations on and spent more time looking at the TT compared with working on an easier ST. The findings that participants spent more time looking at both the ST and TT under the TA condition than under the silent condition are also in line with previous findings that TA would prolong the translation duration by about 30% compared to the silent condition.

Cross-check with the adapted NASA Task Load Index
The adapted NASA-TLX includes four subscales and represents the participant's subjective evaluation of the translation difficulty level of a passage. Descriptive statistical analysis of NASA-TLX scores showed that the trimmed mean (TM) of translations under the TA condition (TM = 5.34) was greater than that under the silent condition (TM = 4.76) and that the trimmed mean of translations of two more difficult STs (TM = 5.5) was greater than that of two easier STs (TM = 4.6).
A linear mixed model comprising two fixed effects (TA, ST difficulty) and one random effect (participant) was applied. Both fixed effects (t = 4.139 for TA condition, p =0.0001 <0.01; t = 5.551 for ST difficulty, p = 2.17e-06 <0.01) were found to be significant on NASA-TLX scores. The interaction between TA and ST difficulty was significant (p = 0.029 <0.05).
In other words, TA had a significant influence on the perceived translation difficulty level. The NASA-TLX score was estimated by the best model to increase by 0.947 from 4.124 under the silent condition to 5.071 under the TA condition. In addition, the NASA-TLX score for translating difficult texts would be 1.270 higher than for translating easier texts. TA would reduce the effect of ST difficulty on NASA-TLX score (TA: ST difficulty = −0.732); that is, when the task changes from translating an easy ST under the silent condition to translating a difficult ST under the TA condition, the NASA-TLX score would increase by 1.485 (i.e., 0.947 + 1.270 − 0.732). In a word, the effects of TA on translation duration and translation difficulty level were consistent. Therefore, TA increased the cognitive effort in translation.

Cross-check with the post-experiment interview
When a participant finished their two translation sessions, they were interviewed in Chinese and asked three questions. In answer to the first question about whether they felt comfortable with TA, almost all of them said that their thoughts were faster than their verbalizations and that TA interfered with their translation speed. This was consistent with our statistical finding that TA had a significant influence on translation duration.
In response to the second question about the influence of TA on their translation process, their responses varied. Some participants reported that TA helped them with checking whether the translation was natural and in making mindful decisions about which words should be used. Others responded that TA tended to dampen sparks of ideas and insights and increase their frustration. Yet others replied that TA interfered with their translating at the beginning, and then they got used to it. Two participants said TA had not influenced their translation process.
The third question was concerned with the impact of TA on translation quality, and here the participants had differing opinions, ranging from positive to negative to that TA had had no influence. Some believed that it had no influence on translation quality when they were working on a difficult ST. The results from our statistical analysis of translation quality scores showed TA had no influence on the quality of the translation of difficult STs (p = 0.957 >0.05), but lowered the quality of the translation of easier STs (p = 0.013 <0.05). This was, to some extent, consistent with Bowles and Leow's (2005) finding that verbalization did not affect text comprehension or written production significantly, although metalinguistic verbalization appeared to cause a significant decrease in text comprehension as opposed to nonmetalinguistic verbalization.

Conclusion
This study investigated the effects of TA on cognitive effort in translation according to time on task, the duration of different translation phases (i.e., orientation, draft, revision), cognitive effort involved in processing the STs and TTs, and translation difficulty as functions of ST difficulty level. The results were as follows: (1) TA has a significant influence on translation duration, and the duration under the TA condition is about 30% longer than that under the silent condition.
(2) The effects of TA on cognitive effort vary as a function of ST difficulty: when the level of ST difficulty increases, the difference in translation duration between the silent condition and the TA condition will decrease. (3) The duration of the cognitive effort in the draft phase is increased due to TA, but the effects of TA on the duration of the orientation phase and the revision phase were not significant. (4) The cognitive effort expended on ST processing and TT processing according to fixation duration is increased due to TA. (5) TA significantly increases the perceived level of translation difficulty as measured by NASA-TLX. (6) The participants' post-experiment interviews indicate that the effects of TA on their translation process may differ. (7) TA had no influence on the quality of the translation of difficult texts, but slightly reduced the quality of the translation of easier texts.
This study did not take into account such factors as the directionality of translation and translation expertise (novices vs professionals), both of which probably have effects on the results (see Jakobsen, 2003); we leave these factors to future research. It would seem that, at this stage, sweeping generalizations such as "TA changes the course or structure of cognitive processes" will not help or convince people, for there are many observed and latent variables involved. We recommend investigating the specific effects of TA on the translation process, for example, whether TA makes participants "reluctant to make large-scale lexical changes, like omissions or additions" (Jääskeläinen, 2011, p. 20), so that researchers can avoid those "disturbing" factors through careful research design and as a result elicit valid data.