Methodological considerations for survey research: Validity, reliability, and quantitative analysis

As translation and interpreting studies continue to develop cognitive theories of translator and interpreter behavior and processing, there has been increased emphasis on research methods and data collection methodologies to glean new insights into the translation process. This article presents a critical review of survey research methods in Cognitive Translation Studies and argues for their inclusion as a means of better understanding translator and interpreter attitudes, behaviors, perceptions, and values. The article begins with a reflection on measurement and the need for alignment with theoretical frameworks and constructs; then it reviews important considerations when developing theoretically-grounded, empirically-based survey instruments, namely, validity, reliability, measurement invariance, and quantitative analysis. The article concludes with a call for additional methodological reflection on developing and using survey instruments.

Consequently, the history of Cognitive Translation Studies is replete with techniques to procure numerical data that reflect various cognitive processes, and many authors have outlined the history of innovation in such research methods (e.g., Alves, 2015;Alves & Hurtado Albir, 2017;Jääskeläinen, 2011;Muñoz Martín, 2017;Shreve & Angelone, 2010b). These histories generally trace the first phase of cognitive studies to the mid-1980s and the predominance of descriptive research based on data collected with think-aloud protocols (TAPs), a method drawn from the field of cognitive psychology (Ericsson & Simon, 1984). Over time, concerns emerged regarding ecological validity, the limits of concurrent and retrospective verbalizations, and the lack of rigorous methodological implementation that might hinder the utility of TAPs (e.g., Bernardini, 2001;Jääskeläinen, 2010;Li, 2004). In response, a second phase emerged that emphasized product-oriented research through keystroke logging, screen recording, and triangulation of multiple research methods (Alves, 2003;Jakobsen, 2003). Methodological innovation in the form of eye-tracking and pupillometry later ushered in a third phase marked by these new methods to provide reflective measures of cognitive processes (Hvelplund, 2014;O'Brien, 2009). Therefore, the overarching history is often presented as a series of innovations in data collection methods and related technology.
If one hopes to identify a current, fourth phase in the field, it might be best characterized by ongoing methodological innovation and triangulation (see Alves & Hurtado Albir, 2017). However, the apparent consensus about the previous three phases can also obscure the diversity of methods, the interdisciplinarity, and the overlapping adoption, which have always characterized Cognitive Translation Studies (Alvstad et al., 2011;O'Brien, 2013). Additional data collection methods -corpus analysis, imaging technologies (e.g., electroencephalography [EEG], functional magnetic resonance imaging [fMRI]), and other psychophysiological measures (e.g., heart rate, blood pressure, stress hormones) -suffice as examples to demonstrate the available range of quantitative methods. These methods have been further augmented by qualitative approaches (Risku, 2014) and triangulation that combines various forms of evidence.
Concomitant with the recent flourishing of innovation in methods is a renewed emphasis on theory building and more formal definition of constructs (e.g., House, 2013;Muñoz Martín, 2016Shreve & Angelone, 2010b). Perhaps with further hindsight a future generation of scholars will condense the first three phrases of research in Cognitive Translation Studies into one longer era of method-driven innovation (in which TAPs, keystroke-logging software, and eye-tracking hardware drove changes in empirical research) and a currently emerging second era including greater attention to theory building, consolidation, and testing. One such attempt to identify broad theoretical paradigms is Muñoz Martín's (2017) distinction between computational translatology and cognitive translatology. The latter emphasizes that cognition is embodied, embedded, enactive, extended, and affective (4EA; Clark, 1996) and recognizes the role of context and situated cognition (Pöchhacker, 2005; see also Muñoz Martín, 2016). . Methodological considerations for survey research: Validity, reliability, and quantitative analysis. Linguistica Antverpiensia,New Series: Themes in Translation Studies,19,[172][173][174][175][176][177][178][179][180][181][182][183][184][185][186][187][188][189][190] Notably lacking in this history of methodological innovation and theoretical development is the rigorous use of psychometric testing as a supplement to other measurement techniques. Whereas surveys do appear in the published literature, they have not played a prominent role in theory development and testing, despite their utility in such related fields as psychology and communication studies (Boyle et al., 2015;Groth-Marnat & Wright, 2016;Rubin et al., 2011). Measures of cognition and related constructs are necessarily proxies that presume a high degree of correlation between an assumed, underlying construct and an observable phenomenon. Therefore, constructs in Cognitive Translation Studies can be measured by responses to questionnaires in addition to the other common data collection methods outlined above.
There have been occasional calls for the development of valid and reliable questionnaires (e.g., Alves & Hurtado Albir, 2017), and some discipline-specific scales are available (e.g., Lee, 2014). However, the traditional Likert-type scale has not received the same rigorous treatment and wide application of other methods, such as eye-tracking and keystroke logging. Therefore, the present study provides a critical review of survey instruments as a theoretically-grounded measure that can help with understanding the various traits and characteristics of translators and interpreters, on the one hand, and the users of language services, on the other. This review focuses on validity as an explicit link between theory and survey instrument development, the importance of establishing reliability and measurement invariance, and the analysis of quantitative survey data. The overall aim is to link theoretical and methodological work from survey design and statistics to cognitive research on translation and interpreting (T&I).

Measurement and Likert-type scales
Surveys can be used to assess latent constructs, which are well-defined, theory-based concepts that yield testable hypotheses. Whereas any measurement technique is susceptible to misuse, survey instruments may be particularly vulnerable because they can be created, circulated, and analyzed with relative ease. Surveys (and especially online surveys) have the additional allure of a potentially larger sample size that can span multiple geographic areas and reach a diverse and scattered sample of respondents (Mellinger, 2015). However, poorly designed measurement scales can invalidate statistical analysis, leading to errors in inference and implications of empirical research. This issue is compounded by a dearth of critical reflection on survey methods in translation and interpreting studies. Therefore, this section briefly reviews the philosophy of measurement and the specific format and construction of Likert-type scales 1 that, we argue, can possibly be of use in Cognitive Translation Studies.
The act of measurement presupposes a theoretical framework, and Borsboom (2005) advocates that psychometrics -specifically, latent variable analysis -demands ontological realism of attributes in order to justify the effort to measure them. This philosophical stance is rarely explained directly in discussions of research methods, but it is implicitly embraced by many empirical scholars. Consider, for example, House's (2013) question of whether observable behavior, such as keystroke logging, is truly informative about unobservable cognitive processes. Similar questions have been raised about the effectiveness of TAPs, . Methodological considerations for survey research: Validity, reliability, and quantitative analysis. Linguistica Antverpiensia,New Series: Themes in Translation Studies,19,[172][173][174][175][176][177][178][179][180][181][182][183][184][185][186][187][188][189][190] fMRIs, and every available measurement tool, and also about whether the observed data correlate with cognitive processes or other attributes of the individual. However, such critiques implicitly assume the existence of underlying cognitive and/or affective phenomena in theorizing about their nature and in attempting to describe valid and reliable means of measuring them for use in quantitative analysis.
The key issue for any measurement is whether the manifest, observable variables are useful and adequate proxies for the underlying phenomenon of interest. For example, much of the eye-tracking research rests on the eye-mind assumption (Just & Carpenter, 1980; see also Hvelplund, 2014Hvelplund, , 2017 to justify the use of specific eye movements and visual attention as indicators of cognitive processes. The use of surveys similarly presupposes the existence of an underlying construct and the correlation of the measurement with its degree or intensity. For instance, Angelelli's (2004) Interpreter Interpersonal Role Inventory (IPRI) measure makes two contentions: first, that visibility is a meaningful and stable construct; second, that the IPRI survey is able to distinguish different levels of that construct among respondents. The first claim is one of ontology, while the second is one of validity.
By advocating ontological realism, we follow Borsboom (2005) by considering solely reflective surveys -in which an underlying attribute is assumed to be the source of variation in responses -while omitting consideration of formative surveys (cf. Edwards & Bagozzi, 2000). The difference might be most easily illustrated in reception studies -for example, satisfaction with subtitles. A formative construct (also sometimes referred to as an emergent construct or an index) might use items concerning the size, color, placement, and pacing of text on the screen; satisfaction on each of these individual aspects would be summed to a total satisfaction score. In contrast, a reflective scale would consider the implications of satisfaction and possibly include items about enjoyment, understanding, and/or intention to view more subtitled content.
Reflective latent variables are favored for several reasons. First, the mathematics of factor analysis relies on the assumption that the observed items covary due to their joint causation by the underlying construct. Second, an ongoing debate among philosophers, statisticians, and applied researchers across disciplines questions the distinction, value, and even the legitimacy of formative measurement scales. Wilcox et al. (2008) review this debate and conclude that formative measures are problematic when used for measuring latent variables. 2 Finally, almost all scales used in psychology and social science research are reflective scales (Bollen, 2002). Therefore, for statistical, theoretical, and practical reasons, theory-driven research in Cognitive Translation Studies should favor reflective measurement scales by first defining the latent trait to be studied and then considering which observable items will reflect that latent attribute.
A fuller discussion of the ontology and epistemology of surveys lies beyond the scope of this article, but these issues deserve at least brief acknowledgement before proceeding to the practical problems of survey design. The agenda of cognitive translation scholars with an interest in surveys should begin with theoretical matters and the operationalization of discipline-specific constructs as well as the recognition and reuse of extant measures and . Methodological considerations for survey research: Validity, reliability, and quantitative analysis. Linguistica Antverpiensia, New Series: Themes in Translation Studies,19,[172][173][174][175][176][177][178][179][180][181][182][183][184][185][186][187][188][189][190] constructs from neighboring disciplines. Then, scale development can proceed with item writing, pilot testing, and factor analysis to provide a sound basis for applied research.
One example of the need for the alignment of philosophy, definition of constructs, and practical measurement is the concept of translation competence. Multiple approaches exist for operationalizing competence, with studies having been conducted in an effort to identify various subcompetences of the translation task (e.g., Hurtado Albir, 2017) and to understand the extent to which competence has been acquired by students or novices. More recently, a questionnaire has been developed to examine competence using self-report data (Schaeffer et al., 2020). These types of instruments hinge on the existence of an underlying construct that can be measured. However, recent scholarship has called into question the utility of competence as a theoretical construct that is grounded in the extant literature in psychology (e.g., Shreve et al., 2018). The potential disconnect between instrument development and the theoretical status of the underlying construct may raise questions about the construct validity of these instruments. Moreover, this lack of alignment demonstrates the iterative nature of research and the imperative for continuous refinement, with research studies serving as the foundation for theory development, which can then be examined through empirical work. As debate and empirical research continue, a pressing issue will be harmony among theory, construct definitions, and measurement tools.
Surveys can take many forms, but this review concentrates on Likert-type scales as a means to measure attitudes (Likert, 1932). The format of such a scale consists of individual items in the form of statements (not questions) to which respondents mark their level of agreement or disagreement. The responses are quantified (including any necessary reverse coding) and then summed to yield a respondent's score. Not all surveys that use a multiple-choice response format are properly called Likert-type surveys. That designation should be reserved for any survey that is conceived as a unified instrument to reflectively measure a construct with the intention that the item scores be summed for analysis. Likert-type scales can be used in a wide range of applications to measure latent constructs that indicate attitudes, knowledge, perceptions, and values (Vogt & Johnson, 2016).
Cognitive translation scholars have previously used Likert-type scales to explore the relationships among various constructs and traits in the context of translation and interpreting. Two of the many available examples are personality traits (Hubscher-Davidson, 2009) and self-efficacy (e.g., Bolaños-Medina, 2014;Jiménez Ivars et al., 2014;Lee, 2014Lee, , 2018. There are also examples of scales developed to measure constructs directly related to issues of language, translation, and interpreting, such as interpreter visibility (Angelelli, 2004) and language learning motivation (Csizér & Dörnyei, 2005). Unfortunately, other published literature does not always exhibit the same level of rigor demonstrated in these studies. Common modeling and statistical errors can lead to confounded research instruments and undermine the researcher's ability to draw conclusions.
Increasing the utility and legitimacy of survey scales in Cognitive Translation Studies requires recognition that the purpose of a scale is to quantify latent constructs. The true relationships among theoretical constructs cannot be directly observed, but the statistical relationships . Methodological considerations for survey research: Validity, reliability, and quantitative analysis. Linguistica Antverpiensia,New Series: Themes in Translation Studies,19,[172][173][174][175][176][177][178][179][180][181][182][183][184][185][186][187][188][189][190] among the measured variables from survey scales can be computed to test hypotheses. The power of the statistical tests and the legitimacy of the conclusions rely on the presumption of both validity and reliability of a reflective survey scale; unstable measurement contributes to smaller estimated effect sizes through attenuation bias and a corresponding increase in Type II statistical error (see Mellinger & Hanson, 2017). Given the philosophical importance and practical implications of survey quality, empirical researchers in Cognitive Translation Studies need to follow best practices in survey design and implementation to ensure accurate measurement. In order to contribute toward that end, the remainder of this article proceeds by discussing issues of validity, reliability, and quantitative analysis of surveys. These methodological discussions are then examined in light of cognitive translation and interpreting studies as a call for their inclusion in the methodological repertoire of T&I researchers interested in cognition.

Validity
Validity can be described as the property of a scale to produce a measurement that accurately reflects an underlying construct. In other words, the scale measures what it intends to measure (Litwin, 1995). Validity can also be thought of as alignment between a measure and theoretical definitions, relationships, and predictions (Messick, 1995). Therefore, validity is the primary concern for any scale development and for the evaluation of scales for reuse (AERA, 1999). In addition to creating and validating scales specifically for use in Cognitive Translation Studies, there is an opportunity to contribute to other social sciences that acknowledge the importance and influence of translation in adapting scales for multiple languages (Hambleton & Patsula, 1998;Smith, 2010). This section discusses the philosophy and terminology of validity while providing several examples from translation and interpreting studies.
Validity is a holistic evaluation that a scale is appropriate, useful, and meaningful in measuring a construct (Kane, 1994) and has traditionally been conceived of in three broad categories: content, construct, and criterion validity (Cronbach & Meehl, 1955). However, modern scholarship stresses that validity is a single property of a test (e.g., AERA, 1999). In particular, Messick (1995) proposed consolidating all validity under the umbrella of construct validity while also describing six aspects to be considered in evaluating validity, notably including the impact of a survey's use on respondents. In a succinct definition, Borsboom (2005) argues that a test is valid if and only if it measures an existing, underlying attribute that causes observable variation in the measurement outcome. Recent standards stress the unitary nature of validity, but for the applied researcher the traditional tripartite division (i.e., content, construct, and criterion validity) can still provide a useful scheme for accumulating and describing evidence in the process of validating a scale (e.g., Goodwin & Leech, 2003). Indeed, discussions of these socalled types of validity can be found in handbooks on T&I research methods and research studies as forms of evidence supporting claims of measurement validity.
Content validity describes the extent to which a survey covers all aspects of a construct and also subsumes the more superficial standard of face validity, which is the extent to which the items in a survey appear relevant to a reader familiar with the construct being measured (DeVellis, . Methodological considerations for survey research: Validity, reliability, and quantitative analysis. Linguistica Antverpiensia,New Series: Themes in Translation Studies,19,[172][173][174][175][176][177][178][179][180][181][182][183][184][185][186][187][188][189][190] 2017). Therefore, content validity relies on theory to describe the construct to be measured. In particular, theory provides insight into the relevant wording, concepts, and dimensionality of a construct. For example, Lee (2014) bases the Interpreting Self-Efficacy scale on social cognitive theory and on related scales that define self-efficacy as encompassing three factors: selfconfidence, self-regulatory efficacy, and preference for task difficulty.
Content and construct validity are established, in part, by including questions that align with each of these dimensions. Moving forward, an important task of scholars in Cognitive Translation Studies is to develop and probe theories for construct definitions and their associated dimensionality to create and test measurement scales that provide valid inferences.
While there is no statistical test for content validity, correlation coefficients are often employed as partial evidence for the other two traditional types of validity: criterion and construct validity. Criterion validity considers the extent to which a scale aligns with an observable trait and encompasses the subcategories of predictive and concurrent validity. Examples of predictive validity often arise in studies related to student performance, screening, and proficiency tests (e.g., Bontempo & Napier, 2011;Lee, 2018).
Meanwhile, construct validity involves correlations with other latent variables in a nomological network (Cronbach & Meehl, 1955). Validating a scale for cognitive translation theories involves collecting evidence of correlations with both manifest variables and a web of relationships with other constructs. Ongoing research on default translation (Halverson, 2019) illustrates these multiple evidential processes. While product-oriented research examines the output of the translation process, other research considers the psychological and cognitive processes that might lead to the existence of a default translation. Therefore, both direct observation and theoretical constructs are considered, which is an example of the types of multiple validation techniques needed for surveys.
Validity is a characteristic of a scale in its particular use and context (Chan, 2014). To illustrate this point with an admittedly extreme example, a scale to measure introversion might be wellconceived and valid for that purpose, but that same scale would clearly be invalid and useless as a measure of translation competence. Adaptation and borrowing of scales from adjacent disciplines is useful, but the practice demands reflection on the instrument's validity and theoretical alignment if the underlying construct is not identical. To date, survey development in CTIS has been too ad hoc and has lacked sufficient theoretical motivation (Muñoz Martín, 2017; Shreve & Angelone, 2010b). Some exceptions do exist (e.g., Angelelli, 2004;Csizér & Dörnyei, 2005;Lee, 2014), but the advancement of the discipline requires explicit alignment of theory and survey scales to provide valid measurement and to aid replication.

Reliability and measurement invariance
The reliability of a survey instrument refers to its ability to produce consistent and reproducible results. For a reliable survey scale, the observed variation in numerical measures is presumed to arise from measurement error (Nunnally, 1978), and the results should be . Methodological considerations for survey research: Validity, reliability, and quantitative analysis. Linguistica Antverpiensia, New Series: Themes in Translation Studies,19,[172][173][174][175][176][177][178][179][180][181][182][183][184][185][186][187][188][189][190] stable across time (test-retest reliability), items (internal reliability), and groups (measurement invariance). 3 The purpose of establishing reliability is to separate variability due to measurement error from true differences attributable to the underlying construct. Similarity of multiple measurements decreases error in the measurement tool and improves the power and interpretation of subsequent statistical analysis. Moreover, reliability allows results obtained from survey instruments to be compared more confidently across research studies, in this way facilitating theory building through replication.
Perhaps the most widely-used method for reporting reliability is Cronbach's alpha, which is a measure of internal reliability based on the proportion of variance that can be attributed to a latent variable (DeVellis, 2017). Alpha is appropriate for latent variable analysis (though not for formative scales; see Streiner, 2003), and the statistic is often described as the average of all split-half reliabilities (Warrens, 2015). Common lore among applied researchers is that Nunnally (1978) justified 0.70 as the standard level for acceptability. However, as with any statistical rule of thumb, this figure is only one benchmark, and the evaluation of reliability should consider multiple factors in a more complete assessment of reliability, including the number of items in the scale and its intended use (Cortina, 1993;Peters, 2014).
For several reasons, the property of reliability cannot be fully established by reporting a single statistic (e.g., Cronbach's alpha) in the initial development of a scale. First, any computed reliability coefficient is a function of the sample data and not an established quality of the survey instrument itself, so researchers must report Cronbach's alpha every time that the survey is administered (DeVellis, 2017). A lack of survey instruments in the field of T&I research makes this somewhat uncommon to date. However, there are examples in the extant literature. For example, Mellinger and Hanson (2018) reported alpha coefficients from published examples in previous studies along with the figures from their sample as part of their methodological discussion of several survey instruments. In addition, a confidence interval for Cronbach's alpha can be reported to provide further information about the likely range of the true value (Mellinger & Hanson, 2017).
Yet, Cronbach's alpha has some notable statistical shortcomings, including variations due to survey length, inter-item correlation structure, and sample characteristics (Agbo, 2010). For this reason, additional techniques should be coupled with reporting the single statistic. The assessment of reliability can also include item analysis, which could be informal assessment of language, leave-one-out analysis using Cronbach's alpha, or item response theory employing item characteristic curves. Alternative measures, such as omega, have also been proposed (Dunn et al., 2014). Software implementation and widespread adoption often lag statistical innovations, which reinforces the need to remain current with one's reading and training in quantitative methods and/or collaborate with statisticians and psychometricians in conducting empirical research.
Internal reliability considers only the relationship among responses to the items of a scale, but nearly every aspect of survey design has been examined for the possibility of both the . Methodological considerations for survey research: Validity, reliability, and quantitative analysis. Linguistica Antverpiensia,New Series: Themes in Translation Studies,19,[172][173][174][175][176][177][178][179][180][181][182][183][184][185][186][187][188][189][190] introduction of bias and the influence on data quality (Choi & Pak, 2005;DeCastellarnau, 2018). Brown (1996) categorizes many of the possible influences into five sources: (1) the test itself, (2) scoring procedures, (3) administration procedures, (4) the test environment, and (5) the individual examinees. Other factors that could affect reliability include respondent motivation and the thoroughness and comprehensibility of the instructions. The diversity of possible influences implies the need for thoughtful choices in all aspects of survey design and administration. The discussion below highlights three issues that are especially common in Cognitive Translation Studies: online administration, translation of surveys, and cross-cultural differences.
First, online surveys are a common modality to conveniently reach a larger sample of the geographically-dispersed population of professional translators and interpreters (Mellinger, 2015). However, by using this data collection technique the researcher relinquishes control of the testing environment and cannot answer any questions, to name just two potential threats to reliability. Whereas much of the commentary related to online surveys has focused on data security and ethics (e.g., Buchanan & Hvizdak, 2009), recent years have seen an increased interest in the psychometric properties of online surveys. Generally, results have shown that online administration of a previously-developed survey does not damage its internal reliability (e.g., Zlomke, 2009). Still, researchers should report how the data were collected, explain whether the survey had been developed or validated for that modality, and describe any potential problems with reliability as a result of the data collection method.
A second threat to reliability is the possible effect of translation on survey responses (Harkness et al., 2004;Harkness et al., 2010). For instance, lexical choices that increase ambiguity or alter the valence of the items can affect responses and reliability, whereas mistranslations may undermine the researcher's ability to measure any potential underlying construct. These challenges can also manifest when adapting materials into signed languages (Graybill et al., 2010).
A third issue that can influence reliability is data collection across different cultural groups (e.g., McGorry, 2000). Reliability can be degraded due to a lack of familiarity with the format of Likert-type scales and cultural bias. For instance, Flaskerud (2012) documents the influence of a respondent's literacy on survey data, and Lee et al. (2002) reveal cross-cultural differences in respondents' willingness to select extreme answers at the endpoints of the scale. Translation and interpreting studies researchers are typically attuned to the challenges of working with multiple, distinct groups; however, explicit reflection on this topic is often taken up by those outside of the discipline. Consequently, this aspect of T&I research methods may be an area worth greater attention as the field continues to evolve.
If a scale is tested and found to behave similarly across a range of samples, it can be said to possess measurement invariance. More formally, measurement invariance concerns the factor structure (configural invariance), factor loadings (metric invariance), mean comparisons (scalar invariance), and equality of variance and error (strict invariance). Measurement invariance has received less attention than validity and reliability in T&I research; its importance is perhaps more recognized in psychology (e.g., Kankaraš & Moors, Mellinger, C. D., & Hanson, T. A. (2020). Methodological considerations for survey research: Validity, reliability, and quantitative analysis. Linguistica Antverpiensia,New Series: Themes in Translation Studies,19,[172][173][174][175][176][177][178][179][180][181][182][183][184][185][186][187][188][189][190] 2010; Lubke et al., 2003;Milfont & Fischer, 2010). For example, the Beck Depression Inventory and Children's Depression Inventory both measure depression but for adult and youth populations respectively; meanwhile, IQ testing has a long and troubled history with cross-group comparisons and issues of measurement invariance (Wicherts & Dolan, 2010). Strict measurement invariance is necessary for direct comparisons across groups, and it is a difficult standard for any scale to meet. CTIS naturally involves multicultural, multilinguistic samples. Scales that lack measurement invariance could be interpreted differently across these groups and yield non-comparable results (Coulacoglou & Saklofske, 2017). As the development and use of Likert-type scales expands, establishing measurement invariance will only become more important.
Multiple approaches have been developed to deal with the issues of creating reliable and invariant surveys across diverse samples (e.g., King et al., 2004). Because psychometric properties are established, in part, through data collection, the nature of the respondents influences the structure and properties of a survey. Therefore, the measurement provided by a scale can be presumed to be valid and reliable only for respondents who are similar to the original sample used to develop the scale. Larger samples, increased replication, and the adoption of best practices in survey methods will allow for the valid use of scales and their widespread adoption.

Quantitative analysis
Rigor in quantitative analysis in translation and interpreting studies continues to improve in terms of statistical design, analysis, and reporting. Scholars who examine large datasets derived from eye-tracking, keystroke logging, and corpus studies have explored a sophisticated range of quantitative tools (e.g., Balling, 2008;Oakes & Ji, 2012), and general volumes on research methods have further contributed to this trend (e.g., Angelelli & Baer, 2016;Mellinger & Hanson, 2017;O'Brien & Saldanha, 2014). In this section, we highlight three common errors in survey analysis. The first two errors were selected because of their prevalence in reported research, whereas the third error relates to the underlying mathematical structures involved in the analysis of surveys. 4 One common error in survey analysis is conducting single-item comparisons. Because a Likerttype scale is conceived and constructed as a unified instrument, only the summed scores should be subject to statistical analysis. In particular, comparisons of the means of single items are almost never appropriate (Carifio & Perla, 2007). Reported results must maintain the distinction between single items and scales: individual items can be summarized and described only qualitatively, whereas summed scales are appropriate for statistical testing. Such is the case across disciplines; however, T&I research that draws on survey data has unfortunately, at times, relied on single-item comparisons to draw larger conclusions. Researchers should always be cautious of overgeneralization based on a single test or result, and survey results are no exception. Our intent here is not to single out studies that have conducted single-item analysis; rather, we hope to cast a more critical eye on results from survey research and present ways by which the methods can be improved. . Methodological considerations for survey research: Validity, reliability, and quantitative analysis. Linguistica Antverpiensia,New Series: Themes in Translation Studies,19,[172][173][174][175][176][177][178][179][180][181][182][183][184][185][186][187][188][189][190] A second error is the belief that standard parametric analyses cannot be applied to data collected with Likert-type scales. While it is true that some scholars raise concerns about the level of measurement of a survey scale, arguing that nonparametric methods are more appropriate (e.g., Jamieson, 2004), the consensus among statisticians is that treating the summed scales as continuous data and conducting traditional parametric analysis will yield acceptable statistical results (Carifio & Perla, 2008;Norman, 2010). Only in unique cases such as severe departures from normality, one-sided tests, moderate sample sizes, or considerable differences in sample size among groups is nonparametric analysis likely to be required (Harpe, 2015). 5 A third error that is sometimes made in the use of survey data is using a statistical method that is inappropriate for latent factor analysis. One specific example is the use of principal components analysis, which should be supplanted by exploratory factor analysis in determining the factor structure of scales (Fabrigar et al., 1999). Additional examples are the incorrect use of path analysis, which should be reserved for use with manifest variables, and partial least squares, which is less powerful than factor analysis of latent variables when sufficient sample sizes are collected (Cole & Preacher, 2014;Rönkkö & Evermann, 2013). The primary issue is that applied research needs to select the appropriate statistical model that aligns with the data and the research question, while understanding the relationships among the variables (Edwards & Bagozzi, 2000).
Scholars who wish to develop new survey scales must pay particular attention to the proper use of factor analysis, although full treatment of the topic is impossible within the confines of a single article. 6 The mathematics of factor analysis is typically distinguished between exploratory and confirmatory models, although these models are nested within the overarching topic of structural equation modeling (SEM), which comprises measurement and structural models and unifies such disparate approaches as path analysis, factor analysis, and item response theory models (Beaujean, 2014). Exploratory factor analysis (EFA) involves a number of decisions, both practical and statistical. The practical decisions include study design, construct definition, and sample selection. The statistical choices consist of selecting a model fitting procedure, identifying the number of factors, determining a rotation methodology, and deciding which items to retain (Fabrigar et al., 1999). As a complement to the selection procedures of EFA, confirmatory factor analysis (CFA) involves comparisons of data with a priori models (Beaujean, 2014). The primary outputs of CFA are model fit indices with which to assess alignment with the theoretical construct. Once measures are determined to fit the hypothesized definitions and to possess acceptable validity and reliability, the relationships among various constructs and other observed variables can combine the measurement model of factor analysis with a structural model to test hypotheses.
In this section, the discussion has centered on latent factor analysis, which is a subjectfocused approach to modeling, in contrast to item response theory (IRT) and the Rasch model, which consider both the subject and the survey items. Both approaches (i.e., latent factor analysis and IRT) have their advocates, strengths, and weaknesses. For instance, IRT is favored in educational testing and in any setting with dichotomous (e.g., right/wrong) responses (Wirth & Edwards, 2007). However, factor analysis is an appropriate tool for initial scale development to measure continuous latent constructs and for application in social scientific research. Furthermore, innovations and developments in statistical methods is always ongoing (e.g., van Bork et al., 2019). Methodological and analytical diversity can strengthen a field, and researchers must select statistical methods that are appropriate to their research questions.

Conclusion
A wide range of research methods are available in cognitive translation and interpreting studies with increasingly more refined approaches to measurement and triangulation. Having multiple tools available enables researchers to explore constructs and hypotheses that were previously more difficult to observe, with the results now providing insights into cognitive theories of translation and interpreting. The present critical review of survey instruments explains some important aspects of Likert-type scales and suggests their utility in translation and interpreting studies. Several examples from the field are provided to illustrate their potential use, given their ability to examine underlying latent constructs that may inform our understanding of behaviors, attitudes, and perceptions during the translation and interpreting task.
The discussion here has emphasized the creation and analysis of new survey instruments specific to CTIS, given the need to align directly with theory development and testing. We have also addressed how the adoption or adaptation of existing scales from neighboring disciplines can provide researchers with useful sources of measurement. The topic of survey translation has been largely omitted, although translation scholars could play an important role in developing the scholarship on that topic. Over-reliance on back-translation and notions of equivalence are problematic in much of the literature on this topic (e.g., Behr & Shishido, 2016). The perspective here is also limited to that of the researcher, although a substantial body of work exists on the survey response process (Schwarz, 2007). All surveys require consideration of validity, reliability, and rigorous quantitative analysis, which is the motivation for their selection here.
CTIS has developed in parallel with new data collection methods, including TAPs, keystroke logging, eye-tracking, and other innovative technologies. A current shift in emphasis in the field should encourage the development and refinement of theories in tandem with improvements in research methods and cross-discipline collaboration to allow for generalization and the advancement of Cognitive Translation Studies as a rigorous science (House, 2013). Surveys can be one important tool contributing to the definition and use of latent constructs that will develop along with theory and empirical work in the field.
Whereas the suggestion is made that survey instruments should be added to the repertoire of translation process research, it is done in full recognition of some limitations of these instruments. No single research tool is optimal for all measurements or for all studies, and surveys are not without their challenges and detractors. Reid (1990) writes in a narrative style to admit the challenges of designing, translating, pilot testing, and analyzing survey data. Furthermore, Gorard (2010) presents a skeptical view of the use of Likert-type scales for measuring latent constructs. The present article highlights several similar aspects of survey design -philosophical, practical, and quantitative -that need to be considered in the process of creating and adapting scales. In particular, we have addressed and summarized issues related to validity, reliability, and quantitative analysis in order to provide guidance and examples. These three areas are by no means exhaustive, and more work is needed with regard to sampling theory, triangulation, item writing, and adaptation of existing instruments. However, an emphasis on validity, reliability, and quantitative analysis can serve as a foundation for more rigorous research that develops and employs surveys in Cognitive Translation Studies.