The visual world paradigm monitors participants’ eye movements in the visual workspace as they are listening to or speaking a spoken language. This paradigm can be used to investigate the online processing of a wide range of psycholinguistic questions, including semantically complex statements, such as disjunctive statements.
In a typical eye tracking study using the visual world paradigm, participants’ eye movements to objects or pictures in the visual workspace are recorded via an eye tracker as the participant produces or comprehends a spoken language describing the concurrent visual world. This paradigm has high versatility, as it can be used in a wide range of populations, including those who cannot read and/or who cannot overtly give their behavioral responses, such as preliterate children, elderly adults, and patients. More importantly, the paradigm is extremely sensitive to fine grained manipulations of the speech signal, and it can be used to study the online processing of most topics in language comprehension at multiple levels, such as the fine grained acoustic phonetic features, the properties of words, and the linguistic structures. The protocol described in this article illustrates how a typical visual world eye tracking study is conducted, with an example showing how the online processing of some semantically complex statements can be explored with the visual world paradigm.
Spoken language is a fast, ongoing information flow, which disappears right away. It is a challenge to experimentally study this temporal, rapidly change speech signal. Eye movements recorded in the visual world paradigm can be used to overcome this challenge. In a typical eye tracking study using the visual world paradigm, participants' eye movements to pictures in a display or to real objects in a visual workspace are monitored as they listen to, or produce, spoken language depicting the contents of the visual world1,2,3,4. The basic logic, or the linking hypothesis, behind this paradigm is that comprehending or planning an utterance will (overtly or covertly) shift participants' visual attention to a certain object in the visual world. This attention shift will have a high probability to initiate a saccadic eye movement to bring the attended area into the foveal vision. With this paradigm, researchers intend to determine at what temporal point, with respect to some acoustic landmark in the speech signal, a shift in the participant's visual attention occurs, as measured by a saccadic eye movement to an object or a picture in the visual world. When and where saccadic eye movements are launched in relation to the speech signal are then used to deduce the online language processing. The visual world paradigm can be used to study both the spoken language comprehension1,2 and production5,6. This methodological article will focus on comprehension studies. In a comprehension study using the visual world paradigm, participants' eye movements on the visual display are monitored as they listen to the spoken utterances talking about the visual display.
Different eye tracking systems have been designed in history. The simplest, least expensive, and most portable system is just a normal video camera, which records an image of the participant's eyes. Eye movements are then manually coded through frame-by-frame examination of the video recording. However, the sampling rate of such an eye-tracker is relatively low, and the coding procedure is time consuming. Thus, a contemporary commercial eye tracking system normally uses optical sensors measuring the orientation of the eye in its orbit7,8,9. To understand how a contemporary commercial eye-tracking system works, the following points should be considered. First, to correctly measure the direction of the foveal vision, an infrared illuminator (normally with the wavelength around 780-880 nm) is normally laid along or off the optical axis of the camera, making the image of the pupil distinguishably brighter or darker than the surrounding iris. The image of the pupil and/or of the pupil corneal reflection (normally the first Purkinje image) is then used to calculate the orientation of the eye in its orbit. Second, the gaze location in the visual world is actually contingent not only on the eye orientation with respect to the head but also on the head orientation with respect to the visual world. To accurately infer the gaze of regard from the eye orientation, the light source and the camera of the eye-trackers either are fixed with respect to participants' head (head-mounted eye-trackers) or are fixed with respect to the visual world (table-mounted or remote eye-trackers). Third, the participants' head orientation must either be fixed with respect to the visual world or be computationally compensated if participants' head is free to move. When a remote eye-tracker is used in a head-free-to-move mode, the participants' head position is typically recorded by placing a small sticker on participants' forehead. The head orientation is then computationally subtracted from the eye orientation to retrieve the gaze location in the visual world. Fourth, a calibration and a validation process are then required to map the orientation of the eye to the gaze of regard in the visual world. In the calibration process, participants' fixation samples from known target points are recorded to map the raw eye data to gaze position in the visual world. In the validation process, participants are presented with the same target points as the calibration process. The difference existing between the computed fixation position from the calibrated results and the actual position of the fixated target in the visual world are then used to judge the accuracy of the calibration. To further reconfirm the accuracy of the mapping process, a drift check is normally applied on each trial, where a single fixation target is presented to participants to measure the difference between the computed fixation position and the actual position of the current target.
The primary data of a visual world study is a stream of gaze locations in the visual world recorded at the sampling rate of the eye-tracker, ranging over the whole or part of the trial duration. The dependent variable used in a visual world study is typically the proportion of samples that participants' fixations are situated at certain spatial region in the visual world across a certain time window. To analyze the data, a time window has firstly to be selected, often referred to as periods of interest. The time window is typically time-locked to the presentation of some linguistic events in the auditory input. Furthermore, the visual world is also needed to split into several regions of interest (ROIs), each of which is associated with one or more objects. One such region contains the object corresponding to the correct comprehension of the spoken language, and thus is often called the target area. A typical way to visualize the data is a proportion-of-fixation plot, where at each bin in a time window, the proportion of samples with a look to each region of interest are averaged across participants and items.
Using the data obtained from a visual world study, different research questions can be answered: a) On the coarse-grain level, are participants' eye movements in the visual world affected by different auditory linguistic input? b) If there is an effect, what is the trajectory of the effect over the course of the trial? Is it a linear effect or high-order effect? and c) If there is an effect, then on the fine-grain level, when is the earliest temporal point where such an effect emerges and how long does this effect last?
To statistically analyze the results, the following points should be considered. First, the response variable, i.e., proportions of fixations, is both below and above bounded (between 0 and 1), which will follow a multinomial distribution rather than a normal distribution. Henceforth, traditional statistical methods based on normal distribution such as t-test, ANOVA, and linear (mixed-effect) models10, cannot be directly utilized until the proportions have been transformed to unbounded variables such as with empirical logit formula11 or have been replaced with unbounded dependent variables such as Euclidean distance12. Statistical techniques that do not require the assumption of normal distribution such generalized linear (mixed-effect) models13 can also be used. Second, to explore the changing trajectory of the observed effect, a variable denoting the time-series has to be added into the model. This time-series variable is originally the eye-tracker’s sampling points realigned to the onset of the language input. Since the changing trajectory typically is not linear, a high-order polynomial function of the time-series is normally added into the (generalized) linear (mixed-effect) model, i.e., growth curve analyses14. Furthermore, participants’ eye positions in the current sampling point is highly dependent on previous sampling point(s), especially when the recording frequency is high, resulting in the problem of autocorrelation. To reduce the autocorrelation between the adjacent sampling points, original data are often down-sampled or binned. In recent years, the generalized additive mixed effect models (GAMM) have also been used to tackle the autocorrelated errors12,15,16. The width of bins varies among different studies, ranging from several milliseconds to several hundred milliseconds. The narrowest bin a study can choose is restricted by the sampling rate of the eye tracker used in the specific study. For example, if an eye tracker has a sampling rate of 500 Hz, then the width of the time window cannot be smaller than 2 ms = 1000/500. Third, when a statistical analysis is repeatedly applied to each time bin of the periods of interest, the familywise error induced from these multiple comparisons should be tackled. As we described earlier, the trajectory analysis informs the researcher whether the effect observed on the coarse-grain level is linear with respect to the changing of the time, but does not show when the observed effect begins to emerge and how long the observed effect lasts. To determine the temporal position when the observed difference starts to diverge, and to figure out the duration of the temporal period that the observed effect lasts, a statistic analysis has to be repeatedly applied to each time bin. These multiple comparisons will introduce the so-called familywise error, no matter what statistical method is used. The familywise error is traditionally corrected with Bonferroni adjustment17. Recently, a method called nonparametric permutation test originally used in neuroimaging field18 has been applied to the visual word paradigm19 to control for the familywise error.
Researchers using the visual world paradigm intend to infer the comprehension of some spoken language from participants’ eye movements in the visual world. To ensure the validity of this deduction, other factors that possibly influence the eye movements should be either ruled out or controlled. The following two factors are among the common ones that need to be considered. The first factor involves some systematic patterns in participants’ explanatory fixations independent of the language input, such as the tendency to fixate on the top left quadrat of the visual world, and moving eyes in the horizontal direction being easier than in the vertical direction, etc.12,20 To make sure that the observed fixation patterns are related to the objects, not to the spatial locations where the objects are situated, the spatial positions of an object should be counterbalanced across different trials or across different participants. The second factor that might affect participants’ eye movements is the basic image features of the objects in the visual world, such as luminance contrast, color and edge orientation, among others21. To diagnose this potential confounding, the visual display is normally presented prior to the onset of the spoken language or prior to the onset of the critical acoustic marker of the spoken language, for about 1000 ms. During the temporal period from the onset of the test image to the onset of the test audio, the language input or the disambiguation point of the language input has not been heard yet. Any difference observed between different conditions should be deduced to other confounding factors such as the visual display per se, rather than the language input. Henceforth, eye movements observed in this preview period provide a baseline for determining the effect of the linguistic input. This preview period also allows participants to get familiarized with the visual display, and to reduce the systematic bias of the explanatory fixations when the spoken language is presented.
To illustrate how a typical eye tracking study using the visual world paradigm is conducted, the following protocol describes an experiment adapted from L. Zhan 17 to explore the online processing of semantically complex statements, i.e., disjunctive statements (S1 or S2), conjunctive statements (S1 and S2), and but-statements (S1 but not-S2). In ordinary conservation, the information expressed by some utterances is actually stronger than its literal meaning. Disjunctive statements like Xiaoming's box contains a cow or a rooster are such utterances. Logically, the disjunctive statement is true as long as the two disjuncts Xiaoming's box contains a cow and Xiaoming's box contains a rooster are not both false. Therefore, the disjunctive statement is true when the two disjuncts are both true, where the corresponding conjunctive statement Xiaoming's box contains a cow and a rooster is also true. In ordinary conversation, however, hearing the disjunctive statement often suggests that the corresponding conjunctive statement is false (scalar implicature); and suggests that the truth values of the two disjuncts are unknown by the speaker (ignorance inference). Accounts in the literature differ in whether two inferences are grammatical or pragmatic processes22,23,24,25,26. The experiment shows how the visual world paradigm can be used to adjudicate between these accounts, by exploring the online processing of three complex statements.
All subjects must give informed written consent before the administration of the experimental protocols. All procedures, consent forms, and the experimental protocol were approved by the Research Ethics Committee of the Beijing Language and Culture University.
NOTE: A comprehension study using the visual world paradigm normally consists of the following steps: Introduce the theoretical problems to be explored; Form an experimental design; Prepare the visual and auditory stimuli; Frame the theoretical problem with regard to the experimental design; Select an eye-tracker to track participants' eye movements; Select a software and build a script with the software to present the stimuli; Code and analyze the recorded eye-movements data. A specific experiment can differ from each other in any of the described steps. As an example, a protocol is introduced to conduct the experiment and discuss some points that researchers need to keep in mind when they build and conduct their own experiment using the visual world paradigm.
1. Prepare Test Stimuli
2. Frame the Theoretical Prediction with regard to the Experimental Design.
3. Build the Experimental Script
4. Recruit Participants
5. Conduct the Experiment
NOTE: When participants are normal developed adults, one experimenter is enough to conduct the conduct the experiment. But if participants are special populations, such as children, two or more experimenters are required.
6. Data Coding and Analyses
Participants' behavioral responses are summarized in Figure 4. As we described earlier, the correct response to a conjunctive statement (S1 and S2) is the big open box, such as Box A in Figure 1. The correct response to a but-statement (S1 but not S2) is the small open box containing the first mentioned animal, such as Box D in Figure 1. Critically, which box is chosen to the disjunctive statement (S1 or S2) depends on how a disjunctive statement is processed: The small closed box, such as box B in Figure 1, is chosen only when the scalar implicature and the ignorance inferences relating to the disjunctive statement are both computed. Figure 4 illustrates that when comprehending a disjunctive statement (S1 or S2), participants compute both the two inferences.
Furthermore, participants' eye fixations on the small closed box, especially when these fixations are followed by the behavioral responses on that box, also suggest that the scalar implicature and the ignorance inferences are computed. The two inferences should have been processed no later than the earliest temporal point when participants shift their visual attention and fixations on the small closed box. Participants' eye-movements are summarized in Figure 5. As we can see in the panel B, participants' eye-fixations on the small-closed box (box B) don't increase unless the sentential connective is the disjunctive connective, or. Furthermore, this increase begins no later than the offset of the disjunctive connective. This suggests that both the scalar implicature and the ignorance inferences are computed before the offset of the sentential connective, i.e., immediately upon encountering the disjunctive connective.
Figure 1. An example of the test images used in the experiment. The gray area is the test image actually being presented to participants. The blues boxes, the dotted lines, and the pixels denoting the width of the elements are only for the purposes of illustration and are invisible to participants. (Adapted from experiment one of L. Zhan 17 with permission). Please click here to view a larger version of this figure.
Figure 2. An example of the test sentences used in the experiment (Reprinted from L. Zhan17 with permission). Please click here to view a larger version of this figure.
Figure 3. The hierarchical structure of a typical eye-tracking experiment using the visual world paradigm. The audio illustrated in the image are the English translations of the Mandarin Chinese used in the experiment. Please click here to view a larger version of this figure.
Figure 4. Participants' behavioral responses recorded in the experiment (Adapted from experiment one of L. Zhan17 with permission). Please click here to view a larger version of this figure.
Figure 5. Participants' eye movements observed in the experiment. The onset and offset of the sentential connectives are signified by two dashed vertical lines. A significant difference existing between the disjunctive and the baseline condition at certain sampling point is signified by the gray area (p < .05, Bonferroni adjusted) (Adapted from L. Zhan17 with permission). Please click here to view a larger version of this figure.
To conduct a visual world study, there are several critical steps to follow. First, researchers intend to deduce the interpretation of the auditorily presented language via participants' eye movements in the visual world. Henceforth, in designing the layout of the visual stimuli, the properties of eye movements in a natural task that potentially affect participants' eye movements should be controlled. The effect of the spoken language on participants' eye movements can then be recognized. Second, acoustic cues in the spoken language are transient and there are no acoustic signatures solely corresponding to certain linguistic category. To correctly time-lock participants' eye-movements with the onset of some linguistic marker, researchers should find an objective and consistent way to define the boundary of some linguistic structure. Third, to correctly map participants' eye rotation around the head with their gaze of regard in the visual world, researchers need to perform one or several runs of calibration, validation, and drift correction processes. Fourth, the data obtained from a visual world study have some peculiar properties, such as being lower and upper bounded, and having autocorrelated errors etc. These peculiar properties should be considered when a method is selected to statistically analyze the data.
A visual world study consists of three essential components: visual display, spoken language, experimental task. Modifications can be made on any component to fulfill researchers' specific purposes. First, a visual display is normally a screening display depicting an array of pictures. But it can also be a screening display depicting an array of printed words32, a schematic scene33,34, or a real world scene containing real objects1,35. Second, the spoken utterances can be a word36, a simple sentence33,34, a semantically complex statement17,37,38, or a dialogue39. Third, in terms of the experimental task, participants are either simply being asked to look at the visual world and listen carefully to the auditory utterances33,34; or are being required to make some behavioral responses, such as acting out the movements described by the auditory utterance1, determining whether or not the auditory utterance applies to the visual display38, or choosing the correct image in the visual display the spoken utterance is talking about17.
The visual world paradigm, compared with other psycholinguistic techniques, has several unique advantages. First, the visual world paradigm can be used in a wide of populations, including those who cannot read and/or who cannot overtly give their behavioral responses, such as preliterate children40,41,42,43,44,45, elderly adults46, and patients (e.g., aphasics)47. This is because the language stimuli are presented in the auditory rather than in the form of written texts; and the language comprehension is inferred from participants' implicit automatic eye movements rather than from their overt behavioral responses. Second, the visual world paradigm is extremely sensitive to fine grained manipulations of the speech signal. This paradigm can be used to study the online processing of most topics in language comprehension at multiple levels, such as the fine grained acoustic phonetic features36,48,49, the properties of words33,34, the linguistic structures1,50, and the logical structures of semantically complex statements like concessives37, biconditionals37, conditionals38, and disjunctions17.
The visual world paradigm, compared with other psycholinguistic technique, also has some potential limitations. First, participants' interpretation of the spoken language is deduced from their eye movements on the visual world, not from the actual interpretation of the language stimuli per se. Henceforth, the language properties that can be studied with the visual world paradigm are limited to the ones that can be visualized, i.e., they should be somehow related to the entities or events in the visual world. Second, the visual world paradigm used is normally more restricted than the actual visual world, with a limited set of pictured referents and a limited set of potential actions. This so-called closed-set problem48 might create task-specific strategies that the observed language processing does not generalize beyond the specific situations created within the experiment. Henceforth, the paradigm might not be sensitive to characteristics of linguistic knowledge and experience lying outside of the closed-set that have been established on a given trial.
The visual world paradigm essentially explores the integration of the information from the visual domain and the information from the auditory domain. Theoretically, any information that can be processed by the two sensational domains can be potentially studied using this paradigm. For example, the visual world can be virtual reality or dynamic videos. The auditory input is not necessarily language, and can be any other format, such as music and sound from the natural world, etc. Furthermore, this paradigm can be further extended to explore the integration of information from other domains, rather than the visual domain and the auditory domain. For example, researchers might use this technique to see how participants' fixations in the visual world are affected by different smells, different touches, etc.
The authors have nothing to disclose.
This research was supported by Science Foundation of Beijing Language and Cultural University under the Fundamental Research Funds for the Central Universities (Approval number 15YJ050003).
Pixelmator | Pixelmator Team | http://www.pixelmator.com/pro/ | image editing app |
Praat | Open Sourse | http://www.fon.hum.uva.nl/praat/ | Sound analyses and editting software |
Eyelink 1000plus | SR-Research, Inc | https://www.sr-research.com/products/eyelink-1000-plus/ | remote infrared eye tracker |
Experimental Builder | SR-Research, Inc | https://www.sr-research.com/experiment-builder/ | eye tracker software |
Data Viewer | SR-Research, Inc | https://www.sr-research.com/data-viewer/ | eye tracker software |
R | Open Sourse | https://www.r-project.org | free software environment for statistical computing and graphics |