Using multimodal sensors is a promising way to understand the role of social interactions in educational settings. This paper describes a methodology for capturing joint visual attention from colocated dyads using mobile eye-trackers.
With the advent of new technological advances, it is possible to study social interactions at a microlevel with unprecedented accuracy. High frequency sensors, such as eye-trackers, electrodermal activity wristbands, EEG bands, and motion sensors provide observations at the millisecond level. This level of precision allows researchers to collect large datasets on social interactions. In this paper, I discuss how multiple eye-trackers can capture a fundamental construct in social interactions, joint visual attention (JVA). JVA has been studied by developmental psychologists to understand how children acquire language, learning scientists to understand how small groups of learners work together, and social scientists to understand interactions in small teams. This paper describes a methodology for capturing JVA in colocated settings using mobile eye-trackers. It presents some empirical results and discusses implications of capturing microobservations to understand social interactions.
JVA has been extensively studied over the last century, especially by developmental psychologists studying language acquisition. It was quickly established that joint attention is more than just a way to learn words but rather a precursor to children's theories of mind1. Thus, it plays a significant role in many social processes, such as communicating with others, collaborating, and developing empathy. Autistic children, for instance, lack the ability to coordinate their visual attention with their caregivers, which is associated with significant social impairments2. Humans need joint attention to become functional members of society, to coordinate their actions, and to learn from others. From children acquiring their first words, teenagers learning from schoolteachers, students collaborating on projects, and to groups of adults working toward common goals, joint attention is a fundamental mechanism to establish common ground between individuals3. In this paper, I focus on the study of JVA in educational research. Understanding how joint attention unfolds over time is of primary importance for the study of collaborative learning processes. As such, it plays a predominant role in socioconstructivist settings.
The exact definition of joint attention is still debated4. This paper is concerned with a subconstruct of joint attention (JA), namely JVA. JVA happens when two subjects are looking at the same place at the same time. It should be noted that JVA does not provide any information about other important constructs of interest in the study of JA, such as monitoring common, mutual, and shared attention, or more generally, awareness of the cognition of another group member. This paper operationalizes and simplifies JVA by combining the eye-tracking data from two participants and analyzing the frequency in which they align their gazes. For a more comprehensive discussion, the interested reader can learn more about the study of the JA construct in Siposovaet al.4.
Over the past decade, technological advances have radically transformed research on JVA. The main paradigm shift was to use multiple eye-trackers to obtain quantitative measures of attentional alignments, as opposed to qualitatively analyzing video recordings in a laboratory or ecological setting. This development has allowed researchers to collect precise, detailed information about dyads' visual coordination. Additionally, eye-trackers are becoming more affordable: until recently, their use was reserved to academic settings or large corporations. It is now possible to purchase inexpensive eye-trackers that generate reliable datasets. Finally, the progressive inclusion of gaze-tracking capabilities into existing devices like high-end laptops and virtual and augmented reality headsets suggests that eye-tracking will soon become ubiquitous.
Because of the popularization of eye-tracking devices, it is important to understand what they can and cannot tell us about social interactions. The methodology presented in this paper marks a first step in this direction. I address two challenges in capturing JVA from multiple eye-trackers: synchronizing the data on 1) the temporal scale, and 2) on the spatial scale. More specifically, this protocol makes use of fiducial markers placed in real-world environments to inform computer vision algorithms where participants are orienting their gaze. This new kind of methodology paves the way to rigorous analysis of human behavior in small groups.
This research protocol complies with the guidelines of Harvard University's human research ethics committee.
1. Participant Screening
2. Preparation for the Experiment
3. Running the experiment
4. Preprocessing the dual eye-tracking data
5. Analyzing the dual eye-tracking data
The methodology presented above was used to study students who were following a vocational training program in logistics (n = 54)12. In this experiment, pairs of students interacted with a Tangible User Interface (TUI) that simulated a small-scale warehouse. The fiducial markers placed on the TUI allowed the research team to remap students' gazes onto a common plane and compute levels of JVA. Findings indicated that groups who had higher levels of JVA tended to do better at the task given to them, learned more, and had a better quality of collaboration13 (Figure 8, left side). Dual eye-tracking datasets also allowed us to capture particular group dynamics like the free-rider effect. We estimated this effect by identifying who was likely to have initiated each moment of JVA (i.e., whose gaze was there first) and who responded to it (i.e., whose gaze was there second). We found a significant correlation between learning gains and the students' tendency to equally share the responsibility of initiating and responding to offers of JVA. In other words, groups in which the same person always initiated moments of JVA were less likely to learn (Figure 8, right side) and groups where this responsibility was equally shared were more likely to learn. This finding shows that we can go beyond merely quantifying JV, and actually identify group dynamics and productivity using dual eye-tracking data.
Figure 1: Each participant generates two video feeds with the X,Y coordinates of their gaze on each video frame. This methodology addresses synchronizing the data temporally and spatially between the participants. Please click here to view a larger version of this figure.
Figure 2: A methodology for synchronizing the two datasets. Briefly showing unique fiducial marker on a computer screen to tag the start and the end of the activity. Please click here to view a larger version of this figure.
Figure 3: Using fiducial markers disseminated in the environment to remap participants' gazes onto a common plan (left side). White lines indicate fiducial markers that have been detected in both images. Please click here to view a larger version of this figure.
Figure 4: Examples of poor data quality. Left: A blurred frame from the eye-tracking video caused by a sudden head movement. Fiducial markers could not be detected in this image. Right: A failed homography where the fiducial marker data were not properly synchronized with the video feed. Please click here to view a larger version of this figure.
Figure 5: Heatmaps. Left: A heatmap of the eye-tracking data remapped onto the experimental scene. This visualization was used as a sanity check for the homography. Right: A group that had too much missing data and had to be discarded. Please click here to view a larger version of this figure.
Figure 6: Cross recurrence graph generated from three dyads to visualize JVA. P1 represents time for the first participant, P2 represents time for the second participant. Black squares show JVA; gray squares show moments where participants are looking at different places; white squares show missing data. Square along the main diagonal indicates moments where participants looked at the same place at the same time. This visualization was used as a sanity check for measures of JVA from the combined eye-tracking data. Please click here to view a larger version of this figure.
Figure 7: A video frame where JVA was detected between two participants (red dots). Richardson et al.11 recommend looking at a time window of +/-2 s. when computing JVA. Additionally, researchers need to define the minimal distance between two gaze points to count as JVA. A radius of 100 pixel was chosen on the middle image above. Please click here to view a larger version of this figure.
Figure 8: Examples of results. Data from Schneider et al.12 where the percentage of time looking at the same place at the same time was correlated with participants' quality of collaboration: r(24) = 0.460, P = 0.018 (left side) and imbalances in initiating/responding to offers of JVA was correlated with their learning gains: r(24) = −0.47, P = 0.02 (right side). Please click here to view a larger version of this figure.
The methodology described in this paper provides a rigorous way to capture JVA in colocated dyads. With the emergence of affordable sensing technology and improved computer vision algorithms, it is now possible to study collaborative interactions with an accuracy that was previously unavailable. This methodology leverages fiducial markers disseminated in the environment and uses homographies as a way to remap participants' gazes onto a common plane. This allows researchers to rigorously study JVA in colocated groups.
This method includes multiple sanity checks that need to be performed at various point of the experiment. Because this is a complex procedure, researchers need to make sure that the resulting datasets are complete and valid. Finally, it is recommended to conduct pilot studies before the actual experiment, and to reconstruct participants' interactions though a video after data collection is completed (Figure 3, Figure 4, Figure 5, Figure 6).
There are several limitations associated with this method:
Number of participants. While this methodology works well for two participants, analysis become more complicated with larger groups. Fiducial markers can still be used to remap gazes onto a ground truth but knowing how to identify JVA becomes a more nuanced process. Should JVA be defined as the times when everyone is looking at the same place at the same time, or when two participants are gazing at the same place? Additionally, visualizations like the cross-recurrence graph become impractical with more than 2–3 people.
Settings. The method described in this paper is appropriate for small, controlled settings (e.g., laboratory studies). Open-ended settings, such as outdoors or large spaces, are usually too complicated to instrument with fiducial markers and thus can limit the usefulness of the eye-tracking data. Additionally, the fiducial markers can be distracting and clutter the environment. In the future, better computer vision algorithms will be able to automatically extract common features between two perspectives. There are already algorithms that exist for this purpose, but we found that the level of accuracy was not yet acceptable for the type of experiment described above.
AOIs. Related to the point above, computing homography and the cross-recurrence graph work well with a stable number of areas of interest, but corrections have to be made when comparing different tasks with different numbers of areas of interest.
Use of equipment. Mobile eye-trackers can be obtrusive, affecting participants' behavior or failing to work with particular eye physiology.
In conclusion, the methodology described in this paper is a promising way to study colocated interactions. It allows researchers to capture a precise metric for JVA, which is a critical construct in the social sciences1. Additionally, it is possible to detect more fine-grained indicators of collaborative learning through this methodology12 compared to traditional qualitative analyses. In short, it is a more efficient and accurate way to study social interactions.
Potential application of this method includes designing interventions to support collaboration through real-time eye-tracking data. Some pioneering work has produced shared gaze visualizations using remote eye-trackers, which has been shown to benefit collaborative learning from a distance14. Dyads who could see the gaze of their partner in real time exhibited more JVA, collaborated better and achieved higher learning gains compared to a control group. Future work will examine whether this kind of intervention can support collaborative processes in colocated settings (e.g., through virtual or augmented reality headsets).
The authors have nothing to disclose.
The development of this methodology was supported by the National Science Foundation (NSF #0835854), the Leading House Technologies for Vocation Education, funded by the Swiss State Secretariat for Education, Research and Innovation, and the Harvard School of Education's Dean Venture Fund.
Tobii Glasses 2 | Tobii | N/A | https://www.tobiipro.com/product-listing/tobii-pro-glasses-2/ |
Fiducial markers | Chili lab – EPFL, Switzerland | N/A | https://github.com/chili-epfl/chilitags |