The objective of the protocol is to detail how to collect video data for use in the laboratory; how to record eye-tracking data of participants looking at the data and how to efficiently analyze the content of the videos that they were looking at using a machine learning technique.
As individuals increasingly live in cities, methods to study their everyday movements and the data that can be collected becomes important and valuable. Eye-tracking informatics are known to connect to a range of feelings, health conditions, mental states and actions. But because vision is the result of constant eye-movements, teasing out what is important from what is noise is complex and data intensive. Furthermore, a significant challenge is controlling for what people look at compared to what is presented to them.
The following presents a methodology for combining and analyzing eye-tracking on a video of a natural and complex scene with a machine learning technique for analyzing the content of the video. In the protocol we focus on analyzing data from filmed videos, how a video can be best used to record participants' eye-tracking data, and importantly how the content of the video can be analyzed and combined with the eye-tracking data. We present a brief summary of the results and a discussion of the potential of the method for further studies in complex environments.
Our daily lived experiences of urban environments greatly impact on our health and well-being. Our well-being can depend on the amount of green spaces that we view and experience1,2,3, and these views can be quantified using eye-tracking equipment to guide decision making about park design. However, an issue arises with the volume of eye tracking data that is generated and making sense of this data. As the equipment for recording gaze data in a lab or natural setting becomes easier to use and more powerful, researchers need to consider how we can collect and analyze data validly to help with decision-making questions.
So far, a great deal of eye tracking research has used photographs in a survey or laboratory setting4. While this methodology allows for a great deal of reproducibility and control over results, it is unable to take advantage of the latest advances in eye-tracking technology which include the use of video and wearable mobile eye-trackers. Furthermore, we would argue that the act of walking and relaxing is necessarily dynamic particularly when oriented towards a task such as wayfinding5. Therefore a fully scientific understanding of these settings should take place outside of the laboratory. However, at the moment, eye-tracking in a real-life naturalistic setting makes comparing the experience between subjects very difficult. For example, if we wanted to compare whether one respondent looks at trees more than another, how could we control for the fact their point of view would be constantly changing compared to others or that their head might have turned. Detailed analysis in these conditions is near impossible with current analysis techniques. We would argue that it is important to control the viewing areas available to the individual being studied and in the analysis to be able to account for the total scene being viewed at any one point in time.
There are a set of theories linking stress levels and perceptions of safety to landscape views and well-evolved measures of stress6,7. There has also been a rapid increase in the sophistication of eye-tracking equipment to measure gaze8. Eye-tracking is important because involuntary eye-movements may be more reliably connected to preference, stress and other traditional measures than surveys and intrusive, physiological tests such as salivary cortisol levels. The objective of this research is to develop tools that enable a more precise measurement of eye-tracking data applied to more naturalistic settings, so as to provide further evidence for or refute long-standing landscape theories that have informed park design for decades.
The aim of this project is to develop and test a novel analysis technique that can generate relevant eye-tracking data for different videos of park walk simulations. Our work reported here and elsewhere9 represents a half-way point between the naturalistic setting of a fully mobile eye-tracking system and the lab-based photo studies referred to above. In particular, we concentrate on using videos as the stimulus material, exploring how this material can be used to test the amount of fascination that different parks generate in the City of Melbourne. Our work is based on the assumption that detailed analysis of videos is a necessary step to breach before undertaking a fuller, more naturalistic assessment of the potential of parks to provide restoration from stress.
In this study, we employed a desktop eye-tracker with videos of walks through urban parks and asked participants to imagine that they were taking a relaxing walk through a park. We describe a method to allow the amount of time that participants spent looking at different objects to be comparable between parks. Desktop studies are generally easier to control compared to mobile ET studies and allow comparative analysis of each subject.
Standard eye-tracking software uses a manual area of interest tool in which an operator can manually draw boundaries around objects of interest in each scene. This enables the amount of time that participants spent looking at different objects to be automatically counted. For video data, this process is labour intensive and subject to operator subjectivity and error. In later versions of eye-tracking analysis software, AOIs can automatically track objects across frames when they are the same size in the video. This is an improvement, however, this is only intended to be used for a small number of stimuli in each image and each image must be checked and confirmed.
Manual labelling of objects in an image is common and supported by image editing software such as GNU Image Manipulation Program (GIMP). Given that 1 s produces 30 frames or images, the manual labelling of videos is impractical. In addition, AOI labelling by drawing vector polygons around the edge of complex objects such as tree canopies is very time consuming. Finally, while it is conceivably possible to calculate the size of objects in a field of view using vector labelling, this feature is not currently available.
The method that we report on below deals with these limitations. This study employed automatic labelling of objects. This is possible using an image processing technique known as semantic labelling, in which every pixel in every frame of the video is assigned a label indicating an object class. Machine learning is used to derive pixel classifiers for each object class of interest. These classifiers provide a probabilistic label for each pixel (known as unary potentials), which are then refined in a subsequent optimization process to achieve the final labelled output. These classifiers learn statistical decision boundaries between each object classes in the space of features extracted from the image, including texture, histogram of edge orientations, RGB color values, and the normalized image coordinates. An appropriate technique for this is implemented in the DARWIN machine learning toolbox10 and is described below.
Ethical approval for this project was given by the Australian Catholic University ethics committee – approval number #201500036E. This ensured that informed consent was gained from all participants and all participants participated voluntarily, and that participants data remained anonymous and confidential. In addition the approval was given due the method and equipment meeting Australian standards safety regulations.
1. Filming Urban Scenes that Can Be Used in an Eye-Tracking Study
2. Setup and Desktop Calibration of the Eye-tracking Equipment
3. Recruitment and Ethics
4. Participant Experimental Setup
5. Connecting the Viewed Parks to Impressions of the Videos
6. Automatic Video Analysis for Area of Interest Extraction
7. Registering the Eye-tracking Data to the Video Content
8. Displaying the Amount of Time that Participants Examined Different Classes of Objects in the Videos
NOTE: Due to the huge size of the eye-tracking data, Python programming language is better used for steps through to 8.4, although a data processing program can also be used.
Figure 1 and Figure 2 show the result of taking all eye-tracking data for the whole video across all participants and producing a heat map; this is the standard approach available in eye-tracking software packages. By comparing Figure 1 and Figure 2 it is possible to identify that on average participants scanned left and right on the x coordinate of the video in Figure 1 compared to Figure 2, which shows a rounder shape. This is because Figure 1 had more horizontal elements throughout the video compared to the park in Figure 2. The image behind the heat map is a single frame and does not adequately represent the full content of the video.
The desktop eye-tracking system and software only counts results where both eyes can be located at the same time. In other words, if one or both eyes cannot be located the data is counted as lost. In our case eye-tracking data was captured for >80% of the time.
Figure 3 and Figure 4 show the result of using the protocol and use of machine learning to analyze the content of the videos. Figure 3 shows the dense vegetation of Fitzroy Gardens compared to the relatively sparse vegetation of Royal Park (Figure 4). In the latter, more sky is visible, more of the scenes are dominated by shrub vegetation. Figure 5 shows the %fixation time on the different objects during the course of the video for one of the participants. It shows that although the path is clearly visible during the course of the video, the participant only looks at this feature occasionally but at key points. Similarly, as Figure 6 shows, although a tiny fraction of the content of the video in Royal Park pertains to artificial objects, the participant in the study examines these features to a comparatively great extent.
The findings in Figure3, Figure 4, Figure 5, and Figure 6 can be summarized in Figure 7 and Figure 8 for all 39 participants whose data were used in this study. Figure 7 shows the dwell time for all participants when looking at objects throughout the length of the video. Figure 8 shows this same data divided by the amount of time and space that these different objects occupied in the video. A value of 1 indicates that the dwell time can be accounted for by the amount of object in the video. Figure 7 shows that artificial objects such as street lamps and benches are dwelt on to a greater extent compared to other objects (>1). Figure 7 also shows that objects that were less pertinent, such as the sky in both images, were viewed comparatively less (<1).
Figure 1: Example heat map that is produced for the whole video of one of the parks. This shows where the majority of the eye tracks were located. Note the horizontal shape of the heat map because of the dominance of horizontal elements in the video. Please click here to view a larger version of this figure.
Figure 2: Example heat map that is produced for the whole video of another one of the parks. Note the more rounded shape because of the presence of vertical and horizontal elements in the video. Please click here to view a larger version of this figure.
Figure 3: Histogram of content in the video of Fitzroy Gardens analyzed using the machine learning technique. Please click here to view a larger version of this figure.
Figure 4: Histogram of content in the video of Royal Park analyzed using the machine learning technique. Please click here to view a larger version of this figure.
Figure 5: Eye tracking fixation time spent looking at objects in Fitzroy Gardens. Please click here to view a larger version of this figure.
Figure 6: Eye tracking fixation time spent looking at objects in Royal Park. Please click here to view a larger version of this figure.
Figure 7: Aggregated eye tracking dwell times for all participants and objects for both parks. Please click here to view a larger version of this figure.
Figure 8: Aggregated eye tracking dwell times relative to content for all participants and objects for both parks. Please click here to view a larger version of this figure.
Generally, in standard software packages for analyzing eye-tracking data, a vector AOI is used. Even for a single still image, the size of the vector cannot be easily measured. Furthermore, including all AOIs in an image and calculating the relative amounts of AOIs is laborious. It is almost impossible to do this manually on a video without a machine learning technique such as the one described. This was a relatively simple statement that infers a free viewing situation. A much more precise scenario can be used and different scenario descriptions before the same stimulus can also affect eye movements11.
Our research used a method for accurately classifying objects in a video and analyzing the extent to which these were visible to the eye. The method represents one means of accounting for the extent to which eye movements are linked to objects of interest. This enables the extent to which objects of interest are present in a field of view to be controlled when comparing the eye-tracking from different subjects with a mobile eye-tracking device, or different locations when using a desktop-based system. Considerations that can affect the automatic classification of objects using the machine learning we employ include lighting, shapes with respect to what is being classified (e.g. not just one type of tree but a range of tree types, their position in the image, the way they are lit etc.), texture, and color. Considerations on the number of frames include the length and variation of visual conditions exhibited across the video being analyzed, as well as the number of object classes to identify and the frequency of their appearance. The frames change every 1/25 seconds, but eye-tracking with the software occurs at 1/120 (120 Hz). For this reason eye-tracking data will be collected at 4.8 times the rate at which a new frame appears. It is not possible to label completely accurately but the labelling of the frames should change every 5 times. For example, eye-tracking data during 99 to 101 s has been considered for the frame of 100th second of the video.
A significant advantage of current eye-tracking packages is that they are set up to allow users to review a film of their own eye-tracking fixations and pathways and describe why they looked at certain objects. This results in a qualitative data set that can reveal why individual subjects think they have looked at certain objects. An extension of the project would be to also show them the amount of time they spent looking at objects at different times relative to the content, for example the information in Figure 8. Yet, doing this by controlling for the number of objects in a scene rapidly enough is currently not possible.
For example, participants could be asked to view their own gaze paths that had been recorded and to describe why they had looked at the particular objects12. In our case, at the end of each film participants were asked to rate each of the parks on a scale of 1-10, firstly for whether they felt they would be able to rest and recover in that environment (1, not very much, to 10, very much) and secondly how much did they like the park (1, not very much, to 10, very much).
The authors have nothing to disclose.
This work was financially supported by the City of Melbourne and partially by ARC DP 150103135. We would like to thank Eamonn Fennessy for his advice and collaborative approach. With special thanks to researcher assistants Isabelle Janecki and Ethan Chen whom also helped collect and analyze this data. All errors remain the authors.
12 mm lens | Olympus | Lens | |
Panasonic GH4 | Panasonic | Video Camera | |
Tobii Studio version (2.1.14) | Tobii | Software | |
Tobii x120 desktop eye-tracker | Tobii | Eye-tracker |