Machine learning algorithms have been trained to use patterns of brain activity to “decode” stimuli presented to humans. Here, we demonstrate that the same technique can decode naturalistic video content from the brains of two domestic dogs. We find that decoders based on the actions in the videos were successful in dogs.
Recent advancements using machine learning and functional magnetic resonance imaging (fMRI) to decode visual stimuli from the human and nonhuman cortex have resulted in new insights into the nature of perception. However, this approach has yet to be applied substantially to animals other than primates, raising questions about the nature of such representations across the animal kingdom. Here, we used awake fMRI in two domestic dogs and two humans, obtained while each watched specially created dog-appropriate naturalistic videos. We then trained a neural net (Ivis) to classify the video content from a total of 90 min of recorded brain activity from each. We tested both an object-based classifier, attempting to discriminate categories such as dog, human, and car, and an action-based classifier, attempting to discriminate categories such as eating, sniffing, and talking. Compared to the two human subjects, for whom both types of classifier performed well above chance, only action-based classifiers were successful in decoding video content from the dogs. These results demonstrate the first known application of machine learning to decode naturalistic videos from the brain of a carnivore and suggest that the dog’s-eye view of the world may be quite different from our own.
The brains of humans, like other primates, demonstrate the parcellation of the visual stream into dorsal and ventral pathways with distinct and well-known functions-the "what" and "where" of objects1. This what/where dichotomy has been a useful heuristic for decades, but its anatomical basis is now known to be much more complex, with many researchers favoring a parcellation based on recognition versus action ("what" vs. "how")2,3,4,5. Additionally, while our understanding of the organization of the primate visual system continues to be refined and debated, much remains unknown about how the brains of other mammalian species represent visual information. In part, this lacuna is a result of the historical focus on a handful of species in visual neuroscience. New approaches to brain imaging, however, are opening up the possibility of noninvasively studying the visual systems of a wider range of animals, which may yield new insights into the organization of the mammalian nervous system.
Dogs (Canis lupus familiaris) present a rich opportunity to study the representation of visual stimuli in a species evolutionarily distant from primates, as they may be the only animal that can be trained to cooperatively participate in MRI scanning without the need for sedation or restraints6,7,8. Due to their co-evolution with humans over the last 15,000 years, dogs also inhabit our environments and are exposed to many of the stimuli that humans encounter on a daily basis, including video screens, which are the preferred way of presenting stimuli in an MRI scanner. Even so, dogs may process these common environmental stimuli in ways that are quite different from humans, which begs the question of how their visual cortex is organized. Basic differences-such as a lack of a fovea, or being a dichromat-may have significant downstream consequences not only for lower-level visual perception but also for higher-level visual representation. Several fMRI studies in dogs have demonstrated the existence of both face- and object-processing regions that appear to follow the general dorsal/ventral stream architecture seen in primates, although it remains unclear whether dogs have face-processing regions per se or whether these regions are selective for the morphology of the head (e.g., dog vs. human)9,10,11,12,13. Regardless, the brain of a dog, being smaller than most primates, would be predicted to be less modularized14, so there may be more intermixing of types of information in the streams or even privileging of certain types of information, like actions. It has been suggested, for example, that movement might be a more salient feature in canine visual perception than texture or color15. Additionally, as dogs do not have hands, one of the primary means through which we interact with the world, their visual processing, particularly of objects, may be quite different than that of primates. In line with this, we recently found evidence that interaction with objects by mouth versus paw resulted in greater activation in object-selective regions in the dog brain16.
Although dogs may be accustomed to video screens in their home environment, that does not mean they are used to looking at images in an experimental setting the same way a human would. The use of more naturalistic stimuli may help to resolve some of these questions. In the last decade, machine learning algorithms have achieved considerable success in decoding naturalistic visual stimuli from human brain activity. Early successes focused on adapting classical, blocked designs to use brain activity to both classify the types of stimuli an individual was seeing, as well as the brain networks that encoded these representations17,18,19. As more powerful algorithms were developed, especially neural networks, more complex stimuli could be decoded, including naturalistic videos20,21. These classifiers, which are typically trained on neural responses to these videos, generalize to novel stimuli, allowing them to identify what a particular subject was observing at the time of the fMRI response. For example, certain types of actions in movies can be accurately decoded from the human brain, like jumping and turning, while others (e.g., dragging) cannot22. Similarly, although many types of objects can be decoded from fMRI responses, general categories appear to be more difficult. Brain decoding is not limited to humans, providing a powerful tool to understand how information is organized in the brains of other species. Analogous fMRI experiments with nonhuman primates have found distinct representations in the temporal lobe for dimensions of animacy and faciness/bodiness, which parallels that in humans23.
As a first step toward understanding dogs' representations of naturalistic visual stimuli, awake fMRI was used in two highly MRI-adept domestic dogs to measure cortical responses to dog-appropriate videos. In this study, naturalistic videos were used because of their potentially greater ecological validity to a dog and because of their demonstrated success with neural nets that map video content to dog movement24. Over three separate sessions, 90 min of fMRI data were obtained from each dog's responses to 256 unique video clips. For comparison, the same procedure was performed on two human volunteers. Then, using a neural network, we trained and tested classifiers to discriminate either "objects" (e.g., human, dog, car) or "actions" (e.g., talking, eating, sniffing) using varying numbers of classes. The goals of this study were two-fold: 1) determine whether naturalistic video stimuli could be decoded from the dog cortex; and 2) if so, provide a first look into whether the organization was similar to that of humans.
The dog study was approved by the Emory University IACUC (PROTO201700572), and all owners gave written consent for their dog's participation in the study. Human study procedures were approved by the Emory University IRB, and all participants provided written consent before scanning (IRB00069592).
1. Participants
2. Stimuli
3. Experimental design
4. Imaging
5. Stimulus labels
6. fMRI preprocessing
7. Analyses
The most common metrics to assess model performance in machine learning analyses include precision, accuracy, recall, and F1 score. Accuracy is the overall percentage of model predictions that are correct, given the true data. Precision is the percentage of the model's positive predictions that are actually positive (i.e., the true positive rate), while recall is the percentage of true positives in the original data that the model is able to successfully predict. F1 score is the weighted average of precision and recall and acts as an alternate measure of accuracy that is more robust to class imbalance. However, the Ivis differs from other commonly used machine learning algorithms in that its output is not binary. Given a particular input of brain voxels, each output element represents the probabilities corresponding to each of the classes. Computing accuracy, precision, recall, and F1 for these outputs required binarizing them in a "winner takes all" fashion, where the class with the highest probability was considered the one predicted for that volume. This approach eliminated important information about the ranking of these probabilities that was relevant to assessing the quality of the model. Thus, while we still computed these traditional metrics, we used the Label Ranking Average Precision (LRAP) score as the primary metric to compute the accuracy of the model on the test set. This metric essentially measures to what extent the classifier assigned higher probabilities to true labels37.
To different degrees, the neural net classifier was successful for both humans and dogs. For humans, the algorithm was able to classify both objects and actions, with three-class models for both achieving a mean accuracy of 70%. The LRAP score was used as the primary metric to compute the accuracy of the model on the test set; this metric measures the extent to which the classifier assigned higher probabilities to true labels37. For both humans, the median LRAP scores were greater than the 99th percentile of a randomly permuted set of labels for all models tested (Table 1; Figure 2). For dogs, only the action model had a median LRAP percentile rank significantly greater than chance in both participants (Table 1; p = 0.13 for objects and p < 0.001 for actions; mean three-class action model LRAP score for dogs = 78th percentile). These results were true for all subjects individually, as well as when grouped by species.
Given the classifier's success, we trained and tested with additional classes to determine the limits of the model. This included computing dissimilarity matrices for the entire 52 potential classes of interest using the Python package scipy's hierarchical clustering algorithm, which clustered classes based on the similarity of an individual's brain response to each, as defined by pairwise correlation. Of the additional models tested, the model with the highest median LRAP percentile ranking in both dogs had five classes: the original "talking", "eating", and "sniffing", plus two new classes, "petting" and "playing" (Figure 2). This model had a median LRAP percentile rank significantly greater than that predicted by chance for all participants (Table 1; p < 0.001 for both dogs and humans; mean five-class action model LRAP score for dogs = 81st percentile).
When back-mapped to their respective brain atlases, the feature importance scores of voxels revealed a number of clusters of informative voxels in the occipital, parietal, and temporal cortices of both dogs and humans (Figure 3). In humans, the object-based and action-based models revealed a more focal pattern than in the dogs and in regions typically associated with object recognition, although with slight differences in the spatial location of object-based voxels and action-based voxels.
We checked that these species differences were not a result of the task-correlated motion of the dogs moving more to some types of videos than others (e.g., videos other than dogs, say, cars). We calculated the Euclidean norm of the six motion parameters and fit a linear mixed-effects model using the R package lme4, with class as a fixed effect and run number as a random effect for each dog. For each of the final models, we found no significant effect of class type on motion for either Daisy (F(2, 2252) = 0.83, p = 0.44 for object-based and F(4, 1235) = 1.87, p = 0.11 for action-based) or Bhubo (F(2, 2231) = 1.71, p = 0.18 for object-based and F(4, 1221) = 0.94, p = 0.45 for action-based).
Figure 1: Naturalistic videos and presentation in MRI bore. (A) Example frames from video clips shown to the participants. (B) Bhubo, a 4-year-old Boxer-mix, watching videos while undergoing awake fMRI. Please click here to view a larger version of this figure.
Figure 2: Model performance in dogs and humans. The distribution of LRAP scores, presented as percentile rankings of their null distributions, over 100 iterations of training and testing the Ivis machine learning algorithm for a three-class object-based model, a three-class action-based model, and a five-class action-based model, where models attempted to classify BOLD responses to naturalistic video stimuli obtained via awake fMRI in dogs and humans. Scores are aggregated by species. An LRAP score with a very high percentile ranking indicates that the model would be very unlikely to achieve that LRAP score by chance. A model performing no better than chance would have a median LRAP score percentile ranking of ~50. Dashed lines represent the median LRAP score percentile ranking for each species across all 100 runs. Please click here to view a larger version of this figure.
Figure 3: Regions important for the discrimination of three-class object and five-class action models. (A) Human and (B) dog participants. Voxels were ranked according to their feature importance using a random forest classifier, averaged across all iterations of the models. The top 5% of voxels (i.e., those used to train models) are presented here, aggregated by species and transformed to group space for visualization purposes (atlases: humans34 and dogs35). Labels show dog brain regions with high feature importance scores, based on those identified by Johnson et al.35. Abbreviation: SSM = the suprasylvian gyrus. Please click here to view a larger version of this figure.
Model Type | Training Accuracy | Test Accuracy | F1 Score | Precision | Recall | LRAP score median percentile | |
Human 1 | Object (3 class) | 0.98 | 0.69 | 0.48 | 0.52 | 0.49 | >99 |
Action (3 class) | 0.98 | 0.72 | 0.51 | 0.54 | 0.54 | >99 | |
Action (5 class) | 0.97 | 0.51 | 0.28 | 0.37 | 0.27 | >99 | |
Human 2 | Object (3 class) | 0.98 | 0.68 | 0.45 | 0.5 | 0.47 | >99 |
Action (3 class) | 0.98 | 0.69 | 0.46 | 0.5 | 0.48 | >99 | |
Action (5 class) | 0.97 | 0.53 | 0.3 | 0.4 | 0.27 | >99 | |
Bhubo | Object (3 class) | 0.99 | 0.61 | 0.38 | 0.41 | 0.39 | 57 |
Action (3 class) | 0.98 | 0.63 | 0.38 | 0.4 | 0.4 | 87 | |
Action (5 class) | 0.99 | 0.45 | 0.16 | 0.29 | 0.13 | 88 | |
Daisy | Object (3 class) | 1 | 0.61 | 0.38 | 0.43 | 0.39 | 43 |
Action (3 class) | 0.97 | 0.62 | 0.35 | 0.38 | 0.35 | 60 | |
Action (5 class) | 0.99 | 0.44 | 0.16 | 0.27 | 0.13 | 76 |
Table 1: Aggregated metrics of the Ivis machine learning algorithm over 100 iterations of training and testing on BOLD responses to naturalistic video stimuli obtained via awake fMRI in dogs and humans. The object models had three target classes ("dog", "human", "car"), and the action models had either three or five classes (three class: "talking", "eating", "sniffing"; five class: "talking", "eating", "sniffing", "petting", "playing"). Values significantly greater than chance are shown in bold.
Supplementary Table 1: Class labels. Please click here to download this File.
Supplementary Movie 1: Sample video clip. Please click here to download this File.
The results of this study demonstrate that naturalistic videos induce representations in dogs' brains that are stable enough over multiple imaging sessions that they can be decoded with fMRI-similar to results obtained in both humans and monkeys20,23. While previous fMRI studies of the canine visual system have presented stripped-down stimuli, such as a face or object against a neutral background, the results here demonstrate that naturalistic videos, with multiple people and objects interacting with each other, induce activation patterns in the dog cortex that can be decoded with a reliability approaching that seen in the human cortex. This approach opens up new avenues of investigation for how the dog's visual system is organized.
Although the field of canine fMRI has grown rapidly, to date, these experiments have relied on relatively impoverished stimuli, such as pictures of people or objects against neutral backgrounds10,12,13. Additionally, while these experiments have begun to identify brain regions analogous to the primate fusiform face area (FFA), involved in face processing, and the lateral occipital cortex (LOC), for object processing, there remains disagreement over the nature of these representations, such as whether dogs have face areas per se responding to similar salient features as primates or whether they have separate representations for dogs and humans or faces and heads, for example9,13. Dogs, of course, are not primates, and we do not know how they interpret these artificial stimuli divorced from their usual multisensory contexts with sounds and smells. Some evidence suggests that dogs do not treat images of objects as representations of real things12. Although it is not possible to create a true multisensory experience in the scanner, the use of naturalistic videos may mitigate some of the artificialness by providing dynamic stimuli that more closely match the real world, at least to a dog. For the same reasons, the use of naturalistic stimuli in human fMRI research has gained popularity, demonstrating, for example, that sequences of events in a movie are represented in the cortex across multiple time scales and that movies are effective at inducing reliable emotion activation38. As such, while naturalistic videos do remain relatively impoverished stimuli, their success in human neuroscience begs the question of whether similar results can be obtained in dogs.
Our results show that a neural net classifier was successful in decoding some types of naturalistic content from dog brains. This success is an impressive feat given the complexity of the stimuli. Importantly, because the classifier was tested on unseen video clips, the decoding model picked up broad categories that were identifiable across clips rather than properties specific to individual scenes. We should note there are multiple metrics for quantifying the performance of a machine learning classifier (Table 1). As naturalistic videos, by their nature, will not have equal occurrences of all classes, we took a prudent approach by constructing a null distribution from the random permutation of labels and assessing the significance referenced to that. Then, we found that the success of the dog models was statistically significant, achieving 75th-90th percentile scores, but only when the videos were coded based on the actions present, such as playing or talking.
The test sets, unlike the training sets, were not balanced across classes. Comprising only 20% of the data, undersampling to the smallest class size would have resulted in very small sample sizes for each class, such that any statistics calculated would have been unreliable. To avoid the possibility of inflated accuracy from this imbalance, the null distribution of the LRAP was computed by randomly permuting the order of the classes 1,000 times for each model iteration. This null distribution acted as a reference for how well the model was likely to perform by chance. Then, the true LRAP was then converted to a percentile ranking in this null distribution. A very high percentile ranking, for example, 95%, would indicate that a score that high arose only 5% of the time in 1,000 random permutations. Such a model could, therefore, be deemed to be performing well above chance. To determine if these percentile rankings are significantly greater than that expected by chance-that is, the 50th percentile-statistically, the median LRAP percentile ranking across all 100 iterations for each model was calculated and a one-sample Wilcoxon signed rank test was performed.
Although the primary goal was to develop a decoder of naturalistic visual stimuli for dogs, comparisons to humans are unavoidable. Here, we note two major differences: for each type of classifier, the human models performed better than the dog models; and the human models performed well for both object- and action-based models, while the dog models performed for action-based only. The superior performance of the human models could be due to several factors. Human brains are roughly 10 times larger than dog brains, so there are more voxels from which to choose to build a classifier. To put the models on equal footing, one should use the same number of voxels, but this could be in either an absolute or relative sense. Although the final model was based on the top 5% of informative voxels in each brain (a relative measure), similar results were obtained using a fixed number of voxels. Thus, it seems more likely that performance differences are related to how humans and dogs perceive video stimuli. As noted above, while dogs and humans are both multisensory in their perception, the stimuli may be more impoverished to a dog than a human. Size cues, for example, may be lost, with everything appearing to be a toy version of the real world. There is some evidence that dogs categorize objects based on size and texture before shape, which is almost opposite to humans39. Additionally, scent, not considered here, is likely an important source of information for object discrimination in dogs, particularly in the identification of conspecifics or humans40,41,42. However, even in the absence of size or scent cues, in the unusual environment of the MRI scanner, the fact that the classifier worked at all says that there was still information relevant to the dogs that could be recovered from their brains. With only two dogs and two humans, the species differences could also be due to individual differences. The two dogs, however, represented the best of the MRI-trained dogs and excelled at holding still while watching videos. While a larger sample size would certainly allow more reliable distinctions to be drawn between species, the small number of dogs that are capable of doing awake fMRI and who will watch videos for periods long enough will always limit generalizability to all dogs. While it is possible that specialized breeds, like sighthounds, might have more finely tuned visual brain responses, we believe that individual temperament and training are more likely to be the major determinants of what is recoverable from a dog's brain.
These species differences raise the question of what aspect of the videos the dogs were paying attention to. One approach to answering this question relies on simpler video stimuli. Then, by using isolated images of, say, humans, dogs, and cars, both individually and together against neutral backgrounds, we might be able to reverse engineer the salient dimensions to a dog. However, this is both methodologically inefficient and further impoverishes the stimuli from the real world. The question of attention can be solved by the decoding approach alone, in effect, using the model performance to determine what is being attended to43. Along these lines, the results here suggest that, while the humans attended to both the actors and the actions, the dogs were more focused on the actions themselves. This might be due to differences in low-level motion features, such as the movement frequency when individuals are playing versus eating, or it might be due to a categorical representation of these activities at a higher level. The distribution of informative voxels throughout the dog's cortex suggests that these representations are not just low-level features that would otherwise be confined to visual regions. Further study using a wider variety of video stimuli may illuminate the role of motion in category discrimination by dogs.
In summary, this study has demonstrated the feasibility of recovering naturalistic visual information from the dog cortex using fMRI in the same way that is done for the human cortex. This demonstration shows that, even without sound or smells, salient dimensions of complex scenes are encoded by dogs watching videos and that these dimensions can be recovered from their brains. Secondly, based on the small number of dogs that can do this type of task, the information may be more widely distributed in the cortex than typically seen in humans, and the types of actions seem to be more easily recovered than the identity of the actors or objects. These results open up a new way of examining how dogs perceive the environments they share with humans, including video screens, and suggest rich avenues for future exploration of how they and other non-primate animals "see" the world.
The authors have nothing to disclose.
We thank Kate Revill, Raveena Chhibber, and Jon King for their helpful insights in the development of this analysis, Mark Spivak for his assistance recruiting and training dogs for MRI, and Phyllis Guo for her help in video creation and labeling. We also thank our dedicated dog owners, Rebecca Beasley (Daisy) and Ashwin Sakhardande (Bhubo). The human studies were supported by a grant from the National Eye Institute (Grant R01 EY029724 to D.D.D.).
3 T MRI Scanner | Siemens | Trio | |
Audio recordings of scanner noise | homemade | none | |
Camera gimbal | Hohem | iSteady PRO 3 | |
Dog-appropriate videos | homemade | none | |
fMRI processing software | AFNI | 20.3.01 | |
Mock scanner components | homemade | none | Mock head coil and scanner tube |
Neural net software | Ivis | 1.7.1 | |
Optical flow software | OpenCV | 4.2.0.34 | |
Projection system for scanner | homemade | none | |
Trophy Cam HD | Bushnell | 119874 | |
Video camera | GoPro | HERO7 | |
Visualization software | ITK-SNAP | 3.6.0 | |
Windows Video Editor | Microsoft | Windows 11 version |
.