This paper proposes an artificial intelligence-based system to automatically detect whether students are paying attention to the class or are distracted. This system is designed to help teachers maintain students’ attention, optimize their lessons, and dynamically introduce modifications in order for them to be more engaging.
The attention level of students in a classroom can be improved through the use of Artificial Intelligence (AI) techniques. By automatically identifying the attention level, teachers can employ strategies to regain students’ focus. This can be achieved through various sources of information.
One source is to analyze the emotions reflected on students’ faces. AI can detect emotions, such as neutral, disgust, surprise, sadness, fear, happiness, and anger. Additionally, the direction of the students’ gaze can also potentially indicate their level of attention. Another source is to observe the students’ body posture. By using cameras and deep learning techniques, posture can be analyzed to determine the level of attention. For example, students who are slouching or resting their heads on their desks may have a lower level of attention. Smartwatches distributed to the students can provide biometric and other data, including heart rate and inertial measurements, which can also be used as indicators of attention. By combining these sources of information, an AI system can be trained to identify the level of attention in the classroom. However, integrating the different types of data poses a challenge that requires creating a labeled dataset. Expert input and existing studies are consulted for accurate labeling. In this paper, we propose the integration of such measurements and the creation of a dataset and a potential attention classifier. To provide feedback to the teacher, we explore various methods, such as smartwatches or direct computers. Once the teacher becomes aware of attention issues, they can adjust their teaching approach to re-engage and motivate the students. In summary, AI techniques can automatically identify the students’ attention level by analyzing their emotions, gaze direction, body posture, and biometric data. This information can assist teachers in optimizing the teaching-learning process.
In modern educational settings, accurately assessing and maintaining students' attention is crucial for effective teaching and learning. However, traditional methods of gauging engagement, such as self-reporting or subjective teacher observations, are time-consuming and prone to biases. To address this challenge, Artificial Intelligence (AI) techniques have emerged as promising solutions for automated attention detection. One significant aspect of understanding students' engagement levels is emotion recognition1. AI systems can analyze facial expressions to identify emotions, such as neutral, disgust, surprise, sadness, fear, happiness, and anger2.
Gaze direction and body posture are also crucial indicators of students' attention3. By utilizing cameras and advanced machine learning algorithms, AI systems can accurately track where students are looking and analyze their body posture to detect signs of disinterest or fatigue4. Furthermore, incorporating biometric data enhances the accuracy and reliability of attention detection5. By collecting measurements, such as heart rate and blood oxygen saturation levels, through smartwatches worn by students, objective indicators of attention can be obtained, complementing other sources of information.
This paper proposes a system that evaluates an individual's level of attention using color cameras and other different sensors. It combines emotion recognition, gaze direction analysis, body posture assessment, and biometric data to provide educators with a comprehensive set of tools for optimizing the teaching-learning process and improving student engagement. By employing these tools, educators can gain a comprehensive understanding of the teaching-learning process and enhance student engagement, thereby optimizing the overall educational experience. By applying AI techniques, it is even possible to automatically evaluate this data.
The main goal of this work is to describe the system that allows us to capture all the information and, once captured, to train an AI model that allows us to obtain the attention of the whole class in real-time. Although other works have already proposed capturing attention using visual or emotional information6, this work proposes the combined use of these techniques, which provides a holistic approach to allow the use of more complex and effective AI techniques. Moreover, the datasets hitherto available are limited to either a set of videos or one of biometric data. The literature includes no datasets that provide complete data with images of the student's face or their body, biometric data, data on the teacher's position, etc. With the system presented here, it is possible to capture this type of dataset.
The system associates a level of attention with each student at each point of time. This value is a probability value of attention between 0% and 100%, which can be interpreted as low level of attention (0%-40%), medium level of attention (40%-75%), and high level of attention (75%-100%). Throughout the text, this probability of attention is referred to as the level of attention, student attention, or whether students are distracted or not, but these are all related to the same output value of our system.
Over the years, the field of automatic engagement detection has grown significantly due to its potential to revolutionize education. Researchers have proposed various approaches for this area of study.
Ma et al.7 introduced a novel method based on a Neural Turing Machine for automatic engagement recognition. They extracted certain features, such as eye gaze, facial action units, head pose, and body pose, to create a comprehensive representation of engagement recognition.
EyeTab8, another innovative system, used models to estimate where someone is looking with both their eyes. It was specially made to work smoothly on a standard tablet with no modifications. This system harnesses well-known algorithms for processing images and analyzing computer vision. Their gaze estimation pipeline includes a Haar-like feature-based eye detector, as well as a RANSAC-based limbus ellipse fitting approach.
Sanghvi et al.9 propose an approach that relies on vision-based techniques to automatically extract expressive postural features from videos recorded from a lateral view, capturing the behavior of the children. An initial evaluation is conducted, involving the training of multiple recognition models using contextualized affective postural expressions. The results obtained demonstrate that patterns of postural behavior can effectively predict the engagement of the children with the robot.
In other works, such as Gupta et al.10, a deep learning-based method is employed to detect the real-time engagement of online learners by analyzing their facial expressions and classifying their emotions. The approach utilizes facial emotion recognition to calculate an engagement index (EI) that predicts two engagement states: engaged and disengaged. Various deep learning models, including Inception-V3, VGG19, and ResNet-50, are evaluated and compared to identify the most effective predictive classification model for real-time engagement detection.
In Altuwairqi et al.11, the researchers present a novel automatic multimodal approach for assessing student engagement levels in real-time. To ensure accurate and dependable measurements, the team integrated and analyzed three distinct modalities that capture students' behaviors: facial expressions for emotions, keyboard keystrokes, and mouse movements.
Guillén et al.12 propose the development of a monitoring system that uses electrocardiography (ECG) as a primary physiological signal to analyze and predict the presence or absence of cognitive attention in individuals while performing a task.
Alban et al.13 utilize a neural network (NN) to detect emotions by analyzing the heart rate (HR) and electrodermal activity (EDA) values of various participants in both time and frequency domains. They find that an increase in the root-mean-square of successive differences (RMSDD) and the standard deviation normal-to-normal (SDNN) intervals, coupled with a decrease in the average HR, indicate heightened activity in the sympathetic nervous system, which is associated with fear.
Kajiwara et al.14 propose an innovative system that employs wearable sensors and deep neural networks to forecast the level of emotion and engagement in workers. The system follows a three-step process. Initially, wearable sensors capture and collect data on behaviors and pulse waves. Subsequently, time series features are computed based on the behavioral and physiological data acquired. Finally, deep neural networks are used to input the time series features and make predictions on the individual's emotions and engagement levels.
In other research, such as Costante et al.15, an approach based on a novel transfer metric learning algorithm is proposed, which utilizes prior knowledge of a predefined set of gestures to enhance the recognition of user-defined gestures. This improvement is achieved with minimal reliance on additional training samples. Similarly, a sensor-based human activity recognition framework16 is presented to address the goal of the impersonal recognition of complex human activities. Signal data collected from wrist-worn sensors is utilized in the human activity recognition framework developed, employing four RNN-based DL models (Long-Short Term Memories, Bidirectional Long-Short Term Memories, Gated Recurrent Units, and Bidirectional Gated Recurrent Units) to investigate the activities performed by the user of the wearable device.
The following protocol follows the guidelines of the University of Alicante's human research ethics committee with the approved protocol number UA-2022-11-12. Informed consent has been obtained from all participants for this experiment and for using the data here.
1. Hardware, software, and class setup
Figure 1: Hardware and data pipeline. The cameras and smartwatches data are gathered and fed to the machine learning algorithms to be processed. Please click here to view a larger version of this figure.
Figure 2: Position of the sensors, teacher, and students. Diagram showing the positions of the cameras, smartwatches, and GUI in the classroom with the teacher and students. Please click here to view a larger version of this figure.
2. Capture and data processing pipeline
NOTE: All these steps are performed automatically by processing software deployed in a server. The implementation used for the experiments in this work was written in Python 3.8.
Figure 3: Data captured by the smartwatch. The smartwatch provides a gyroscope, accelerometer, heart rate, and light condition as streams of data. Please click here to view a larger version of this figure.
Figure 4: Examples of the categories considered by the activity recognition model. Four different actions are recognized by the activity recognition model: handwriting, typing on a keyboard, using a smartphone, and resting position. Please click here to view a larger version of this figure.
The target group of this study is undergraduate and master's students, and so the main age group is between 18 and 25 years old. This population was selected because they can handle electronic devices with fewer distractions than younger students. In total, the group included 25 people. This age group can provide the most reliable results to test the proposal.
The results of the attention level shown to the teacher have 2 parts. Part A of the result shows individual information about the current attention level of each student. Part B is then intended to obtain the average attention of the whole class and its temporal history throughout the lesson. This allows us to capture a general trend of the student's attention in the classroom and to adapt the methodology used by the teacher in a live fashion. Every second, the interface requests new information from the server. Furthermore, this view incorporates the use of browser notifications, allowing drastic changes in the students' attention to be shown in a non-intrusive manner while the teacher performs their activities normally, without the need to keep this GUI in the foreground. An example of this GUI can be seen in Figure 5.
Figure 5: Graphical user interface of the system. The level of attention is shown in a GUI that can be accessed by any internet browser on any capable device, such as a tablet, smartphone, and desktop or laptop computer. Please click here to view a larger version of this figure.
As for the activity recognition model, a recurrent neural network was defined, such that it receives a sequence of 200 measurements with 6 values each as input: namely, three values from the accelerometer and 3 from the gyroscope. The model has an LSTM layer with 64 units followed by a SoftMax-activated fully connected layer with four output neurons, one per category. The architecture is depicted in Figure 6.
Figure 6: Architecture of the activity classifier. As input, the model takes smartwatch data and processes it through an LSTM layer followed by a fully connected layer. The output is the probability of the sample depicting each activity. Please click here to view a larger version of this figure.
As an output, the classifier returns the class corresponding to the estimated action being performed by the student. This neural network was trained using data captured from 6 different individuals. Each was recorded while performing actions from the four different categories for 200 s. All the data taken was duplicated, generating a new mirrored data set by inverting the value obtained from the sensors in the X-axis. This is similar to collecting data from both the right and left hands of all individuals. This is a common practice in the machine learning area and is intended to generate more samples from the existing dataset to avoid overfitting.
The 200 measurements (one record per second) are grouped into streams of 4 s to match the input from the LSTM network by moving the window one second at a time. As a result, we obtained 197 combinations of data taken in an interval of 4 s. Summarizing, in total, there are 9,456 data inputs, 6 people, 4 classes, 2 hands, and 197 training sets. The data was separated into 90% training and 10% validation, and the network was trained for 300 epochs and a batch size of 64.
As shown in Figure 7, the model was trained for 300 epochs. The validation loss was less than 0.1%, and the validation accuracy was 97%. The metrics obtained highlight the good performance of the model.
Figure 7: Training and validation losses and accuracies. Training and validation losses and accuracies show that the model performance is adequate and does not suffer from overfitting. Please click here to view a larger version of this figure.
Finally, the results of each subsystem (head pose, pose estimation, emotion prediction, and activity recognition) are merged into a boosting classifier that provides a probability value on whether or not the student is attentive to the lesson.
In order to advance conceptual and procedural clarification for accurate labeling and expert input, existing studies were consulted as described below.
Regarding expert input, the Delphi method was chosen20,21,22, a method that is increasingly relevant in the technological field23. As pointed out in a previous publication, the Delphi method is defined as an iterative, group, and anonymous process to generate opinions on a topic and explore consensus among experts on that topic23. In the case presented here, 6 experts contributed for 2 weeks and 2 rounds of consultation, in concurrence with Khodyakov et al.24. Due to the importance of the profile of the participating experts, the consultation included academic specialists from universities in the fields of Psychology, Pedagogy, and Computer Science. A quantitative method was used to collect the data. The results have led to a consensus on the labeling used in this study.
With regard to the studies consulted as a basis for the labeling, we started with an exploratory study in the main databases, such as WOS and Scopus. The contributions of earlier studies25,26,27,28 are worth mentioning in this regard. All of them address the problem of care from specific perspectives, but not in a holistic way from an intelligent system, as this study intends to address. On the other hand, there are studies that combine two specific sources, such as in Zaletelj et al.29, where they focus on facial and body features, but they are far from global approaches such as this study. One previous work stands out30, citing Posner's taxonomy, which is taken into account in this study. Posner considers attention as a set of isolatable neural systems (alertness, orienting, and executive control), which often work together to organize behavior30.
The boosting classifier is an ensemble algorithm that learns weights for each weak output of the classifier and generates a final value by means of a weighted combination of each individual decision. This information, as discussed in step 2.9, is presented in real-time via a web interface so that the teacher can notice drastic changes in the attention level of the class with browser notifications. With this visualization interface, which shows the real-time evolution of the students' overall attention level, teachers can adapt their classes to engage students in their lessons and get more out of the class.
Table 1 shows the dataset structure, which is composed of the following elements: Individual camera: one image per second at 960 x 720 pixels RGB; Zenital cameras: two images per second at 1920 x 1080 pixels RGB; Gyroscope: 50 data per second, each data is decomposed into 3 floating point values with 19 decimal values, corresponding to the coordinates X, Y, Z. Measures angular acceleration in °/s; Accelerometer: 50 data per second, each data is decomposed into 3 floating point values with 19 decimal values, corresponding to the coordinates X, Y, Z. Measures the acceleration in m/s2; Rotation Vector: 50 data per second, each data is decomposed into a quaternion with 4 floating point values with 19 decimal places (with values between -1 and 1); Heart Rate: one value per second measuring beats per minute; Light-Sensor: approximately 8-10 values per second measuring the light level with integers; Head direction: For each image, 3 decimal numbers represent the X-axis (roll), Y-axis (pitch), and Z-axis (yaw), which indicate the tilt of the head; Body pose: For each image, 18 decimal numbers represent the X and Y coordinates of 9 key points.
Individual camera | Zenithal cameras | Gyroscope | Accelero-meter | Rotation-Vector | Heart-rate | Light-condition | Head direction | Body pose |
960 x 720 pixels RGB image | 2 x (1920 x 1080 pixels) | 50 x 3 (XYZ) | 50 x 3 (XYZ) | 50 x quaternion | beats per minute | 10 x lumens | 3 (XYZ) decimal numbers | 9 x 2 (XY) decimal numbers |
RGB image | decimal numbers | decimal numbers |
Table 1: Structure of the dataset. The dataset shows different data for classification purposes. All the data shown comes from biometric data and images taken from different cameras.
This work presents a system that measures the attention level of a student in a classroom using cameras, smartwatches, and artificial intelligence algorithms. This information is subsequently presented to the teacher for them to have an idea of the general state of the class.
One of the main critical steps of the protocol is the synchronization of the smartwatch information with the color camera image, as these have different frequencies. This was solved by deploying raspberries as servers that receive information from the smartwatch and cameras with their respective timestamps and perform a rough matching of this information. Finally, this information is sent to a centralized server for further processing. The other critical step of the protocol is the definition of the final classifier to generate the final inference with the data obtained from different sources. To resolve this point, the raw data must be preprocessed through different subsystems to generate valuable information such as the pose, emotion, or action of the user. This information, together with biometric data, is normalized and combined to estimate the attention level of each student.
The experimental results suggest that head direction and pose estimation can be accurately estimated using zenith and single cameras, while emotion recognition performs better with the most common emotions, such as happiness or neutrality. Regarding the action classifier with the smartwatch, highly distinguishable actions, such as writing, typing, or texting, present good detection accuracy and are suitable for the system.
This system offers several advantages and distinct features when compared to existing or alternative methods, as described below.
Objective and continuous measurements
Smartwatches and cameras provide objective and continuous measurements of students' attention throughout the class session. Traditional methods, such as self-reporting with questionnaires, can be subjective and prone to bias. The use of handheld devices and cameras eliminates reliance on self-reporting or external judgments, allowing educators to obtain more reliable and detailed data on attention levels.
Real-time information
The method captures attention data in real-time within the natural context of the classroom. It allows teachers to understand attention patterns during the course of a lesson. This is crucial as attention fluctuates during a class, and capturing real-time data allows for a more accurate assessment of attention dynamics.
Multimodal data integration
By combining data from smartwatches, which provide physiological measures, and cameras, which provide visual information, educators can gain a better understanding of attention. Physiological measures, such as movement patterns, can complement visual observations from cameras, providing a richer and more nuanced representation of attentional states. This multimodal approach increases the reliability and validity of attention assessment.
However, although this system has several advantages, there are some limitations to be considered, as described below.
Ethical and privacy concerns
Collecting physiological and visual data from students raises privacy concerns. It is essential to ensure that proper informed consent is obtained from participants and that data are anonymized and securely stored to protect the privacy and rights of the individuals involved. Unauthorized access or misuse of sensitive data must be prevented.
Reliability and validity of data
Although smartwatches and cameras can provide objective measurements, ensuring the reliability and validity of the data collected can pose challenges. Technical limitations of the devices, such as sensor accuracy or signal interference, can affect data quality. Calibration and validation procedures are necessary to establish the accuracy and consistency of the measurements.
Interpretation of attention signals
Interpreting attention-related signals obtained from smartwatches and cameras requires careful analysis. Attention is a complex cognitive process influenced by various factors, and physiological signals or visual information may not always directly correlate with attention levels, as they might not capture other important aspects of attention, such as cognitive engagement, selective focus, or mental effort.
Invasive or disruptive nature
Wearing devices or being observed by cameras may alter students' behavior or attention levels. Some students may feel self-conscious or uncomfortable, which may affect their natural attention patterns. It is important to consider the potential impact of the method itself on the attention being measured and to minimize any disruption to the learning environment. To overcome some potential distractions, we simplified the use of the smartwatch, which works by simply wearing it like a normal watch, and any other functions were disabled, such that it is no longer a distraction that can alter the learner's level of attention. The students receive a brief explanation of the experiment and their goals to give them time to get accustomed to the setting.
Regarding head pose17, pose estimation18, and emotion prediction19, comprehensive and exhaustive experimentation is carried out in the corresponding articles. In summary, it should be noted that these three systems work properly, achieving high accuracy and resolving the corresponding tasks. However, despite the good performance, the approaches have some limitations. Regarding head direction estimation, it tends to provide erroneous predictions when the head is heavily tilted towards any direction. When this happens, the landmark estimation system performs poorly as many of the interesting points are not present in the image as a consequence of the abovementioned self-occlusion. Thus, this event inevitably leads to poor performance. This case is also present in the pose estimation method, as the approach is similar. If there are important key points that are not visible within the input image, the pose estimation methods tend to make them up, leading to erroneous and impossible poses. This effect can be mitigated by correctly placing the zenithal and individual cameras. As for emotion prediction, it tends to accurately detect some facial expressions whilst struggling with others. For instance, neutral and happy emotions are consistently detected, while the system is more prone to fail on fear, disgust, and surprise emotions if they are only slightly or fleetingly shown.
Regarding network interferences, the specifications of the D-Link DSR-1000AC Router indicate a maximum WLAN data transfer rate of 1300 Mbps. Consequently, an average participant in the experiment would transmit 4.344 Mb. This calculation takes into account the transmission of 3 images per second, each with an average size of 180 kB, and a sensor data package from the smartwatch of 3 kB per second. Therefore, considering the zenithal images, the theoretical maximum number of students that can be connected simultaneously is 297. However, this number will be influenced by the final number of devices connected to the router and the congestion level of the WiFi channel at the moment of the experiment.
Finally, the proposed method has several important implications and potential applications in various research areas, as described below.
Education and research
This method can provide valuable insights into the factors that affect students' attention and engagement in the classroom. Researchers can analyze the data collected from smartwatches and cameras to understand how different teaching methods, classroom environments, or even individual student characteristics influence attention levels.
Special education
This can be especially beneficial for studying attention-related challenges in students with special needs. Researchers can use data collected from smartwatches and cameras to identify patterns that lead to attention difficulties. This information can help develop targeted interventions and personalized strategies to support students with attention deficit disorders or other attention-related conditions.
The authors have nothing to disclose.
This work was developed with funding from Programa Prometeo, project ID CIPROM/2021/017. Prof. Rosabel Roig is the chair of the UNESCO "Education, Research and Digital Inclusion".
4 GPUs Nvidia A40 Ampere | NVIDIA | TCSA40M-PB | GPU for centralized model processing server |
FusionServer 2288H V5 | X-Fusion | 02311XBK | Platform that includes power supply and motherboard for centralized model processing server |
Memory Card Evo Plus 128 GB | Samsung | MB-MC128KA/EU | Memory card for the operation of the raspberry pi 4b 2gb. One for each raspberry. |
NEMIX RAM – 512 GB Kit DDR4-3200 PC4-25600 8Rx4 EC | NEMIX | M393AAG40M32-CAE | RAM for centralized model processing server |
Processor Intel Xeon Gold 6330 | Intel | CD8068904572101 | Processor for centralized model processing server |
Raspberry PI 4B 2GB | Raspberry | 1822095 | Local server that receives requests from the clocks and sends them to the general server. One every two students. |
Samsung Galaxy Watch 5 (40mm) | Samsung | SM-R900NZAAPHE | Clock that monitors each student's activity. For each student. |
Samsung MZQL23T8HCLS-00B7C PM9A3 3.84Tb Nvme U.2 PCI-Express-4 x4 2.5inch Ssd | Samsung | MZQL23T8HCLS-00B7C | Internal storage for centralized model processing server |
WebCam HD Pro C920 Webcam FullHD | Logitech | 960-001055 | Webcam HD. One for each student plus two for student poses. |