This article presents a method for estimating same-day P300 speller Brain-Computer Interface (BCI) accuracy using a small testing dataset.
Performance estimation is a necessary step in the development and validation of Brain-Computer Interface (BCI) systems. Unfortunately, even modern BCI systems are slow, making collecting sufficient data for validation a time-consuming task for end users and experimenters alike. Yet without sufficient data, the random variation in performance can lead to false inferences about how well a BCI is working for a particular user. For example, P300 spellers commonly operate around 1-5 characters per minute. To estimate accuracy with a 5% resolution requires 20 characters (4-20 min). Despite this time investment, the confidence bounds for accuracy from 20 characters can be as much as ±23% depending on observed accuracy. A previously published method, Classifier-Based Latency Estimation (CBLE), was shown to be highly correlated with BCI accuracy. This work presents a protocol for using CBLE to predict a user’s P300 speller accuracy from relatively few characters (~3-8) of typing data. The resulting confidence bounds are tighter than those produced by traditional methods. The method can thus be used to estimate BCI performance more quickly and/or more accurately.
Brain-computer interfaces (BCIs) are a noninvasive technology that allows individuals to communicate through machines directly without regard for physical limitations imposed by the body. BCI can be utilized as an assistive device operated directly by the brain. BCI uses the brain activity of a user to determine if the user intends to choose a certain key (letter, number, or symbol) displayed on the screen1. In a typical computer system, a user physically presses the intended key on a keyboard. However, in a BCI system with a visual display, the user needs to focus on the desired key. Then, BCI will select the intended key by analyzing the measured brain signals1. The activity of the brain can be measured using various techniques. Though there are competing BCIs technologies, electroencephalogram (EEG) is considered a leading technique due to its noninvasive nature, high temporal resolution, reliability, and relatively low cost2.
Applications of BCI include communication, device control, and also entertainment3,4,5,6. One of the most active BCI application areas is the P300 speller, which was introduced by Farwell and Donchin7. The P300 is an event-related potential (ERP) produced in response to the recognition of a rare but relevant stimulus8. When a person recognizes their target stimulus, they automatically produce a P300. The P300 is an effective signal for a BCI because it conveys the participant's recognition of the target event without requiring an outward response9.
The P300 BCI has attracted researchers from computer science, electrical engineering, psychology, human factors, and various other disciplines. Advances have been made in signal processing, classification algorithms, user interfaces, stimulation schemes, and many other areas10,11,12,13,14,15. However, regardless of the research area, the common thread in all of this research is the necessity of measuring the BCI system performance. This task typically requires the generation of a test data set. This necessity is not limited to research; eventual clinical application as an assistive technology will likely require individual validation sets for each end user to ensure the system can generate reliable communication.
Despite the considerable research applied toward the P300 BCI, the systems are still quite slow. While the majority of people are able to use a P300 BCI16, most P300 Spellers produce text on the order of 1-5 characters per minute. Unfortunately, this slow speed means that generating test data sets requires substantial time and effort for participants, experimenters, and eventual end users. Measuring BCI system accuracy is a binomial parameter estimation problem, and many characters of data are necessary for a good estimate.
To estimate the presence or absence of the P300 ERP, most classifiers use a binary classification model, which involves assigning a binary label (e.g., "presence" or "absence") to each trial or epoch of EEG data. The general equation used by most classifiers can be expressed as:
where is called the classifier's score, which represents the probability of the P300 response being present, x is the feature vector extracted from the EEG signal, and b is a bias term17. The function f is a decision function that maps the input data to the output label, and is learned from a set of labeled training data using a supervised learning algorithm17. During training, the classifier is trained on a labeled dataset of EEG signals, where each signal is labeled as either having a P300 response or not. The weight vector and bias term are optimized to minimize the error between the predicted output of the classifier and the true label of the EEG signal. Once the classifier is trained, it can be used to predict the presence of the P300 response in new EEG signals.
Different classifiers can use different decision functions, such as linear discriminant analysis (LDA), stepwise linear discriminant analysis (SWLDA), least squares (LS), logistic regression, support vector machines (SVM), or neural networks (NNs). The least squares classifier is a linear classifier that minimizes the sum of squared errors between the predicted class labels and the true class labels. This classifier predicts the class label of a new test sample using the following equation:
(1)
where the sign function returns +1 if the product is positive and -1 if it is negative, and the weight vector is obtained from the feature set of the training data, (x) and class labels (y) using the below equation:
(2)
In earlier research, we argued that Classifier-Based Latency Estimation (CBLE) can be used to estimate BCI accuracy17,18,19. CBLE is a strategy for evaluating latency variation by exploiting the classifier's temporal sensitivity18. While the conventional approach to P300 classification involves using a single time window that is synchronized with each stimulus presentation, the CBLE method involves creating multiple time-shifted copies of the post-stimulus epochs. Then it detects the time shift that results in the maximum score in order to estimate the latency of the P300 response17,18. Here, this work presents a protocol that estimates BCI performance from a small dataset using CBLE. As a representative analysis, the number of characters is varied to make predictions of the overall performance of an individual. For both example datasets, the root mean square error (RMSE) for vCBLE and actual BCI accuracy were computed. The results indicate that the RMSE from vCBLE predictions, using its fitted data, was consistently lower than the accuracy derived from 1 to 7 tested characters.
We developed a Graphical User Interface (GUI) called "CBLE Performance Estimation" for the implementation of the proposed methodology. The example code is also provided (Supplementary Coding File 1) that operates on the MATLAB platform. The example code performs all of the steps applied in the GUI, but the steps are provided to assist the reader with adapting to a new dataset. This code employs a publicly available dataset "Brain Invaders calibration-less P300-based BCI using dry EEG electrodes Dataset (bi2014a)" to evaluate the proposed method20. Participants played up to three game sessions of Brain Invaders, each session having 9 levels of the game. The data collection continued until all levels were completed or the participant lost all control over the BCI system. The Brain Invaders interface included 36 symbols that flashed in 12 groups of six aliens. According to the Brain Invaders P300 paradigm, a repetition was created by 12 flashes, one for each group. Out of these 12 flashes, two flashes contained the Target symbol (known as Target flashes), while the remaining 10 flashes did not contain the Target symbol (known as non-Target flashes). More information on this paradigm can be found in the original reference20.
The CBLE approach was also implemented on a Michigan dataset, which contained data from 40 participants18,19. Here, the data of eight participants had to be discarded because their tasks were incomplete. The whole study required three visits from each participant. On the first day, each participant typed a 19-character training sentence, followed by three 23-character testing sentences on Days 1, 2, and 3. In this example, the keyboard included 36 characters which were grouped into six rows and six columns. Each row or column was flashed for 31.25 milliseconds with an interval of 125 milliseconds between flashes. Between characters, a 3.5 s pause was provided.
Figure 1 shows the block diagram of the proposed method. The detailed procedure is described in the protocol section.
The "CBLE Performance Estimation" GUI was applied to two datasets: "BrainInvaders" dataset and Michigan dataset. For the "BrainInvaders" dataset, the data collection was approved by the Ethical Committee of the University of Grenoble Alpes20. Michigan data were collected under the University of Michigan Institutional Review Board approval19. Data were analyzed under Kansas State University exempt protocol 7516. If collecting new data, follow the user's IRB-approved process for collecting informed consent. Here, the proposed protocol is evaluated using offline analysis of previously-recorded, de-identified data and therefore did not require additional informed consent.
The graphical user interface (GUI) included in this article is proficient in managing two distinct dataset formats. The first format is associated with the BCI2000 software, while the second format is referred to as the "BrainInvaders" dataset. In order to utilize the "Brain Invaders" format, data must be pre-processed as described in step 1 of the protocol section. However, when dealing with the "BCI2000" dataset format, step 1 can be omitted.
1. Data preparation
2. Downloading and installing the GUI package
3. Storing the dataset in a subfolder of the GUI location
4. Opening the installed GUI
5. Choosing the dataset format
6. Loading the EEG data file
7. Setting the parameters
8. BrainInvaders only: Splitting the dataset into training and test set
9. Training a model with the training dataset
NOTE: Step 9.1 is applicable for "Brain Invaders" format, and step 9.2 is applicable for "BCI2000" format.
10. Predicting the accuracy of the test set
11. Getting X-target accuracies
12. Calculating vCBLE
13. Calculating the Root mean square error (RMSE) of BCI accuracy and vCBLE
14. Visualizing the analysis results
15. Predicting the performance of an individual participant
The proposed protocol has been tested on two different datasets: "BrainInvaders" and the Michigan dataset. These datasets are already introduced briefly in the Introduction section. The parameters used for this two datasets are mentioned in Table 1. Figures 2–4 depict the findings obtained using the "BrainInvaders" dataset, whereas Figures 5–7 demonstrate the results achieved from the Michigan dataset.
The "BrainInvaders" dataset has 64 participants. Figure 2 presents the relationship between BCI accuracy and vCBLE of all 64 participants. It shows that vCBLE is highly negatively correlated with BCI accuracy, although a few outliers are observed. Figure 3 illustrates the RMSE of vCBLE and actual accuracy when the prediction was made based on the accuracy of testing characters. It shows the evidence that the RMSE for this prediction, based on the fit obtained by vCBLE, was lower than the accuracy based on any number of testing characters from 1 to 10. For the "BrainInvaders" dataset, vCBLE is capable of predicting BCI accuracy using only 7 characters. In Figure 4, the prediction was made from the vCBLE of test sets with 2, 5, 7, and 10 characters, respectively. Here, a leave-one-participant-out approach was employed in the regression analysis to predict the accuracy of each individual participant. BCI accuracy and vCBLE were estimated over 100 repetitions. The lower and upper bounds are ±2 standard deviations from the mean. All four conditions indicate that there is minimal variance observed when the number of participants in the training set exceeds 10. It was concluded that about 10 individuals are required to build the regression model for the relationship between vCBLE and accuracy for a particular experimental paradigm.
In the second example, the Michigan dataset has 32 participants, in which all typed one training sentence on Day 1 and three testing sentences on Days 1, 2, and 3. The test sentences were 23 or 24 characters in length, and many participants made additional selections to correct errors made during online operation. In Figure 5, it can be seen that the vCBLE model performed better when the training and testing datasets were collected on the same day. In fact, this prediction resulting from the fit provided by vCBLE resulted in a lower RMSE than the accuracy based for any number of testing characters from 1-20 when the training data and testing data were collected on the same day. Figure 6 shows that overall, the vCBLE fit had a lower RMSE when the test included less than six characters. Additionally, it can be seen from Figure 7 that the RMSE of the vCBLE accuracy estimation only decreases about 0.025 between three characters and the optimal number of characters used. This implies that there is not much benefit to collecting more than three characters for the small test set.
Figure 1: Block diagram of the proposed protocol. (a) Data pre-processing and feature extraction. (b) P300 classification. (c) Evaluation of vCBLE. (d) Predicting the accuracy of an individual. Please click here to view a larger version of this figure.
Figure 2: Accuracy vs. vCBLE. BCI accuracy plotted against vCBLE using "bi2014a" dataset. It shows a high negative correlation between accuracy and vCBLE. Please click here to view a larger version of this figure.
Figure 3: RMSE of BCI accuracy and vCBLE. The RMSE of vCBLE and accuracy were plotted against different test dataset sizes (1-10) using "bi2014a" dataset. Overall, vCBLE performs better than BCI accuracy. Please click here to view a larger version of this figure.
Figure 4: Comparison of models using RMSE. These models are built while predictions are performed from different sizes of test characters. Top left: 2 targets; top right: 5 targets; bottom left: 7 targets; bottom right: 10 targets. Please click here to view a larger version of this figure.
Figure 5: RMSE Values of vCBLE Models. A separate model was built to predict accuracy over three different days using the Michigan dataset. The RMSE values for models built using different test dataset sizes are shown. Please click here to view a larger version of this figure.
Figure 6: Model comparison. The mean of the RMSE over three days was computed for the vCBLE and the accuracy models using Michigan data. Please click here to view a larger version of this figure.
Figure 7: Mean RMSE difference from the best model. For each day, the minimum RMSE value was subtracted from each character's RMSE value. The mean was calculated over the three days. This graph represents the average performance of a model using a certain data set size compared to the best model. Please click here to view a larger version of this figure.
Dataset name | Subject number | ID length | Channel ID | Sampling rate | Decimation value | Original window | CBLE window | Training sample no | Target number, X |
Brain Invaders | 64 | N/A | [1:16] | 512 | 26 | [100, 600] | [0, 700] | 20 | 10 |
Michigan | 32 | 4 | [1:16] | 256 | 13 | [4, 804] | [-100, 900] | N/A | 20 |
TABLE 1: Standard parameters for "BrainInvaders" and Michigan datasets.
Supplemental Coding File 1. Please click here to download this File.
Supplemental Coding File 2. Please click here to download this File.
This article outlined a method for estimating BCI accuracy using a small P300 dataset. Here, the current protocol was developed based on the "bi2014a" dataset, although the efficacy of the protocol was confirmed on two different datasets. To successfully implement this technique, it is crucial to establish certain variables, such as the epoch window for the original data, the window for time shifting, the down-sampling ratio, and the size of both the training and testing datasets. These variables are determined by the characteristics of the dataset being used, including the number of targets or characters, the number of sequences, and the total number of participants.
The findings of the "bi2014a" dataset indicate that vCBLE's prediction exhibits superior performance compared to character-level BCI accuracy for all test conditions (less than 10 characters), which involve test datasets containing one to ten characters. However, when the test dataset comprises more than seven targets, the performance of vCBLE shows minimal variance. Results from the Michigan data suggest that using vCBLE to predict same-day performance will outperform the accuracy-based estimation if the test data set is less than six characters. Interestingly, increasing the amount of data used to build this model only improves marginally after the first few characters of data. Overall, this would imply that it is not necessary to collect large amounts of data to predict same-day accuracy.
According to the outcomes of the "bi2014a" dataset, it can be suggested that a minimum of 10 participants is necessary to construct a classifier model that can forecast an individual's BCI accuracy. However, this also depends on the number of characters or the number of sequences used in both the training and testing phases. The " bi2014a " dataset includes several participants who had a relatively small number of total targets. It is worth mentioning that the vCBLE prediction method has already been successfully tested on small-size datasets consisting of 32 and 9 participants, respectively, and has demonstrated effective performance17,18. However, these datasets have a relatively larger number of total targets, such as 19 characters in the training session and a minimum of 23 characters in the testing session.
There are a few limitations to be aware of when applying this method. From the analysis of the Michigan dataset, the vCBLE model seems to perform worse when the training and test data are collected on different days. Also, this method requires multiple participants to build a custom model for a given dataset. Moreover, the proposed method has been tested on four classifiers, including a least-squares classifier, stepwise linear discriminant analysis, support vector machine (SVM), and space autoencoder (SAE)17,18. However, the protocol should be applicable to any time-sensitive classifier. Despite these limitations, the potential time savings to the research and clinical communities warrant further investigation and application.
The authors have nothing to disclose.
The data used for representative results were collected from the work supported by the National Institute of Child Health and Human Development (NICHD), the National Institutes of Health (NIH) under Grant R21HD054697, and the National Institute on Disability and Rehabilitation Research (NIDRR) in the Department of Education under Grant H133G090005 and Award Number H133P090008. The rest of the work was funded in part by the National Science Foundation (NSF) under award #1910526. Findings and opinions within this work do not necessarily reflect the positions of NICHD, NIH, NIDRR or NSF.