The study introduces a training-testing paradigm to investigate old/new effects of event-related potentials in confident and doubtful prosodic scenarios. Data reveals an enhanced late positive component between 400-850 ms at Pz and other electrodes. This pipeline can explore factors beyond speech prosody and their influence on cue-binding target identification.
Recognizing familiar speakers from vocal streams is a fundamental aspect of human verbal communication. However, it remains unclear how listeners can still discern the speaker’s identity in expressive speech. This study develops a memorization-based individual speaker identity recognition approach and an accompanying electroencephalogram (EEG) data analysis pipeline, which monitors how listeners recognize familiar speakers and tell unfamiliar ones apart. EEG data captures online cognitive processes during new versus old speaker distinction based on voice, offering a real-time measure of brain activity, overcoming limits of reaction times and accuracy measurements. The paradigm comprises three steps: listeners establish associations between three voices and their names (training); listeners indicate the name corresponding to a voice from three candidates (checking); listeners distinguish between three old and three new speaker voices in a two-alternative forced-choice task (testing). The speech prosody in testing was either confident or doubtful. EEG data were collected using a 64-channel EEG system, followed by preprocessing and imported into RStudio for ERP and statistical analysis and MATLAB for brain topography. Results showed an enlarged late positive component (LPC) was elicited in the old-talker compared to the new-talker condition in the 400-850 ms window in the Pz and other wider range of electrodes in both prosodies. Yet, the old/new effect was robust in central and posterior electrodes for doubtful prosody perception, whereas the anterior, central, and posterior electrodes are for confident prosody condition. This study proposes that this experiment design can serve as a reference for investigating speaker-specific cue-binding effects in various scenarios (e.g., anaphoric expression) and pathologies in patients like phonagnosia.
Human vocal streams are rich in information, such as emotion1,2, health status3,4, biological sex5, age6, and, more importantly, the individual vocal identity7,8. Studies have suggested that human listeners have a robust capacity to recognize and differentiate their peers’ identities through voices, overcoming within-speaker variations around speaker identity’s average-based representation in the acoustic space9. Such variations are brought about by acoustic manipulation (fundamental frequency and vocal tract length, i.e., F0 and VTL) that corresponds to no clear pragmatic intentions9, emotion prosodies10, and vocal confidence that conveys speakers’ feeling of knowing11. Behavioral experiments have focused on many factors that influence listeners’ performance in recognizing the talkers, including language-related manipulations8,12,13, participants-related characteristics such as music experience or reading ability14,15, and stimuli-related adaptions like backward speech or nonwords16,17; more can be found in literature reviews18,19. A few recent experiments have investigated how individual variation of the speaker identity representation might undermine the recognition accuracy, considering aspects including high versus low emotional expressiveness16 and neutral versus fearful prosodies5; more possible scenarios open for further investigation, as suggested by a review20.
For the first research gap, the study proposes that the neurological underpinnings of speaker identification have yet to fully explore how within-speaker variation challenges listeners’ brain activities. For example, in an fMRI-based speaker recognition task by Zäske et al., participants’ right posterior superior temporal gyrus (pSTG), right inferior/middle frontal gyrus (IFG/MFG), right medial frontal gyrus, and left caudate showed reduced activation when correctly identified as old versus new talkers, regardless of the linguistic content being the same or different21. However, an earlier electroencephalography (EEG) study by Zäske et al. did not observe this old/new effect when speaker identity variation was introduced through different texts22. Specifically, a larger, late positive component (LPC) ranging from 300 to 700 ms, detected at the Pz electrode when listeners encountered their familiar trained talker expressing the same text (i.e., hearing a replay with non-varied linguistic content), was absent when the talkers delivered new texts.
In support of the assertion made by Zäske et al.21, this study suspects that an old/new effect can still be observed despite differences in linguistic content between training and testing sessions in event-related potential (ERP) analyses. This rationale stems from the notion that the absence of the old/new effect in Zäske et al.22, under conditions where different texts were used, may be attributed to the lack of an additional check session during the training task to ensure thorough and effective identity learning, as suggested by Lavan et al.23. Consequently, the first objective of the study is to examine and validate this hypothesis. This study aims to test this by adding a checking session to the training-testing paradigm22.
Another key question this study aims to address is the robustness of speaker identification in the presence of speech prosody. Previous behavioral studies have suggested that listeners particularly struggle to recognize talkers across different prosodies, which indicates a modulatory role of prosodic context – listeners underperformed in the different training-testing prosody conditions. This study aims to test this by exposing listeners to recognize familiar talkers in either confident or doubtful prosodies24. This study expects that the observed ERP differences will help explain how speech prosody influences identity recognition.
The core objective of the current study is to investigate the robustness of the old/new effect in speaker recognition, specifically examining whether there are differences in recognizing talkers in confident versus doubtful prosodies. Xu and Armony10 performed a behavioral study using a training-testing paradigm, and their findings suggest that listeners cannot overcome prosodic differences (e.g., trained to recognize a talker in neutral prosody and tested on fearful prosody) and can only achieve accuracy lower than chance level10. Acoustic analysis indicates that speakers expressing varied emotive states are associated with VTL/F0 modulation; for example, confident prosody is characterized by lengthened VTL and lower F0, whereas the opposite is true for doubtful prosody11,24. Another piece of evidence comes from the study by Lavan et al.23, which confirmed that listeners can adapt to VTL and F0 changes of the speaker and form average-based representations of the talkers. This study reconciles that, from a behavioral data perspective, listeners are likely to still recognize the talker’s identity across prosodies (e.g., trained to recognize one in confident prosody but tested in doubtful prosody; reported in a separate manuscript in preparation). Yet, the neural correlates of speaker identification, specifically the generalizability of the old/new effect observed by Zäske et al.22, remain unclear. Hence, the current study is committed to validating the robustness of the old/new effect in confident versus doubtful prosodies as contexts for testing.
The study introduces a departure from previous research paradigms in old/new effects studies. While past research focused on how old/new talker recognition influences perception, this study extends this by incorporating two confidence levels (confident versus doubtful) into the paradigm (thus, a 2+2 study). This allows us to investigate speaker recognition within the contexts of confident and doubtful speech prosodies. The paradigm enables the exploration of the robustness of old/new effects. The analyses of memory effects and regions of interest (ROI) within both confident and doubtful speech contexts serve as evidence for this investigation.
Altogether, the study aims to update the understanding of the EEG correlates of voice recognition, with the hypotheses that the enlarged LPC of the EEG old/new effect is observable even when 1) the linguistic content is not the same, and 2) with the presence of confident versus doubtful prosody. This study investigated the hypotheses through a three-step paradigm. Firstly, during the training phase, participants established associations between three voices and their corresponding names. Subsequently, in the checking phase, they were tasked with identifying the name corresponding to a voice from a selection of three candidates. This checking, following Lavan et al.23, aims to overcome insufficient old speaker familiarization, which led to the unobserved old/new effect when the text in the training and testing phases differed6, and talkers could not recognize talkers across neutral and fearful prosodies10. Finally, in the testing phase, participants distinguished between three old and three new speaker voices in a two-alternative forced-choice task, with speech prosody presented as either confident or doubtful. EEG data were collected using a 64-channel EEG system and underwent preprocessing before analysis. Statistical analysis and event-related potential (ERP) analysis were conducted in RStudio, while MATLAB was utilized for brain topography analysis.
Regarding design details, this study proposes a speaker identity learning experiment that controls for the talker’s height, which is related to VTL and influences impressions of who is talking23. This aspect also influences social impressions, such as perceived dominance25, and such higher-level impression formation might interact with decoding speaker identity26.
The Ethics Committee of the Institute of Linguistics, Shanghai International Studies University, has approved the experiment design described below. Informed consent was obtained from all participants for this study.
1. Preparation and validation of the audio library
2. Programming for EEG data collection
3. Collecting EEG data
4. EEG data processing
NOTE: The following descriptions involve EEG data preprocessing, statistical analysis, and visualization using MATLAB and RStudio for batch processing.
The classic old/new effect is characterized by a significant increase in listeners' brain activity on the Pz electrode (between 300 to 700 ms) when the speech content of the testing session matches that of the training session, particularly in the old talker condition compared to the new talker condition22. The protocol unveils an updated version of this effect: Firstly, observing larger positive trends in the Pz electrode and across the entire brain region for the old condition compared to the new talker condition between 400 to 850 ms. Secondly, the speech content in the testing session will differ from that of the training session. Thirdly, both confident and doubtful speech prosody conditions are expected to exhibit these trends. Lastly, the old/new effect is more pronounced in doubtful condition during the testing session (Figure 2).
The LMER analysis with the formula
lmer(Voltage ~ Memory * ROI + (1|Subject) + (1|Channel))
suggests that both memory types (old versus new) and ROI have main effects, as well as an interaction between memory and ROI (Table 1). Further post-hoc analysis revealed that, across all brain regions, the old condition exhibits a larger positive voltage than the doubtful condition, including in the anterior, central, and posterior regions (Table 2). Comparing the beta values suggests that the old/new effect was more pronounced at central and posterior electrodes than anterior electrodes: for the combined dataset – Anterior β = .40, Central β = .63, and Posterior β = .60; for the confident dataset – Anterior β = .61, Central β = .63, and Posterior β = .76, and for the doubtful dataset – Anterior β = .44, Central β = .87, and Posterior β = .69. The involvement of central and posterior electrodes was most noticeable in the doubtful prosody condition.
With the formula
lmer(Voltage ~ Memory + (1|Subject))
we confirmed the existence of old/new effects in the Pz electrode. At the Pz electrode, a main effect of memory (old versus new) was observed (F(1, 69341.99) = 120.46, p < .001, η²p = .002, β = .425, SE = .039, z-ratio = 10.98, p < .001). In the confident-only condition, a main effect of memory (old versus new) was observed at the Pz electrode (F(1, 34318.32) = 5.04, p = .025, η²p = .0001, β = .125, SE = .056, z-ratio = 2.25, p = .025). In the doubtful-only condition, a main effect of memory (old versus new) was observed at the Pz electrode (F(1, 34993.20) = 317.02, p < .001, η²p = .009, β = .914, SE = .051, z-ratio = 17.81, p < .001).
Figure 1: Workflow of the data collection for each block. In (A) Training, listeners hear a voice and associate the name subsequently presented with it. Three old talkers are required to be remembered. The language that appeared in the program was originally Chinese. The A and C represent names such as Xiao (Junior) ZHANG. In (B) Checking, listeners identify the talker's name upon hearing a voice by pressing 1, 2, or 3 on the number pad to associate the voice identity with names like Xiao ZHAO. In (C) Testing, listeners hear a voice and classify it as spoken by the old or new speaker. As illustrated in (D) Prosody Design, listeners learn three talkers express only confidently or doubtfully, but hearing six talkers speak both confidently and doubtfully. The appearance of Version A or B is mutually exclusive. If Version A appears with a male or female speaker, Version B will appear with the corresponding female or male speaker. Please click here to view a larger version of this figure.
Figure 2: The old/new effect. (A, B, C) Figures display the grey-indicated ERP of Pz electrodes from 400 to 850 ms for the prosody-combined, confident-only, and doubtful-only conditions, respectively. (D, E, F) Figures illustrate the topography of the old minus new condition across all electrodes (depicted as black dots) for the prosody-combined, confident-only, and doubtful-only conditions. Please click here to view a larger version of this figure.
Context | Brain Region | F value | Pr(>F) | Eta2_partial |
Combined | Memory | 9938.98 | .00 | .00 |
ROI | 4.13 | .02 | .13 | |
Memory:ROI | 182.37 | .00 | .00 | |
Confident | Memory | 7291.22 | .00 | .00 |
ROI | 3.60 | .03 | .12 | |
Memory:ROI | 41.94 | .00 | .00 | |
Doubtful | Memory | 8333.38 | .00 | .00 |
ROI | 4.65 | .01 | .15 | |
Memory:ROI | 290.15 | .00 | .00 |
Table 1: Results from LMER analysis for old/new effect across brain regions: Combined, confident, and doubtful datasets. Using post-hoc analysis, * significant at p < .05, ** significant at p < .01, *** significant at p < .001.
Context | Brain Region | Contrast | Estimate | SE | z | p |
Combined | Anterior | old-new | .40 | .01 | 43.70 | .00*** |
Central | old-new | .63 | .01 | 61.74 | .00*** | |
Posterior | old-new | .60 | .01 | 67.51 | .00*** | |
Confident | Anterior | old-new | .61 | .01 | 46.63 | .00*** |
Central | old-new | .63 | .01 | 43.22 | .00*** | |
Posterior | old-new | .76 | .01 | 59.95 | .00*** | |
Doubtful | Anterior | old-new | .44 | .01 | 35.95 | .00*** |
Central | old-new | .87 | .01 | 64.05 | .00*** | |
Posterior | old-new | .69 | .01 | 57.75 | .00*** |
Table 2: Post-hoc test results for old/new effects across brain regions: Combined, confident, and doubtful datasets. Using post-hoc analysis, significant at p < .001 (***).
The study presents a pipeline for EEG data collection and analysis, focusing on recognizing previously learned speaker identities. This study addresses variations between learning and recognition phases, including differences in speech content22 and prosody10. The design is adaptable to a range of research fields, including psycholinguistics, such as pronoun and anaphoric processing41.
The training-testing paradigm is a classic experimental design used to assess participants’ learning outcomes on specific topics such as voice learning42,43. This paradigm evaluates how well participants have learned particular information (as reflected in accuracy)10. It allows researchers to introduce variables incrementally under controlled experimental conditions, such as different prosodies during training and testing phases, to understand their influence on voice recognition accuracy, for example, VTL/F0 modulated voices23, fearful versus neutral10, or doubtful versus confident in this study.
However, the paradigm has limitations. The differences between the learning and testing environments can affect the validity of experimental results, as controlled learning conditions may not reflect the more variable testing conditions. For instance, the training session uses a single prosody rather than a proportionate difference, such as 30% versus 70%44. To address this imbalance, ensuring a more diverse learning environment could better replicate real-life scenarios where speakers use varied prosodies while interacting with listeners. Additionally, this study acknowledges that the complexity of the experimental design, involving multiple stages and sophisticated programming (using tools like R Studio, MATLAB, and Python), can be challenging for newcomers.
The primary insight emphasizes the importance of adequate familiarization and a check phase. Xu and Armony’s work highlights that listeners struggle to identify old talker identities without sufficient training and checks above chance levels10. Additionally, Zaske et al. found the LPC old/new effect was only present when the same text was repeated, not with different text22. In this study, the implementation of a check phase revealed the persistence of the old/new ERP effect, even with different text stimuli, supporting the fMRI studies claims21. The study suggests that, for training-testing-based paradigms, inserting a check session is critical. It allows listeners to form a robust impression of the speaker’s acoustic identity, associating a talker with a specific symbol, such as a name23. Without sufficient learning of the speaker’s representation, listeners may struggle to adapt to within-speaker variations10.
This study also observed the role of prosody as a binding cue for speaker recognition45. Contrary to previous views that prosody may hinder old talker recognition, this study found the old/new effect present across confident and doubtful prosody conditions. This robust effect suggests a modulation role of prosody in speaker recognition. Further analysis revealed differences in anterior region activation across prosody conditions. Confident prosody elicited lower levels of the old/new effect in anterior regions compared to doubtful prosody. This finding suggests that confident speech may make talker identification more challenging due to extended vocal tract length and lowered fundamental frequency, potentially leading to increased attention from listeners11,29.
This study’s design can inform future investigations into recognition impairments in patient populations, such as those with prosopagnosia or phonagnosia46,47. Additionally, modifications to accommodate participants with shorter attention spans, such as individuals with autism spectrum disorders48, could enhance study accessibility.
Furthermore, the paradigm extends beyond speaker recognition to investigate pronoun processing and anaphoric comprehension within psycholinguistic research. Coopmans and Nieuwland41 demonstrate how neural oscillatory synchronization patterns distinguish between antecedent activation and integration in anaphor comprehension, which aligns with this study’s exploration of identity-related cues. Cues alike include communicative styles (e.g., literal or ironic statements), word orders (Subject-Object-Verb (SOV), or Object-Subject-Verb (OSV) sentence structure44,45,49,50), and vocal expression types (confident vs. doubtful prosody) in this paper.
The authors have nothing to disclose.
This work was supported by the Natural Science Foundation of China (Grant No. 31971037); the Shuguang Program supported by the Shanghai Education Development Foundation and Shanghai Municipal Education Committee (Grant No. 20SG31); the Natural Science Foundation of Shanghai (22ZR1460200); the Supervisor Guidance Program of Shanghai International Studies University (2022113001); and the Major Program of the National Social Science Foundation of China (Grant No. 18ZDA293).
64Ch Standard BrainCap for BrainAmp | Easycap GmbH | Steingrabenstrasse 14 DE-82211 | https://shop.easycap.de/products/64ch-standard-braincap |
Abrasive Electrolyte-Gel | Easycap GmbH | Abralyt 2000 | https://shop.easycap.de/products/abralyt-2000 |
actiCHamp Plus | Brain Products GmbH | 64 channels + 8 AUX | https://www.brainproducts.com/solutions/actichamp/ |
Audio Interface | Native Instruments GmbH | Komplete audio 6 | https://www.native-instruments.com/en/products/komplete/audio-interfaces/komplete-audio-6/ |
Foam Eartips | Neuronix | ER3-14 | https://neuronix.ca/products/er3-14-foam-eartips |
Gel-based passive electrode system | Brain Products GmbH | BC 01453 | https://www.brainproducts.com/solutions/braincap/ |
High-Viscosity Electrolyte Gel | Easycap GmbH | SuperVisc | https://shop.easycap.de/products/supervisc |
.