Summary

Memorization-Based Training and Testing Paradigm for Robust Vocal Identity Recognition in Expressive Speech Using Event-Related Potentials Analysis

Published: August 09, 2024
doi:

Summary

The study introduces a training-testing paradigm to investigate old/new effects of event-related potentials in confident and doubtful prosodic scenarios. Data reveals an enhanced late positive component between 400-850 ms at Pz and other electrodes. This pipeline can explore factors beyond speech prosody and their influence on cue-binding target identification.

Abstract

Recognizing familiar speakers from vocal streams is a fundamental aspect of human verbal communication. However, it remains unclear how listeners can still discern the speaker’s identity in expressive speech. This study develops a memorization-based individual speaker identity recognition approach and an accompanying electroencephalogram (EEG) data analysis pipeline, which monitors how listeners recognize familiar speakers and tell unfamiliar ones apart. EEG data captures online cognitive processes during new versus old speaker distinction based on voice, offering a real-time measure of brain activity, overcoming limits of reaction times and accuracy measurements. The paradigm comprises three steps: listeners establish associations between three voices and their names (training); listeners indicate the name corresponding to a voice from three candidates (checking); listeners distinguish between three old and three new speaker voices in a two-alternative forced-choice task (testing). The speech prosody in testing was either confident or doubtful. EEG data were collected using a 64-channel EEG system, followed by preprocessing and imported into RStudio for ERP and statistical analysis and MATLAB for brain topography. Results showed an enlarged late positive component (LPC) was elicited in the old-talker compared to the new-talker condition in the 400-850 ms window in the Pz and other wider range of electrodes in both prosodies. Yet, the old/new effect was robust in central and posterior electrodes for doubtful prosody perception, whereas the anterior, central, and posterior electrodes are for confident prosody condition. This study proposes that this experiment design can serve as a reference for investigating speaker-specific cue-binding effects in various scenarios (e.g., anaphoric expression) and pathologies in patients like phonagnosia.

Introduction

Human vocal streams are rich in information, such as emotion1,2, health status3,4, biological sex5, age6, and, more importantly, the individual vocal identity7,8. Studies have suggested that human listeners have a robust capacity to recognize and differentiate their peers’ identities through voices, overcoming within-speaker variations around speaker identity’s average-based representation in the acoustic space9. Such variations are brought about by acoustic manipulation (fundamental frequency and vocal tract length, i.e., F0 and VTL) that corresponds to no clear pragmatic intentions9, emotion prosodies10, and vocal confidence that conveys speakers’ feeling of knowing11. Behavioral experiments have focused on many factors that influence listeners’ performance in recognizing the talkers, including language-related manipulations8,12,13, participants-related characteristics such as music experience or reading ability14,15, and stimuli-related adaptions like backward speech or nonwords16,17; more can be found in literature reviews18,19. A few recent experiments have investigated how individual variation of the speaker identity representation might undermine the recognition accuracy, considering aspects including high versus low emotional expressiveness16 and neutral versus fearful prosodies5; more possible scenarios open for further investigation, as suggested by a review20.

For the first research gap, the study proposes that the neurological underpinnings of speaker identification have yet to fully explore how within-speaker variation challenges listeners’ brain activities. For example, in an fMRI-based speaker recognition task by Zäske et al., participants’ right posterior superior temporal gyrus (pSTG), right inferior/middle frontal gyrus (IFG/MFG), right medial frontal gyrus, and left caudate showed reduced activation when correctly identified as old versus new talkers, regardless of the linguistic content being the same or different21. However, an earlier electroencephalography (EEG) study by Zäske et al. did not observe this old/new effect when speaker identity variation was introduced through different texts22. Specifically, a larger, late positive component (LPC) ranging from 300 to 700 ms, detected at the Pz electrode when listeners encountered their familiar trained talker expressing the same text (i.e., hearing a replay with non-varied linguistic content), was absent when the talkers delivered new texts.

In support of the assertion made by Zäske et al.21, this study suspects that an old/new effect can still be observed despite differences in linguistic content between training and testing sessions in event-related potential (ERP) analyses. This rationale stems from the notion that the absence of the old/new effect in Zäske et al.22, under conditions where different texts were used, may be attributed to the lack of an additional check session during the training task to ensure thorough and effective identity learning, as suggested by Lavan et al.23. Consequently, the first objective of the study is to examine and validate this hypothesis. This study aims to test this by adding a checking session to the training-testing paradigm22.

Another key question this study aims to address is the robustness of speaker identification in the presence of speech prosody. Previous behavioral studies have suggested that listeners particularly struggle to recognize talkers across different prosodies, which indicates a modulatory role of prosodic context – listeners underperformed in the different training-testing prosody conditions. This study aims to test this by exposing listeners to recognize familiar talkers in either confident or doubtful prosodies24. This study expects that the observed ERP differences will help explain how speech prosody influences identity recognition.

The core objective of the current study is to investigate the robustness of the old/new effect in speaker recognition, specifically examining whether there are differences in recognizing talkers in confident versus doubtful prosodies. Xu and Armony10 performed a behavioral study using a training-testing paradigm, and their findings suggest that listeners cannot overcome prosodic differences (e.g., trained to recognize a talker in neutral prosody and tested on fearful prosody) and can only achieve accuracy lower than chance level10. Acoustic analysis indicates that speakers expressing varied emotive states are associated with VTL/F0 modulation; for example, confident prosody is characterized by lengthened VTL and lower F0, whereas the opposite is true for doubtful prosody11,24. Another piece of evidence comes from the study by Lavan et al.23, which confirmed that listeners can adapt to VTL and F0 changes of the speaker and form average-based representations of the talkers. This study reconciles that, from a behavioral data perspective, listeners are likely to still recognize the talker’s identity across prosodies (e.g., trained to recognize one in confident prosody but tested in doubtful prosody; reported in a separate manuscript in preparation). Yet, the neural correlates of speaker identification, specifically the generalizability of the old/new effect observed by Zäske et al.22, remain unclear. Hence, the current study is committed to validating the robustness of the old/new effect in confident versus doubtful prosodies as contexts for testing.

The study introduces a departure from previous research paradigms in old/new effects studies. While past research focused on how old/new talker recognition influences perception, this study extends this by incorporating two confidence levels (confident versus doubtful) into the paradigm (thus, a 2+2 study). This allows us to investigate speaker recognition within the contexts of confident and doubtful speech prosodies. The paradigm enables the exploration of the robustness of old/new effects. The analyses of memory effects and regions of interest (ROI) within both confident and doubtful speech contexts serve as evidence for this investigation.

Altogether, the study aims to update the understanding of the EEG correlates of voice recognition, with the hypotheses that the enlarged LPC of the EEG old/new effect is observable even when 1) the linguistic content is not the same, and 2) with the presence of confident versus doubtful prosody. This study investigated the hypotheses through a three-step paradigm. Firstly, during the training phase, participants established associations between three voices and their corresponding names. Subsequently, in the checking phase, they were tasked with identifying the name corresponding to a voice from a selection of three candidates. This checking, following Lavan et al.23, aims to overcome insufficient old speaker familiarization, which led to the unobserved old/new effect when the text in the training and testing phases differed6, and talkers could not recognize talkers across neutral and fearful prosodies10. Finally, in the testing phase, participants distinguished between three old and three new speaker voices in a two-alternative forced-choice task, with speech prosody presented as either confident or doubtful. EEG data were collected using a 64-channel EEG system and underwent preprocessing before analysis. Statistical analysis and event-related potential (ERP) analysis were conducted in RStudio, while MATLAB was utilized for brain topography analysis.

Regarding design details, this study proposes a speaker identity learning experiment that controls for the talker’s height, which is related to VTL and influences impressions of who is talking23. This aspect also influences social impressions, such as perceived dominance25, and such higher-level impression formation might interact with decoding speaker identity26.

Protocol

The Ethics Committee of the Institute of Linguistics, Shanghai International Studies University, has approved the experiment design described below. Informed consent was obtained from all participants for this study. 1. Preparation and validation of the audio library Audio recording and editing Create a Chinese vocal database following the standard procedure of making a previous English version while making adaptations where needed to fit into the conte…

Representative Results

The classic old/new effect is characterized by a significant increase in listeners' brain activity on the Pz electrode (between 300 to 700 ms) when the speech content of the testing session matches that of the training session, particularly in the old talker condition compared to the new talker condition22. The protocol unveils an updated version of this effect: Firstly, observing larger positive trends in the Pz electrode and across the entire brain region for the old condition compared to th…

Discussion

The study presents a pipeline for EEG data collection and analysis, focusing on recognizing previously learned speaker identities. This study addresses variations between learning and recognition phases, including differences in speech content22 and prosody10. The design is adaptable to a range of research fields, including psycholinguistics, such as pronoun and anaphoric processing41.

The training-testing paradigm is a c…

Divulgations

The authors have nothing to disclose.

Acknowledgements

This work was supported by the Natural Science Foundation of China (Grant No. 31971037); the Shuguang Program supported by the Shanghai Education Development Foundation and Shanghai Municipal Education Committee (Grant No. 20SG31); the Natural Science Foundation of Shanghai (22ZR1460200); the Supervisor Guidance Program of Shanghai International Studies University (2022113001); and the Major Program of the National Social Science Foundation of China (Grant No. 18ZDA293).

Materials

64Ch Standard BrainCap for BrainAmp Easycap GmbH Steingrabenstrasse 14 DE-82211 https://shop.easycap.de/products/64ch-standard-braincap
Abrasive Electrolyte-Gel Easycap GmbH Abralyt 2000 https://shop.easycap.de/products/abralyt-2000
actiCHamp Plus Brain Products GmbH 64 channels + 8 AUX https://www.brainproducts.com/solutions/actichamp/
Audio Interface Native Instruments GmbH Komplete audio 6 https://www.native-instruments.com/en/products/komplete/audio-interfaces/komplete-audio-6/
Foam Eartips Neuronix ER3-14  https://neuronix.ca/products/er3-14-foam-eartips
Gel-based passive electrode system Brain Products GmbH BC 01453 https://www.brainproducts.com/solutions/braincap/
High-Viscosity Electrolyte Gel  Easycap GmbH SuperVisc https://shop.easycap.de/products/supervisc

References

  1. Larrouy-Maestri, P., Poeppel, D., Pell, M. D. The sound of emotional prosody: Nearly 3 decades of research and future directions. Perspect Psychol Sci. , 17456916231217722 (2024).
  2. Pell, M. D., Kotz, S. A. Comment: The next frontier: Prosody research gets interpersonal. Emotion Rev. 13 (1), 51-56 (2021).
  3. Cummins, N., et al. Multilingual markers of depression in remotely collected speech samples: A preliminary analysis. J Affect Disor. 341, 128-136 (2023).
  4. Cummins, N., Baird, A., Schuller, B. W. Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning. Methods. 151, 41-54 (2018).
  5. Kennedy, E., Thibeault, S. L. Voice-gender incongruence and voice health information-seeking behaviors in the transgender community. Am J Speech-language Pathol. 29 (3), 1563-1573 (2020).
  6. Zäske, R., et al. Electrophysiological correlates of voice memory for young and old speakers in young and old listeners. Neuropsychologia. 116, 215-227 (2018).
  7. Lavan, N., Burton, A. M., Scott, S. K., Mcgettigan, C. Flexible voices: Identity perception from variable vocal signals. Psychonomic Bullet Rev. 26, 90-102 (2019).
  8. Perrachione, T. K., Del Tufo, S. N., Gabrieli, J. D. Human voice recognition depends on language ability. Science. 333 (6042), 595-595 (2011).
  9. Lavan, N., Knight, S., Mcgettigan, C. Listeners form average-based representations of individual voice identities. Nat Comm. 10 (1), 2404 (2019).
  10. Xu, H., Armony, J. L. Influence of emotional prosody, content, and repetition on memory recognition of speaker identity. Quart J Exp Psychol. 74 (7), 1185-1201 (2021).
  11. Jiang, X., Pell, M. D. The sound of confidence and doubt. Speech Comm. 88, 106-126 (2017).
  12. Winters, S. J., Levi, S. V., Pisoni, D. B. Identification and discrimination of bilingual talkers across languages. J Acoustical Soci Am. 123 (6), 4524-4538 (2008).
  13. Orena, A. J., Polka, L., Theodore, R. M. Identifying bilingual talkers after a language switch: Language experience matters. J Acoustical Soc Am. 145 (4), EL303-EL309 (2019).
  14. Xie, X., Myers, E. The impact of musical training and tone language experience on talker identification. J Acoustical Soc Am. 137 (1), 419-432 (2015).
  15. Kadam, M. A., Orena, A. J., Theodore, R. M., Polka, L. Reading ability influences native and non-native voice recognition, even for unimpaired readers. J Acoustical Soc Am. 139 (1), EL6-EL12 (2016).
  16. Fleming, D., Giordano, B. L., Caldara, R., Belin, P. A language-familiarity effect for speaker discrimination without comprehension. Proc Natl Acad Sci. 111 (38), 13795-13798 (2014).
  17. White, K. S., Yee, E., Blumstein, S. E., Morgan, J. L. Adults show less sensitivity to phonetic detail in unfamiliar words, too. J Memory Lang. 68 (4), 362-378 (2013).
  18. Levi, S. Methodological considerations for interpreting the language familiarity effect in talker processing. Wiley Interdiscip Revi: Cognitive Sci. 10 (2), e1483 (2019).
  19. Perrachione, T. K., Frühholz, S., Belin, P. Recognizing Speakers Across Languages. The oxford handbook of voice perception. , 515-538 (2018).
  20. Lavan, N., Burton, A. M., Scott, S. K., Mcgettigan, C. Flexible voices: Identity perception from variable vocal signals. Psychonomic Bullet Rev. 26 (1), 90-102 (2019).
  21. Zäske, R., Hasan, B. a. S., Belin, P. It doesn’t matter what you say: Fmri correlates of voice learning and recognition independent of speech content. Cortex. 94, 100-112 (2017).
  22. Zäske, R., Volberg, G., Kovács, G., Schweinberger, S. R. Electrophysiological correlates of voice learning and recognition. J Neurosci. 34 (33), 10821-10831 (2014).
  23. Lavan, N., Knight, S., Mcgettigan, C. Listeners form average-based representations of individual voice identities. Nat Comm. 10 (1), 1-9 (2019).
  24. Chen, W., Jiang, X. Voice-Cloning Artificial-Intelligence Speakers Can Also Mimic Human-Specific Vocal Expression. Preprints. , (2023).
  25. Pisanski, K., Anikin, A., Reby, D. Vocal size exaggeration may have contributed to the origins of vocalic complexity. Philosoph Trans Royal Soc B. 377 (1841), 20200401 (2022).
  26. Belin, P., Fecteau, S., Bedard, C. Thinking the voice: Neural correlates of voice perception. Trend Cognitive Sci. 8 (3), 129-135 (2004).
  27. . Praat: doing phonetics by computer Available from: https://www.fon.hum.uva.nl/praat/ (2022)
  28. Jiang, X., Pell, M. D. On how the brain decodes vocal cues about speaker confidence. Cortex. 66, 9-34 (2015).
  29. Jiang, X., Gossack-Keenan, K., Pell, M. D. To believe or not to believe? How voice and accent information in speech alter listener impressions of trust. Quart J Exp Psychol. 73 (1), 55-79 (2020).
  30. Rigoulot, S., Pell, M. D. Seeing emotion with your ears: Emotional prosody implicitly guides visual attention to faces. PloS One. 7 (1), e30740 (2012).
  31. Cui, X., Jiang, X., Ding, H. Affective prosody guides facial emotion processing. Curr Psychol. 42 (27), 23891-23902 (2023).
  32. . Memorization-based training and testing paradigm for robust vocal identity recognition in expressive speech using event-related potentials analysis Available from: https://osf.io/6zu83/ (2024)
  33. . Brainvision recorder Available from: https://www.brainproducts.com/downloads/recorder/ (2024)
  34. Jiang, X., Paulmann, S., Robin, J., Pell, M. D. More than accuracy: Nonverbal dialects modulate the time course of vocal emotion recognition across cultures. J Exp Psychol. 41 (3), 597 (2015).
  35. Jiang, X., Pell, M. D. The feeling of another’s knowing: How "mixed messages" in speech are reconciled. J Exp Psychol. 42 (9), 1412 (2016).
  36. Zhou, X., et al. Semantic integration processes at different levels of syntactic hierarchy during sentence comprehension: An erp study. Neuropsychologia. 48 (6), 1551-1562 (2010).
  37. Jiang, X., Tan, Y., Zhou, X. Processing the universal quantifier during sentence comprehension: Erp evidence. Neuropsychologia. 47 (8-9), 1799-1815 (2009).
  38. Acunzo, D. J., Mackenzie, G., Van Rossum, M. C. W. Systematic biases in early erp and erf components as a result of high-pass filtering. J Neurosci Meth. 209 (1), 212-218 (2012).
  39. Bates, D. Fitting linear mixed models in r. R news. 5 (1), 27-30 (2005).
  40. Oostenveld, R., Fries, P., Maris, E., Schoffelen, J. M. Fieldtrip: Open source software for advanced analysis of meg, eeg, and invasive electrophysiological data. Computat Intelligence Neurosci. 2011, 1-9 (2011).
  41. Coopmans, C. W., Nieuwland, M. S. Dissociating activation and integration of discourse referents: Evidence from erps and oscillations. Cortex. 126, 83-106 (2020).
  42. Humble, D., et al. The jena voice learning and memory test (jvlmt): A standardized tool for assessing the ability to learn and recognize voices. Behavior Res Meth. 55 (3), 1352-1371 (2023).
  43. Holmes, E., To, G., Johnsrude, I. S. How long does it take for a voice to become familiar? Speech intelligibility and voice recognition are differentially sensitive to voice training. Psychol Sci. 32 (6), 903-915 (2021).
  44. Kroczek, L. O. H., Gunter, T. C. Communicative predictions can overrule linguistic priors. Sci Rep. 7 (1), 17581 (2017).
  45. Kroczek, L. O. H., Gunter, T. C. The time course of speaker-specific language processing. Cortex. 141, 311-321 (2021).
  46. Schroeger, A., et al. Atypical prosopagnosia following right hemispheric stroke: A 23-year follow-up study with mt. Cognitive Neuropsychol. 39 (3-4), 196-207 (2022).
  47. Garrido, L., et al. Developmental phonagnosia: A selective deficit of vocal identity recognition. Neuropsychologia. 47 (1), 123-131 (2009).
  48. Schelinski, S., Borowiak, K., Von Kriegstein, K. Temporal voice areas exist in autism spectrum disorder but are dysfunctional for voice identity recognition. Social Cognitive Affective Neurosci. 11 (11), 1812-1822 (2016).
  49. Holle, H., Gunter, T. C. The role of iconic gestures in speech disambiguation: Erp evidence. J Cognitive Neurosci. 19 (7), 1175-1192 (2007).
  50. Regel, S., Coulson, S., Gunter, T. C. The communicative style of a speaker can affect language comprehension? Erp evidence from the comprehension of irony. Brain Res. 1311, 121-135 (2010).
This article has been published
Video Coming Soon
Keep me updated:

.

Citer Cet Article
Chen, W., Jiang, X. Memorization-Based Training and Testing Paradigm for Robust Vocal Identity Recognition in Expressive Speech Using Event-Related Potentials Analysis. J. Vis. Exp. (210), e66913, doi:10.3791/66913 (2024).

View Video