This research aimed to make a comparison between L1-L2-English and L1-L2 Portuguese to check how much the effect of a foreign accent accounts for both metrics and prosodic-acoustic parameters, as well as for the choice of the target voice in a voice lineup.
This research aims to examine both the prosodic-acoustic features and the perceptual correlates of foreign-accented English and foreign-accented Brazilian Portuguese and check how the speakers' productions of foreign and native accents are correlated to the listeners' perception. In the Methodology, we conducted a speech production procedure with a group of American speakers of L2 Brazilian Portuguese and a group of Brazilian speakers of L2 English, and a speech perception procedure in which we performed voice lineups for both languages.For the speech production statistical analysis, we ran Generalized Additive Models to evaluate the effect of the language groups on each class (metric or prosodic-acoustic) of features controlled for the smoothing effect of the covariate(s) of the opposite class. For the speech perception statistical analysis, we ran a Kruskal-Wallis test and a post-hoc Dunn's test to evaluate the effect of the voices of the lineups on the scores judged by the listeners. We nevertheless conducted acoustic (voice) similarity tests based on Cosine and Euclidean distances. Results showed significant acoustic differences between the language groups in terms of variability of the f0, duration, and voice quality. For the lineups, the results indicated that prosodic features of f0, intensity, and voice quality correlated to the listeners' perceived judgments.
The accent is a salient and dynamic aspect of communication and fluency, both in the native language (L1) and in a foreign language (L2)1. Foreign accent represents the L2 phonetic features of a target language, and it can change (over time)in response to the speaker’s L2 experience, speaking style, input quality, exposition, among other variables. A foreign accent can be quantified as a (scalar) degree of difference between L2 speech produced by a foreign speaker and a local or reference accent of the target language2,3,4,5.
This research aims to examine both the prosodic-acoustic features and the perceptual correlates of foreign-accented English and foreign-accented Brazilian Portuguese (BP), as well as to check to which extent the speakers' productions of foreign and native accents are correlated to the listeners' perception. Prior research in the forensic field has demonstrated the robustness of vowels and consonants in foreign accent identification as either being stable for a long-term analysis of an individual (The Lo Case6) or referring to the high ID accuracy of a speaker (The Lindbergh Case7). However, the exploration of prosodic-acoustic features based on duration, fundamental frequency (f0, i.e., the acoustic correlate of pitch), intensity, and voice quality (VQ) has gained increasing attention8,9. Thus, the choice for prosodic-acoustic features in this study represents a promising avenue in the forensic phonetics field8,10,11,12,13.
The present research is supported by studies dedicated to foreign accents as a form of voice disguise in forensic cases14,15, as well as in the preparation of voice lineups for speaker recognition16,17. For instance, speech rate played an important role in the identification of German, Italian, Japanese, and Brazilian speakers of English18,19,20,21,22. Besides speech rate, long-term spectral and f0 features challenge L2 proficient speakers with familiarity with the target language because the brain and cognitive bases suffer a deficit in memory, attention, and emotion, reflecting in the speaker's phonetic performance during long speech turns23. The view of foreign accents in the forensic field is that the definition of what really sounds like a foreign accent depends much on the listener's unfamiliarity rather than having a none-to-extreme degree of foreign-accented speech24.
In the perceptual domain, a common forensic tool used since the mid-1990s for criminals' recognition is the Voice Lineup (auditory recognition), which is analogous to visual recognition used to identify a perpetrator in a crime scene16,25. In a voice lineup, the suspect's voice is presented alongside foils-voices similar in sociolinguistic aspects such as age, sex, geographical location, dialect, and cultural status-for identification by an earwitness. The success or failure of a voice lineup will depend on the number of voice samples and the sample durations25,26. Furthermore, for real-world samples, it is considered that the audio quality consistently impacts the accuracy of voice recognition. Poor audio quality can distort the unique characteristics of a voice27. In the case of voice similarity, fine phonetic detail based on f0 can confuse the listener during voice recognition28,29. Such acoustic features extend beyond f0 and encompass elements of duration, and spectral features of intensity and VQ30. This view of multiple prosodic features is crucial in the context of forensic voice comparison to ensure accurate speaker identification9,14,15,29,31.
In summary, studies in forensic phonetics have shown some variation regarding foreign accent identification over the last decades. On the one hand, a foreign accent does not seem to affect the process of identifying a speaker32,33 (especially if the speaker is unfamiliar with the target foreign accent34). On the other hand, there are findings in the opposite direction12,34,35.
This work received approval from a human research ethics committee. Furthermore, informed consent was obtained from all participants involved in this study to use and publish their data.
1. Speech production
NOTE: We collected speech from a reading task on both 'L1 English-L2 BP' produced by Group 1: The American English (from the U.S.A.) Speakers (AmE-S), and on both 'L1 BP-L2 English' produced by Group 2: The Brazilian Speakers (Bra-S). See Figure 1 for a flowchart for speech production.
Figure 1: Schematic flowchart for speech production. Please click here to view a larger version of this figure.
Figure 2: Screenshot from phonetic alignment using MAUS forced aligner. (A) The dashed rectangle is meant for dragging and dropping 'my_file.wav'/' my_file.txt' files or clicking inside for searching such files from the folder; the upload button is indicated by the red arrow. (B) The uploaded files from panel A (see blue arrow), the pipeline to be used, the language for the pairs of files, the file format to be returned, and a 'true/false' button for keeping all files. (C) The checkbox terms of Usage (see green arrow), the Run Web Services button, and the Results (TextGrid files to be downloaded). Please click here to view a larger version of this figure.
Figure 3: Screenshot of the realignment procedure. (A) Input settings form for the realignment procedure. (B) Partial waveform, broadband spectrogram with f0 (blue) contour, and six tiers segmented (and labeled) as tier 1: units of vowel onset to the next vowel onset (V_to_V); vowel onset (V_to_V); tier 2: units of vowel (V), consonant (C), and pause (#); tier 3: phonic representations V_to_V; tier 4: some words from the text; tier 5: some chunks (CH) of speech from the text; tier 6: tonal tier containing the highest (H), and the lowest (L) tone of each speech chunk produced by a female AmE-S. (C) Input settings for the automatic extraction of the acoustic features. Please click here to view a larger version of this figure.
2. Speech perception
NOTE: We carried out four voice lineups in English with American listeners and four lineups in BP with Brazilian listeners. See Figure 4 for a flowchart for speech perception.
Figure 4: Schematic flowchart for speech perception. Please click here to view a larger version of this figure.
Figure 5: Directory setup for speech perception. Lineup folders. Each folder contains Six L1 voices, the L2 target voice, the "CreateLineup" script, and the voice lineup audio file (returned after running the script). Please click here to view a larger version of this figure.
Results for speech production
In this section, we described the performance of the statistically significant prosodic-acoustic features and rhythm metrics. Such prosodic features were speech, articulation, and pause rates, which are related to duration, and shimmer, which is related to voice quality. The rhythm metrics were standard deviation (SD) of syllable duration, SD of consonant, SD of vocalic or consonantal duration, and the variation coefficient of syllable duration (see the Supplemental Table S1).
The Speech rate (Figure 6A) decreases more rapidly for L1-L2 English than for L1-L2 BP, although the rate is lower for both L2s. The Speech rate is related to three metrics – SD of syllable duration (SD-S), SD of vowels (SD-V), and the syllabic variation coefficient (Varco-S). It is also important to keep in mind that both AmE-S, and Bra-S group are proficient in their L2s. It seems that such rate is affected by the higher values of syllable duration and the lower variability produced by L1-L2 BP, causing less-steeped slopes.
The Articulation rate (Figure 6D) is related to SD-S, Varco-S, as well as the Syllabic Rhythm Ratio (RR-S); the latter represents the ratio between shorter and longer syllables. Such metrics seem to affect the Articulation rate just like the affected Speech rate due to sentence length and variation in duration leading to higher L2 speech planning, and L2 cognitive load9,23,54. A higher cognitive-linguistic load may be required in the sentence length domain in a way that prosodic-acoustic parameters are affected55.
Regarding the prosodic-acoustic features of Speech and Articulation rates, our findings suggest that the more a speaker varies the duration of syllables (and vowels), the lower such rates. We hypothesize that the speech perception of both L1 and L2 BP sounds somewhat slower than L1 and L2-English, and that L1-English sounds faster than all other language levels. This fact might be linked to speech rhythm and the hypothesis that while L1-L2 BP leans toward the syllable-timed pole of the rhythm continuum, L1-L2 English moves toward the stress-timed pole21,22.
The Local shimmer, Figure 6B, is influenced by SD-S and Varco-S. For the Local shimmer, both productions of Bra-S, L1-BP, and L2-English present a somewhat monotonic trajectory as variability of syllable duration increases (SD and Varco, see Supplemental Table S1). The perception of vocal effort might be apparent when listening to the productions of Bra-S due to a low variation that ranges between 8.5 and 9 dB. Bra-S speech is affected by vocal load. L1-BP is highly affected by sentence length being less sensitive to high variations of syllable duration (BP rhythmic pattern shows less syllabic variation22,56).
The Pause rate (Figure 6C) shows that L2-English speakers produce longer pauses (from 200 ms to 375 ms) than the L1-English, L1-BP, and L2-BP ones (mean of 30 ms for L1-English, 50 ms for L1-BP, and 100 ms for L2-BP). Speech planning, in this case, might have affected L2-English production due to increased brain activity54,57. It is expected that higher language proficiency would result in a greater reading speed and span54; nevertheless, even for L2 proficient speakers, reading tasks presented more effort in higher-order cognitive aspects of foreign speech and in language processing54,57. Our findings show, at least on a preliminary basis, that L2-BP speakers may be less affected by pauses than L2-English due to differences in the rhythmic pattern of both languages.
Figure 6: Generalized Additive Models for the prosodic-acoustic features. On the Y-axis, the response variable (the acoustic features to be modeled); on the X-axis, the predictor variables (the factor 'Language' and the covariates in the additive models that will provide the shape and the trajectories of the curves (i.e., the smoothing effects: 'Language + covariates'), and the product of the 'Language' and a metric feature that will provide a more accurate projection of the response variable (i.e., factor-smooth interactions: 'covariate:Language'). Abbreviation: GAMs = Generalized Additive Models. Please click here to view a larger version of this figure.
In relation to the rhythm metrics, for SD-S (Figure 7A), our findings reveal that the more a speaker varies the f0 curve and increases the speech rate, the more SD-S decreases for all the language levels (a mirror effect of Figure 6A). In the case of the SD for consonants (SD-C, Figure 7C), L1-BP, contrary to L1 English, performs low variation as Speech rate or Pause duration increases. Such SD direction is predicted since L1-English variability of syllable duration is (statistically) significantly higher than L1-BP44,58. The Varco-S (Figure 7B and Supplemental Table S1) seems somewhat correlated to the L1 and the L2 of the same language group. As the f0 variability and Speech rate increase, the Varco-S decreases for the L1-BP and seems to decrease and attenuate for L2-BP. For the English productions, while f0 variability and Speech rate increase, Varco-S increases and attenuates for L1-English and L2-English.
Regarding the SD for vowels or consonants (SD-V_C, Figure 7D and Supplemental Table S1), our results show some similarities with the Varco-S performance (Figure 7B). For instance, higher Speech rate, Pause duration, and f0 variability promote a fall-rise trajectory of the SD-V_C for L1-BP, and for the L2-BP. In English productions, while such prosodic features are higher, SD-V_C slightly decreases, attenuates for L1-English, slightly increases, and then attenuates for L2-English. The Varco-S and the SD-V_C correlation to the L1-L2 of the same language groups may be partially explained by the Lombard Effect, which refers to the alteration of acoustic parameters in speech due to background noise59.
In foreign speech, depending on the complexity of the task (e.g., read speech), speakers have used similar acoustic strategies on both L1 and L259,60. On the one hand, Marcoux and Ernestus found that f0 measures (range and median) in L2 Lombard speech suggest that L2 English speakers were influenced by their L1 Dutch61,62. On the other hand, more recent findings propose that although L2 Lombard speech is expected to differ from L1 Lombard speech due to the higher cognitive load when speaking the L2, L2 English speakers produced Lombard speech in the same direction as the L1 English speakers63. The extent to which our findings are aligned with Marcoux and Ernestus63 and diverge from Waaning59, Villegas et al.60, and Marcoux and Ernestus61,62 needs further investigation.
Figure 7: Generalized Additive Models for the metric features. On the Y-axis, the response variable (the metric features to be modeled); on the X-axis, the predictor variables (the factor 'Language' and the covariates in the additive models that will provide the shape and the trajectories of the curves (i.e., the smoothing effects: 'Language + covariates'), and the product of the 'Language' and a metric feature that will provide a more accurate projection of the response variable (i.e., factor-smooth interactions: 'covariate:Language'). Abbreviation: GAMs = Generalized Additive Models. Please click here to view a larger version of this figure.
Results for speech perception
In relation to the BP voice lineups, in lineup #1 (Figure 8A), the foil voice #3 was judged as being the target voice. In fact, the target voice was the voice #4. Results show no statistically significant difference (see Supplemental Table S1) between the foil #3 (83%) and the target voice #4 (71%). In lineup #2 (Figure 8B), a comparison between the target voice #3 (40%) and the foil #4 (20%) also showed no statistically significant difference (see Supplemental Table S1) for the English voice lineups. In lineup #1 (Figure 9A), there was no significant difference (see Supplemental Table S1) between the foil #2 (26%) and the target voice #4 (54%).
In the analysis of voice similarity, both the Cosine similarity (CoSim) and the Euclidean distance (EucDist, [EucDist transformed]54) showed a consistent correlation between foil #3 and the target for the BP lineup #1 (CoSim = 0.99; EucDist = 77.4, [0.93]). Correlation between foil #4 and the target was similarly strong in the BP lineup #2 (CoSim = 0.99, EucDist = 76.3, [0.93]). In the English lineup #1, correlation was consistent for CoSim (0.83(, but weak-to-satisfactory for EucDist (213.0, [0.47]. Overall, both CoSim and EucDist were satisfactory techniques in determining voice similarity. In addition, CoSim performed slightly more effectively, as addressed by Gahman and Elangovan52(see Supplemental Table S1).
In terms of similarity, what could have led to an increase in false alarms? Prosodic-acoustic features showed evidence that, although foils and target voices performed (statistically) significantly differently-with an exception for the lineups presented in Figure 8A,B and Figure 9A-listeners' choices somewhat diversified along different foils in the other lineups (Figure 8C,D, and Figure 9B-D). Such judgment seems to depend more on one (or maybe more) specific acoustic feature that speakers share than a whole class of prosodic-acoustic features. For instance, our study presented several features (co)varying between productions; nevertheless, perceptual results show the judgments' preferences for a specific foil to match the target voice8,35,64,65. Further analyses are necessary in future work.
It is noteworthy to consider the choice for a multidimensional parametric matrix for acoustic similarity66 as the one presented in this study. Our findings are aligned with previous studies that used acoustic features based on f0, VQ, and intensity9,35. Such features are remarkably suggested for speaker comparison tasks in the forensic field9,66. It is nevertheless important to highlight that in the present research, none of the foils were predicted to be similar to the target voice, although we have used a variant of McFarlane's guidelines16,17 in the preparation of the voice lineups for foreign-accented speaker recognition. Furthermore, we must emphasize that our voice lineup procedure contains important differences from McFarlane's guidelines, including stimulus length, number of voices, selection of foils (randomized here), and speech style.
To what extent prosodic-acoustic features of f0, voice quality, and intensity seem to be related to listeners' perception would highly depend on the speaking style used in the research. Here, we used reading passages. The majority of earwitness studies opt for (semi-) spontaneous stimuli as a balanced solution-e.g., the DyVis corpus67-which has been extensively employed in contemporary earwitness investigations28,68. The reason that led us to choose the reading tasks is that our corpus, at the time of the composition of this manuscript, consisted exclusively of data obtained from reading tasks. It is, however, important to acknowledge that the process of compiling a (semi-) spontaneous corpus is already in progress. Furthermore, the act of reading a story can generate a dependable level of uniformity and variation, both intra-speaker and inter-speaker. This facilitates the emphasis on prosodic or phonetic characteristics-individual or dialectal-in situations involving voice comparison. This becomes especially noticeable in instances of vocal load during the production of a foreign accent, even among proficient speakers 8,9,23,54.
Figure 8: Bar plots containing the results for the four lineups carried out by the Brazilian listeners. (A,B) Non-significant difference between the target voice (in blue) and a foil (in light blue). (C,D) significant differences from the foils, and the target voice. Abbreviation: NS = non-significant. Please click here to view a larger version of this figure.
Figure 9: Bar plots containing the results for the four lineups carried out by the American listeners. (A) Non-significant difference between the target voice (in red) and a foil (in pink). (B,C,D) Significant differences from the foils and the target voice. Abbreviation: NS = non-significant. Please click here to view a larger version of this figure.
Supplementary Table S1: Speech production statistics – Generalized Additive Models (GAMs): The F-statistics (degrees of freedom), the adjusted R2 of each statistically significant prosodic-acoustic feature (Figure 6), and rhythm metric (Figure 7) per Language level, as well as the individual values of each smooth term (the covariates). Speech perception statistics: The Kruskall-Wallis χ2 statistics (degrees of freedom), the p-values, and the adjusted η2, as well as a post-hoc Dunn's test for pairwise comparison (Figure 8 and Figure 9) of each statistically non-significant difference between the pairs of target-foil voices (Figure 8A,B and Figure 9A). Acoustic similarity matrices for Cosine and Euclidean distance of each lineup. Please click here to view a larger version of this figure.
The current protocol presents a novelty in the field of (forensic) phonetics. It is divided into two phases: one based on production (acoustic analysis) and one based on perception (judgement analysis). The production analysis phase comprises the Data preparation and Forced alignment, Realignment, and Automatic extraction of prosodic-acoustic features besides the statistics. This protocol connects the stage of data collection to the data analysis in a faster and more efficient way than the traditional protocols based on a predominantly manual segmentation.
However, the realignment process, shown in protocol steps 1.3.6 to 1.3.6.9, basically works on the phone and word units generated in protocol steps 1.3.1 to 1.3.4. The novelty of the realignment process is that it generates a new TextGrid containing three new text tiers: a syllabic tier, a speech chunk tier that can be generated automatically or manually based on words, and a tone tier that can be quite useful for calculating f0 measures. The realignment guidelines proposed for this protocol were previously used in studies of L2 speech rhythm21,69,70, studies for the detection of foreign accent degree using machine learning techniques22, and studies in the forensic domain9.
The automatic extraction of prosodic features, presented in protocol steps 1.3.7 to 1.3.7.10, generates a multidimensional matrix of prosodic-acoustic features as could be testified in the current research. Such features include melodic, durational, and spectral (voice quality and intensity) speech features71. A considerable number of these acoustic features are less sensitive and somewhat robust to noisy audio recordings eventually collected in the forensic or any other scenarios72. Furthermore, the multidimensional matrix can be applied to speech recognition systems via several AI techniques (work in progress). It can still be quite useful in the teaching prosodic aspects in L2 pronunciation classes21 (work in progress).
The statistical analysis of this part of the protocol represents the use of the GAM method since we dealt with a complex relationship between a specific acoustic feature and multiple language-factor speakers. To model a specific acoustic feature, for instance, it was necessary to check how other acoustic features from the matrix (the smooth terms) would perform so that we could describe more accurately what was happening during speech production.
Regarding the perception analysis, the current protocol was constituted by preparing and implementing the voice lineups, statistical analysis, and acoustic similarity analysis. To prepare the lineups, we used the CreateLineup script for Praat, which automatically generates an audio file for each lineup containing randomly sequenced foils and target voice as described in protocol steps 2.2 to 2.2.1.9.
For the statistical analysis of this protocol, we ran the Kruskal-Wallis non-parametric test since our data did not meet the Analysis of Variance (ANOVA) assumptions. Once we had a relationship between the judgment scores assessed by the listeners and controlled for the target voice and the foils, we opted to use the linear model statistics.
For the acoustic similarity analysis, we used the AcousticSmilarity_cosine_euclidean script for Python. The program pops up the computer's directory tree and asks the user to choose the file containing the prosodic-acoustic features of the foils and the target voice that composes the lineup in analysis. At a click of a button, the script generates three similarity matrices: a Cosine, a Euclidean, and a transformed (zero-to-one) Euclidean, on both '.txt' (tab-delimited) and '.csv' files. We still must keep in mind that every dataset (the '.txt' file containing the prosodic-acoustic features of the foils and the target) must be in the original folder of the lineup, and every lineup folder must have a copy of the AcousticSmilarity_cosine_euclidean script.
In summary, the current protocol advances in (forensic) phonetics research. Comprising distinct phases – production and perception analyses, as well as acoustic similarity – it bridges the gap between data collection and multidimensional acoustic-feature analysis. Researchers can benefit from a holistic perspective, exploring both production and perception aspects. Moreover, this protocol ensures research reproducibility, allowing others to build upon the guidelines of this work.
Limitations and future directions
We still must keep in mind that everything will work out successfully if the data preparation follows the guidelines proposed in protocol steps 1.3.1 to 1.3.4. Besides, in the forced alignment procedures, we must be aware that an external pre-trained forced aligner-like MAUS used in the current work-is susceptible to segmentation errors, especially in non-pretrained foreign accented data or even in pretrained aligners, whether the audio does not have a good quality or the aligner does not contain the audio language (this fact would make the expert's work more difficult and time-consuming). Forensic transcription, especially of longer audio files, encounters difficulties with respect to the accurate and reliable representation of spoken recordings72.
There are also several challenges in transcribing and aligning longer audio files. Aligner performance is impacted by the speaking style and phonetic environment, as well as dialect-specific factors. Although there are aligner-specific differences, in general, their performance is similar and generates segmentations that compare well to manual alignment overall73. In addition, L1 and L2 English production were run with a pre-trained acoustic model of American English due to the unavailability of foreign speech models in MAUS. Furthermore, there are no pretrained acoustic models for BP in MAUS. For the BP data, the pre-existing acoustic models of Italian were used (see NOTE, protocol step 1.3.5.7).
In future work (in progress), we will use the Montreal Forced Aligner (MFA)74 as part of the protocol. A great advantage of MFA is the availability of pre-trained acoustic models and grapheme-to-phoneme models for a wide variety of languages (including BP), as well as the ability to train new acoustic and grapheme-to-phoneme models to any new dataset (i.e., our foreign-accented speech corpus)75.
The authors have nothing to disclose.
This study was supported by the National Council for Scientific and Technological Development – CNPq, grant no. 307010/2022-8 for the first author, and grant no. 302194/2019-3 for the second author. The authors would like to express their sincere gratitude to the participants of this research for their generous cooperation and invaluable contributions.
CreateLineup | Personal collection | # | Script for praat for voice lineups preparation |
Dell I3 (with solid-state drive – SSD) | Dell | # | Laptop computer |
Praat | Paul Boersma & David Weenink | # | Software for phonetic analysis |
Python 3 | Python Software Foundation | # | Interpreted, high-level, general-purpose programming language |
R | The R Project for Statistical Computing | # | Programming language for stattistical computing |
Shure Beta SM7B | Shure | # | Microphone |
SpeechRhythmExtractor | Personal collection | # | Script for praat for automatic extraction of acoustic features |
SurveyMonkey | SurveyMonkey Inc. | # | Assemble of free customizable surveys, as well as a suite of back-end programs that include data analysis, sample selection, debiasing, and data representation. |
Tascam DR-100 MKII | Tascam | # | Digital voice recorder |
The Munich Automatic Segmentation System MAUS | University of Munich | # | Forced-aligner of audio (.wav) and linguistic information (.txt) files |
VVUnitAligner | Personal collection | # | Script for praat for automatic realignment and post-processing of phonetic units |
.