One-Shot Speaker Identification for a Service Robot using a CNN-based Generic Verifier

One-Shot Speaker Identiﬁcation f or a Service Robot using a CNN-based Generic V eriﬁer Ivette V ´ elez 1 , Caleb Rascon 2 and Gibr ´ an Fuentes-Pineda 3 Abstract — In service r obotics, there is an inter est to identify the user by voice alone. However , in application scenarios where a ser vice robot acts as a waiter or a stor e clerk, new users are expected to enter the envir onment frequently . T ypically , speaker identiﬁcation models need to be retrained when this occurs, which can take an impractical amount of time. In this paper , a new approach for speaker identiﬁcation through veriﬁcation has been developed using a Siamese Con volutional Neural Network architectur e (SCNN), where it lear ns to generically verify if two audio signals are from the same speaker . By having an external database of recorded audio of the users, identiﬁcation is carried out by verifying the speech input with each of its entries. If new users are encountered, it is only r equired to add their r ecorded audio to the external database to be able to be identiﬁed, without retraining . The system was evaluated in f our differ ent aspects: the performance of the veriﬁer , the performance of the system as a classiﬁer using clean audio, its speed, and its accuracy in real-life settings. Its performance in conjunction with its one-shot-learning capabilities, makes the proposed system a viable alternative for speaker identiﬁcation f or service robots. I . I N T RO D U C T I O N It is of great interest that machines, specially service robots, interact with humans in a similar manner as a human would. Thus, there is a growing regard to correctly identify the speaker with whom the robot is interacting by their v oice alone, and to do so in a real-life setting [1], [2]. In such scenarios, ho wever , ne w users are often introduced in the environment, such as when a new customer enters a restaurant or when a family member visits the user’ s home. Accommodating these new users is expected from a service robot, which includes iden- tifying them by their voice. Howe ver , typical speaker identiﬁcation systems require a retraining process every time a new speaker is added [3]. In this work, we propose an approach that does not hav e this requirement (known as one-shot learning ). It relies on a generic veriﬁcation model that establishes if two audio recordings are from the same speaker . This is complemented by having an external database of audio Instituto de Inv estigaciones en Matem ´ aticas Aplicadas y en Sistemas (IIMAS), Universidad Nacional Aut ´ onoma de M ´ exico (UNAM), Mex- ico. 1 ijvelezt@gmail.com 2 caleb.rascon@iimas.unam.mx 3 gibranfp@unam.mx recordings of the users to be identiﬁed and applying the model to each of its entries to verify if the speech input is of any of the users in the database. Once all the entries are veriﬁed, the results are used to establish from which known speaker is the speech input; this process can also deem the user as unknown. Additionally , and more im- portantly , if the user is unknown and their identiﬁcation is of interest in the future, it is only required to add their speech input as another entry in the database; the veriﬁcation model does not requires to be retrained. The architecture of the proposed model is based on a Siamese Conv olutional Neural Network (SCNN). The resulting veriﬁcation model in this work is able to e xtract proper audio features and a function that determines the similarity between both inputs so as to verify if the two recordings are from the same speaker . This system is planned to be carried out over a service robot in a real-life setting. T o beneﬁt the rhythm of the human-robot interaction, the proposed system is expected to perform in a fast manner , and it will be e valuated in this aspect. Additionally , the system is expected to hav e a high lev el of performance, so that it does not frequently mistake a user for another and cause frustration. As it is discussed in Section II, most speaker identiﬁcation systems in real-life settings hav e an accurac y of around 80%, and those that have greater accurac y than this require to kno w the speakers a- priori . Moreover , most service robotics settings assume that there are a limited number of users with which to interact. Thus, we are considering any le vel of perfor - mance abov e 80% with a limited number of speakers as acceptable, given the one-shot-learning nature of the proposed system. A video demonstration of the full system, as well as all relev ant downloads (corpora, source code, models, etc.) can be found at http://calebrascon.info/ oneshotid/ . The remainder of this paper is organized as follows: a summary of the related works is presented in Section II; in Section III the proposed system is described, as well as the 3 main components of the core model (training data set, the representation of the data and its architecture); the methodology used for ev aluating the models as well as their results are presented in Section IV; and, we conclude our work in Section V. I I . R E L A T E D W O R K S There is a v ast amount of literature on speaker iden- tiﬁcation, with two important types of techniques: by classiﬁcation and by veriﬁcation. The classiﬁcation type of techniques train one model with an output limited by the classes (or, in this case, known speakers) that it was trained with. When a new speaker enters the scenario, the model attempts to match it with one of the known speakers, resulting in a false positive. On the other hand, the veriﬁcation type of techniques train a model for each known speaker , with their outputs providing the probability of the speaker being the one with which the model was trained for . This type of techniques are able to register if a speaker is unknown (when all the models provide lo w probabilities), howe ver , they generally tend to be less accurate than the classiﬁcation type of techniques [3]. Since the proposed system employs a veriﬁer to mea- sure the similarity between two audio recordings in the speaker domain, the remainder of this section revie ws works that are related to this approach. In [4] the speaker identiﬁcation is carried out by comparing a measure of similarity between the audio of a speaker to be identiﬁed and the patterns previously gen- erated for the known speakers. Later , in [5], the author proposes changes to this system where the database and the architecture change, howe ver the similarity measure is kept. Neural networks had also been used with raw audio [6], where a CNN extracts the relev ant information and an ad-hoc veriﬁer is generated for ev ery speaker . In [7], the authors describe the V oxCeleb database and trained deep learning models for identiﬁcation and v eriﬁ- cation of speakers. They use the cosine distance between two signals. In [8], speaker veriﬁcation is carried out by using a Siamese model of two Long Short-T erm Memory (LSTM) networks, and a contrastiv e loss for the veriﬁcation. It is important to note that the use of neural networks in conjunction with audio has not been limited only to the identiﬁcation and veriﬁcation of the speaker , but has also addressed the issue of extracting audio features to make them more robust. The works of [9], [10], [11], [12], [13] use embeddings or complete neural networks to generate features with which statistical methods are applied to verify a speaker . The aforementioned works, depending on the database used to train, achiev ed performances above 80% of accuracy for veriﬁcation. Greater performance has been recently achiev ed, and has been applied in the ﬁeld of service robotics. For example, in [1] a speaker identiﬁcation process is carried out by inputting the power spectrum of the speech input, a model is trained for each speaker , and then the Eu- clidean distance between the models output is measured. The authors report a 96% average rate of identiﬁcation for 20 speakers. In [2], 10 speakers were identiﬁed using 32 MFCC-based characteristics. The reported rate of identiﬁcation reached 100% only in certain locations of the speaker , and required such location to be known as part of the training data. It is very important to mention that in all the afore- mentioned works that carry out identiﬁcation via classi- ﬁcation, it is assumed that the users to be identiﬁed are known. During the testing phase, new speakers are not added, so it is implied that the models are not able to identify speakers that they were not trained for . I I I . P RO P O S E D S Y S T E M The proposed system is divided into three parts: a generic veriﬁer that outputs the distance between two audio inputs in the speaker domain; an external database that stores audio entries of known users; and a user selector , which has as an input a list of veriﬁcation scores. A diagram of the complete proposed system is shown in Figure 1. Sp eak er to recognize Pre-p ro cessing A B ... Datab ase (0-1) Selec tion of sp eak er V eri ﬁ er Sto rage of results Result ... s ... s Identi ﬁ cation Syst em Fig. 1: Diagram for speaker identiﬁcation. When an identiﬁcation is carried out, the system iterates through the database entries, verifying each with the data of the speaker to recognize, and storing each veriﬁcation result. When the iteration ends, a speaker is selected based on the veriﬁcation results. As it can be seen, the central part of the system is the veriﬁer , which is expected to indicate if two audio recordings are from the same user or not. Satisfying this expectation, howe ver , the system provides two virtues that are of great interest to service robotics. First, the system permits to hav e sev eral entries per speaker which, as it will be seen, contributes to the robustness of their identiﬁcation. Second, the system also permits the addition of new speakers without requiring to retrain the model. When the selection process deems the new audio data as not belonging to any of the known speakers, a human-robot interaction can be carried out to ask the unknown user their name and add them to the database for future identiﬁcations. This operation is summarized in Figure 2. Sp eak er C Identi ﬁ cation System Result Datab ase B A Unkno wn Sp eak er Datab ase B A C A ddition of new sp eak er Fig. 2: Diagram for speaker identiﬁcation with speaker addition. As it can be concluded, it is very important that the generic veriﬁer at the center of the system is trained with a database with a large number of speakers. This is so that it can attain its generalizability . Additionally , it is also important that the audio data is appropriately rep- resented and that a proper architecture is chosen, so the system as a whole achieves the performance and speed previously discussed. These three aspects (databases, data representation, and architecture) are described in the remainder of this section. A. Databases The LibriSpeech database [14] is based on the project LibriV ox. It was recorded under a controlled en viron- ment, with just one speaker talking at a time, with little variance of background noise between recordings. This database was chosen since it is text-independent, which will make the model robust against any phrasing. It contains more than 100 speakers, which will grant the model generality in its veriﬁcation. And it does not in volve any monetary requirement, which simpliﬁed its procurement. Another database that was used is V oxceleb [7]. It was obtained from Y ouTube videos of interviews of celebrities. This resulted in the database having more than 6000 speakers recorded in many different recording en vironments, which v aried in terms of noise presence and distance of the microphone to the user . This database was chosen because of these variations, since it is expected to provide the veriﬁcation model robustness against them. Additionally , it also does not inv olve any monetary requirement, which simpliﬁed its procurement. For training, 80% of the databases was used, with 10% for validation and 10% for testing. Since the focus of the proposed approach is to verify generically between speakers, the training, validation and testing datasets do not share speakers. Meaning, none of the recorded data of the speakers of the validation and testing datasets was used as part of the training process. For further testing, an ev aluation corpus was recorded based on the testing subset of LibriSpeech, referred to here as LibriSpeechReal. T wo real environments were used: an open-cubicle computer lab (background noise at -45 dBFS with a τ 60 ≈ 0 . 51 s ) and an ofﬁce (background noise at -47 dBFS with a τ 60 ≈ 0 . 39 s ). A monitor speaker reproducing the Librispeech testing subset data was placed 1 m. from a ﬂat-response microphone, which recorded such reproduction. This corpus emulates what a service robot would be hearing (in terms of noise and rev erberation) from a speaker in a real-life setting. B. Repr esentation of the data The audio signals are conv erted to a time-frequency spectrogram before being fed to the model. T wo varia- tions of this spectrogram-based data representation were used as inputs for the trained models. Both variations are generated using a 1 s. segment of audio, using an overlap of 50%, and only using frequencies of the lower half of the frequency spectrum as the input to the model. Preliminary testing showed no major differences in performance between using only the lo wer half of the spectrum and using the full frequency range. Both variations are normalized as part of their calculation. One variation, referred to here as Spect 256, applies a 1024 FFT point window , and only uses the lower 256 FFT points of the frequenc y range. These 256 FFT points represent frequencies up to 4 k H z . The other variation, referred to here as Spect 32, applies a 400 FFT point window , and only uses the lower 32 FFT points of the frequency range. These 32 FFT points represent frequencies up to 1.28 k H z . It is important to note that other types of represen- tation were tested, such as the FFT of the whole audio segment, the Mel-Frequency Cepstral spectrum of the whole audio se gment, as well using the upper half of the frequency range. Howe ver , the performance obtained with these representations did not improv ed upon the results when using the representations described in this section. C. Ar chitectur e Sev eral architectures were tested to be used as a generic veriﬁer . In this section, the three models that obtained the highest performance are described. These three models are based on two architectures: VGG [15] and ResNet 50 [16]. In the proposed models, these architectures are ar- ranged as a Siamese network [17], [18]. When used for veriﬁcation, these networks consists mainly of two elements: feature extraction and similarity calculation. T o this effect, their ﬁrst layers are mainly con volutional layers, while the last layers are fully connected. The outputs of the latter are passed through a SoftMax function to calculate the probability of the two inputs being the same. For each model, 800,000 audio ﬁles were randomly selected from the used database (either Librispeech or V oxceleb) for training per epoch, while 80,000 data ﬁles were randomly selected for validation, and 80,000 for testing. Each data set had the same amount of positive and negati ve examples. Although the audio ﬁles are the same through the epochs, the audio input that is fed to the models is dif ferent due to the random selection of the segment of audio from the ﬁle. 1) VGG: In Figure 3, the proposed Siamese network inspired by the VGG 16 [15] can be seen. It is composed of: 4 con volutional layers for each part of the Siamese network; 2 pooling layers; and 3 fully-connected layers for identiﬁcation. This network corresponds to the ﬁrst 4 layers and the last 3 layers of the original VGG 16 architecture. Since this network only use 7 layers of the VGG 16 architecture, it is referred to here as VGG7. The con volutional layers use 64 ﬁlters of 3 × 3 for the ﬁrst two layers and 128 ﬁlters for layers 3 and 4. The intermediate layers use ReLU as activ ation function. Signal B V AD Sp ectrogram Signal A V AD Sp ectrogram fc 1024 fc 1024 fc 2 p o ol/2 p o ol/2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 64 3x3 conv, 64 p o ol/2 p o ol/2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 64 3x3 conv, 64 Pre-p ro cessing Arch itecture Fig. 3: Siamese VGG7. T wo of the proposed models are based on this net- work, each using the dif ferent data representation dis- cussed earlier . The model referred to here as VGG7 256 uses the the Spect 256 data representation, while the model referred to here as VGG7 32 uses the Spect 32 data representation. The training of this network was done in batches of 50 samples, using cross entropy as loss and stochastic gradient descent as optimization algorithm. This model at ﬁrst was trained for 15 epochs with a learning rate of 0.01 with Librispeech; another model was later trained for 8 epochs with the same learning rate with V oxCeleb . 2) ResNet 50: In Figure 4, the proposed Siamese network based on the ResNet 50 [16] can be seen. It is composed of: 1 con volutional layer with 64 ﬁlters of 7 × 7 ; 16 bottleneck blocks; and 2 fully connected layers. Intermediate layers use batch normalization and ReLU as activ ation function. fc, 2048 fc, 2 7x7 conv, 64/2 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 max p o ol/2 avg p o ol/2 Group 1 Group 2 Group 3 Group 4 7x7 conv, 64/2 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 max p o ol/2 avg p o ol/2 Group 1 Group 2 Group 3 Group 4 Signal A V AD Sp ectrogram Signal B V AD Sp ectrogram Pre-p ro cessing Architecture Fig. 4: Siamese ResNet 50. The proposed model based on this network is referred to here only as ResNet for ease of reference, and it uses the Spect 32 data representation. T o speed up the training process, an initialization network similar to ResNet was used, with the last layer performing both speaker veriﬁcation and speaker classiﬁcation. After an acceptable loss was achieved, the initialized weights were re-used to start the training of ResNet. The training of this network was done with Lib- rispeech in batches of 10 samples, using cross entropy as loss with l2 norm as regularizer and stochastic gradient descent as optimization algorithm. This model, after the initialization, was trained for 10 epochs with a learning rate of 0.01, and later another 10 epochs with a learning rate of 0.001. I V . E V A L UA T I O N A N D R E S U L T S The three models previously described were ev aluated as part of the proposed system that can add new speakers to identify . The models were ev aluated in four aspects: the performance of the veriﬁer , the performance of the system as a classiﬁer using clean audio, the speed of the system, and the accuracy of the system when it is ev aluated in real-life settings. As mentioned before, an accuracy higher than 80% is considered acceptable, since the closest system to our proposed approach is [7], and it had an accuracy of 80.5 %. A. Evaluation of veriﬁer The different trained veriﬁers were ev aluated with 1,000 samples, 500 samples from the same speakers and 500 from different speakers. These audios were randomly selected from the testing data set. This process is performed 10 times by each veriﬁer and the av erage of true positives, true negativ es, false positi ves and false negati ves of these results are obtained. Then, the precision, recall, F1 and accuracy are calculated, and are shown in Figure 5. ResNet VGG7 256 VGG7 32 0 20 40 60 80 100 120 89 89 91 97 98 97 92 94 94 92 93 94 Performance (%) Precision Recall F1 Accuracy Fig. 5: Evaluation of veriﬁer with 1000 audios. As it can be seen, the model with the highest precision is the V GG7 32 , reaching 91%, follo wed by V GG7 256 and ResNet with 89%. Although the models have different results in the different metrics, they do not differ in more than 2%. All results are above the desired 80% which implies that the proposed veriﬁcation models function appropriately for comparing whether two audio signals are from the same speaker or not. It is important to mention that these results only ev aluate the performance of the models as standalone generic veriﬁers, and do not use an external database to carry out the identiﬁcation process. B. Evaluation as a classiﬁer T o ev aluate the system as a classiﬁer, the external database is considered as static a-priori kno wledge of known users. The ﬁnal selection is the speaker with the highest average veriﬁcation result of all the users in the database against the speaker to recognize. In this ev aluation, an accuracy heat map is obtained for each model. Each heat map presents the accuracy of the system having different combinations of number of known speakers and number of audio recordings the database has of each kno wn speaker . Each speaker- audios combination is tested such that each known speaker is veriﬁed against known speakers and unknown speakers alike in a balanced manner . The intent of the accuracy heat maps is to show how the performance of the system changes as it has more known users and more audio recordings per known user . These accuracy heat maps are shown in Figure 6. 0 5 10 0 5 10 Audios Speakers (a) ResNet 0 5 10 0 5 10 Audios Speakers (b) VGG7 256 0 5 10 0 5 10 Audios Speakers (c) VGG7 32 0 20 40 60 80 100 Accuracy Fig. 6: Accuracy heat maps as classiﬁer . ‘ Audios’ indi- cates the amount of audios known per speaker . ‘Speak- ers’ indicates the number of known speakers. As it can be seen, the three models have a similar pat- tern: as the number of speakers increases, more known audio recordings in the external database are required to achiev e a better performance. Howe ver , it can be reported that the three systems obtained an accuracy higher than 70% in all speaker- audios combinations, while some combinations reached values very close to 100%. Additionally , VGG7 32 is the model with the best classifying results with an accuracy of 97.7%, followed by 95% of the VGG7 256 and 94.5% of the ResNet. Additionally , the overall av erage accuracy of the system with VGG7 32 model is of 97%, which implies that it performed with a high accuracy in all speaker -audios combinations. C. System speed T o determine the speed the system when using each model, the average run time was measured of 10 sets of veriﬁcations of 1 audio segment against 100 others that are stored in the database as time-frequency spec- trograms generated from 1-second audio segments. This was carried out to av oid calculating the spectrograms of the audio data stored in the database for each veriﬁcation. The times measured for each veriﬁcation were: the time to calculate the spectrogram of the input audio and load the spectrogram database ( t spec ), and the time that the veriﬁcation process takes to run the CNN model through all the database entries ( t model ). These run times are shown in Figure 7. ResNet VGG7 256 VGG7 32 0 . 00 0 . 20 0 . 40 0 . 60 0 . 80 1 . 00 0 . 13 0 . 26 0 . 13 0 . 61 0 . 29 3 . 78 · 10 − 2 0 . 75 0 . 55 0 . 17 Time (s) t spec t model t total Fig. 7: A verage run time for each model. t spec indicates the run time of the calculation of the frequency spectro- gram of input data and loading the spectrogram database. t model indicates the run time of the veriﬁcation process to run the CNN model through all the database entries. And t total indicates the total run time. As it can be seen in Figure 7, the VGG-based models are faster than ResNet, which is to be expected since they are smaller and require less computations. VGG7 32 is much faster than VGG7 256 , in part because VGG7 256 architecture has much more parameters, and because calculating a 1024 FFT point spectrogram takes more time than calculating a 400 FFT point spectrogram. It is important to note that in the VGG-based models, the calculation of the spectrogram takes up a consid- erable amount of their total run time. In the case of VGG7 32 , it takes 76.5% of the total run time; in the case of the VGG7 256 , it takes 47.3%. This points that, if the system is to be made faster , it may be necessary to ﬁnd a faster way to calculate the spectrogram, or ﬁnd another type of data representation that is faster to calculate. Howe ver , it is to be noted that with the VGG7 32 model, the total run time of the system is 0.17 s. which is an acceptable response time gi ven its one- shot-learning nature. In addition, the bulk of the total run time is taken by the calculation of the spectrogram of the input data and the loading of the database, both of which only need to be carried out once. This indicates that the response time will not signiﬁcantly increase as the number of users increases; not unless it reaches sev eral order of magnitudes bigger than 10 users, which is unlikely giv en the application case of service robotics. D. Evaluation in real-life settings Up until this point, the evaluated models were trained with LibriSpeech, which is a clean data set. Howe ver , giv en the service robotics application case, it is of interest to ev aluate the system in a real-life setting. For this purpose, an e v aluation corpus based on LibriSpeech was recorded in two real en vironments. This corpus is refereed to here as LibriSpeechReal (details are provided in Section III-A). The VGG7 32 model is chosen due to it being the most accurate model as well as being the fastest, (as seen in Sections IV -B and IV -C), the two desired aforementioned qualities relev ant for the service robotics application case. The resulting accuracy heat map is shown in Figure 8. As it can be seen, the performance of the system decreases considerably , up to 51% in many cases. This is due to the lack of robustness of the veriﬁcation model against noise and rev erberation, since it was trained with a clean dataset. T o overcome this situation, the VGG7 32 model was retrained with the V oxCeleb database [7] (described in Section III-A). In this case, more than 2.5 million samples were used per training epoch. The ev aluation with LibriSpeechReal of the retrained VGG7 32 model is shown as a heat map in Figure 9. As it can be seen, its performance has increased sig- niﬁcantly , now obtaining an average accuracy of 81.2%. More importantly , it can also be seen that the system ov ercomes the accuracy threshold expected for a service robot in many of the speaker -audios combinations. For further analysis, tw o confusion matrices are sho wn in T ables I and II, to allow us to analyze in which speakers is the model failing. The matrices belong to the combination of 10 known speakers with 1 audio per 0 5 10 0 5 10 Audios Speakers (c) VGG7 32 0 20 40 60 80 100 % Fig. 8: Accuracy heat maps with LibriSpeechReal for a model trained with LibriSpeech dataset. ‘ Audios’ indi- cates the amount of audios known per speaker . ‘Speak- ers’ indicates the number of known speakers. 0 5 10 0 5 10 Audios Speakers (c) VGG7 32 0 20 40 60 80 100 % Fig. 9: Accuracy heat maps with LibriSpeechReal for a model trained with V oxCeleb dataset. ‘ Audios’ indicates the amount of audios known per speaker . ‘Speakers’ indicates the number of known speakers. speaker , and to the combination of 10 known speakers with 10 audios per speaker . These two speaker-audios combinations were chosen since they present the most wide variety of change when increasing the amount of audios per speaker . It’ s worth mentioning that the confusion matrices for the speaker-audios combination for 1 or 2 known speakers, independently of the amount of audios, score a near perfect result, with the model not confusing any of the kno wn speakers or the unknown ones. Known speakers Sp. 1 2 3 4 5 6 7 8 9 10 U 1 16 1 3 0 0 0 0 0 0 0 0 2 0 10 0 1 0 0 2 1 0 6 0 3 7 0 4 2 0 0 0 0 6 1 0 4 3 3 0 3 0 7 1 0 3 0 0 5 1 0 0 2 9 0 5 0 0 3 0 6 1 0 0 0 0 19 0 0 0 0 0 7 0 1 0 2 5 0 6 3 0 3 0 8 0 0 0 2 0 2 1 9 0 6 0 9 1 2 2 1 0 0 0 0 14 0 0 10 2 1 1 0 0 0 1 1 0 14 0 U 0 0 0 0 0 0 0 1 0 1 18 T ABLE I: Confusion matrix for the combination of 10 known speakers with 1 spectrogram in the database. ‘Sp. ’ represents the label of the speaker and ‘U’ represent unknown speakers. Known speakers Sp. 1 2 3 4 5 6 7 8 9 10 U 1 20 0 0 0 0 0 0 0 0 0 0 2 0 13 0 0 0 0 0 3 0 4 0 3 4 0 12 1 0 0 0 0 3 0 0 4 0 0 0 15 0 4 0 0 1 0 0 5 0 3 0 0 7 0 7 0 0 3 0 6 0 0 0 0 0 20 0 0 0 0 0 7 0 0 0 0 2 0 15 0 0 3 0 8 0 1 0 0 0 0 3 10 0 6 0 9 1 0 2 7 0 1 0 0 9 0 0 10 0 1 0 0 0 0 1 4 0 14 0 U 0 0 0 0 0 0 2 0 0 0 18 T ABLE II: Confusion matrix for the combination of 10 known speakers with 10 spectrograms in the database. ‘Sp. ’ represents the label of the speak er and ‘U’ represent unknown speakers. A green cell represents an improv e- ment o ver T able I, a blue cell represents no improv ement, and a red cell represent a decline. As it can be seen, there is a considerable improvement in most of the users in the confusion matrix sho wn in T able II, which implies that having more audio entries per speaker contributes to its performance. Howev er, this change does not seem to affect the detection of unknown speakers, which is to be expected, since audio entries of unknown speakers are not stored in the database. It can also be seen that there is one user that was not affected by the change in number of audio entries and that there are two users that were mistaken more with more audio entries. This is contrary to the tendency shown by the majority of the users. Although we are not able to explain deﬁniti vely why this happened, we believ e that it may be attrib uted to the nature in which the audio entry is selected: the system asks the user to talk for a pre-speciﬁed amount of time; then a 1 s. audio segment is randomly chosen from the recording that has an av erage energy above a pre-speciﬁed threshold. This was carried out as a form of V oice Acti vity Detection (V AD) to ensure that the audio se gment stored in the database had speech information with which to identify the user in the future. Howe ver , using this basic form of V AD may result in storing segments of audio with a considerable amount of silence in them. When additional testing was carried out, it was found that modifying the V AD threshold did not impact in any meaningful way the ov erall performance of the system. Thus, a more sophisticated V AD system may be required. Moreover , other data representation techniques could be used to normalize the input in terms of energy as well as feature extraction. V . C O N C L U S I O N S In this work, a speaker identiﬁcation system was proposed based on a generic veriﬁer and a dynamic external database of audio recordings of kno wn speakers. Because of its generality , this system does not need to be retrained when new speakers enter the scenario, since these can be ﬂexibly added to the database. For a demonstration and access to all relev ant information, visit http://calebrascon.info/oneshotid/ . The performance of the highest three tested models for the role of the generic veriﬁer were shown. These were Siamese con volutional models, based on two ar - chitectures: VGG 16 and ResNet 50. The overall performance of the system showed an av erage accuracy of 97% with a clean testing corpus and an av erage performance of 81.2% with real-life recordings. Ho wever , a compromise needs to be struck between the amount of audio entries per known speaker stored in the database and the number of speakers that it needs to identify . Speed ev aluations showed that the VGG7 32 is the fastest model of the proposed ones, and its response to verify 1 audio against 100 hundred spectrograms is 0.17 s. This is an acceptable response time, giv en the one- shot-learning nature of the system as a whole. A model trained with a noisy database, proved to be robust against noise, and achiev e an accuracy above the desired 80% in a real-life setting with most combinations of number of speakers vs audios entries per speaker . Interestingly , most of these combinations are aligned with the application case of service robotics. It is left for future work to integrate this system as part of a task of a service robot and to test its in the robot’ s social interaction. Additionally , the training parameters of the proposed models could be reﬁned to further improve the veriﬁcation performance. Finally , as mentioned before, a more sophisticated V AD system could be employed and other types of data representation could be explored. AC K N OW L E D G M E N T The authors would like to thank the support of CON ACYT through the research grant 251319, UC- MEXUS through the research grant CN-17-54, and P APIIT -UNAM through the research grant IA104016. R E F E R E N C E S [1] F . Grondin and F . Michaud, “W iss, a speaker identiﬁcation system for mobile robots, ” in 2012 IEEE International Confer ence on Robotics and Automation , May 2012, pp. 1817–1822. [2] K. Y oussef, S. Argentieri, and J. Zarader , “Binaural speaker recognition for humanoid robots, ” in 2010 11th International Confer ence on Control Automation Robotics V ision , Dec 2010, pp. 2295–2300. [3] J. P . Campbell, “Speaker recognition: A tutorial, ” Pr oceedings of the Ieee , vol. 85, no. 9, pp. 1437–1462, 1997. [4] K. Daqrouq, “W avelet entropy and neural network for text- independent speaker identiﬁcation, ” Eng. Appl. Artif. Intell. , vol. 24, no. 5, pp. 796–802, Aug. 2011. [Online]. A vailable: http://dx.doi.org/10.1016/j.engappai.2011.01.001 [5] K. Daqrouq and T . A. Tutunji, “Speaker identiﬁcation using vowels features through a combined method of formants, wav elets, and neural network classiﬁers, ” Applied Soft Computing , vol. 27, pp. 231 – 239, 2015. [Online]. A vailable: http://www .sciencedirect.com/science/ article/pii/S1568494614005778 [6] H. Muckenhirn, S. Marcel et al. , “T o wards directly modeling raw speech signal for speaker veriﬁcation using cnns, ” Idiap, T ech. Rep., 2017. [7] A. Nagrani, J. S. Chung, and A. Zisserman, “V oxceleb: a large- scale speaker identiﬁcation dataset, ” CoRR , vol. abs/1706.08612, 2017. [Online]. A vailable: http://arxiv .org/abs/1706.08612 [8] A. Mobiny , “T ext-independent speaker veriﬁcation using long short-term memory networks, ” EESS , 2018. [9] D. Snyder, D. Garcia-Romero, D. Povey , and S. Khudanpur , “Deep neural network embeddings for text-independent speaker veriﬁcation, ” in INTERSPEECH . ISCA, 08 2017, pp. 999–1003. [10] D. Snyder , P . Ghahremani, D. Povey , D. Garcia-Romero, Y . Carmiel, and S. Khudanpur , “Deep neural network-based speaker embeddings for end-to-end speaker veriﬁcation, ” in 2016 IEEE Spoken Language T echnology W orkshop (SLT) , Dec 2016, pp. 165–170. [11] E. V ariani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez- Dominguez, “Deep neural networks for small footprint text- dependent speaker veriﬁcation, ” in 2014 IEEE International Con- fer ence on Acoustics, Speech and Signal Processing (ICASSP) , May 2014, pp. 4052–4056. [12] G. Bhattacharya, M. J. Alam, and P . K enny , “Deep speaker embeddings for short-duration speaker veriﬁcation, ” in INTER- SPEECH , 2017. [13] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer , “End-to- end text-dependent speaker veriﬁcation, ” in 2016 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pr ocessing (ICASSP) , March 2016, pp. 5115–5119. [14] V . Panayotov , G. Chen, D. Pove y , and S. Khudanpur , “Lib- rispeech: an ASR corpus based on public domain audio books, ” in Proceedings of the International Confer ence on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2015. [15] K. Simonyan and A. Zisserman, “V ery deep con volutional networks for large-scale image recognition, ” arXiv preprint arXiv:1409.1556 , 2014. [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition, ” CoRR , vol. abs/1512.03385, 2015. [17] J. Bromley , I. Guyon, Y . LeCun, E. S ¨ ackinger , and R. Shah, “Signature veriﬁcation using a ”siamese” time delay neural network, ” in Pr oceedings of the 6th International Conference on Neural Information Pr ocessing Systems , ser . NIPS’93. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993, pp. 737–744. [Online]. A vailable: http://dl.acm.org/citation.cfm? id=2987189.2987282 [18] R. Hadsell, S. Chopra, and Y . LeCun, “Dimensionality reduction by learning an inv ariant mapping, ” in 2006 IEEE Computer Society Conference on Computer V ision and P attern Recognition (CVPR’06) , vol. 2, 2006, pp. 1735–1742.

One-Shot Speaker Identification for a Service Robot using a CNN-based Generic Verifier

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment