Research

Speaker Verification


"A Segment-Based Speaker Verification System Using SUMMIT"

My master's thesis describes the development of a segment-based speaker verification system. Our investigation is motivated by past observations that speaker-specific cues may manifest themselves differently depending on the manner of articulation of the phonemes. By treating the speech signal as a concatenation of phone-size units, one may be able to capitalize on measurements for such units more readily. A potential side benefit of such an approach is that one may be able to achieve good performance with unit (i.e., phonetic inventory) and feature sizes that are smaller than what would normally be required for a frame-based system, thus deriving the benefit of reduced computation.

To carry out our investigation, we started with the segment-based speech recognition system developed in our group called SUMMIT, and modified it to suit our needs. The speech signal was first transformed into a hierarchical segment network using frame-based measurements. Next, acoustic models were developed for a small set of six phoneme broad classes using diagonal Gaussians, preceded by principle component analysis. Speaker-specific models were adapted from a general speaker-independent model. The initial feature vectors included averages of MFCCs, plus other more speaker-specific measurements such as energy, fundamental frequency (F0), and duration. The size and content of the feature vectors were determined through a greedy algorithm while optimizing overall speaker verification performance. Below is a block diagram of the system.


To facilitate a comparison with previously reported work, our speaker verification experiments were carried out using 630 speakers from the TIMIT corpus. 462 of the speakers were used to train the general speaker-independent model, and the remaining 168 speakers were used for evaluation. Each speaker-specific model was developed from the eight SI and SX sentences. Verification was performed using the two SA sentences common to all speakers.

To classify a speaker, a Viterbi forced alignment was determined by first collapsing the phonetic transcriptions into the six broad manner classes. Speaker verification was achieved by comparing the forced alignment score of the purported speaker with those obtained with the models of the speaker's competitors. Ideally, the purported speaker's score should be compared to scores of every other system user. To reduce the computation, we adopted a procedure in which the score for the purported speaker is compared only to scores of a cohort set consisting of the 14 most similar speakers. The cohorts were determined by using a Mahalanobis distance to measure similarity between two speakers. Specifically, Mahalanobis distances between a particular speaker's model and all 168 speaker models were computed. These distances were then rank ordered, and the speaker models that resulted in smaller distances were assigned as cohorts for that particular speaker. We have found this method to significantly reduce computation while minimally affecting overall performance.

During testing, forced alignment scores were computed on the two test utterances using the speaker's model and each of the models in the speaker's cohort set. These scores were then rank ordered and the user was accepted if his/her model's score was within the top $N$ scores, where $N$ is a parameter we varied in our experiments. To test for false acceptance, we used as impostors only the members of a speaker's cohort set. We were able to achieve a performance of 100\% correct acceptance with 4.85\% false acceptance while retaining a simple system design and reducing computation significantly through the use of a small number of features on broad-classes, diagonal Gaussian speaker models, and cohort sets.


Home Page Publications