This thesis deals with the separation of mixed speech signals from a single acquisition channel; a problem that is commonly referred to as co-channel speech separation. The goal of the thesis is to present some contributions towards the design and implementation of a robust and enhanced co-channel speech separation system.
The phenomenon of co-channel speech commonly occurs due to the combination of speech signals from simultaneous and independent sources into one signal at the receiving microphone, or when two speech signals are transmitted simultaneously over a single channel. An efficient co-channel speech separation system is an important front-end component in many applications such as Automatic Speech Recognition (ASR), Speaker Identification (SID), and hearing aids.
The separation process of co-channel speech consists, mainly, of three stages: Analysis, Separation, and Reconstruction. The central separation stage represents the heart of the system in which the target speech is separated from the interfering speech. At the front, since the separation process works on one segment of co-channel speech at a time, a mean must be found in the analysis stage to accurately classify each segment into single or multi-speaker before separation. Precise estimation of each speaker's speech model parameters is another important task in the analysis stage. The speech signal of the desired speaker is finally synthesized from its estimated parameters in the reconstruction stage. In order to have a reliable overall speech separation system, improvements need to be achieved in all three stages.
This thesis introduces a classification algorithm that is capable of determining the voicing-state of co-channel speech. The algorithm uses some features of the reconstructed state-space of the speech data as a measure to identify the three voicingstates of co-channel speech; Unvoiced/Unvoiced (U/U), Voiced/Unvoiced (V/U), and Voiced/Voiced (V/V). The proposed method requires neither a priori information nor speech training data. Nonetheless, simulation results show enhanced performance in identifying the three voicing-states at different target-to-interference ratio (TIR) values as well as at different levels of background noise compared to other existing techniques.
A time-domain method to precisely estimate the sinusoidal model parameters of co-channel speech is also presented. The method does not require the calculation of the discrete Fourier transform nor the multiplication by a window function which both degrade the estimate of the sinusoidal model parameters. The method incorporates a least-squares estimator and an adaptive technique to model and separate the cochannel speech into its individual speakers. The application of this method on speech data demonstrates the effectiveness of this method in separating co-channel speech signals with different TIRs.