In a video conferencing environment, it is desirable to isolate the active talker. Traditionally, talker localization is performed acoustically using a beamforming microphone array or videographically using image processing techniques. Since these approaches rely only on the audio or the video data for performing the localization, they are often prone to errors. In this thesis, a new modular multimodal architecture is designed. Data from each localization modality are separated in the beginning, and localizations are performed using each data stream independently. In order to study the effectiveness of this modular multimodal architecture, this thesis combines audio, visual and infrared cues to locate talkers in the video conferencing environment. Special purpose acoustic, video and thermo localizers are developed to perform the localization. Individual results from the localizers are then combined using data fusion techniques to give the final estimation of the talker’s location. Two common fusion methods, the summing voter and the Bayesian network, are studied in this thesis. The effectiveness of another two novel fusion methods, the talker occupancy grid assisted summing voter and the talker occupancy grid assisted Bayesian network, are also investigated. A unique algorithm that uses the correlation lags to detect acoustic reflections is also developed in the process of this thesis. Based on the results from experiments and computer simulations, the proposed multimodal localization method outperforms localization methods, in terms of accuracy and robustness, when compared with other single modal methods that rely only on audio, video, or infrared data.