This paper presents a real-time human-robot interface system (HRIS), which processes both speech and vision information to improve the quality of communication between human and an autonomous mobile robot. The HRIS contains a real-time speech attention system and a real-time face tracking system. In the speech attention system, a microphone-array voice acquisition system has been developed to estimate the direction of speaker and purify the speaker's speech signal in a noisy environment. The developed face tracking system aims to track the speaker's face under illumination variation and react to the face motion. The proposed HRIS can provide a robot with the abilities of finding a speaker's direction, tracking the speaker's face, moving its body to the speaker, focusing its attention to the speaker who is talking to it, and purifying the speaker's speech. The experimental results show that the HRIS not only purifies speech signal with a significant performance, but also tracks a face under illumination variation in real-time.