Voice control in a multi-talker and multimedia environment is disclosed. In one aspect, there is provided a method comprising: receiving a microphone signal for each zone in a plurality of zones of an acoustic environment; generating a processed microphone signal for each zone in the plurality of zones of the acoustic environment, the generating including removing echo caused by audio transducers in the acoustic environment from each of the microphone signals, and removing interference from each of the microphone signals; and performing speech recognition on the processed microphone signals.