Patent attributes
A speech recognition computer system uses video input as well as audio input of known speech when the speech recognition computer system is being trained to recognize unknown speech. The video of the speaker can be captured using multiple cameras, from multiple angles. The audio can be captured using multiple microphones. The video and audio can be sampled so that timing of events in the video and audio can be determined from the content independent of an audio or video capture device's clock. Video features, such as a speaker's moving body parts, can be extracted from the video and random sampled, to be used in a speech modeling process. Audio is modeled at the phoneme level, which provides word mapping with minor additional effort. The trained speech recognition computer system can then be used to recognize speech text from video/audio of unknown speech.