Patent attributes
Embodiments are associated with a speaker-independent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-independent attention-based encoder-decoder model associated with a first output distribution, a speaker-dependent attention-based encoder-decoder model to classify output tokens based on input speech frames, the speaker-dependent attention-based encoder-decoder model associated with a second output distribution, training of the second attention-based encoder-decoder model to classify output tokens based on input speech frames of a target speaker and simultaneously training the speaker-dependent attention-based encoder-decoder model to maintain a similarity between the first output distribution and the second output distribution, and performing automatic speech recognition on speech frames of the target speaker using the trained speaker-dependent attention-based encoder-decoder model.