Patent attributes
Speaker diarization techniques that enable processing of audio data to generate one or more refined versions of the audio data, where each of the refined versions of the audio data isolates one or more utterances of a single respective human speaker. Various implementations generate a refined version of audio data that isolates utterance(s) of a single human speaker by generating a speaker embedding for the single human speaker, and processing the audio data using a trained generative model—and using the speaker embedding in determining activations for hidden layers of the trained generative model during the processing. Output is generated over the trained generative model based on the processing, and the output is the refined version of the audio data.