A computer-implemented method includes obtaining, using a hardware processor, training data including utterances of speakers and performing tasks to train a machine learning model that converts an utterance into a feature vector, each task using one subset of multiple subsets of training data. The subsets of training data include a first subset of training data including utterances of a first number of speakers and at least one second subset of training data. Each second subset of training data includes utterances of a number of speakers that is less than the first number of speakers.