Patent attributes
Receiving a raw speech signal from a human speaker; providing an acoustic representation of the raw speech signal if the raw speech signal is determined to be within one of a plurality of pre-defined acoustic domains; augmenting the raw speech signal with the acoustic representation to provide a plurality of augmented speech signals; determining a set of a plurality of Mel frequency cepstral coefficients for each of the plurality of augmented speech signals, wherein each set of the plurality of Mel frequency cepstral coefficients is transformed using domain-dependent transformations to obtain acoustic reference vector, such that there are a plurality of acoustic reference vectors, for each one of the plurality of augmented speech signals; stacking the plurality of acoustic reference vectors corresponding to each augmented speech signal to form a super acoustic reference vector; and processing the super acoustic reference vector through a neural network which has been previously trained on data from a plurality of human speakers to obtain domain-independent embeddings for speaker recognition.