Patent attributes
A voice conversion system for generating realistic, natural-sounding target speech is disclosed. The voice conversion system preferably comprises a neural network for converting the source speech data to estimated target speech data; a global variance correction module; a modulation spectrum correction module; and a waveform generator. The global variance correction module is configured to scale and shift (or normalize and de-normalize) the estimated target speech based on (i) a mean and standard deviation of the source speech data, and further based on (ii) a mean and standard deviation of the estimated target speech data. The modulation spectrum correction module is configured to apply a plurality of filters to the estimated target speech data after it has been scaled and shifted by the global variance correction module. Each filter is designed to correct the trajectory representing the curve of one MCEP coefficient over time. Collectively, the plurality of filters are designed to correct the trajectories of each of the MCEP coefficients in the target voice data being generated from the source speech data. Once the MCEP coefficients are corrected, they are then provided to a waveform generator configured to generate the target voice signal that can then be played to the user via a speaker.