Aspects of this disclosure provide techniques for generating a viseme and corresponding intensity pair. In some embodiments, the method includes generating, by a server, a viseme and corresponding intensity pair based at least on one of a clean vocal track or corresponding transcription. The method may include generating, by the server, a compressed audio file based at least on one of the viseme, the corresponding intensity, music, or visual offset. The method may further include generating, by the server or a client end application, a buffer of raw pulse-code modulated (PCM) data based on decoding at least a part of the compressed audio file, where the viseme is scheduled to align with a corresponding phoneme.