The present disclosure provides a method and apparatus for automatically synchronizing a first stream of media data with a second, and outputting a third by decoding the streams into channels, splitting the channels into a plurality of piecewise segments defined by control points that the control points with the same index are to be synchronized in time across all the channels, automatically and intelligently adjusting the length of the media data in each segment in an optimal and hybrid manner using a linear or non-linear digital signal processing algorithm, synchronizing and mixing all the processed segments, and outputting the final mixed and encoded data stream. Specifically, one of the media data is video and the other is audio or a translation voice in a different language. With a controlled minimized distortion, one can achieve faster post-processing speed and optimal synchronization quality, therefore save both time and cost for video language localization services.