Disclosed is a method for speech recognition performed by one or more processors of a computing device. The method includes inputting voice information into an encoder to extract a first feature vector and calculating a first loss function. The method includes inputting the first feature vector extracted from the encoder to a first decoder to perform prediction on the voice information, calculating a second loss function, and extracting a second feature vector. The method includes inputting a second feature vector extracted from the first decoder to a second decoder to perform grapheme-based prediction, and calculating a third loss function. The method includes training at least one of the encoder, the first decoder, or the second decoder based on the first loss function, the second loss function, and the third loss function.