A method and a system for automatically generating multiple captions of an image are provided. A method for training an auto image caption generation model according to an embodiment of the present disclosure includes: generating a caption attention map by using an image; converting the generated caption attention map into a latent variable by projecting the caption attention map onto a latent space; deriving a guide map by using the latent variable; and training to generate captions of an image by using the guide map and the image. Accordingly, a plurality of captions describing various characteristics of an image and including various expressions can be automatically generated.