A method and device for image generation are provided. The method includes: obtaining a text describing a content of an image to be generated; extracting, using a text encoder, a text feature vector from the text; determining a semantic mask as spatial constraints of the image to be generated; and automatically generating the image using a generative adversarial network (GAN) model according to the semantic mask and the text feature vector.