Systems, methods, and computer-readable media are described for performing weakly supervised semantic segmentation of input images that utilizes self-guidance on attention maps during training to cause a guided attention inference network (GAIN) to focus attention on an object in an input image in a holistic manner rather than only on the most discriminative parts of the image. The self-guidance is provided jointly by a classification loss function and an attention mining loss function. Extra supervision can also be provided by using a select number pixel-level labeled input images to enhance the semantic segmentation capabilities of the GAIN.