In some implementations, a device may obtain a set of input images, of an object of interest, that includes images that have a plain background. The device may obtain a set of background images that includes images associated with the object of interest. The device may generate, for an input image, a first modified image of the input image that removes a plain background of the input image. The device may generate, for the input image, a second modified image of the input image that is based on the first modified image and a background image. The device may generate, for the input image, a training image that includes an indication of a location of the object of interest depicted in the training image. The device may provide the training image to a training set, that includes a set of training images, for a computer vision model.