Using a first trained generative adversarial network, a first multimedia content is transformed into a text description of the first multimedia content. The text description is adjusted according to a constraint using a trained attention layer, the adjusting creating an adjusted text description. Using a trained model, the adjusted text description is transformed into a second multimedia content, the second multimedia content comprising an adjustment of the first multimedia content according to the constraint.