Techniques for generating machine learning training data which corresponds to one or more downstream tasks are disclosed. In one example, a computer implemented method comprises generating one or more synthetic data instances for training a machine learning model, and determining a value of respective ones of the one or more synthetic data instances with respect to at least one task. One or more additional synthetic data instances for training the machine learning model are generated based at least in part on the values of the respective ones of the one or more synthetic data instances.