Method or system for reinforcement learning that simultaneously learns a DR distribution ϕ while optimizing an agent policy Π to maximize performance over the learned DR distribution; method or system for training a learning agent using data synthesized by a simulator based on both a performance of the learning agent and a range of parameters present in the synthesized data.