Systems and methods for generating and evaluating driving scenarios with varying difficulty levels is provided. The disclosed systems and methods may be used to develop a suite of regression tests that track the progress of an autonomous driving stack. A robustness trace of a temporal logic formula may be computed from an always-eventually fragment using a computation graph. The robustness trace may be approximated by a smoothly differentiable computation graph, which can be implemented in existing machine learning programming frameworks. The systems and methods provided herein may be useful in automatic test case generation for autonomous or semi-autonomous vehicles.