Methods and systems for training a neural network include determining a graph representation of a set of neural network training operations based on definition-use chains. A memory allocation queue is determined based on a slack value for each neural network training operation in the graph representation. Memory for each neural network training operation in the memory allocation queue is allocated. Execution of neural network training operations with non-zero slack is delayed to minimize an amount of memory allocated at any one time. Neural network training is executed using the allocated memory for each neural network training operation.