Patent attributes
One embodiment provides an apparatus comprising an interconnect fabric comprising a processing cluster including an array of multiprocessors coupled to an interconnect fabric, scheduling circuitry to distribute a plurality of thread groups across the array of multiprocessors, each thread group comprising a plurality of threads. A first multiprocessor of the array of multiprocessors can be assigned to process a first thread group comprising a first plurality of threads including a first thread sub-group and a second thread sub-group. The second thread sub-group has a data dependency on the first thread sub-group and the first multiprocessor includes circuitry to cause threads of the second thread sub-group to sleep until the threads of the first thread sub-group have satisfied the data dependency.