Dynamic Task Parallelism with a GPU Work-Stealing Runtime ...

and implemented a runtime abstraction for dynamic task parallelism using the nish-async style API [3] on top of the regular data parallel model of execution. The programmer is freed from the responsibility of load balancing dynamically created tasks with the aid of our CUDA work stealing scheduler that operates across multiple SMs in the same device. The runtime helps reduce data transfer ... ................
................