I think the latency overhead can be low enough these days (kernel launch / copy ...

I think the latency overhead can be low enough these days (kernel launch / copy data from CPU to GPU, read from global memory, do some compute, write to global memory, copy data back to CPU and wait for completion from the CPU side (which might copy the data to some other buffer) or non-blocking DMA write instead), say on the order of 10-100 us over PCIe, but there is a tradeoff between the units of work that one would give the GPU, the efficiency of the compute (and the working set size of data that you need to load in from global memory to do the compute), and the number of individual kernel launches that one would need to do to produce small pieces of the output. There are some tricks involving atomics (or in CUDA in particular, cooperative thread groups) that could allow for persistent kernels that are always producing data and periodically ingesting commands from the CPU to avoid the CPU constantly needing to tell the GPU to do things to make it easier.