> *For maximum performance one would create one large job for each Core of the C...

> For maximum performance one would create one large job for each Core of the CPU used.

Dmitry Vyukov has suggested otherwise in a similar scenario using Go:

> If you split the image into say 8 equal parts, and then one of the goroutines/threads/cores accidentally took 2 times more time to complete, then whole processing is slowed down 2x. The slowdown can be due to OS scheduling, other processes/interrupts, unfortunate NUMA memory layout, different amount of processing per part (e.g. ray tracing) and other reasons. [...] size of a work item must never be dependent on input data size (in an ideal world), it must be dependent on overheads of parallelization technology. Currently a reference number is ~100us-1ms per work item. So you can split the image into blocks of fixed size (say 64x64) and then distribute them among threads/goroutines. This has advantages of both locality and good load balancing.

https://groups.google.com/d/msg/golang-nuts/CZVymHx3LNM/esYk...

Or to put it this way: imagine that there was zero concurrency overhead. Then splitting out jobs to their minimal size would be the ideal option, as that would allow for the most smoothed out division of labour where every processor does work all the time and they are all doing work until the entire task is completed.