I'm not poo-pooing improving the ability get 10x speedups... that'd be awesome in many contexts.
Rather I think a lot of even "embarrassingly parallel" problems have lots of micro-sequential code within.
With your example there are mico-sequential portions for allocating the processors for the split addition to run on, and then the sequential bit of adding the resultants back together again.
Those micro-sequential portions are right now lumped into one huge 'can't fix', to be able to exploit parallism at that level would make huge speedups possible.
After all, Amdahl's law is 'bad' when you're looking at a 20% segment that you can't improve, but as you get closer to 100% the pay-offs of even small optimizations becomes larger and larger.
Rather I think a lot of even "embarrassingly parallel" problems have lots of micro-sequential code within.
With your example there are mico-sequential portions for allocating the processors for the split addition to run on, and then the sequential bit of adding the resultants back together again.