> This has been a standard optimization for half a century. The original C compi...

reitzensteinm · on April 25, 2020

Since it's what I'm typing this on, let's look at Skylake.

Multiply: Latency 3, Throughput 1 Shift: Latency 1, Throughput 2

If the ALU contained an early out or fast path for simpler multiplies, the latency would read 1-3. You can verify this by looking at div, which does early out and has a latency of 35-88.

Any compiler that doesn't swap a multiply to a shift when it can is negligent.

https://www.agner.org/optimize/instruction_tables.pdf

anyfoo · on April 24, 2020

Valid points, but in this case (and many others where you encounter power-of-2 mult/div), I'd consider that a shift might actually semantically be the more natural operation in the first place, instead of an "optimization" of the mult/div operation. (With their equivalence being obvious to any reader, it might not matter.)

lgg · on April 24, 2020

I was not arguing the people writing code should perform strength reductions manually, I was explaining what they were and then stating that even ancient compilers do them automatically. While I did not explicitly state it, the logical follow on is that programmers should almost never explicitly strength reduce in their code, they should write the semantically clear version and let the compiler handle it for them.

You are correct, that on modern CPUs there are often specifically recognized idioms where the processor can implicitly perform an instruction transform such as a strength reduction from a multiply to a shift.

Having said that, it still makes sense a compiler to perform strength reductions rather than depending on the CPU frontend, at least if your compiler has a relatively decent scheduling model for the CPU. I don't know of any modern production quality compiler that would omit a simple strength reduction like this and leave it to the CPU.