Looks like there is a Fortran compiler which is already emitting MLIR IR with few OpenMP constructs support and 2 SPEC CPU 2017 benchmarks running: https://github.com/compiler-tree-technologies/fc
If you're shipping binaries, you don't know the exact architecture in advance (because there are many extensions to x86 and you don't know if the end user is running a new enough processor to use all of them). If you don't use them you are likely leaving performance on the table. So you want to select the fastest option supported by the processor you happen to be running on. You can do this with fairly minimal peformance impact by linking in different versions of the function at runtime, but this requires some support from your compiler and runtime environment.
1. A portable binary where only individual SIMD operations are optimized for all targets.
2. Building the optimized binary for every target architecture when needed (either by the user or by the binary distributor).
Concern with (1) is, as the number of dynamically called functions (or decided by if-else nests) increases the quality of the generated code reduces for any architecture. Basically, compiler will be left with opaque unrecognizable functions which restricts even the target independent optimizations (Like, GVN, CSE Constant propagation etc).
Let's say, if the user writes a SIMD program which contains full of dynamically called functions (which are opaque to the compiler), doesn't it impact the performance heavily?
Isn't taking the compiler support for optimizing the SIMD operations necessary rather than writing wrapper libraries ? For example, lowering the SIMD operation calls to the existing vectorized math libraries which are recognizable by the compilers ( Example: sin(), cos(), pow() in libm ).
Phoronix article: https://www.phoronix.com/scan.php?page=news_item&px=FC-LLVM-...