I think a big mistake in the article, in a context where performance is the main...

tremon · on March 30, 2023

Apache Arrow (https://arrow.apache.org/overview/) is built exactly around this idea: it's a library for managing the in-memory representation of large datasets.

ohr · on March 31, 2023

Author here: I agree, that's a great perf advice (esp. when you can restructure your code).

I couldn't get into a this in the article (would be too long), but this is a great point and the original library does this in a lot of places.

One problem in our use case is that the actual structs members are pretty big & that we need to group/regroup them a lot.

The fastest approach for us was to do something like in the article for the initial filtering, then build a hashmap of SoAs with the needed data, and do the heavier math on that.

Yoric · on March 30, 2023

Having used this approach in a few languages, I agree that it's (sometimes) (much) better for performance, but it tends to wreak havoc on readability.

snicker7 · on March 30, 2023

Languages and with some sort of decent metaprogramming support can alleviate this sort of issue (see Zig, Julia, Jai, etc.)

Yoric · on March 30, 2023

Do you have examples? I cannot think of a good way to do that.

jmalicki · on March 31, 2023

For python, you can use pandas. It even has arrow support now.

Yoric · on March 31, 2023

Yeah, I need to try it.

But the parent was suggesting that metaprogramming could make readability issues go away and I can't wrap my head around how it would do that.

rsfern · on April 1, 2023

In Julia you can use https://juliaarrays.github.io/StructArrays.jl/stable/ which lets your code use the array of structs interface but internally it’s a structure of arrays

Yoric · on April 1, 2023

Nice, thanks!

prirun · on March 30, 2023

Modern CPU caches are usually loaded in 64-byte units - much larger than 128 bits. I just ran some tests with a C program on an Intel I5 with both AoS and SoA using a list of 1B points with 32-bit X and Y components. Looping through the list of points and totaling all X and Y components was the same speed with either AoS or SoA.

It's easy to make intuitive guesses about how things are working that seem completely reasonable. But you have to benchmark because modern CPUs are so complex that reasoning and intuition mostly don't work.

Programs used for testing are below. I ran everything twice because my system wasn't always idle, so take the lower of the 2 runs.

  [jim@mbp ~]$ sh -x x
  + cat x1.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x + p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m12.078s
  user 0m7.319s
  sys 0m4.363s
  + ./x1
  s=1808348672

  real 0m9.415s
  user 0m6.677s
  sys 0m2.685s
  + cat x2.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      x[i] = i;
      y[i] = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += x[i] + y[i];
    }
    printf("s=%d\n", s);
  }
  + cc -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m9.753s
  user 0m6.713s
  sys 0m2.967s
  + ./x2
  s=1808348672

  real 0m9.642s
  user 0m6.674s
  sys 0m2.902s
  + cat x3.c
  #include <stdio.h>

  #define NUM 1000000000

  struct {
    int x;
    int y;
  } p[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++) {
      p[i].x = i;
    }
    for (i=0; i<NUM; i++) {
      p[i].y = i;
    }

    s=0;
    for (i=0; i<NUM; i++) {
      s += p[i].x;
    }
    for (i=0; i<NUM; i++) {
      s += p[i].y;
    }
    printf("s=%d\n", s);
  }
  + cc -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m13.844s
  user 0m11.095s
  sys 0m2.700s
  + ./x3
  s=1808348672

  real 0m13.686s
  user 0m11.038s
  sys 0m2.611s
  + cat x4.c
  #include <stdio.h>

  #define NUM 1000000000

  int x[NUM];
  int y[NUM];

  int main() {
    int i,s;

    for (i=0; i<NUM; i++)
      x[i] = i;
    for (i=0; i<NUM; i++)
      y[i] = i;

    s=0;
    for (i=0; i<NUM; i++)
      s += x[i];
    for (i=0; i<NUM; i++)
      s += y[i];

    printf("s=%d\n", s);
  }
  + cc -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m13.530s
  user 0m10.851s
  sys 0m2.633s
  + ./x4
  s=1808348672

  real 0m13.489s
  user 0m10.856s
  sys 0m2.603s

lights0123 · on March 30, 2023

It appears that you didn't enable optimizations. That's needed for SIMD, which can only be taken advantage of with contiguously packed data.

prirun · on March 31, 2023

Results with -O are below.

x1 (AoS) vs x2 (SoA): no performance difference x3 (arrays not in structure, both arrays in loop): slower x4 (arrays not in structure, one array in loop): faster

My advice is still not to assume that SoA is always faster than AoS without benchmarking.

  + cc -O -o x1 x1.c
  + ./x1
  s=1808348672

  real 0m11.775s
  user 0m3.540s
  sys 0m6.592s
  + ./x1
  s=1808348672

  real 0m5.427s
  user 0m2.727s
  sys 0m2.682s
  + cc -O -o x2 x2.c
  + ./x2
  s=1808348672

  real 0m5.185s
  user 0m2.296s
  sys 0m2.872s
  + ./x2
  s=1808348672

  real 0m5.140s
  user 0m2.273s
  sys 0m2.852s
  + cc -O -o x3 x3.c
  + ./x3
  s=1808348672

  real 0m6.423s
  user 0m3.745s
  sys 0m2.660s
  + ./x3
  s=1808348672

  real 0m6.485s
  user 0m3.741s
  sys 0m2.714s
  + cc -O -o x4 x4.c
  + ./x4
  s=1808348672

  real 0m4.875s
  user 0m2.205s
  sys 0m2.651s
  + ./x4
  s=1808348672

  real 0m4.894s
  user 0m2.189s
  sys 0m2.684s

Narishma · on March 31, 2023

The reason you're not seeing much difference is that your struct is very small, just 16 bytes. In the cases where you're using both x and y in the loop (x1 and x2), you can fit 4 of them in a cache line and you're not wasting space since you need to use both. In the case you're only using one of the values (x3), you're wasting half a cache line and that shows in the benchmark. If you had a bigger struct and/or where you're not using all the members in the calculation, you'd see a much bigger difference in performance between SoA and AoS.

rpep · on March 31, 2023

You need to add “-march=native” to turn on SIMD at the highest level your processor will support, otherwise it will just use SSE4.1 by default on most compilers (the lowest common denominator, as all x86-64 processors have it).