I think I lean towards per-thread sharding instead of mutex based or lock free d...

I think I lean towards per-thread sharding instead of mutex based or lock free data structures except for lockfree ringbuffers.

You can get embarassingly parallel performance if you split your data by thread and aggregate periodically.

If you need a consistent view of your entire set of data, that is a slow path with sharding.

In my experiments with multithreaded software I simulate a bank where many bankaccounts are randomly withdrawn from and deposited to. https://github.com/samsquire/multiversion-concurrency-contro...

  ShardedBank.java
  ShardedBank2.java
  ShardedBankNonRandom.java
  ShardedHashMap.java
  ShardedNonRandomBank2.java
  ShardedTotalOrder.java
  ShardedTotalRandomOrder.java

I get 700 million requests per second over 12 threads due to the sharding of money over accounts. Here I prioritise throughput of transacitons per second over balance checks (a consistent view of the data).

I am also experimenting with left-right concurrency control which trades memory usage for performance, it basically keeps two copies of data, one that is currently being read and written to and another inactive copy that is not active. You switch the data structures around periodically to "make changes visible" to that thread.