On a second note: There are many other aspects to concurrent scheduling such as queue implementations etc.. Especially if there is no "beef" inside the message processing, these differences become more dominant compared to cache misses, but this is another problem that has been covered extensively by other people in depth (e.g. Nitsan Wakart).
Focus of this experiment is locality/cache misses, keep in mind different queueing implementations of executors for sure add dirt/bias.
As requested, I add results from the linux "perf" tool to prove there are significant differences in cache misses caused by random assignment of Thread - to Actor as done by ThreadPoolExecutor and WorkStealingExecutor.
Check out my recent post for a description of the test case.
Results with adjusted SingleThreadExecutor (XEON 2 socket, each 6 cores, no HT)
When increasing number of Actors to 8000 (so 1000 actors per thread), "Workstealing" and "Dedicated" perform similar. Reason: executing 8000 actors round robin creates cache misses for both executors. Note that in a real world server its likely that there are active and inactive actors, so I'd expect "Dedicated" to perform slightly better than in this synthetic test.
"perf stat -e" and "perf stat -cs" results
(only 2000, 4000, 8000 local size tests where run)
333,669,424 cache-misses19.996366007 seconds time elapsed
20.230098005 seconds time elapsed
=> 9,300 context switches per second
39.610565607 seconds time elapsed
39.831169694 seconds time elapsed
=> 9,500 context switches per second
92.141264115 seconds time elapsed
87.547306379 seconds time elapsed
=>290,000 context switches per second
A quick test with a more realistic test method
In order to get a more realistic impression I replaced the synthetic int-iteration by some dirty "real world" dummy stuff (do some allocation and HashMap put/get). Instead of increasing the size of the "localstate" int array, I increase the HashMap size (should also have negative impact on locality).
Note that this is rather short processing, so queue implementations and executor internal implementation might dominate locality here. This test is run on Opteron 8c16t * 2Sockets, a processor with 8kb L1 cache size only. (BTW: impl is extra dirty, so no performance optimization comments pls, thx)
As ThreadPoolExecutor is abnormous bad in this Test/Processor combination, plain numbers:
|64 HMap entries||256 HMapentries||2000 HMapentries||4000 HMapentries||32k HMapentries||320k HMapentries|
Conclusions basically stay same as in original post. Remember cache misses are only one factor of overall runtime performance, so there are workloads where results might look different. Quality/specialization of queue implementation will have huge impact in case processing consists of only some lines of code.
Finally, my result:
Pinning actors to threads created lowest cache miss rates in any case tested.