Java is the new C: Follow up: Executors and Cache Locality Experiment

Friday, October 17, 2014

Follow up: Executors and Cache Locality Experiment

Thanks to Jean Philippe Bempel who challenged my results (for a reason), I discovered an issue in last post: Code-completion let me accidentally choose Executors.newSingleThreadScheduledExecutor() instead of Executors.newSingleThreadExecutor(), so the pinned-to-thread-actor results are actually even better than reported previously. The big picture has not changed that much, but its still worthwhile reporting.

On a second note: There are many other aspects to concurrent scheduling such as queue implementations etc.. Especially if there is no "beef" inside the message processing, these differences become more dominant compared to cache misses, but this is another problem that has been covered extensively by other people in depth (e.g. Nitsan Wakart).

Focus of this experiment is locality/cache misses, keep in mind different queueing implementations of executors for sure add dirt/bias.

As requested, I add results from the linux "perf" tool to prove there are significant differences in cache misses caused by random assignment of Thread - to Actor as done by ThreadPoolExecutor and WorkStealingExecutor.

Check out my recent post for a description of the test case.

Results with adjusted SingleThreadExecutor (XEON 2 socket, each 6 cores, no HT)

As in previous post, "dedicated" actor-pinned-to-thread performs best. For very small local state, there are only few cache misses so differences are small, but widen once a bigger chunk of memory is accessed by each actor. Note that ThreadPool is hampered by its internal scheduling/queuing mechanics, regardless of locality, it performs weak.

When increasing number of Actors to 8000 (so 1000 actors per thread), "Workstealing" and "Dedicated" perform similar. Reason: executing 8000 actors round robin creates cache misses for both executors. Note that in a real world server its likely that there are active and inactive actors, so I'd expect "Dedicated" to perform slightly better than in this synthetic test.

"perf stat -e" and "perf stat -cs" results

(only 2000, 4000, 8000 local size tests where run)

Dedicated/actor-pinned-to-thread:

333,669,424 cache-misses

19.996366007 seconds time elapsed
185,440 context-switches
20.230098005 seconds time elapsed
=> 9,300 context switches per second

workstealing:
2,524,777,488 cache-misses
39.610565607 seconds time elapsed
381,385 context-switches
39.831169694 seconds time elapsed
=> 9,500 context switches per second

fixedthreadpool:
3,213,889,492 cache-misses
92.141264115 seconds time elapsed
25,387,972 context-switches
87.547306379 seconds time elapsed
=>290,000 context switches per second

A quick test with a more realistic test method

In order to get a more realistic impression I replaced the synthetic int-iteration by some dirty "real world" dummy stuff (do some allocation and HashMap put/get). Instead of increasing the size of the "localstate" int array, I increase the HashMap size (should also have negative impact on locality).

Note that this is rather short processing, so queue implementations and executor internal implementation might dominate locality here. This test is run on Opteron 8c16t * 2Sockets, a processor with 8kb L1 cache size only. (BTW: impl is extra dirty, so no performance optimization comments pls, thx)

As ThreadPoolExecutor is abnormous bad in this Test/Processor combination, plain numbers:

	64 HMap entries	256 HMapentries	2000 HMapentries	4000 HMapentries	32k HMapentries	320k HMapentries
WorkStealing	1070	1071	1097	1129	1238	1284
Dedicated	656	646	661	649	721	798
ThreadPool	8314	8751	9412	9492	10269	10602

Conclusions basically stay same as in original post. Remember cache misses are only one factor of overall runtime performance, so there are workloads where results might look different. Quality/specialization of queue implementation will have huge impact in case processing consists of only some lines of code.

Finally, my result: