Friday, October 17, 2014

Follow up: Executors and Cache Locality Experiment

Thanks to Jean Philippe Bempel who challenged my results (for a reason), I discovered an issue in last post: Code-completion let me accidentally choose Executors.newSingleThreadScheduledExecutor() instead of Executors.newSingleThreadExecutor(), so the pinned-to-thread-actor results are actually even better than reported previously. The big picture has not changed that much, but its still worthwhile reporting.

On a second note: There are many other aspects to concurrent scheduling such as queue implementations etc.. Especially if there is no "beef" inside the message processing, these differences become more dominant compared to cache misses, but this is another problem that has been covered extensively by other people in depth (e.g. Nitsan Wakart).

Focus of this experiment is locality/cache misses, keep in mind different queueing implementations of executors for sure add dirt/bias.

As requested, I add results from the linux "perf" tool to prove there are significant differences in cache misses caused by random assignment of Thread - to Actor as done by ThreadPoolExecutor and WorkStealingExecutor.

Check out my recent post for a description of the test case.

Results with adjusted SingleThreadExecutor (XEON 2 socket, each 6 cores, no HT)


As in previous post, "dedicated" actor-pinned-to-thread performs best. For very small local state, there are only few cache misses so differences are small, but widen once a bigger chunk of memory is accessed by each actor. Note that ThreadPool is hampered by its internal scheduling/queuing mechanics, regardless of locality, it performs weak.


When increasing number of Actors to 8000 (so 1000 actors per thread), "Workstealing" and "Dedicated" perform similar. Reason: executing 8000 actors round robin creates cache misses for both executors. Note that in a real world server its likely that there are active and inactive actors, so I'd expect "Dedicated" to perform slightly better than in this synthetic test.

"perf stat -e" and "perf stat -cs" results

(only 2000, 4000, 8000 local size tests where run)

Dedicated/actor-pinned-to-thread:
333,669,424 cache-misses                                                
19.996366007 seconds time elapsed
185,440 context-switches                                            
20.230098005 seconds time elapsed
=> 9,300 context switches per second

workstealing:
2,524,777,488 cache-misses                                                
39.610565607 seconds time elapsed
381,385 context-switches                                            
39.831169694 seconds time elapsed
=> 9,500 context switches per second

fixedthreadpool:
3,213,889,492 cache-misses                                                
92.141264115 seconds time elapsed
25,387,972 context-switches                                            
87.547306379 seconds time elapsed
=>290,000 context switches per second








A quick test with a more realistic test method

In order to get a more realistic impression I replaced the synthetic int-iteration by some dirty "real world" dummy stuff (do some allocation and HashMap put/get). Instead of increasing the size of the "localstate" int array, I increase the HashMap size  (should also have negative impact on locality).


Note that this is rather short processing, so queue implementations and executor internal implementation might dominate locality here. This test is run on Opteron 8c16t * 2Sockets, a processor with 8kb L1 cache size only. (BTW: impl is extra dirty, so no performance optimization comments pls, thx)


As ThreadPoolExecutor is abnormous bad in this Test/Processor combination, plain numbers:

64 HMap entries256 HMapentries2000 HMapentries4000 HMapentries32k HMapentries320k HMapentries
WorkStealing107010711097112912381284
Dedicated656646661649721798
ThreadPool83148751941294921026910602

Conclusions basically stay same as in original post. Remember cache misses are only one factor of overall runtime performance, so there are workloads where results might look different. Quality/specialization of queue implementation will have huge impact in case processing consists of only some lines of code.

Finally, my result: 
Pinning actors to threads created lowest cache miss rates in any case tested.




14 comments:

  1. Hi Rudiger
    Small typo at the beginning
    "Code-completion let me accidentally choose Executors.newSingleThreadScheduledExecutor() instead of Executors.newSingleThreadScheduledExecutor()"
    Again you selected the Scheduled one :)
    Thanks for sharing
    Georges

    ReplyDelete
  2. Argh .. poor me .. I am considering to fork openjdk and delete this method :-) Thx

    ReplyDelete
  3. Existing without the answers to the difficulties you’ve sorted out through this guide is a critical case, as well as the kind which could have badly affected my entire career if I had not discovered your website
    Digital Marketing Training in rajajinagar

    ReplyDelete
  4. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this
    Digital Marketing Training in rajajinagar

    ReplyDelete
  5. This blog is the general information for the feature. You got a good work for these blog.We have a developing our creative content of this mind.Thank you for this blog. This for very interesting and useful.
    Click here:
    python training in Bangalore
    Click here:
    python training in Bangalore

    ReplyDelete
  6. I read this post two times, I like it so much, please try to keep posting & Let me introduce other material that may be good for our community.
    Blue Prism Training in Pune

    Blueprism training in tambaram

    Blueprism training in annanagar

    ReplyDelete
  7. I read this post two times, I like it so much, please try to keep posting & Let me introduce other material that may be good for our community.
    java online training | java training in pune

    java training in chennai | java training in bangalore

    ReplyDelete
  8. Devops is not a Tool.Devops Is a Practice, Methodology, Culture or process used in an Organization or Company for fast collaboration, integration and communication between Development and Operational Teams. In order to increase, automate the speed of productivity and delivery with reliability.

    python training in bangalore
    aws training in bangalore
    artificial intelligence training in bangalore
    data science training in bangalore
    machine learning training in bangalore
    hadoop training in bangalore
    devops training in bangalore

    ReplyDelete
  9. Gaining Python certifications will validate your skills and advance your career.
    python certification

    ReplyDelete
  10. This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me.. 
    python training in pune | python training institute in chennai | python training in Bangalore

    ReplyDelete