Tuesday, October 14, 2014

Experiment: Cache effects when scheduling Actors with F/J, Threadpool, Dedicated Threads

Update: I accidentally used newSingleThreadScheduledExecutor instead of newFixedThreadPool(1) for the "Dedicated" test case [ide code completion ..]. With this corrected, "Dedicated" outperforms even more. See follow up post for updated results + "perf" tool cache miss measurement results (do not really change the big picture).

The experiment in my last post had a serious flaw: In an actor system, operations on a single actor are executed one after the other. However by naively adding message-processing jobs to executors, private actor state was accessed concurrently, leading to "false-sharing" and cache coherency related costs especially for small local state sizes.

Therefore I modified the test. For each Actor scheduled, the next message-processing is scheduled once the previous one finished, so the experiment resembles the behaviour of typical actors (or lightweight processes/tasks/fibers) correctly without concurrent access to a memory region.

Experiment roundup:

Several million messages are scheduled to several "Actor" simulating classes. Message processing is simulated by reading and writing the private, actor-local state in random order. There are more Actors (24-8000) than threads (6-8). Note that results established (if any) will also hold true for other light-weight concurrency schemes like go-routines, fibers, tasks ...

The test is done with

  • ThreadPoolExecutor
  • WorkStealingExecutor
  • Dedicated Thread (Each Actor has a fixed assignment to a worker thread)

Simulating an Actor accessing local state:


Full Source of Benchmark

 Suspection:
As ThreadPoolExecutor and WorkStealingExecutor schedule each message on a random Thread, they will produce more cache misses compared to pinning each actor onto a fixed thread. Speculation is, that work stealing cannot make up for the costs provoked by cache misses.


(Some) Variables:
  • Number of worker threads
  • Number of actors
  • Amount of work per message
  • Locality / Size of private unshared actor state


8 Threads 24 actors 100 memory accesses (per msg)


Interpretation:

For this particular load, fixed assigned threads outperform executors. Note: the larger the local state of an actor, the higher the probability of a prefetch fail => cache miss. In this scenario my suspection holds true: Work stealing cannot make up for the amount of cache misses. fixed assigned threads profit, because its likely, some state of a previously processed message resides still in cache once a new message is processed on an actor.
Its remarkable how bad ThreadpoolExecutor performs in this experiment.

This is a scenario typical for backend-type service: There are few actors with high load. When running a front end server with many clients, there are probably more actors, as typically there is one actor per client session. Therefor lets push up the number of actors to 8000:

8 Threads 8000 actors 100 memory accesses (per msg)



Interpretation:

With this amount of actors, all execution schemes suffer from cache misses, as the accumulated size of 8000 actors is too big to fit into L1 cache. Therefore the cache advantage of fixed-assigned threads ('Dedicated') does not make up for the lack of work stealing. Work Stealing Executor outperforms any other execution scheme if a large amount of state is involved.
This is a somewhat unrealistic scenario as in a real server application, client request probably do not arrive "round robin", but some clients are more active than others. So in practice I'd expect "Dedicated" will at least have some advantage of higher cache hits. Anyway: when serving many clients (stateful), WorkStealing could be expected to outperform.

Just to get a third variant: same test with 240 actors:


These results complete the picture: with fewer actors, cache effect supercede work stealing. The higher the number of actors, the higher the number of cache misses gets, so work stealing starts outperforming dedicated threads.


Modifying other variables

Number of memory accesses

If a message-processing does few memory accesses, work stealing improves compared to the other 2. Reason: fewer memory access means fewer cache misses means work stealing gets more significant in the overall result.

 ************** Worker Threads:8 actors:24 #mem accesses: 20
local state bytes: 64 WorkStealing avg:505
local state bytes: 64 ThreadPool avg:2001
local state bytes: 64 Dedicated avg:557
local state bytes: 256 WorkStealing avg:471
local state bytes: 256 ThreadPool avg:1996
local state bytes: 256 Dedicated avg:561
local state bytes: 2000 WorkStealing avg:589
local state bytes: 2000 ThreadPool avg:2109
local state bytes: 2000 Dedicated avg:600
local state bytes: 4000 WorkStealing avg:625
local state bytes: 4000 ThreadPool avg:2096
local state bytes: 4000 Dedicated avg:600
local state bytes: 32000 WorkStealing avg:687
local state bytes: 32000 ThreadPool avg:2328
local state bytes: 32000 Dedicated avg:640
local state bytes: 320000 WorkStealing avg:667
local state bytes: 320000 ThreadPool avg:3070
local state bytes: 320000 Dedicated avg:738
local state bytes: 3200000 WorkStealing avg:1341
local state bytes: 3200000 ThreadPool avg:3997
local state bytes: 3200000 Dedicated avg:1428


Fewer worker threads

Fewer worker threads (e.g. 6) increase probability of an actor message being scheduled to the "right" thread "by accident", so cache miss penalty is lower which lets work stealing perform better than "Dedicated" (the fewer threads used, the lower the cache advantage of fixed assigned "Dedicated" threads). Vice versa: if the number of cores involved increases, fixed thread assignment gets ahead.

Worker Threads:6 actors:18 #mem accesses: 100
local state bytes: 64 WorkStealing avg:2073
local state bytes: 64 
ThreadPool avg:2498
local state bytes: 64 Dedicated avg:2045
local state bytes: 256 WorkStealing avg:1735
local state bytes: 256 
ThreadPool avg:2272
local state bytes: 256 Dedicated avg:1815
local state bytes: 2000 WorkStealing avg:2052
local state bytes: 2000 
ThreadPool avg:2412
local state bytes: 2000 Dedicated avg:2048
local state bytes: 4000 WorkStealing avg:2183
local state bytes: 4000 
ThreadPool avg:2373
local state bytes: 4000 Dedicated avg:2130
local state bytes: 32000 WorkStealing avg:3501
local state bytes: 32000 
ThreadPool avg:3204
local state bytes: 32000 Dedicated avg:2822
local state bytes: 320000 WorkStealing avg:3089
local state bytes: 320000 
ThreadPool avg:2999
local state bytes: 320000 Dedicated avg:2543
local state bytes: 3200000 WorkStealing avg:6579
local state bytes: 3200000 
ThreadPool avg:6047
local state bytes: 3200000 Dedicated avg:6907

Machine tested:

(real cores no HT)
$ lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               3067.058
BogoMIPS:              6133.20
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     1,3,5,7,9,11
NUMA node1 CPU(s):     0,2,4,6,8,10


Conclusion
  • Performance of executors depends heavy on use case. There are work loads where cache locality dominates, giving an advantage of up to 30% over Work-Stealing Executor
  • Performance of executors varies amongst different CPU types and models (L1 cache size + cost of a cache miss matter here)
  • WorkStealing could be viewed as the better overall solution. Especially if a lot of L1 cache misses are to be expected anyway.
  • The ideal executor would be WorkStealing with a soft actor-to-thread affinitiy. This would combine the strength of both execution schemes and would yield significant performance improvements for many workloads
  • Vanilla thread pools without work stealing and actor-to-thread affinity perform significantly worse and should not be used to execute lightweight processes.
Source of Benchmark

34 comments:

  1. Gelbooru is the website created by Japanese hentai, this website includes millions of pictures of Japanese children. This website has millions of visitors coming and watching these pictures. In addition, it has its own website to address various category forms with different outcomes. Anyone can create an account here, and access free photos to create a free user account to sign up.

    ReplyDelete
  2. putlockers one of the great streaming systems, in which you can watch special films and tv series. Moreover, you may down load various motion pictures of your choice and watch it while you want to observe it subsequent.

    At a time, the channels have been close down and plenty of humans had been no longer able to watch the favorite films in their desire. Because of the lack of ability to benefit get right of entry to to their internet site, many human beings have to look for the alternative. Incidentally, there are plenty of alternatives available for you.


    You can test the numerous alternatives supplied to you here and use any of them to watch movies on putlocker , tv series, and most significantly, you can down load films of your preference.

    You can watch on line movies and also you should make certain that you are secure while watching it. To be safe way that the film website will now not be harm on your machine with viruses. This is one of the risks users encounter in the sort of web page. The possibility of encountering viruses is constantly there.

    Before you start to get right of entry to that site, you must make sure that you do not disclose your PC or your telephone or any of the devices you want to watch the video. If you need to download movies, you ought to additionally use sturdy antivirus to ensure which you do no longer down load any film that can damage your machine.

    This may be very vital, specifically in case you want to benefit get right of entry to to a free website like maximum of them which might be encouraged here.

    Furthermore, to make certain which you live safe, you ought to make certain that you examine the website's guidelines and do now not cross contra to the regulations put in area. Ensure that you do not destroy any rule you can get prosecuted in case you go towards the policies.

    ReplyDelete
  3. I am reading this article about the experiment cache effects thanks sharing this article this good information GTA city


    ReplyDelete
  4. Hello Sir I saw your blog, It was very nice blog, and your blog content is awesome, i read it and i am impressed of your blog, i read your more blogs, thanks for share this summary.
    Learn to Recover Disabled Facebook Account

    ReplyDelete
  5. I like your experiment idea. Can Use it for my store that hosted online?

    ReplyDelete
  6. I like your experiment idea. Can Use it for my store that hosted online?

    cracks Download and cracks Website Free crack Download cracks Website Free

    ReplyDelete
  7. Nice post. I was checking constantly this blog and I’m impressed! Extremely useful info specially the last part I care for such information a lot. I was seeking this certain info for a long time. Thank you and good luck.scheduling AC repairs

    ReplyDelete
  8. Turbo VPN Mod APK is an application based on the Android operating system. Where users can browse the virtual world without limits based on the location applied by the country or the website.

    ReplyDelete
  9. I am very happy to read this article. Thanks for giving us Amazing info. Fantastic post.
    Thanks For Sharing such an informative article, Im taking your feed also, Thanks.construct 3 cracked

    ReplyDelete
  10. Dragonframe CrackDragonframe Crack is a stop-motion energy show that has been used to make some
    Subverse Download blockbuster movies, including Disney’s Frankenweenie, Laika’s Coraline, The Boxtrolls, and ParaNorman. It is the same method that is used to broadcast adobe acrobat pro crackscenes from development stops in live action movies, for example Star Wars avid media composer crack

    ReplyDelete
  11. This was achieved under the supervision of the original developers and consultants. vstsearch korg legacy collection We bring you the authentic analog experience that only KORG can offer. pluginbeasts vsdc video editor pro For analog devices, KORG’s Component Modeling Technology cloudvst vsdc video editor pro updates analog’s unique sense of organicity and unpredictability. high-pitched voices clickbeautytips best electric toothbrush with uv sanitizer Korg engineers carefully studied the original synthesizer documentation and source code monitorpapa best ips monitor under 200

    ReplyDelete
  12. ApowerManager Serial Key is a beneficial phone management software that contains the set of tools for managing the number of files,Hide.Me VPN Patch messages, contact lists, videos, pictures, audio, etc. Undoubtedly, it defines the best SoundPad Keytechnique to move and remove unnecessary content and make your Android fast for the next job.Sibelius For Mac Crack There is enough capacity to manage the variety of devices without considering data loss issues.

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. You are really a talented person I have ever seen.

    ReplyDelete
  15. Thanks for sharing such a valuable features and other relevant information.

    ReplyDelete
  16. Great blog, thank you so much for sharing with us.

    ReplyDelete
  17. I am always looking forward on your work and I think that you're always doing an excellent job!!

    ReplyDelete
  18. You could certainly see your enthusiasm in the article you write.

    ReplyDelete
  19. I spend a lot of time on this blog to learn a lot of good information.

    ReplyDelete
  20. I hope you prosper a lot and please post good comments often. I come often.

    ReplyDelete
  21. Immerse yourself in Healing Buddha's holistic methodology, weaving ancient wisdom with modern techniques for profound transformation.

    pranic healing

    ReplyDelete
  22. In the digital age, where communication is primarily conducted through various messaging platforms,How to propose a girl on chat has become a common and accepted practice. However, the art of proposing online requires finesse, creativity, and a deep understanding of the other person. In this comprehensive guide, we will explore the steps and strategies to make your online proposal memorable and meaningful.

    ReplyDelete
  23. Wonderful blog! Many thanks for generously sharing with us. Yeni Medya




    ReplyDelete

  24. Veryy well-written. I am really thankful for sharing your quality words with us.

    ReplyDelete
  25. I like the efforts you have put in this, thanks for all the great posts.

    ReplyDelete
  26. I love the efforts you have put in this, appreciate it for all the great content.

    ReplyDelete
  27. I gotta favorite this website it seems very beneficial extremely helpful

    ReplyDelete