Saturday, June 27, 2015

Don't REST ! Revisiting RPC performance and where Fowler might have been wrong ..

[Edit: Title is a click bait of course, Fowler is aware of the async/sync issues, recently posted http://martinfowler.com/articles/microservice-trade-offs.html with clarifying section regarding async. ]

Hello my dear citizens of planet earth ...

There are many good reasons to decompose large software systems into decoupled message passing components (team size + decoupling, partial + continuous software delivery, high availability, flexible scaling + deployment architecture, ...).

With distributed applications, there comes the need for ordered point to point message passing. This is different to client/server relations, where many clients send requests at low rate and the server can choose to scale using multiple threads processing requests concurrently.
Remote Messaging performance is to distributed systems what method invocation performance is for non-distributed monolithic applications.
(guess what is one of the most optimized areas in the JVM: method invocation)

[Edit: with "REST", I also refer to HTTP based webservice style API, this somewhat imprecise]

Revisiting high level remoting Abstractions

There were various attempts at building high-level, location transparent abstractions (e.g. corba, distributed objects), however in general those idea's have not received broad acceptance.

This article of Martin Fowler sums up common sense pretty well:

http://martinfowler.com/articles/distributed-objects-microservices.html

Though not explicitely written, the article implies synchronous remote calls, where a sender blocks and waits for a remote result to arrive thereby including cost of a full network roundtrip for each remote call performed.

With asynchronous remote calls, many of the complaints do not hold true anymore. When using asynchronous message passing, the granularity of remote calls is not significant anymore.

"course grained" processing

remote.getAgeAndBirth().then( (age,birth) -> .. );

is not significantly faster than 2 "fine grained" calls

all( remote.getAge(), remote.getBirth() ) 
   .then( resultArray -> ... ); 

as both variants include network round trip latency only once.

On the other hand with synchronous remote calls, every single remote method call has a penalty of one network round trip, and only then Fowlers arguments hold.

Another element changing the picture is the availability of "Spores", a snippet of code which can be passed over the network and executed at receiver side e.g.

remote.doWithPerson( "Heinz", heinz -> {
    // executed remotely, stream data back to callee
    stream( heinz.salaries().sum() / 12 ); finish();
}).then( averageSalary -> .. );

Spore's are implementable effectively with availability of VM's and JIT compilation.

Actor Systems and similar asynchronous message passing approaches have gained popularity in recent years. Main motivation was to ease concurrency and the insight that multithreading with shared data does not scale well and is hard to master in an industrial grade software development environment.

As large servers in essence are "distributed systems in a box", those approaches apply also for distributed systems.

Following I'll test performance of remote invocations of some frameworks. I'd like to proof that established frameworks are far from what is technically possible and want to show that popular technology choices such as REST are fundamentally inept to form the foundation of large and fine grained distributed applications.


Test Participants

Disclaimer: As tested products are of medium to high complexity, there is danger of misconfiguration or test errors, so if anybody has a (verfied) improvement to one of the testcases, just drop me a comment or file an issue to the github repository containing the tests:

https://github.com/RuedigerMoeller/remoting-benchmarks.

I verified by googling forums etc. that numbers are roughly in line with what others have observed.

Features I expect from a distributed application framework:
  • Ideally fully location transparent. At least there should be a concise way (e.g. annotations, generators) to do marshalling half-automated. 
  • It is capable to map responses to their appropriate request callbacks automatically (via callbacks or futures/promises or whatever).
  • its asynchronous

products tested (Disclaimer: I am the author of kontraktor):
  • Akka 2.11
    Akka provides a high level programming interface, marshalling and networking details are mostly invisible to application code (full location transparency).
  • Vert.x 3.1
    provides a weaker level of abstraction compared to actor systems, e.g. there are no remote references. Vert.x has a symbolic notion of network communication (event bus, endpoints).
    As it's "polyglot", marshalling and message encoding need some manual support.
    Vert.x is kind of a platform and addresses many practical aspects of distributed applications such as application deployment, integration of popular technology stacks, monitoring, etc.
  • REST (RestExpress)
    As Http 1.1 based REST is limited by latency (synchronous protocol), I just choosed this more or less randomly.
  • Kontraktor 3, distributed actor system on Java 8. I believe it hits a sweet spot regarding performance, ease of use and mind model complexity. Kontraktor provides a concise, mostly location transparent high level programming model (Promises, Streams, Spores) supporting many transports (tcp, http long poll, websockets).

Libraries skipped:
  • finagle - requires me to clone and build their fork of thrift 0.5 first. Then I'd have to define thrift messages, then generate, then finally run it. 
  • parallel universe - at time of writing the actor remoting was not in a testable state ("Galaxy" is alpha), examples are without build files, the gradle build did not work. Once i managed to build, the programs where expecting configuration files which I could not find. Maybe worth a revisit (accepting pull requests as well :) ).

The Test

I took a standard remoting example:
The "Ask" testcase: 
The sender sends a message of two numbers, the remote receiver answers with the sum of those 2 numbers. The remoting layer has to track and match requests and responses as there can be tens of thousand "in-flight". 
The "Tell" testcase: 
Sender sends fire-and forget. No reply is sent from the receiver. 

Results

Attention: Don't miss notes below charts.

Platform: Linux Centos 7 dual socket 20 real cores @2.5 GHZ, 64GB ram. As the tests are ordered point to point, none of the tests made use of more than 4 cores.

tell Sum
(msg/second)
ask Sum (msg/second)
Kontraktor Idiomatic1.900.000860.000
Kontraktor Sum-Object1.450.000795.000
Vert.x 3 200.000200.000
AKKA (Kryo)120.00065.000
AKKA70.00064.500
RestExpress15.00015.000
Rest >15 connections48.00048.000

let me chart that for you ..



Remarks:

  • Kontraktor 3 outperforms by a huge margin. I verified the test is correct and all messages are transmitted (if in doubt just clone the git repo and reproduce). 
  • Vert.x 3 seems to have a built-in rate limiting. I saw peaks of 400k messages/second however it averaged at 200k (hints for improvement welcome). In addition, the first connecting sender only gets 15k/second throughput. If I stop and reconnect, throughput is as charted.
    I tested the very first Vert.x 3 final release. For marshalling fast-serialization (FST) was used (also used in kontraktor). Will update as Vert.x 3 matures
  • Akka. I spent quite some time on improving the performance with mediocre results. As Kryo is roughly of same speed as fst serialization, I'd expect at least 50% of kontraktor's performance.

    Edit:
    Further analysis shows, Akka is hit by poor serialization performance. It has an option to use Protobuf for encoding, which might improve results (but why kryo did not help then ?).

    Implications of using protobuf:
    * need each message be defined in a .proto file, need generator to be run
    * frequently additional datatransfer is done like "app data => generated messages => app data"
    * no reference sharing support, no cyclic object graphs can be transmitted
    * no implicit compression by serialization's reference sharing.
    * unsure wether the ask() test profits as it did not profit from Kryo as well
    * Kryo performance is in the same ballpark as protobuf but did not help that much either.

    Smell: I had several people contacting me aiming to improve Akka results. They disappear somehow.

    Once I find time I might add a protobuf test. Its a pretty small test program, so if there was an easy fix, it should not be a huge effort to provide it. The git repo linked above contains a maven buildable ready to use project.
  • REST. Poor throughput is not caused by RestExpress (which I found quite straight forward to use) but by Http1.1's dependence on latency. If one moves a server to other hardware (e.g. different subnet, cloud), throughput of a service can change drastically due to different latency. This might change with Http 2.
    Good news is: You can <use> </any> <chatty> { encoding: for messages }, as it won't make a big difference for point to point REST performance.
    Only when opening many connections (>20) concurrently, throughput increases. This messes up transaction/message ordering, so can only be used for idempotent operations (a species mostly known from white papers and conference slides, rarely seen in the wild).


Misc Observations


Backpressure

Sending millions of messages as fast as possible can be tricky to implement in a non-blocking environment. A naive send loop
  • might block the processing thread 
  • build up a large outbound queue as put is faster than take+sending. 
  • can prevent incoming Callbacks from being enqueued + executed (=Deadlock or OOM).
Of course this is a synthetic test case, however similar situations exist e.g. when streaming large query results or sending large blob's to other node's (e.g. init with reference data).

None of the libraries (except rest) did that out of the box:
  • Kontraktor requires a manual increase of queue sizes over default (default is 32k) in order to not deadlock in the "ask" test. In addition its required to programatically adopt send rate by using the backpressure signal raised by the TCP stack (network send blocks). This can be done non-blocking, "offer()" style.
  • For VertX i used a periodic task sending a burst of 500 to 1000 messages. Unfortunately the optimal number of messages per burst depends on hardware performance, so the test might need adoption when run on e.g. a Laptop.
  • For Akka I send 1 million messages each 30 seconds in order to avoid implementation of application level flow control. It just queued up messages and degrades to like 50 msg/s when used naively (big loop).
  • REST was not problematic here (synchronous Http1.1 anyway). Degraded by default.



Why is kontraktor remoting that faster ?
  • premature optimization 
  • adaptive batching works wonders, especially when applied to reference sharing serialization.
  • small performance compromises stack up, reduce them bottom up 
Kontraktor actually is far from optimal. It still generates and serializes a "MethodCall() { int targetId, [..], String methodName, Object args[]}" for each message remoted. It does not use Externalizable or other ways of bypassing generic (fast-)serialization.
Throughputs beyond 10 million remote method invocations/sec have been proved possible at cost of a certain fragility + complexity (unique id's and distributed systems ...) + manual marshalling optimizations.


Conclusion

  • As scepticism regarding distributed object abstractions is mostly performance related, high performance asynchronous remote invocation is a game changer
  • Popular libraries have room for improvement in this area 
  • Don't use REST/Http for inter-system connect, (Micro-) Service oriented architectures. Point to point performance is horrible. It has its applications in the area of (WAN) web services / platform neutral, easily accessible API's and client/server patterns.
  • Asynchronous programming is different, requires different/new solution patterns (at source code level). It is unavoidable to learn use of asynchronous messaging primitives.
    "Pseudo Synchronous" approaches (e.g. fibers) are good in order to better scale multithreading, but do not work out for distributed systems.
  • lack of craftsmanship can kill visions.


12 comments:

  1. How about

    Data data = remote.getData();
    data.getAge(), data.getBirth()

    ReplyDelete
    Replies
    1. That's the "use course-grained API's to overcome latency" approach.

      Problem is, if there are say 80 columns instead of 2, the "getData()" thingy will be expensive encoding + bandwith wise. This usually results in methods like getUperHalfData() and getLowerHalfData() and getDetailData() being added to the remote interface, still transfering way more data than actually required. It helps reducing latency effects however its still an (inefficient) kludge that cannot be applied in general throughout a system.

      With sync calls a thread gets blocked, a very limited ressource. One cannot surpass more than say ~1000 requests concurrently, though your hardware easily can support tens of thousands concurrent requests / transactions cpu+network wise. Just imagine what throughput you would get with the sum-test above using sync RPC (estimation: 5..50k depending on network latency).

      In theory this also applies to inter-thread communication (replace threads with actors), however for most applications threads work fine and dev's are used to them. So lower reward, same effort in that area

      Delete
  2. Reviewing your library and benchmarks, it appears that there is bias in favor of kontraktor. Even though you claim to have spent significant time optimizing Akka this isn't evident from the benchmark. Note that I don't use any of these libraries and am just eyeballing the code, but from what I see its an invalid comparison.

    The benchmarks are of a single actor with one sender and one consumer. This single-producer / single-consumer (SPSC) throughput test is not a good indicator of concurrency and doesn't model a real application's behavior. This favors kontraktor's design choices much more than Akka's.

    Akka uses ForkJoinPool as its default executor, which is good at handling lots of actors being enqueued/dequeued quickly. Kontraktor uses a ThreadPoolExecutor with an unbounded LinkedBlockingQueue, which typically has worse performance. However this isn't evident in the benchmark because the executor is never contended and threads are not context switching rapidly.

    Akka is configured with the default mailbox, a ConcurrentLinkedQueue, even though the documentation recommends using a SingleConsumerOnlyUnboundedMailbox when a BalancingDispatcher is not used. Kontraktor uses a multiple producer / single consumer (MPSC) bounded array queue, which is allows to discard messages when overflowing (whereas Akka may not). The array queue performs excellently because it takes advantage of caching effects and is not contended, but at 512k size it is heavy weight. The reason Akka prefers a linked MPSC queue is to avoid wasted memory when having millions of actors in the system (at least that was a condition Victor Klang had when asking for an algorithm on concurrency-interest).

    An MPSC linked queue can outperform an array queue under contention by using a combining backoff arena [1]. My implementation is 2.5x faster on a 4-core laptop and matches the performance with no contention (and beats it in the "optimistic" mode). That gap would increase on a larger machine because the caching effects are offset by producer contention. The array queue is faster than ConcurrentLinkedQueue with low/mild contention, and similar at high.

    I am sure there are many other differences in configuration and design choices that make the comparison unfair. We all know benchmarking is hard, but you don't acknowledge that or highlight any caveats of your scenario. There's nothing to indicate that you reached out for a peer review or profiled to understand the performance behavior of the different libraries.

    Like I said I don't use any of these libraries. Unfortunately if I decided to then I couldn't justify a choice based on your benchmarks because I think your analysis is flawed. Instead I'd have to look at other benchmarks, the api / feature set, and the community. I might even hold the benchmarks against you after seeing many other projects knowingly lie for self promotion, and I hold integrity as one of the primary characteristic an engineer must display.

    It would be really interesting to see accurate benchmarks and I hope your current set don't mislead you in development of your library.

    [1] https://github.com/ben-manes/caffeine/wiki/SingleConsumerQueue

    ReplyDelete
    Replies
    1. Hello Ben,

      deleted my previous reply as i have a tendency to rage hard at times :)

      1. I actually would be happy if Akka would show better results, from experience its:
      Be 2 times faster and you'll get appreciated, be >10 times faster and you are a troll.

      2. Point to Point remoting performance is the foundation for good scaleout performance. That's why I am testing the basic pattern. In addition
      for many applciations serial, ordered streaming is of high importance.

      3. if a benchmark is utterly wrong, I get a verified correction next day. If my benchmark tells the truth, people come up with all kinds of half baked + mystic advices. I am aware I can be wrong and I am capable accepting + publishing that. Just show me ze code.
      If you check the forums you'll see what I measured is inline with user complaints.

      4. it seems you missed I am measuring **remoting** performance. Queue performance is mostly irrelevant for this test as most CPU time is spend with marshalling and networking. So your arguing basically is off topic ;). That's why using of Kryo made a difference, changing dispatchers not. (i tried, see https://github.com/RuedigerMoeller/remoting-benchmarks/blob/master/akka/Server/src/main/resources/application.conf).
      Akka probably suffers from naively implemented marshalling and networking. Process-local messaging is on par. At a mere 70k-100k messages per second, contention is not an issue ..

      5. Kontraktor uses a bounded MPSC queue of JCTools, it does not use a ThreadPool based scheduler (the pool in Actors.java is used to isolate blocking calls)

      6. You ConcurrentLinkedQueue seems nice work, I hope you are not "lying for self promotion" ;)

      7. Citing you: "... your analysis is flawed. .. projects knowingly lie for self promotion ..."

      This kind of badmouthing after displaying utterly failure in understanding what the determining factors for distributed messaging performance are .. hm .. questionable. Its also naive to believe some threadpool/configuration magic will skyrocket point to point remoting performance by a factor of 10 (hint: it doesn't).

      8. I tell you what a misleading benchmark is: "High Performance. 50 million msg/sec on a single machine" cited from the Akka frontpage. I don't know why tech companies spend 80% in marketing and 20% in software development, but I can show you the results of that (hint: scroll up).

      9. Akka's remoting slowness from what I can see stems from some implementation/detail-design mistakes, candidates:

      1) definition of serializer libs per class (BTW: Errors with object sharing) => cannot batch on serialization level.
      2) even when using kryo, it seems like JDK serialization seems to get used for some classes. Instantiating + initilizing a JDK ObjectInputStream already consumes ~14 microseconds => limits message rate to 71428 per second (scroll up and compare ..).

      > Like I said I don't use any of these libraries.

      I dont buy that, even if you tell it 3 times (lol).

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. I have done some further analysis and my suspection was true: Akka suffers from poor serialization performance. So its actually not Akka performing poor but the way they serialize data. I'd be interested in having a protocol buffer benchmark.

      On the other hand: Protobuf serialization requires some explicit work. Its a significant drop in productivity if each message being remoted has to be declared in a proto file and protobuf wrappers have to be generated. In a real application one frequently then has to transform application data => protobuf wrapper => send, receive message => move protobuf wrapper data to application level data. One rarely uses generated wrappers directly in application code (cannot add methods because generated :) ).

      In addition protobuf does not support shared references, so one cannot simple pass cyclic and complex objects. It will also be interesting if the ask() test will profit, from the kryo example one can see that only "tell" benchmark improves.

      I'll add a disclaimer to the post, as I agree its unfair to put a "Akka is slow" stamp instead of "Serialization with Akka is slow".

      On another note: fast-serialization used by kontraktor outperforms protobuf by a good margin (especially for high load batching scenarios). However the gap might close.

      Delete
  4. You are discussing about of good techniques by this post. Remote messaging is best technique and this information is clear my doubts about it. I learn Java from Best Java Development Course Training Institute in Jaipur so its information is very important for me. Please keep updating more information about it.

    ReplyDelete
  5. Great article, learned couple of things and clear couple of misconception. thanks..

    ReplyDelete
  6. I am experimenting with Kontraktor to build a simulation system for testing a component. I'd be interested in understanding better how to use the different actor 'calling' mechanisms. Here is my problem. I have a plain java class that loops through 96 timeslots each representing 15 minutes. During each slot metadata determines the number of each of 2 types of messages that will be sent, either to a MQ queue or a file or both. These are 'metered across the 15 minutes, so sending 900 messages in a 15 minute slot would mean sending 1 per second.

    How I implemented this is a loop, in which I use first mesageType1.sendForTheseIds(ArrayList); and then mesageType2.sendForTheseIds(ArrayList); The message sending Actors meter the output of their messages across the 15 minutes. The loop also has a 15 minute sleep before continuing. Given this sleep I do not have issues with messages coming out of order so did no use onSerial.

    Is there a better way to implement this scenario, better using the actor system?

    ReplyDelete
    Replies
    1. I meant "did not use serialOn" in that 2nd to last sentence

      Delete
    2. "serialOn" is a very special method used to prevent out of order execution of code performing async IO or networking calls. In general ofc an Actor is FIFO.
      Using sleep inside an actor thread is a no-go, there are better ways to implement cyclic task (e.g. self().delayed( 1000, () -> loopMethod() ) );

      Can we continue this discussion on github ? I'd prefer to see your full source code as I am not completely sure I understand what exactly you are doing :).

      Just enter an issue on the kontraktor porject

      Delete
  7. Seems to be very informative blog. Comment section clears the doubt. Thanks for uploading.

    ReplyDelete