Java is the new C

A fast React Live Editing implementation

2017-12-31T00:32:00.002+01:00

A fast React Live Editing implementation

Hi folks, part of my personal mission to integrate good ol' Java with modern JavaScript frameworks was the implementation of "Hot Reload". I am not talking of "automatic Browser refresh" but of live patching the code of a React client without losing component state.

This blog has moved, continue reading here

Developing a React+JSX app from Java without node, webpack, and babel

2017-08-15T22:18:00.001+02:00

Implementing a web app server side using Java has been getting ugly recently as its required to run a a full nodejs stack including a load of npm packages and javascript related stuff. By implementing a bundling and JSX Transpilation in native Java, kontraktor can help out and make things simple again (#MTSA).

you can't have a java article without picturing a cup of coffee ..

Gone are the days of server side web apps, though there are apparently still people out there trying to get something fancy out of .. [this blog has moved, continue reading here]

Move Code Not Data: 'Sandboxing' using Remote Lambda Execution

2017-02-25T02:24:00.001+01:00

Leveraging remote lambda execution to avoid being IO bound in a data intensive ~~micro~~service system.

This blog has moved. Read the full article here.

Practice Report: A webapp with Polymer.js

2016-11-29T19:19:00.002+01:00

Part two of a series of our journey building juptr.io:
Our experiences building a front end with google's Polymer.js

Why Polymer ?

There are numerous alternatives when choosing a framework to build web front ends some of most popular probably being ... [this blog has moved, continue reading here]

Make distributed programming great again: Microservices with Distributed Actor Systems

2016-11-12T18:11:00.000+01:00

Part one of a series of our journey building juptr.io:

How a fundamental abstraction streamlines, simplifies and speeds up developing a web scale system

goes higher, faster and avoids dirty feet :)

Juptr.io is a content personalization, sharing and authoring platform. We crawl 10'th of thousands of blogs and media sites (german + english), then classify the documents to enable personalized content curation, consumption & discussion.

This requires

to crawl, store, query, analyze and classify millions of documents
horizontal scalability
a failure resistant distributed (micro)services architecture
java, javascript interoperability

Compared to ordinary enterprise-level projects, a startup has much tighter time and budget constraints (aka startup in europe), so reduction of development effort was a major priority.
We decided to use the following foundation stack .. [this blog has moved, continue reading here]

2016-10-19T21:53:00.003+02:00

Word disambiguation combining LDA and word embeddings

Topical search tries to take into account the occurence of search-term-related words to avoid matching irrelevant documents. This post lines out how LDA can be used to disambiguate a word-set obtained using word embeddings.

[this blog has moved, continue reading here]

2016-10-19T21:51:00.006+02:00

Texting Bots: The command line interface rebranded ?

Texting Bots and Chat Bots are all the hype, but is typing to a mostly dumb AI the way users want to interact with software ? Maybe other aspects besides AI and texting are the real killer feature of chat bots.

[blog has moved, continue here]

Clarification and Fix for Serialization Exploit

2015-11-20T13:11:00.001+01:00

There have been reports of a Java serialization based epxloit (e.g. here), enabling the attacker to execute arbitrary code. As usual, there is a lot of half baked reasoning about this issue, so I'd like to present a fix for my fast-serialization library as well as tweaks to block deserialization of certain classes for stock JDK serialization.

In my view the issue is not located in Java serialization, but insecure programming in some common open source libraries.

Attributing this exploit to Java Serialization is wrong, the exploit requires door opening non-jdk code to be present on the server.

Unfortunately some libraries accidentally do this, however its very unlikely this is a wide spread (mis-) programming pattern.

How can a class at server side open the door ?

the class has to define readObject()
the readObject() method has to be implemented in a way that allows to execute code sent from remote

I struggle to imagine a sane use case on why one would do something along the line of

"readObject(ObjectInputStream in) { Runtime.exec(in.readString()); }"

or

"readObject(ObjectInputStream in) { define_and_execute_class(in.readBytes()); }"

Anyway, it happened. Again: something like the code above must be present on the class path at server side, it cannot be injected from outside.

Fixing this for fast-serialization

fast-serialization 2.43 introduces a delegate interface, which enables blacklisting/whitelisting of classes, packages or whole package trees. By narrowing down classes being allowed for serialization, one can assure only expected objects are deserialized.

if you cannot upgrade to fst 2.43 for backward compatibility reasons, at least register a custom serializer for the problematic classes (see exploit details). As registered custom serializers are processed by fast-serialization before falling back to JDK serialization emulation, throwing an exception in the read or instantiate method of the custom serializer will effectively block the exploit.
There is no significant performance impact as this is executed on initialization only

Hotfixing stock JDK Serialization implementation

(Verfied for JDK 1.8 only)

Use a subclass of ObjectInputStream instead of the original implementation and put in a blacklisting/whitelisting by overriding this method in ObjectInputStream:

(maybe also check resolveProxyClass, I'm not sure wether this also can get exploited [probably not]).
This will have a performance impact, so its crucial to have a efficient implementation of blacklisting/whitelisting here.

Polymer WebComponents Client served by Java Actors

2015-07-31T00:24:00.002+02:00

Front end development productivity has been stalling for a long time. I'd even say regarding productivity, there was no substantial improvement since the days of Smalltalk (Bubble Event Model, MVC done right, component based).

Weird enough, for a long time things got worse: Document centric web technology made classic and proven component centric design patterns hard to implement.

For reasons unknown to me (fill in some major raging), it took some 15 years until a decent and well thought webcomponent standard (proposal) arised: WebComponents. Its kind an umbrella of several standards, most importantly:

Html Templates - the ability to have markup snippets which are not "seen" and processed by the browser but can be "instantiated" programmatically later on.
Html Imports - finally the ability to include a bundle of css, html and javascript implicitely resolving dependencies by not loading an import twice. Style does not leak out to the main document.
Shadow Dom

It would not be a real web technology without some serious drawbacks: Html Imports are handled by the browser, so you'll get many http requests when loading a webcomponent app using html imports to resolve dependencies.
As the Html standard is strictly separated from the Http protocol, simple solutions such as server side imports (to reduce number of http requests) can never be part of an official standard.

Anyway, with Http-2's pipelining and multiplexing features html-imports won't be an issue within some 3 to 5 years (..), until then some workarounds and shim's are required.

Github repo for this blog "Buzzword Cloud": kontraktor-polymer-example

Live Demo: http://46.4.83.116/polymer/ (just sign in with anything except user 'admin'). Requires Websocket connectivity, modern browser (uses webcomps-lite.js, so no IE). Curious wether it survives ;-).

Currently Chrome and Opera implement WebComponents, however there is a well working polyfill for mozilla + safari and a last resort polyfill for ie.
Using a server side html import shim (as provided by kontraktor's http4k), web component based apps run on android 4.x and IOS safari.

(Polymer) WebComponents in short

a web component with dependencies (link rel='import'), embedded style, a template and code

Dependency Management with Imports ( "<link rel='import' ..>" )

The browser does not load an imported document twice and evaluates html imports linearly. This way dependencies are resolved in correct order implicitly. An imported html snippet may contain (nearby) arbitrary html such as scripts, css, .. most often it will contain the definition of a web component.

Encapsulation

Styles and templates defined inside an imported html component do not "leak" to the containing document.
Web components support data binding (one/two way). Typically a Web component coordinates its direct children only (encapsulation). Template elements can be easily accessed with "this.$.elemId".

Application structure

An application also consists of components. Starting from the main-app component, one subdivides an app hierarchically into smaller subcompontents, which has a nice self-structuring effect, as one creates reusable visual components along the way.

The main index.html typically refers to the main app component only (kr-login is an overlay component handling authentication with kontraktor webapp server):

That's nice and pretty straight forward .. but lets have a look what my simple sample application's footprint looks like:

Web Reality:

hm .. you might imagine how such an app will load on a mobile device, as the number of concurrent connections typically is limited to 2..6 and a request latency of 200 to 500ms isn't uncommon. As bandwidth increases continously, but latency roughly stays the same reducing the number of requests pays off even at cost of increased initial bandwidth for many apps.

In order to reduce the number of requests and bandwidth usage, the JavaScript community maintains a load of tools minifying, aggregating and preprocessing resources. When using Java at the server side, one ends up having two build systems, gradle or maven for your java stuff, node.js based javascript tools like bower, grunt, vulcanize etc. . In addition a lot of temporary (redundant) artifacts are created this way.

As Java web server landscape mostly sticks to server-centric "Look-'ma-we-can-do-ze-web™" applications, its hardly possible to make use of modern javascript frameworks using Java at server side, especially as REAL JAVA DEVELOPERS DO NOT SIMPLY INSTALL NODE.JS (though they have a windows installer now ;) ..). Nashorn unfortunately isn't yet there, currently it fails to replace node.js due to missing API or detail incompatibilities.

I think its time for an unobtrusive pointer to my kontraktor actor library, which abstracts away most of the messy stuff allowing for direct communication of JavaScript and Java Actors via http or websockets.

Even when serving single page applications, there is stuff only a web server can implement effectively:

dynamically inline Html-Import's
dynamically minify scripts (even inline javascript)

In essence inlining, minification and compression are temporary wire-level optimizations, no need to add this to the build and have your project cluttered up. Kontraktor's Http4k optimizes served content dynamically if run in production mode.
The same application with (kontraktor-)Http4k in production mode:

even better: As imports are removed during server side inlining, the web app runs under Safari IOS and Android 4.4.2 default browser.

Actor based Server side

Wether its a TCP Server Socket accepting client(-socket)s or a client browser connecting a web server: its structurally the same pattern:

1. server listens for authentication/connection requests.
2. server authenticates/handles a newly connecting client and instantiates/opens a client connection (webapp terminology: session).

What's different when comparing a TCP server and a WebApp Http server is

underlying transport protocol
message encoding

As kontraktor maps actors to a concrete transport topology at runtime, a web app server application does not look special:

a generic server using actors

The decision on transport and encoding is done by publishing the server actor. The underlying framework (kontraktor) then will map the high level behaviour description to a concrete transport and encoding. E.g. for websockets + http longpoll transports it would look like:

On client side, js4k.js implements the api required to talk to java actors (using a reduced tell/ask - style API).

So with a different transport mapping, a webapp backend might be used as an in-process or tcp-connectable service.

So far so good, however a webapp needs html-import inlining, minification and file serving ...
At this point there is an end to the abstraction game, so kontraktor simply leverages the flexibility of RedHat's Undertow by providing a "resource path" FileResource handler.

The resource path file handler parses .html files and inlines+minifies (depending on settings) html-import's, css links and script tags. Ofc for development this should be turned off.
The resource path works like Java's classpath, which means a referenced file is looked up starting with the first path entry advancing along the resource path until a match is found. This can be nice in order to quickly switch versions of components used and keep your libs organized (no copying required).

As html is fully parsed by Http4k (JSoup parser ftw), its recommended to keep your stuff syntactically correct. In addition keep in mind that actors require non-blocking + async implementation of server side logic, have your blocking database calls "outsourced" to kontraktors thread pool like

Conclusion:

javascript frameworks + web standards keep improving. A rich landscape of libraries and ready to use components has grown.
they increasingly come with node.js based tooling
JVM based non-blocking servers scale better and have a much bigger pool of server side software components
kontraktor http4k + js4k.js helps closing the gap and simplifies development by optimizing webapp file structure and size dynamically and abstracting (not annotating!) away irrelevant details and enterprisey ceremony.

Don't REST ! Revisiting RPC performance and where Fowler might have been wrong ..

2015-06-27T12:36:00.000+02:00

[Edit: Title is a click bait of course, Fowler is aware of the async/sync issues, recently posted http://martinfowler.com/articles/microservice-trade-offs.html with clarifying section regarding async. ]

Hello my dear citizens of planet earth ...

There are many good reasons to decompose large software systems into decoupled message passing components (team size + decoupling, partial + continuous software delivery, high availability, flexible scaling + deployment architecture, ...).

With distributed applications, there comes the need for ordered point to point message passing. This is different to client/server relations, where many clients send requests at low rate and the server can choose to scale using multiple threads processing requests concurrently.

Remote Messaging performance is to distributed systems what method invocation performance is for non-distributed monolithic applications.

(guess what is one of the most optimized areas in the JVM: method invocation)

[Edit: with "REST", I also refer to HTTP based webservice style API, this somewhat imprecise]

Revisiting high level remoting Abstractions

There were various attempts at building high-level, location transparent abstractions (e.g. corba, distributed objects), however in general those idea's have not received broad acceptance.

This article of Martin Fowler sums up common sense pretty well:

http://martinfowler.com/articles/distributed-objects-microservices.html

Though not explicitely written, the article implies synchronous remote calls, where a sender blocks and waits for a remote result to arrive thereby including cost of a full network roundtrip for each remote call performed.

With asynchronous remote calls, many of the complaints do not hold true anymore. When using asynchronous message passing, the granularity of remote calls is not significant anymore.

"course grained" processing

remote.getAgeAndBirth().then( (age,birth) -> .. );

is not significantly faster than 2 "fine grained" calls

all( remote.getAge(), remote.getBirth() )

.then( resultArray -> ... );

as both variants include network round trip latency only once.

On the other hand with synchronous remote calls, every single remote method call has a penalty of one network round trip, and only then Fowlers arguments hold.

Another element changing the picture is the availability of "Spores", a snippet of code which can be passed over the network and executed at receiver side e.g.

remote.doWithPerson( "Heinz", heinz -> {
// executed remotely, stream data back to callee
stream( heinz.salaries().sum() / 12 ); finish();
}).then( averageSalary -> .. );

Spore's are implementable effectively with availability of VM's and JIT compilation.

Actor Systems and similar asynchronous message passing approaches have gained popularity in recent years. Main motivation was to ease concurrency and the insight that multithreading with shared data does not scale well and is hard to master in an industrial grade software development environment.

As large servers in essence are "distributed systems in a box", those approaches apply also for distributed systems.

Following I'll test performance of remote invocations of some frameworks. I'd like to proof that established frameworks are far from what is technically possible and want to show that popular technology choices such as REST are fundamentally inept to form the foundation of large and fine grained distributed applications.

Test Participants

Disclaimer: As tested products are of medium to high complexity, there is danger of misconfiguration or test errors, so if anybody has a (verfied) improvement to one of the testcases, just drop me a comment or file an issue to the github repository containing the tests:

https://github.com/RuedigerMoeller/remoting-benchmarks.

I verified by googling forums etc. that numbers are roughly in line with what others have observed.

Features I expect from a distributed application framework:

Ideally fully location transparent. At least there should be a concise way (e.g. annotations, generators) to do marshalling half-automated.
It is capable to map responses to their appropriate request callbacks automatically (via callbacks or futures/promises or whatever).
its asynchronous

products tested (Disclaimer: I am the author of kontraktor):

Akka 2.11
Akka provides a high level programming interface, marshalling and networking details are mostly invisible to application code (full location transparency).
Vert.x 3.1
provides a weaker level of abstraction compared to actor systems, e.g. there are no remote references. Vert.x has a symbolic notion of network communication (event bus, endpoints).
As it's "polyglot", marshalling and message encoding need some manual support.
Vert.x is kind of a platform and addresses many practical aspects of distributed applications such as application deployment, integration of popular technology stacks, monitoring, etc.
REST (RestExpress)
As Http 1.1 based REST is limited by latency (synchronous protocol), I just choosed this more or less randomly.
Kontraktor 3, distributed actor system on Java 8. I believe it hits a sweet spot regarding performance, ease of use and mind model complexity. Kontraktor provides a concise, mostly location transparent high level programming model (Promises, Streams, Spores) supporting many transports (tcp, http long poll, websockets).

Libraries skipped:

finagle - requires me to clone and build their fork of thrift 0.5 first. Then I'd have to define thrift messages, then generate, then finally run it.
parallel universe - at time of writing the actor remoting was not in a testable state ("Galaxy" is alpha), examples are without build files, the gradle build did not work. Once i managed to build, the programs where expecting configuration files which I could not find. Maybe worth a revisit (accepting pull requests as well :) ).

The Test

I took a standard remoting example:

The "Ask" testcase:

The sender sends a message of two numbers, the remote receiver answers with the sum of those 2 numbers. The remoting layer has to track and match requests and responses as there can be tens of thousand "in-flight".

The "Tell" testcase:

Sender sends fire-and forget. No reply is sent from the receiver.

Results

Attention: Don't miss notes below charts.

Platform: Linux Centos 7 dual socket 20 real cores @2.5 GHZ, 64GB ram. As the tests are ordered point to point, none of the tests made use of more than 4 cores.

	tell Sum (msg/second)	ask Sum (msg/second)
Kontraktor Idiomatic	1.900.000	860.000
Kontraktor Sum-Object	1.450.000	795.000
Vert.x 3	200.000	200.000
AKKA (Kryo)	120.000	65.000
AKKA	70.000	64.500
RestExpress	15.000	15.000
Rest >15 connections	48.000	48.000

let me chart that for you ..

Remarks:

Kontraktor 3 outperforms by a huge margin. I verified the test is correct and all messages are transmitted (if in doubt just clone the git repo and reproduce).
Vert.x 3 seems to have a built-in rate limiting. I saw peaks of 400k messages/second however it averaged at 200k (hints for improvement welcome). In addition, the first connecting sender only gets 15k/second throughput. If I stop and reconnect, throughput is as charted.
I tested the very first Vert.x 3 final release. For marshalling fast-serialization (FST) was used (also used in kontraktor). Will update as Vert.x 3 matures
Akka. I spent quite some time on improving the performance with mediocre results. As Kryo is roughly of same speed as fst serialization, I'd expect at least 50% of kontraktor's performance.

Edit: Further analysis shows, Akka is hit by poor serialization performance. It has an option to use Protobuf for encoding, which might improve results (but why kryo did not help then ?).

Implications of using protobuf:
* need each message be defined in a .proto file, need generator to be run
* frequently additional datatransfer is done like "app data => generated messages => app data"
* no reference sharing support, no cyclic object graphs can be transmitted
* no implicit compression by serialization's reference sharing.
* unsure wether the ask() test profits as it did not profit from Kryo as well
* Kryo performance is in the same ballpark as protobuf but did not help that much either.

Smell: I had several people contacting me aiming to improve Akka results. They disappear somehow.

Once I find time I might add a protobuf test. Its a pretty small test program, so if there was an easy fix, it should not be a huge effort to provide it. The git repo linked above contains a maven buildable ready to use project.
REST. Poor throughput is not caused by RestExpress (which I found quite straight forward to use) but by Http1.1's dependence on latency. If one moves a server to other hardware (e.g. different subnet, cloud), throughput of a service can change drastically due to different latency. This might change with Http 2.
Good news is: You can <use> </any> <chatty> { encoding: for messages }, as it won't make a big difference for point to point REST performance.
Only when opening many connections (>20) concurrently, throughput increases. This messes up transaction/message ordering, so can only be used for idempotent operations (a species mostly known from white papers and conference slides, rarely seen in the wild).

Misc Observations

Backpressure

Sending millions of messages as fast as possible can be tricky to implement in a non-blocking environment. A naive send loop

might block the processing thread
build up a large outbound queue as put is faster than take+sending.
can prevent incoming Callbacks from being enqueued + executed (=Deadlock or OOM).

Of course this is a synthetic test case, however similar situations exist e.g. when streaming large query results or sending large blob's to other node's (e.g. init with reference data).

None of the libraries (except rest) did that out of the box:

Kontraktor requires a manual increase of queue sizes over default (default is 32k) in order to not deadlock in the "ask" test. In addition its required to programatically adopt send rate by using the backpressure signal raised by the TCP stack (network send blocks). This can be done non-blocking, "offer()" style.
For VertX i used a periodic task sending a burst of 500 to 1000 messages. Unfortunately the optimal number of messages per burst depends on hardware performance, so the test might need adoption when run on e.g. a Laptop.
For Akka I send 1 million messages each 30 seconds in order to avoid implementation of application level flow control. It just queued up messages and degrades to like 50 msg/s when used naively (big loop).
REST was not problematic here (synchronous Http1.1 anyway). Degraded by default.

Why is kontraktor remoting that faster ?

premature optimization
adaptive batching works wonders, especially when applied to reference sharing serialization.
small performance compromises stack up, reduce them bottom up

Kontraktor actually is far from optimal. It still generates and serializes a "MethodCall() { int targetId, [..], String methodName, Object args[]}" for each message remoted. It does not use Externalizable or other ways of bypassing generic (fast-)serialization.
Throughputs beyond 10 million remote method invocations/sec have been proved possible at cost of a certain fragility + complexity (unique id's and distributed systems ...) + manual marshalling optimizations.

Conclusion

As scepticism regarding distributed object abstractions is mostly performance related, high performance asynchronous remote invocation is a game changer
Popular libraries have room for improvement in this area
Don't use REST/Http for inter-system connect, (Micro-) Service oriented architectures. Point to point performance is horrible. It has its applications in the area of (WAN) web services / platform neutral, easily accessible API's and client/server patterns.
Asynchronous programming is different, requires different/new solution patterns (at source code level). It is unavoidable to learn use of asynchronous messaging primitives.
"Pseudo Synchronous" approaches (e.g. fibers) are good in order to better scale multithreading, but do not work out for distributed systems.
lack of craftsmanship can kill visions.

A persistent KeyValue Server in 40 lines and a sad fact

2014-12-22T10:56:00.001+01:00

This post originally was submitted to the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

It also has been published on voxxed.com .

picking up Peters well written overview on the uses of Unsafe, i'll have a ~~short~~ fly-by on how low level techniques in Java can save development effort by enabling a higher level of abstraction or allow for Java performance levels probably unknown to many.

My major point is to show that conversion of Objects to bytes and vice versa is an important fundamental, affecting virtually any modern java application.

Hardware enjoys to process streams of bytes, not object graphs connected by pointers as "All memory is tape" (M.Thompson if I remember correctly ..).

Many basic technologies are therefore hard to use with vanilla Java heap objects:

Memory Mapped Files - a great and simple technology to persist application data safe, fast & easy.
Network communication is based on sending packets of bytes
Interprocess communication (shared memory)
Large main memory of today's servers (64GB to 256GB). (GC issues)
CPU caches work best on data stored as a continuous stream of bytes in memory

so use of the Unsafe class in most cases boil down in helping to transform a java object graph into a continuous memory region and vice versa either using

[performance enhanced] object serialization or
wrapper classes to ease access to data stored in a continuous memory region.

(Code & examples of this post can be found here)

Serialization based Off-Heap

Consider a retail WebApplication where there might be millions of registered users. We are actually not interested in representing data in a relational database as all needed is a quick retrieve of user related data once he logs in. Additionally one would like to traverse the social graph quickly.

Let's take a simple user class holding some attributes and a list of 'friends' making up a social graph.

easiest way to store this on heap, is a simple huge HashMap.

Alternatively one can use off heap maps to store large amounts of data. An off heap map stores its keys and values inside the native heap, so garbage collection does not need to track this memory. In addition, native heap can be told to automagically get synchronized to disk (memory mapped files). This even works in case your application crashes, as the OS manages write back of changed memory regions.

There are some open source off heap map implementations out there with various feature sets (e.g. ChronicleMap), for this example I'll use a plain and simple implementation featuring fast iteration (optional full scan search) and ease of use.

Serialization is used to store objects, deserialization is used in order to pull them to the java heap again. Pleasantly I have written the (afaik) fastest fully JDK compliant object serialization on the planet, so I'll make use of that.

Done:

persistence by memory mapping a file (map will reload upon creation).
Java Heap still empty to serve real application processing with Full GC < 100ms.
Significantly less overall memory consumption. A user record serialized is ~60 bytes, so in theory 300 million records fit into 180GB of server memory. No need to raise the big data flag and run 4096 hadoop nodes on AWS ;).

Comparing a regular in-memory java HashMap and a fast-serialization based persistent off heap map holding 15 millions user records, will show following results (on a 3Ghz older XEON 2x6):

	consumed Java Heap (MB)	Full GC (s)	Native Heap (MB)	get/put ops per s	required VM size (MB)
HashMap	6.865,00	26,039	0	3.800.000,00	12.000,00
OffheapMap (Serialization based)	63,00	0,026	3.050	750.000,00	500,00

[test source / blog project] Note: You'll need at least 16GB of RAM to execute them.

As one can see, even with fast serialization there is a heavy penalty (~factor 5) in access performance, anyway: compared to other persistence alternatives, its still superior (1-3 microseconds per "get" operation, "put()" very similar).

Use of JDK serialization would perform at least 5 to 10 times slower (direct comparison below) and therefore render this approach useless.

Trading performance gains against higher level of abstraction: "Serverize me"

A single server won't be able to serve (hundreds of) thousands of users, so we somehow need to share data amongst processes, even better: across machines.

Using a fast implementation, its possible to generously use (fast-) serialization for over-the-network messaging. Again: if this would run like 5 to 10 times slower, it just wouldn't be viable. Alternative approaches require an order of magnitude more work to achieve similar results.

By wrapping the persistent off heap hash map by an Actor implementation (async ftw!), some lines of code make up a persistent KeyValue server with a TCP-based and a HTTP interface (uses kontraktor actors). Of course the Actor can still be used in-process if one decides so later on.

Now that's a micro service. Given it lacks any attempt of optimization and is single threaded, its reasonably fast [same XEON machine as above]:

280_000 successful remote lookups per second
800_000 in case of fail lookups (key not found)
serialization based TCP interface (1 liner)
a stringy webservice for the REST-of-us (1 liner).

[source: KVServer, KVClient] Note: You'll need at least 16GB of RAM to execute the test.

A real world implementation might want to double performance by directly putting received serialized object byte[] into the map instead of encoding it twice (encode/decode once for transmission over wire, then decode/encode for offheaping map).

"RestActorServer.Publish(..);" is a one liner to also expose the KVActor as a webservice in addition to raw tcp:

C like performance using flyweight wrappers / structs

With serialization, regular Java Objects are transformed to a byte sequence. One can do the opposite: Create wrapper classes which read data from fixed or computed positions of an underlying byte array or native memory address. (E.g. see this blog post).

By moving the base pointer its possible to access different records by just moving the the wrapper's offset. Copying such a "packed object" boils down to a memory copy. In addition, its pretty easy to write allocation free code this way. One downside is, that reading/writing single fields has a performance penalty compared to regular Java Objects. This can be made up for by using the Unsafe class.

"flyweight" wrapper classes can be implemented manually as shown in the blog post cited, however as code grows this starts getting unmaintainable.
Fast-serializaton provides a byproduct "struct emulation" supporting creation of flyweight wrapper classes from regular Java classes at runtime. Low level byte fiddling in application code can be avoided for the most part this way.

How a regular Java class can be mapped to flat memory (fst-structs):

Of course there are simpler tools out there to help reduce manual programming of encoding (e.g. Slab) which might be more appropriate for many cases and use less "magic".

What kind of performance can be expected using the different approaches (sad fact incoming) ?

Lets take the following struct-class consisting of a price update and an embedded struct denoting a tradable instrument (e.g. stock) and encode it using various methods:

a 'struct' in code

Pure encoding performance:

Structs	fast-Ser (no shared refs)	fast-Ser	JDK Ser (no shared)	JDK Ser
26.315.000,00	7.757.000,00	5.102.000,00	649.000,00	644.000,00

Real world test with messaging throughput:

In order to get a basic estimation of differences in a real application, i do an experiment how different encodings perform when used to send and receive messages at a high rate via reliable UDP messaging:

The Test:
A sender encodes messages as fast as possible and publishes them using reliable multicast, a subscriber receives and decodes them.

Structs	fast-Ser (no shared refs)	fast-Ser	JDK Ser (no shared)	JDK Ser
6.644.107,00	4.385.118,00	3.615.584,00	81.582,00	79.073,00

(Tests done on I7/Win8, XEON/Linux scores slightly higher, msg size ~70 bytes for structs, ~60 bytes serialization).

Slowest compared to fastest: factor of 82. The test highlights an issue not covered by micro-benchmarking: Encoding and Decoding should perform similar, as factual throughput is determined by Min(Encoding performance, Decoding performance). For unknown reasons JDK serialization manages to encode the message tested like 500_000 times per second, decoding performance is only 80_000 per second so in the test the receiver gets dropped quickly:

"
...
***** Stats for receive rate: 80351 per second *********
***** Stats for receive rate: 78769 per second *********
SUB-ud4q has been dropped by PUB-9afs on service 1
fatal, could not keep up. exiting
"
(Creating backpressure here probably isn't the right way to address the issue ;-) )

Conclusion:

a fast serialization allows for a level of abstraction in distributed applications impossible if serialization implementation is either
- too slow
- incomplete. E.g. cannot handle any serializable object graph
- requires manual coding/adaptions. (would put many restrictions on actor message types, Futures, Spore's, Maintenance nightmare)
Low Level utilities like Unsafe enable different representations of data resulting in extraordinary throughput or guaranteed latency boundaries (allocation free main path) for particular workloads. These are impossible to achieve by a large margin with JDK's public tool set.
In distributed systems, communication performance is of fundamental importance. Removing Unsafe is not the biggest fish to fry looking at the numbers above .. JSON or XML won't fix this ;-).
While the HotSpot VM has reached an extraordinary level of performance and reliability, CPU is wasted in some parts of the JDK like there's no tomorrow. Given we are living in the age of distributed applications and data, moving stuff over the wire should be easy to achieve (not manually coded) and as fast as possible.

Addendum: bounded latency

A quick Ping Pong RTT latency benchmark showing that java can compete with C solutions easily, as long the main path is allocation free and techniques like described above are employed:

[credits: charts+measurement done with HdrHistogram]

This is an "experiment" rather than a benchmark (so do not read: 'Proven: Java faster than C'), it shows low-level-Java can compete with C in at least this low-level domain.
Of course its not exactly idiomatic Java code, however its still easier to handle, port and maintain compared to a JNI or pure C(++) solution. Low latency C(++) code won't be that idiomatic either ;-)

About me: I am a solution architect freelancing at an exchange company in the area of realtime GUIs, middleware, and low latency CEP (Complex Event Processing) nightly hacking at https://github.com/RuedigerMoeller.

The Internet is running in debug mode

2014-10-28T19:59:00.003+01:00

With the rise of the Web, textual encodings like xml/JSON have become very popular. Of course textual message encoding has many advantages from a developer's perspective. What most developers are not aware of, is how expensive encoding and decoding of textual messages really is compared to a well defined binary protocol.

Its common to define a system's behaviour by its protocol. Actually, a protocol messes up two distinct aspects of communication, namely:

Encoding of messages
Semantics and behavior (request/response, signals, state transition of communication parties ..)

Frequently (not always), these two very distinct aspects are mixed up without need. So we are forced to run the whole internet in "debug mode", as 99% of webservice and webapp communication is done using textual protocols.

The overhead in CPU consumption compared to a well defined binary encoding is factor ~3-5 (JSON) up to >10-20 (XML). The unnecessary waste of bandwith also adds to that greatly (yes you can zip, but this in turn will waste even more CPU).

I haven't calculated the numbers, but this is environmental pollution at big scale. Unnecessary CPU consumption to this extent wastes a lot of energy (global warming anyone ?).

Solution is easy:

Standardize on some simple encodings (pure binary, self describing binary ("binary json"), textual)
Define the behavioral part of a protocol (exclude encoding)
use textual encoding during development, use binary in production.

Man we could save a lot of webserver hardware if only http headers could be binary encoded ..

Follow up: Executors and Cache Locality Experiment

2014-10-17T22:10:00.000+02:00

Thanks to Jean Philippe Bempel who challenged my results (for a reason), I discovered an issue in last post: Code-completion let me accidentally choose Executors.newSingleThreadScheduledExecutor() instead of Executors.newSingleThreadExecutor(), so the pinned-to-thread-actor results are actually even better than reported previously. The big picture has not changed that much, but its still worthwhile reporting.

On a second note: There are many other aspects to concurrent scheduling such as queue implementations etc.. Especially if there is no "beef" inside the message processing, these differences become more dominant compared to cache misses, but this is another problem that has been covered extensively by other people in depth (e.g. Nitsan Wakart).

Focus of this experiment is locality/cache misses, keep in mind different queueing implementations of executors for sure add dirt/bias.

As requested, I add results from the linux "perf" tool to prove there are significant differences in cache misses caused by random assignment of Thread - to Actor as done by ThreadPoolExecutor and WorkStealingExecutor.

Check out my recent post for a description of the test case.

Results with adjusted SingleThreadExecutor (XEON 2 socket, each 6 cores, no HT)

As in previous post, "dedicated" actor-pinned-to-thread performs best. For very small local state, there are only few cache misses so differences are small, but widen once a bigger chunk of memory is accessed by each actor. Note that ThreadPool is hampered by its internal scheduling/queuing mechanics, regardless of locality, it performs weak.

When increasing number of Actors to 8000 (so 1000 actors per thread), "Workstealing" and "Dedicated" perform similar. Reason: executing 8000 actors round robin creates cache misses for both executors. Note that in a real world server its likely that there are active and inactive actors, so I'd expect "Dedicated" to perform slightly better than in this synthetic test.

"perf stat -e" and "perf stat -cs" results

(only 2000, 4000, 8000 local size tests where run)

Dedicated/actor-pinned-to-thread:

333,669,424 cache-misses

19.996366007 seconds time elapsed
185,440 context-switches
20.230098005 seconds time elapsed
=> 9,300 context switches per second

workstealing:
2,524,777,488 cache-misses
39.610565607 seconds time elapsed
381,385 context-switches
39.831169694 seconds time elapsed
=> 9,500 context switches per second

fixedthreadpool:
3,213,889,492 cache-misses
92.141264115 seconds time elapsed
25,387,972 context-switches
87.547306379 seconds time elapsed
=>290,000 context switches per second

A quick test with a more realistic test method

In order to get a more realistic impression I replaced the synthetic int-iteration by some dirty "real world" dummy stuff (do some allocation and HashMap put/get). Instead of increasing the size of the "localstate" int array, I increase the HashMap size (should also have negative impact on locality).

Note that this is rather short processing, so queue implementations and executor internal implementation might dominate locality here. This test is run on Opteron 8c16t * 2Sockets, a processor with 8kb L1 cache size only. (BTW: impl is extra dirty, so no performance optimization comments pls, thx)

As ThreadPoolExecutor is abnormous bad in this Test/Processor combination, plain numbers:

	64 HMap entries	256 HMapentries	2000 HMapentries	4000 HMapentries	32k HMapentries	320k HMapentries
WorkStealing	1070	1071	1097	1129	1238	1284
Dedicated	656	646	661	649	721	798
ThreadPool	8314	8751	9412	9492	10269	10602

Conclusions basically stay same as in original post. Remember cache misses are only one factor of overall runtime performance, so there are workloads where results might look different. Quality/specialization of queue implementation will have huge impact in case processing consists of only some lines of code.

Finally, my result:

Pinning actors to threads created lowest cache miss rates in any case tested.

Experiment: Cache effects when scheduling Actors with F/J, Threadpool, Dedicated Threads

2014-10-14T21:57:00.002+02:00

Update: I accidentally used newSingleThreadScheduledExecutor instead of newFixedThreadPool(1) for the "Dedicated" test case [ide code completion ..]. With this corrected, "Dedicated" outperforms even more. See follow up post for updated results + "perf" tool cache miss measurement results (do not really change the big picture).

The experiment in my last post had a serious flaw: In an actor system, operations on a single actor are executed one after the other. However by naively adding message-processing jobs to executors, private actor state was accessed concurrently, leading to "false-sharing" and cache coherency related costs especially for small local state sizes.

Therefore I modified the test. For each Actor scheduled, the next message-processing is scheduled once the previous one finished, so the experiment resembles the behaviour of typical actors (or lightweight processes/tasks/fibers) correctly without concurrent access to a memory region.

Experiment roundup:

Several million messages are scheduled to several "Actor" simulating classes. Message processing is simulated by reading and writing the private, actor-local state in random order. There are more Actors (24-8000) than threads (6-8). Note that results established (if any) will also hold true for other light-weight concurrency schemes like go-routines, fibers, tasks ...

The test is done with

ThreadPoolExecutor
WorkStealingExecutor
Dedicated Thread (Each Actor has a fixed assignment to a worker thread)

Simulating an Actor accessing local state:

Full Source of Benchmark

Suspection:

As ThreadPoolExecutor and WorkStealingExecutor schedule each message on a random Thread, they will produce more cache misses compared to pinning each actor onto a fixed thread. Speculation is, that work stealing cannot make up for the costs provoked by cache misses.

(Some) Variables:

Number of worker threads
Number of actors
Amount of work per message
Locality / Size of private unshared actor state

8 Threads 24 actors 100 memory accesses (per msg)

Interpretation:

For this particular load, fixed assigned threads outperform executors. Note: the larger the local state of an actor, the higher the probability of a prefetch fail => cache miss. In this scenario my suspection holds true: Work stealing cannot make up for the amount of cache misses. fixed assigned threads profit, because its likely, some state of a previously processed message resides still in cache once a new message is processed on an actor.
Its remarkable how bad ThreadpoolExecutor performs in this experiment.

This is a scenario typical for backend-type service: There are few actors with high load. When running a front end server with many clients, there are probably more actors, as typically there is one actor per client session. Therefor lets push up the number of actors to 8000:

8 Threads 8000 actors 100 memory accesses (per msg)

Interpretation:

With this amount of actors, all execution schemes suffer from cache misses, as the accumulated size of 8000 actors is too big to fit into L1 cache. Therefore the cache advantage of fixed-assigned threads ('Dedicated') does not make up for the lack of work stealing. Work Stealing Executor outperforms any other execution scheme if a large amount of state is involved.

This is a somewhat unrealistic scenario as in a real server application, client request probably do not arrive "round robin", but some clients are more active than others. So in practice I'd expect "Dedicated" will at least have some advantage of higher cache hits. Anyway: when serving many clients (stateful), WorkStealing could be expected to outperform.

Just to get a third variant: same test with 240 actors:

These results complete the picture: with fewer actors, cache effect supercede work stealing. The higher the number of actors, the higher the number of cache misses gets, so work stealing starts outperforming dedicated threads.

Modifying other variables

Number of memory accesses

If a message-processing does few memory accesses, work stealing improves compared to the other 2. Reason: fewer memory access means fewer cache misses means work stealing gets more significant in the overall result.

************** Worker Threads:8 actors:24 #mem accesses: 20
local state bytes: 64 WorkStealing avg:505
local state bytes: 64 ThreadPool avg:2001
local state bytes: 64 Dedicated avg:557
local state bytes: 256 WorkStealing avg:471
local state bytes: 256 ThreadPool avg:1996
local state bytes: 256 Dedicated avg:561
local state bytes: 2000 WorkStealing avg:589
local state bytes: 2000 ThreadPool avg:2109
local state bytes: 2000 Dedicated avg:600
local state bytes: 4000 WorkStealing avg:625
local state bytes: 4000 ThreadPool avg:2096
local state bytes: 4000 Dedicated avg:600
local state bytes: 32000 WorkStealing avg:687
local state bytes: 32000 ThreadPool avg:2328
local state bytes: 32000 Dedicated avg:640
local state bytes: 320000 WorkStealing avg:667
local state bytes: 320000 ThreadPool avg:3070
local state bytes: 320000 Dedicated avg:738
local state bytes: 3200000 WorkStealing avg:1341
local state bytes: 3200000 ThreadPool avg:3997

local state bytes: 3200000 Dedicated avg:1428

Fewer worker threads

Fewer worker threads (e.g. 6) increase probability of an actor message being scheduled to the "right" thread "by accident", so cache miss penalty is lower which lets work stealing perform better than "Dedicated" (the fewer threads used, the lower the cache advantage of fixed assigned "Dedicated" threads). Vice versa: if the number of cores involved increases, fixed thread assignment gets ahead.

Worker Threads:6 actors:18 #mem accesses: 100
local state bytes: 64 WorkStealing avg:2073
local state bytes: 64 ThreadPool avg:2498
local state bytes: 64 Dedicated avg:2045
local state bytes: 256 WorkStealing avg:1735
local state bytes: 256 ThreadPool avg:2272
local state bytes: 256 Dedicated avg:1815
local state bytes: 2000 WorkStealing avg:2052
local state bytes: 2000 ThreadPool avg:2412
local state bytes: 2000 Dedicated avg:2048
local state bytes: 4000 WorkStealing avg:2183
local state bytes: 4000 ThreadPool avg:2373
local state bytes: 4000 Dedicated avg:2130
local state bytes: 32000 WorkStealing avg:3501
local state bytes: 32000 ThreadPool avg:3204
local state bytes: 32000 Dedicated avg:2822
local state bytes: 320000 WorkStealing avg:3089
local state bytes: 320000 ThreadPool avg:2999
local state bytes: 320000 Dedicated avg:2543
local state bytes: 3200000 WorkStealing avg:6579
local state bytes: 3200000 ThreadPool avg:6047
local state bytes: 3200000 Dedicated avg:6907

Machine tested:

(real cores no HT)

$ lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 44
Stepping:              2
CPU MHz:               3067.058
BogoMIPS:              6133.20
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              12288K
NUMA node0 CPU(s):     1,3,5,7,9,11
NUMA node1 CPU(s):     0,2,4,6,8,10

Conclusion

Performance of executors depends heavy on use case. There are work loads where cache locality dominates, giving an advantage of up to 30% over Work-Stealing Executor
Performance of executors varies amongst different CPU types and models (L1 cache size + cost of a cache miss matter here)
WorkStealing could be viewed as the better overall solution. Especially if a lot of L1 cache misses are to be expected anyway.
The ideal executor would be WorkStealing with a soft actor-to-thread affinitiy. This would combine the strength of both execution schemes and would yield significant performance improvements for many workloads
Vanilla thread pools without work stealing and actor-to-thread affinity perform significantly worse and should not be used to execute lightweight processes.

Source of Benchmark

Alternatives to Executors when scheduling Tasks/Actors

2014-10-12T18:03:00.000+02:00

Executors work well when fed with short units of stateless work, e.g.: division of a computation onto many CPU's. However they are sub-optimal to schedule jobs which are part of an ongoing, larger unit of work, e.g.: Scheduling messages of an actor or lightweight process.

Many Actor Frameworks (or similar message passing concurrency frameworks) schedule batches of messages using an Executor service. As the Executor service is not context aware, this spreads out processing of a single Actor/Task's messages across several threads/CPU's.

This can lead to many cache misses when accessing the state of an Actor/Process/Task as the processing thread changes frequently.
Even worse, this way CPU caches cannot stabilize as each new "Runnable" washes out cached memory of the previously processed task's.

A second issue arises when using busy-spin polling. If a framework reads its queues using busy-spin, it generates 100% CPU load for each processing Thread. So one would like to add a second thread/core only if absolutely required, so a one-thread-per-actor policy is not feasible.

With Kontraktor 2.0 I implemented a different scheduling mechanism, which achieves a horizontal scaling using a very simple metric to measure actual application CPU requirements, without randomly spreading processing of an actor onto different Cores's.

An actor has a fixed assignment to a workerthread ("DispatcherThread"). The scheduler periodically reschedules Actors by moving them to another worker if necessary.

Since overly sophisticated algorithms tend to introduce high runtime costs, actual scheduling is done in a very simple manner:

1) a Thread is observed as being overloaded, if the consume loop (pulling messages from the actor queue) has not been idle for N messages (currently N=1000).

2) If a Thread has been marked "overloaded", Actors with largest mailbox sizes are migrated to a newly started Thread (until #Threads == ThreadMax) as long SUM_QUEUED_MSG(Actors scheduled on Thread A) > SUM_QUEUED_MSG(Actors scheduled on newly created Thread B).

3) in case #Threads == ThreadMax, actors are rebalanced based on summation of queued messages and "overload" observations.

Problem areas:

If the processing time of messages vary a lot, summation of queued messages is misleading. An improvement would be to add a weight onto each message type by profiling periodically. A simpler option would be to give an Actor an additional weight to be multiplicated with its queue size.
For load bursts, there is a latency until all of the available CPU's are actually used.
The delay until JIT kicks in produces bad profiling data and leads to false scale ups (heals over time, so not that bad).

An experiment

Update: Woke up this morning and it came to my mind this experiment has a flaw, as jobs per workitem are scheduled in parallel for the executor tests, so I am probably measuring the effects of false sharing. Results below removed, check the follow up post.

Performance cost of adaptive Scaling

To measure cost of auto-scheduling vs. explicit Actor=>Thread assignment, I run the Computing-Pi Test (see previous posts).
These numbers do not show effects of locality as it compares explicit scheduling with automatic scheduling.

Test 1 manually assigns a Thread to each Pi computing Actor,
Test 2 always starts with one worker and needs to scale up automatically once it detects actual load.

(note example requires >= kontraktor2.0-beta-2, if the 'parkNanos' stuff is uncommented it scales up to 2-3 threads only)

The test is run 8 times with increasing thread_max by one with each run.

results:

Kontraktor Autoscale (always start with 1 thread, then scale up to N threads)

1 threads : 1527
2 threads : 1273
3 threads : 718
4 threads : 630
5 threads : 521
6 threads : 576
7 threads : 619
8 threads : 668

Kontraktor with dedicated assignment of threads to Actors (see commented line in source above)

1 threads : 1520
2 threads : 804
3 threads : 571
4 threads : 459
5 threads : 457
6 threads : 534
7 threads : 615
8 threads : 659

Conclusion #2

Differences in runtimes can be attributed mostly to the delay in scaling up. For deterministic load problems, prearranged Actor scheduling is more efficient ofc. However thinking of a server receiving varying request loads over time, automatic scaling is a valid option.

Breaking the habit: Solving the Dining Philosopher's problem with Actors

2014-09-27T01:11:00.000+02:00

Concurrent programming using Actors is different to traditional mutex based approaches, therefore I'd like to show how the solution of a classical concurrency problem can be done using Actors. It requires some time to get used to it as future/actor based concurrency looks different and uses different patterns.

Additionally I'll show how little effort is required to make this a distributed application, as asyncronous Actor based concurrency is much more resilient regarding network induced latency.

I'll use my Kontraktor Actor library version 2.0 beta to also illustrate how the "message == asynchronous method call" approach of Kontraktor works out in "practice".

Resources:

Kontraktor lib: https://github.com/RuedigerMoeller/kontraktor.
Full source of sample is here.

Description of the philosophers dining problem:
http://en.wikipedia.org/wiki/Dining_philosophers_problem
(I am using a non-fair solution to keeps things simple.)

The Table

One actor represents the table. For each fork, a list of philosophers waiting for "fork is free"-notification is kept.

Short explanation: In Kontraktor all public methods are translated into asynchronous messages put on to an actors queue ("mailbox") behind the scenes. The worker thread of an actor takes the messages of the mailbox and calls the "real" implementation. So its guaranteed an actor is executing single threaded. To identify asynchronous calls easily, I always put a $ in front of public actor methods.

Explanation: if a fork is taken and the queue of waiting philosophers is empty, a fulfilled promise is returned, else an unfulfilled promise is added to the queue. The caller code then looks like:

Once a fork is returned, the next Promise in the arraylist (if any) is notified. signal() is equivalent to receive("dummy",null).

Note that the "result" object is unused, I just need a signal from the future returned. That's why the fulfilled Promise gets a "void" as a dummy result object (which will then trigger immediate execution of the then-closure on caller side).

Note how I can use simple single threaded collections and data structures as the actor is guaranteed to be executed single threaded.

The Philosopher

A philosopher has a wonderful life which can be divided into 3 states:

10 think randomTime
20 be_hungry_and_try_to_grab_2_forks
30 eat randomTime
40 goto 10

[rumors are of lost lines 15 and 35 related to indecorous social activities introduced after Epicurus happened to join the table]

The meat is in the $live method:
After thinking, the philosopher tries to grab one fork, the closure given to the "then" method gets executed once the fork becomes available. Once both forks have been grabbed, another "delayed" call keeps the "eat" state for a while, then both forks are returned and a $live message is put onto the actors mailbox (=message queue) to start thinking again.

Remarks:

"delayed" just schedules the given closure onto the actors mailbox with a delay.
"self().$live()" puts a $live message onto the actors mailbox. If "this.$live()" would be called, the $live() message would be executed directly [synchronous] like a normal method. In this example this would not hurt, however in many cases this makes a difference.
Wait-Free. The executing thread is never blocked. One could easily simulate thousands of philosophers onto a single thread

Setup and run

The main method to start the life of 5 philosophers looks like this:

and

Note Hoarde is a utility to create and manage groups of Actors. "startReportingThread" just starts a thread dumping the state of all philosophers:

"Is this still Java?" .. probably weird looking stuff for the untrained eye ..
(note Philosopher.$getState is async and yield() is used to wait for all futures to complete before executing the then() closure printing the state)

BTW: spacing is somewhat inconsistent as I started to reformat code partially during edit .. but won't do screenshots again ..

Yay, done:

Let's make it a distributed application

The killer feature of Actor based asynchronous+lock/wait free programming is simple distributability. A solution using blocking mutexes would have to get a rewrite or rely on synchronous remote calls, so the bigger the network latency, the lower the throughput. Actors do not suffer from that.
We can just change the startup to make the table run remotely:

Is this cool or what ? I now can feed hoardes of philosophers with a single table server:

Backpressure in async communication systems

2014-08-30T16:13:00.000+02:00

I recently saw this excellent presentation video of Mr Kuhn .. and later on found about the community standardization effort of reactive streams.

As I have done many experiments dealing with backpressure issues in a high volume async distributed system, I'd like to point out some possible pitfalls of the "flow control" mechanism proposed in the reactive streams spec and point out how I dealt with the problem (after having tried similar solutions).

The problem of flow control arises in asynchronous communication if the sender sends messages at a higher pace than (one of the) receivers can process them.
This problem is well known from networking and actually TCP implements a similar scheme to the one proposed in the RStreams spec. Another - less known - flow control scheme is NAK. NAK is used typically in low latency brokerless (multicast) messaging products.

The "Reactive Streams" spec proposes an acknowledge based protocol (although its not presented as such).

Receiver sends "requestMore(numberOfMsg)" to Sender
Sender sends "numberOfMsg" messages to Receiver

by emitting "requestMore", a Receiver signals it has received+processed (or will have finished processing soon) previously received messages. Its an acknowledgement message (ACK).

Problematic areas:

it works for point to point communication only. If you have a multicast situation (e.g. one sender multicasts to N receivers), this does not work well.
Reason:
In order to make this work, the sender needs to be aware of all receivers, then wait for every receiver to have sent a "requestMore", then compute the minimum of all "requestMore" messages and finally send this amount of messages. Once a receiver dies, no "requestMore" is send/received, so the sender can't sent anything for an extended amount of time (receiver-timeout). If one receiver pauses (e.g. GC's), a "requestMore" is missing, so the sender needs to stall all communication.
You can see those effects in action with JGroups using a configuration with credit based flow control protocol. Once a cluster node terminates, throughput falls to zero until the node-leave timeout is reached (similar issues are observable with the ACK-based NACKACK reliable UDP transport of JGroups). ACK ftl :-)
The actual throughput depends hard on connection latency. As the ACK signal ("requestMore") takes a longer time then to reach the sender, the receiver buffer must be enlarged depending on latency, additionally the size of chunks requested by "requestMore" must be enlarged. If one does not do this, throughput drops depending on latency. Short said:
sizeOfBuffer needs to be >= eventsProcessable per roundTripLatency interval. This can be solved by runtime introspection, however in practice this can be tricky especially with bursty traffic.
As long reliable communication channels (in memory, TCP) are used: Those transports already have implemented a flow control / backpressure mechanism, so "Reactive Streams" effectively doubles functionality present at a lower and more efficient level. Its not too hard to make use of the backpressure signals of e.g. TCP (one solution is an "outbound" queue on sender side, then observe its size). One can achieve the same flow control results without any need for explicit application level ACK/Pull messages like "requestMore". For inmemory message passing, one simply can inspect the queue size of the receiver. [you need a queue implementation with a fast size() implementation then ofc]

Alternative: NAK instead of ACK, adaptive rate limits

The idea of NAK (=negative acknowledgement) is raise backpressure signals from receiver side only in case a component detects overload. If receivers are not in an overload situation, no backchannel message traffic is present at all.

The algorithm outlined below worked for me in a high throughput distributed cluster:

Each sender has a send rate limit which is increased stepwise over time. Usually the starting default is a high send rate (so its actually not limiting). Receivers observe the size of their inbound queue. Once the queue size grows to certain limits, receivers emit an NAK( QFillPercentage ) message. E.g. 50%, 75%, 87%, 100%. The sender then increases the send rate limit depending on the "strength" of the NAK message. To avoid permanent degrade, the sender increases the send rate stepwise (the step size may depend on the time no NAK was received).

This works for 1:1 async communication patterns as well as for 1:N. In an 1:N situation the sender reacts to the highest strength NAK message within a timing window (e.g. 50ms). The 0-traffic-on-dying-receiver problem is not present. Additionally the sender does not need to track the number of receivers/subscribers, which is a very important property in 1:N multicast messaging.

Finding the right numbers for NAK-triggers, and sender speed up steps can be tricky as they depend on the volatility of sendrate and latency of transport link. However it should be possible to adaptively find appropriate values at runtime. In practice I used empirical values found by experimenting with the given system on sample loads, which is not a viable solution for a general implementation ofc.

The backpressure signal raised by the NAK messages can be applied to the sender either blocking or nonblocking. In concrete system I just blocked the sending thread which effectively puts a dynamically computed (current rate limit) "cost" for each async message send operation. Though this is blocking , it did not hurt performance, as there was no "other work" to do in case a receiver can't keep up. However backpressure signals can be applied in a nonblocking fashion as well ("if (actualSendRate < sendRateLimit) { send } else { do other stuff }").

Java Concurrency: Learning from Node.js, Scala, Dart and Go

2014-06-05T20:24:00.000+02:00

Preface

Stuff below regarding asynchronism isn't exactly new, however in the Java-world you'll still see a lot of software using synchronous, blocking multi threading patterns, frequently built deep into the architecture of apps and frameworks. Because of the Java tradition to unify concurrency and parallelism, frequently even senior developers are not aware of the fact, one cannot build well scaling high throughput distributed systems using typical java synchronous multi-threading idioms.

So what is the problem with synchronous processing ?

Think of a web server handling user requests. In this example, to process a client request, multiple requests to another remote service have to be done.

Now, the service can process 10.000 requests per second, but request latency is 5ms (network time, de/encoding). So a simple request/response takes 5+5 ms (back and forth) = 10 ms. A single threaded client implemented in a blocking fashion can send/receive max. 100 requests per second, therefore is unable to saturate the service.
If the webserver is expected to emit on average 1 service requests per client request, its thread pool has to be sized to 100 worker threads in order to maximize throughput and saturate the service.

Now consider that the remote service makes use of another service with same throughput and latency characteristics.

This increases latency (as seen from the webserver) by another 10 ms, so overall latency of a service request/response is now 20ms in turn requiring a worker pool of 200 threads in the webserver, 100 service threads in service (A).

It is clear, in a complex distributed system this imposes a lot of problems when using thread based synchronous processing. A change in the infrastructure (=latency changes) might require reconfiguration of many servers/services.

A common solution to this dilemma is to just configure way too much worker threads in turn getting into multithreading related issues without need. Additionally, a thread is quite a heavy weight object. Its impossible to scale up to say 100.000 or more threads in case. The higher the latency, the more threads must be used to saturate a remote service (e.g. think of inter-datacenter connections), so if you need to process e.g. 50k requests per second and need to query a remote service for each incoming requst, you simply cannot use a synchronous, blocking style. Additionally, having more threads than CPU cores hampers overall performance (context switches), so having 100's of threads per CPU actually will reduce throughput significantly.

If you think of higher numbers (e.g. 100k requests per second), its clear that one cannot compensate for synchronous, blocking programming by adding more threads.

With synchronous IO programming, latency has direct impact on throughput

No problem, we can go asynchronous, we have threads ..

(note: sleep is a placeholder for some blocking IO or blocking processing)

Bang ! because:

HashMap must be changed to ConcurrentHashMap, else you might run into endless loop when calling "get" concurrenlty to a "put" operation.
You might see or not the result of "someInt++" (JMM). Needs to be declared 'volatile'.

Congrats ! Now most of your code is multithreaded, which means you'll clutter your code with ConcurrentHashMap's and Atomics, synchronized and deadlocks ... the step-up of "callback hell" is "multithreaded callback hell".

With plain java, asynchronous programming provokes unnecessary multithreading.

A thread-safe version of this code would manage to deliver the callback "inside" the calling thread. But for that we need to enqueue the callback and need kind of a threading framework which ensures the queue is read and executed.

Well, one can do this in plain java like this:

Hm, some boilerplate code here, basically a handcrafted simple implementation of an Actor. Actually its a very simple case, as the async call does not return values. In order to realize this, more handcrafting would be required (e.g. a callback interface to deliver the result of blocking code to the calling thread, additionally you won't do the result-callback from your one-and-only-unblockable mailbox processor thread. Tricky stuff ahead.

With an actor library, this stuff is solved inside the actor framework. Same code with kontraktor:

(I left some comments as well, to avoid dillusion of the fact that only 2-3 lines of extra code are required compared to a synchronous implementation).

Reminder: Bad Threads

Using multi threaded asynchronous processing easily leads to massive multi threading, which raises a bunch of problems:

Out-of-Control Threading: its hard to tell which parts of your application data/resources are accessed concurrently as software grows (an innocent call to your application framework from a within a foreign callback thread can literally make everything multithreaded unnoticed).
Because of (1) developers start synchronizing defensively
Because of (2) you get deadlocks
Because of (2) you run 500 Threads all waiting on contended locks, so your service goes down despite having just 60% CPU usage.
Because of (1) you get spurious unreproducable errors caused by random cache coherency/data visibility issues.

Anyway, lets have a look what newer languages bring to the table regarding concurrency ..

Concurrency Idioms/Concepts in Go, Dart, Node.js, Scala

Despite the hype and discussions surrounding those emerging languages, they technically present more or less the same basic concept to deal with concurrency (ofc there are lots of differences in detail, especially Go is different):

Single threaded entities (with associated thread-local state) sequentially process incoming events from bounded or unbounded queue's.

This technical foundation can be presented in different ways to a programmer. With actors, functionality and data is grouped 'around' the queue ('mailbox'). In the channel model, larger entities ('Process', 'Isolate') explicitely open/close 'channels' (=queues=mailboxes) to exchange messages with other processing entities.

node.js/javascript

A node.js process is a single actor instance. Instead of a 'mailbox' the queue in node.js is called "event queue". Blocking operations are externalized to the javascript V8 VM, the results of such an asynchronous operation is then put onto the event queue. The programming model is callback/future style however there are extensions providing a fiber like (software threads, "green" threads) programming model.

Multicore hardware is saturated by running several instances of node.js.

Dart

(on server side) features a more sophisticated, but similar model compared to JS. In contradiction to javascript, Dart provides a richer set of utilities to deal with concurrency.
Its possible to run several event loops called 'Isolates' from within a single process or in separate processes. Conceptually, each isolate is run by a separate thread. Isolates communicate by putting messages onto each others event loop. Very actor'ish, in fact this can be seen as an implementation of the Actor-pattern.
As a special, a Dart ~~Actor~~ Isolate reads from 2 event queues, of which one has higher priority and (simplified) processes self-induced asynchronous code, while the other one gets events from the 'outside' world.

Api-wise, Dart favours 'Futures' over 'Callbacks':

Ordering in Dart

ordered execution of asynchronous pieces of code then can be chained using 'then'.

For messaging amongs Isolates, Dart provides generic deep-copy of arbitrary object graphs in-process, and simple datatypes for communication amongst isolates running in different processes.

Both Dart and javascript actually use a flavour of the Actor pattern, though they use a different terminology.

Scala/Akka

Although Scala has access to the same concurrency primitives as Java (runs on the JVM), Scala has a tradition of asynchronism, which is reflected by existence of many async tools, idioms and libraries. Its a very creative, adventurous and innovative platform, which can be both good or bad in industrial-grade software projects.

Ordering in Akka

As to be expected, Scala's Akka+async libraries are very rich featured. Akka proposes a somewhat different system-design phliosophy compared to Dart/Nodes.js as they encourage a more fine grained actor design.
Akka promotes an untyped actor model (though typed actors are available) and has features such as actor-induced message reordering/dropping, re-dispatching, scanning the mailbox, etc. As always: great flexibility also paves the way for abusive and hard to maintain programming patterns. Splitting up an application into many actors (fine grained actor design) is advantageous in that a sophisticated scheduler may parallelize execution better. On the other hand this comes at a high cost regarding maintainability and controllability. Additionally, enqueuing a message instead of doing a plain method call has high fixed overhead, so its questionable wether doing fine grained actor design pays off.

As Akka originally was designed for Scala, it requires quite some boiler-plating when used from java. One either has to write immutable 'message classes' or (with typed actors) has to define an interface along with each actor (because typed actors use java.*.Proxy internally).

As I showed in a previous post, its not optimized to the bones, however this should not be significant for real world asynchronous IO/messaging centric applications.

Recent versions improved remoting of actors and also added support for multicast-based messaging (zeroMQ).

If you look at typical processing sequences in an actor based asynchronous program, you'll probably notice its kind of a handcrafted multiplexing of concurrently executed statement sequences onto a single thread. I'll call such a sequence 'threadlet' (Edit: rest of the world calls them 'Monad').

Basically myFunc models an ordered sequence of actions ('threadlet'):

getUrl
process contents
return result

Since getting the URL needs some time, it is executed async, so other events can be read from the inbox.

If the runtime would be aware of "blocking calls" automatically, it could assist by providing the illusion of single threaded execution by automatically splitting a "threadlet" into separate events. This is what Go does:

go myFunc() {
String text = getUrl("http://google.com" );
[process text]
return result;
}()

While one builds up a "virtual stack" of a "threadlet" in an actor based system by chaining futures and callbacks, Go manages to transform a "normal" looking piece of code behind the scenes. For that it builds up a (dynamically sized) stack which replaces Future chaining of the actor based approaches. Its a non-native lightweight thread.

So this might look very different, its actually a similar concept of multiplexing small snippets of ordered statement sequences onto a single thread. Its similar to "Green Threads" (low overhead software emulated threads). If such a 'goroutine'/'threadlet' needs to wait or is logically blocked, the processing thread does not stop but processes another goroutine (in case of actors: message from mailbox).

As the Java, Javascript and Dart Virtual Machines have no concept of "coroutines" (=a runnable piece of code with associated stack and state = software thread = green thread) support, a more explicit syntax to emulate green threads is required for asynchronous programming in those lanugages.

The rationale behind go-routines/actors/future based asynchronism is, that the programmer should specify concurrency of execution. The paralellism/scheduling is handled by the runtime system.

The Go runtime schedules go routines onto one or more OS threads. Blocking operations are automatically detected (e.g. OS calls), blocking operations such as waiting on messages from a channel/queue actually do not block as the current thread just continues with executing another go-routine.
So from an abstract point of view, go-routines are similar to lightweight threads with built-in awareness of non-blocking message-queues (channels).

Channels replace mailboxes,futures,callbacks

An actor encapsulates a queue (mailbox), a set of messages 'understood' and private (=thread local) data. In Go, mailboxes are replaced by queue objects called 'Channels' which can be created and deleted dynamically. A go routine might choose to open one or more queues (channels) it is listening and sending to. The size of a queue can be set to 0, which equals synchronous data transfer (actually still non blocking as a go-routine != a thread).

Shared State

As go-routines might run in parallel on different cores if the runtime scheduler decides so, there might be multithreaded access to shared data structures (if two go-routines are scheduled on different threads).
As far I understood, in those cases manual synchronization is required. I'd see this as a disadvantage compared to the actor model, where it is very clear wether private or shared data gets accessed. Additionally the concept of actor-local data matches current CPU designs, as each core has a non-shared cache.
The actor pattern enforces a thread-data-ownership by design. Astonishingly I did not find something along the lines of a GO memory model comparable to the JMM, which is required when access to shared data from multiple threads is possible.

See also Concurrency is not parallelism

Dart, Node.js have a shared nothing architecture

No data can be shared amongs processes/isolates. In order to share data, messaging must be used (=has to be encoded and copied). That's probably a contributing factor to the rise of no-sql storage like redis or mongodb. They are used both for persistance and to mimic a shared 'Heap'. In order to share complex data structures, kind of serialization has to be used. Frequently simple solutions, such as transmitting primitve types only or JSon are used, which adds quite some overhead to data sharing.

Other advantages of Shared Nothing

single threaded garbage collection
single threaded VM + JIT
No memory model like JMM required

Performance constraints

While single threaded concurrency has many advantages in IO centric, middle ware layer applications, it is a disadvantage when it comes to situations where several cores are used to speed up processing. E.g. it is not possible to implement something like the Disruptor without shared memory and access to native threading/volatile/barriers.

Why single threaded concurrency is easier than multithreaded concurrency (true parallelism)

Actor/CSP based concurrency still is challenging at times, however one does not have to deal with the issues arising from memory visibility rules resulting from modern CPU architectures (see JMM). So a synchronous sequence of statments inside an actor can do a simple HashMap put or increment a counter without even thinking of synchronization or concurrent modification.

Scheduling

As stated, actor (and asynchronous programming) splits up execution into smaller pieces of ordered statement sequences. Go does a similar thing with its go-routines.
A challenge to runtime-system implementations is the scheduling of ordered statement sequences ("threadlets") onto "real" OS threads.

The V8 VM (Node.js) schedules a single event loop (=mailbox) + some helper threads to "outsource" blocking operations. So the user application is run on a single thread.

The Dart VM provides a more refined system, having a "micro event loop" and an external event loop. Click here for more details .

Akka has a pluggable scheduling strategy: thread pool based, fork join (applies work-stealing), and pinned (each actor has its own thread). The queue implementation of the mailbox can be choosen, however there is no concept of several queues. Given that the "dual queues" are just an implementation of a "message priority" concept, it is clear similar concepts can be realized with Akka.

In my pet-library 'kontraktor', I use a dual queue approach similar to Dart (by chance). The programmer defines the mapping of actors to native (Dispatcher-) threads. This gives best results regarding cache-locality if done right. A downside is that there is no dynamic adaption in case an applications has varying load patterns.

Pure work-stealing algorithm might perform good in micro benchmarks, in real applications the penalty of cache misses when scheduling an actor on a different thread could be much higher compared to micro benchmark results.

In order to implement a self-optimizing dispatcher, a sophisticated runtime analysis is required (e.g. count cache misses of 'threadlets', track communication patterns amongst different actors) as outlined here. AFAIK no VM actually has implemented such sophisticated algorithms, there is ongoing research in the Go community.

Further reading on GO's scheduler.

Do we all have to learn Scala, Go or Dart now ?

As Java has all the low level concurrency tools at hand, mostly a change in habit and mind-culture is required. A lot of Java APIs and components feature synchronous, blocking interfaces. Avoid them, they will become problematic in the uprising world of globally distributed systems.

In fact, JVM based languages are at advantage, because they have the freedom to choose or even mix different concurrency models.
One can implement Actors, Futures (nonblocking ones, not the blocking version like java.concurrent.Future) using the JVM's concurrency primitives. With the introduction of the shorthand syntax for anonymous classes (some call them lambdas) asynchronous code does not look as messy as with prior versions of java.

If I'd be a student, I'd go with Scala. This will also improve your general programming skills a lot and bring in a lot of interesting ideas and design patterns.
If you are in the middle of a java career and already heavily invested, its probably better to port/copy successful design patterns found in other, newer languages.

A groundbreaking improvement to Java's concurrency toolset would require VM and language improvements:

There has to be a way for the programmer to express concurrent/asynchronous execution, that differs from expressing parallel execution, so the JVM gets a chance to optimize execution of concurrent tasks. Currently, concurrency and parallelism are expressed using threads.
Native support for continuations would be required (probably a very hard requirement, especially regarding HotSpot). There are some implementations using pure bytecode weaving out there (e.g. WebFlow) which unfortunately have pretty high performance costs.

Anyway, with recent language improvements (java 8) callback/futures style is much more manageable than before.
Another solution might come from hardware/OS. If threads could be realized in a much more lightweight way (+ faster context switch), software-level concurrency by multiplexing might not be required anymore.

Real world experience

We have transformed two processes of a very IO intensive clustered application from traditional multithreading to an actor based approach. Both of them performed an order of a magnitude faster thereafter. Additionally the code base got much more maintainable as the amount of shared data (data being accessed multithreaded) reduced dramatically. The number of threads declined from >300 to 2 (+some pooled helper threads).

What we changed:

Did not use threads to model concurrency. Concurrency != parallelism
We used non blocking IO
We used an actor approach (kontraktor). This way an ownership-relation is created inbetween an execution entity (actor) and datastructures.
We respected the short version of the "reactive manifest":

Never block

Comparision of different concurrency models: Actors, CSP, Disruptor and Threads

2014-01-28T01:01:00.000+01:00

In order to make use of modern multi core/socket hardware, Java offers threads and locks for concurrent programming. It is well known this model suffers of a number of problems:

Deadlocks
Bad scaling due to "over-synchronization" or contention on common resources
Errors caused by invalid synchronization are timing dependent and hard to reproduce. It is common they appear under special conditions (load, hardware) in production.
As a software project grows, it gets hard to safely modify the application. A programmer needs global awareness of an applications control flow and data access patterns in order to change the software without introducing errors.
The subtleties of the Java Memory Model are not well known to most Java programmers. Even if JMM is known, incorrect programming isn't obvious and will fail rarely. It can be a nightmare to spot them from production logs of large applications.

So I am looking for alternative models to express concurrency. Therefore a simple benchmark of the most popular "alternative" concurrency models is done: Actors, CSP and Disruptor.

Actors

Real Life Analogy: People cooperating by sending mails to each other.

An Actor is like an object instance executed by a single thread. Instead of direct calls to methods, messages are put into the Actors "mailbox" (~queue). The actor single threaded reads and processes messages from the queue sequentially (with exceptions).
Internal state is exposed/shared by passing messages (with copied state) to other Actors.

CSP (Communicating Sequential Processes)

Real Life Analogy: People phoning each other.

Though a different terminology is used, CSP systems can be seen as a special actor system having bounded mailboxes (queues) of size 0.

So if one process (~Actor) wants to pass a message to another process, the caller is blocked until the receiver accepts the message. Alternatively to being blocked, a CSP-process can choose to do other things e.g. check for incoming messages (this introduces non determinism to the order of outgoing messages). A receiver cannot accept incoming messages if he is occupied with processing a prior one.

Disruptor

The disruptor data structure came up some years ago and was invented and pioneered by Martin Thompson, Mike Barker, and Dave Farley at LMAX exchange.

Real Life Analogy: Assembly line

The disruptor is a bounded queue (implemented by a ring buffer) where producers add to the head of the queue (if slots are available, else the producer is blocked).
Consumers access the queue at different points, so each consumer has its own read position (cursor) inside the queue. Sequencing is used to manage queue access and processing order.

Thread + locking of shared data

contention

Ok, everybody knows this. Its Java's default to model concurrent execution. Actually these are the primitives (+CAS) used to build higher level concurrent models like those described above.

Conclusion

While the CSP/Actor-Model realizes inter-thread communication by passing/copying thread-local data to some queue or process, the Disruptor keeps data in place and assures only one thread (consumer or producer) is owner of a data item (=queue slot) at a time.
This model seems to fit existing CPU, memory and VM architectures better resulting in high throughput and superior performance. It avoids high allocation rates and reduces the probability of cache misses. Additionally it implicitly balances processing speed of interdependent consumers without growing queues somewhere.
There is no silver bullet for concurrent programming, one still has to plan carefully and needs to know what's going on.

While I am writing this, I realize each of these patterns has its set of best-fit problem domains, so my initial intention using a single benchmark to compare performance of different concurrency models might be somewhat off :-).

The Benchmark Application

Update: Jason Koch implemented a custom executor performing significantly better than the stock JDK executors. See his blog post.

The benchmark approximates the value of PI concurrently. Therefore N jobs are created, and the result of each job must be added to finally get an approximation of PI. For maximum performance one would create one large job for each Core of the CPU used. With this setup all solutions roughly would perform the same as there is no significant concurrent access and scheduling overhead.

However if we compute PI by creating 1 million jobs, each computing a 0..100-loop slice of PI, we get an impression of:

cost of accessing shared data when the result of each job is added to one single result value
overhead created by the concurrency control mechanics (e.g. sending messages/scheduling Runnables to a thread pool, CAS, Locks, volatile r/w ..).

The test is run with 2 configurations

100 iterations per pi-slice-job, 1,000,000 jobs
1,000 iterations per pi-slice-job, 100,000 jobs

As hardware I use

AMD Opteron 6274 Dual Socket 2,2 Ghz. Each socket has 8 cores + 8 Hardware Threads (so overall 16 cores + 16 HW Threads)
Intel XEON@3Ghz dual socket six core (12 cores + Hyperthreading turned off). (2011 release)

I never use more threads as there are "real" cores.

Stop the blabber, gimme results !

See bottom of this post for the benchmark source.

Threads - somewhat naive thread based implementation of the Pi computation. Enough effort invested it is possible to match any of the other results of course. At the core of the VM, threads, locks (+atomic, volatile r/w + CAS) are the only concurrent primitives. However there is no point in creating an ad-hoc copy of the Disruptor or an Actor system in order to compare concurrency approaches.
Akka - a popular Actor implementation on the VM. The benchmark has been reviewed and revised (especially the ActorSystem configuration can make a big difference) by the Akka crew. Threads are scheduled using Java 7's fork join pool. Actually the Pi computation is one of Akka's tutorial examples.
Abstraktor - my experimental Actor/CSP implementation. It's using short bounded queues (so leans more to the CSP side) and avoids deadlocks by maintaining 2 queues per Actor (in and out). If the out-queue is blocked, it just reads from the in-queue.
I am using Nitsan Wakarts excellent MPSC queue implementation (check his blog or github jaq-in-a-box) and that's the major reason it shows kind of competitive performance+scaling.
I use this to get a rough baseline for comparision and experiment with different flavours of Actors/CSP. Probably the only thing one can do with it is to run the Pi bench ;-).
Update: The experimental version benchmarked here has been consolidated + improved. You can find it on https://github.com/RuedigerMoeller/kontraktor
Disruptor - my naive approach implementing the benchmark based on the Disruptor 3.2. It turned out that I used a not-yet-polished utility class, however I keep this benchmark just to illustrate how smallish differences in implementation may have big consequences.
Disruptor2 - As Michael Barker would have implemented it (Thanks :-) ). Its actually more than twice as fast for the 1 million test as the closest runner up.

Intel (Xeon 2 socket each 6 cores, Hyperthreading off)

Well, it seems with 100k jobs of 1000 iterations, the benchmark is dominated by computation, not concurrency control. Therefore I retry with 1 million jobs each computing a 100 iteration slice.

Ok, we probably can see differences here :-)

AMD Opteron 6274 Dual Socket 2,2 Ghz (=16 real cores, 16 HW Threads)

Again the 1 million tiny-job variant spreads the difference amongst the approaches (and their implementation):

Note there is like 5% run to run jitter (GC and stuff), however that does not change the big picture.

Last but not least: Best results per CPU architecture per lib:

Discussion of Results

We see that scaling behaviour can be significantly different depending on the hardware platform used.

Akka loses a lot of CPU time doing GC and allocation. If I modify Abstraktor to use unbounded Concurrent Linked Queue (Akka uses those), it performs similar to Akka and builds up temporary queues >100k elements. This is an inherent issue of the Actor model. By leaning towards CSP (use very short bounded blocking queues with size <100), performance also suffers, as threads are blocked/spinning for input too often. However with a queue size of 5000 elements, things work out pretty well (introducing other problems like deadlocks in case of cyclic Actor graphs).
The Disruptor library is very well implemented, so a good part of the performance advantage could be attributed to the excellent quality of implementation.

Cite from the Disruptor documentation regarding queuing:

3.5 The Problems of Queues
[...] If an in-memory queue is allowed to be unbounded then for many classes of problem it can grow unchecked until it reaches the point of catastrophic failure by exhausting memory. This happens when producers outpace the consumers. Unbounded queues can be useful in systems where the producers are guaranteed not to outpace the consumers and memory is a precious resource, but there is always a risk if this assumption doesn’t hold and queue grows without limit. [...]
When in use, queues are typically always close to full or close to empty due to the differences in pace between consumers and producers. They very rarely operate in a balanced middle ground where the rate of production and consumption is evenly matched. [...]

Couldn't have said it better myself =).

Conclusion

Although the Disruptor worked best for this example, I think looking for "the concurrency model to go for" is wrong. If we look at the real world, we see all 4 patterns used dependent on use case.
So a broad concurrency library ideally would integrate the assembly-line pattern (~Disruptor), queued messaging (~Actors) and unqueued communication (~CSP).

Benchmark source

AKKA

Multi Threading

Abstraktor

Disruptor naive

Disruptor optimized

Big Data the 'reactive' way

2013-12-21T02:32:00.003+01:00

This has been posted (by me ofc) on the Java Advent Calendar originally. Check it out for more interesting articles:

http://www.javaadvent.com/2013/12/big-data-reactive-way.html

A metatrend going on in the IT industry is a shift from query-based, batch oriented systems to (soft) realtime updated systems. While this is associated with financial trading only, there are many other examples such as "Just-In-Time"-logistic systems, flight companies doing realtime pricing of passenger seats based on demand and load, C2C auction system like EBay, real time traffic control and many more.

It is likely this trend will continue, as the (commercial) value of information is time dependent, value decreases with age of information.

Automated trading in the finance sector is just a forerunner in this area, because some microseconds time advantage can be worth millions of dollars. Its natural real time processing systems evolve in this domain faster.

However big parts of traditional IT infrastructure is not designed for reactive, event based systems. From query based databases to request-response based Http protcol, the common paradigm is to store and query data "when needed".

Current Databases are static and query-oriented

Current approaches to data management such as SQL and NOSQL databases focus on data transactions and static query of data. Databases provide convenience in slicing and dicing data but they do not support update of complex queries in real time. Uprising NOSQL databases still focus on computing a static result.
Databases are clearly not "reactive".

Current Messaging Products provide poor query/filtering options

Current messaging products are weak at filtering. Messages are separated into different streams (or topics), so clients can do a raw preselection on the data received. However this frequently means a client application receives like 10 times more data than needed, doing fine grained filtering 'on-top'.
A big disadvantage is, that the topic approach builts filter capabilities "into" the system's data design.
E.g. if a stock exchange system splits streams on a per-stock base, a client application still needs to subscribe to all streams in order to provide a dynamically updated list of "most active" stocks. Querying usually means "replay+search the complete message history".

A scalable, "continuous query" distributed Datagrid.

I had the enjoyment to do conceptional & technical design for a large scale realtime system, so I'd like to share a generic scalable solution for continuous query processing at high volume and large scale.

It is common, that real-time processing systems are designed "event sourced". This means, persistence is replaced by journaling transactions. System state is kept in memory, the transaction journal is required for historic analysis and crash recovery only.
Client applications do not query, but listen to event streams instead. A common issue with event sourced systems is the problem of "late joining client". A late client would have to replay the whole system event journal in order to get an up-to-date snapshot of the system state.
In order to support late joining clients, a kind of "Last Value Cache" (LVC) component is required. The LVC holds current system state and allows late joiners to bootstrap by querying.
In a high performance, large data system, the LVC component becomes a bottleneck as the number of clients rises.

Generalizing the Last Value Cache: Continuous Queries

In a continuous query data cache, a query result is kept up to date automatically. Queries are replaced by subscriptions.

subscribe * from Orders where
symbol in ['ALV', 'BMW'] and
volume > 1000 and
owner='MyCompany'

creates a message stream, which initially performs a query operation, after that updates the result set whenever a data change affecting the query result happened (transparent to the client application). The system ensures each subscriber receives exactly the change notifications necessary to keep its "live" query results up-to-date.

A distributed continous query system: The LVC Nodes hold data. Transactions are sent to them on a message bus (red). The LVC nodes compute the actual difference caused by a transaction and send change notifications on a message bus (blue). This enables "processing nodes" to keep a mirror of their relevant data partition up-to-date. External clients connected via TCP/Http do not listen to the message bus (because multicast is not an option in WAN). "Subscription processors" keep the client's continuous queries up-to-date by listening to the (blue) message bus and dispatching required change notifications only to client's point2point connection.

Difference of data access patterns compared to static data management:

High write volume
Real time systems create a high volume of write access/change in data.
Fewer full table scans.
Only late-joining clients or changes of a query's condition require a full data scan. Because continuous queries make "refreshing" a query result obsolete, Read/Write ratio is ~ 1:1 (if one counts the change notification resulting from a transaction as "Read Access").
The majority of load is generated, when evaluating queries of active continuous subscriptions with each change of data. Consider a transaction load of 100.000 changes per second with 10.000 active continuous queries: this requires 100.000*10.000 = 1 Billion evaluations of query conditions per second. That's still an underestimation: When a record gets updated, it must be tested whether the record has matched a query condition before the update and whether it matches after the update. A record's update may result in an add (because it matches after the change) or a remove transaction (because the record does not match anymore after a change) to a query subscription.

Data Cluster Nodes ("LastValueCache Nodes")

Data is organized in tables, column oriented. Each table's data is evenly partitioned amongst all data grid nodes (=last value cache node="LVC node"). By adding data nodes to the cluster, capacity is increased and snapshot queries (initializing a subscription) are sped up by increased concurrency.

There are three basic transactions/messages processed by the data grid nodes:

AddRow(table,newRow),
RemoveRow(table,rowId),
UpdateRow(table, rowId, diff).

The data grid nodes provide a lambda-alike (row iterator) interface supporting the iteration of a table's rows using plain java code. This can be used to perform map-reduce jobs and as a specialization, the initial query required by newly subscribing clients. Since ongoing computation of continuous queries is done in the "Gateway" nodes, the load of data nodes and the number of clients correlate weakly only.

All transactions processed by a data grid node are (re-)broadcasted using multicast "Change Notification" messages.

Gateway Nodes

Gateway nodes track subscriptions/connections to client applications. They listen to the global stream of change notifications and check whether a change influences the result of a continuous query (=subscription). This is very CPU intensive.

Two things make this work:

by using plain java to define a query, query conditions profit from JIT compilation, no need to parse and interpret a query language. HotSpot is one of the best optimizing JIT compilers on the planet.
Since multicast is used for the stream of global changes, one can add additional Gateway nodes with ~no impact on throughput of the cluster.

Processor (or Mutator) Nodes

These nodes implement logic on-top of the cluster data. E.g. a statistics processor does a continuous query for each table, incrementally counts the number of rows of each table and writes the results back to a "statistics" table, so a monitoring client application can subscribe to realtime data of current table sizes. Another example would be a "Matcher processor" in a stock exchange, listening to orders for a stock, if orders match, it removes them and adds a Trade to the "trades" table.

If one sees the whole cluster as kind of a "giant spreadsheet", processors implement the formulas of this spreadsheet.

Scaling Out

with data size:
increase number of LVC nodes
Number of Clients
increase subscription processor nodes.
TP/S
scale up processor nodes and LVC nodes

Of cause the system relies heavily on availability of a "real" multicast messaging bus system. Any point to point oriented or broker-oriented networking/messaging will be a massive bottleneck.

Conclusion

Building real time processing software backed by a continuous query system simplifies application development a lot.

Its model-view-controller at large scale.
Astonishing: patterns used in GUI applications for decades have not been extended regulary to the backing data storage systems.
Any server side processing can be partitioned in a natural way. A processor node creates an in-memory mirror of its data partition using continuous queries. Processing results are streamed back to the data grid. Computing intensive jobs, e.g. risk computation of derivatives can be scaled by adding processor instances subscribing to distinct partitions of the data ("sharding").
The size of the Code Base reduces significantly (both business logic and Front-End).
A lot of code in handcrafted systems deals with keeping data up to date.

About me

I am a technical architect/senior developer consultant at an european company involved heavily in stock & derivative trading systems.

This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

Dart - a possible solution to the high.cost@lowquality.webapp syndrome

2013-12-15T19:39:00.000+01:00

I am involved in development of high quality real time middle ware/GUI front-end applications professionally. Up to now, we were forced to use traditional Fat Client/Server designs, as Web Applications fail to fulfil basic usability requirements (e.g. Key Shortcuts) and performance requirement (high frequency real time updates). Basic UI features often require nasty and unmaintainable JavaScript hacks (the awkward, unfunful kind of hack) which blow up development+testing cost.

Regarding user experience/performance, checkout this example of a (good!) server centric framework:

http://www.wicket-library.com/wicket-examples/ajax/tree/simple;jsessionid=1D907A232F857EEC0AA5D7D687D1F143?0

(no offence, Apache-Wicket is one of the best Web-Frameworks out there IMO).

its sluggish
can't navigate with arrow keys
no standard multi selection (with Ctrl, Shift etc.)

if we go down the JavaScript route, we can do better:
http://www.jqwidgets.com/jquery-widgets-demo/demos/jqxgrid/index.htm?(arctic)#demos/jqxgrid/filtering.htm

However (having fiddled with JS myself) its suspicious: one can't resize any of the demo tables. Googling results in the following finding:

The Grid widget works with fixed width only which is specified by its ‘width’ property. It is not possible to set its width in percentages in this version. We’ll implement the requested feature for a future version.

Best Regards, XXXX, DevTeam

So by the end of year 2013, it is still a problem to achieve basic UI functionality with current web standards and browsers. Even worse, it fails to respect basic software engineering principles such as DRY (don't repeat yourself), encapsulation, ..
(check out the "source" tab of the link above to get an impression of the mess required to create a simple table demo component).

This is not the fault of the libraries mentioned above, the root cause of this mess are the underlying half baked W3C "standards" (+ browser implementations). They are apparently defined by people never involved in real world application development.

So we sit there with an intact hype, but without a productive development stack to fulfil the vision. Since management tends to share the hype but does not validate technical feasibility, this led to some prominent mis-decisions such as Steve Jobs claiming WebApps can be used on the iPhone instead of native apps (first iPhone release had no native APIs), another example is facebook relying on html5 too much, now moving back to native phone apps hastily.

Following I'll explain why I think we urgently need the paradigm shift in Web-Applications. Without a fundamental change, industry will keep burning money trying to build WebApplications based on expensive , crappy, half-baked technology.

In the beginning there was REST

and it sucked right away ..

The use of the REST (Representational State Transfer) design pattern results in

the absence of state in a server application.
Any durable client state gets passed with each request of a client.
each user interaction (e.g. click a button) requires a complete html document served, transmitted and rendered.
no concept of a "connection" at HTTP-protocol level,
so authentication and security related stuff (e.g. SSL handshake) has to be repeated with each request/response (=change of GUI state).

This allowed WebGUIs to max out on bandwidth inefficiency and latency. It also leads to usability nuggets by losing subtle GUI state (such as scrollbar position or the longish forum post you started to write).

JavaScript, Ajax and "Look ma, a menu in the browser"

Innovation (& browser wars) never sleep. Industry leading companies and consortium's figured out, that the best way to address fundamental design issues is to provide many many many workarounds on top of it.

Examples:

Http Keep alive to reduce latency
Standardize on an untyped weirdish scripting language,
which can be cluttered all across the html page in tiny snippets. Ah .. there is nothing like mixing program, content and styling into a big code soup ..
Content Caching (man we had fun doing software updates ..)
to avoid superfluous reload of static content
Define new Html Tags. A lot of them.
The advantage of having many html tags is, that browsers have more room to implement them differently, allowing for the popular "if ( ie6 ) {...........} else if (mozilla) .. else if (safari) .." coding style. This way bandwidth requirements could be extended even more.
CSS was embraced,
so when done with definition of new tags, they could continue with definition of new css attributes. Also javascript could mess up even better modifying the HTML DOM and CSS definitions at runtime.
Async Http Request (AJAX)
avoids the need to reload whole html-page.

These are kludges (some of them with notable negative side effects) addressing fundamental issues of an overly simplistic foundation technology. So we got, what I call the MESS technology stack:

The MESS principles also infiltrated server side frameworks, as they had to deal with the MESS somehow. I'd speculate some of the frameworks are 90% "mess-handlers', designed to virtualize a solid engineered foundation in order to increase productivity when messing with the MESS.

Google to the rescue !!

One can safely assume apple and microsoft are not *that* interested in providing a fluid web application platform. If all applications would be WebApplications, the importance of the underlying operation system dwindles. All a user needs, is a compliant browser (hello Chrome OS).

Google as a major web application implementor/provider suffer them self from the high.cost@low-quality.webapp syndrome.

Some clever JavaScript hackers (<= positive) started to figure out a simplified, more flexible web application design. They basically implemented a "Fat Client" in JavaScript. Instead of doing page-reloading Http requests, they used Async Http to retrieve data from the server and rendered this data by modifying parts of the displayed html document (via DOM). Whenever you see a responsive 'cool' web application, you can safely assume it relies on this approach.

However JavaScript is not designed to work well at large business applications with many lines of code and fluctuation of development staff as typical for enterprise software development. It becomes a maintenance nightmare for sure (if the project succeeds at all) or a minimalistic solution hated by end-users.

The first widely known step in backing up the hype with a structured technology was the Google Widget Toolkit, which relied on compiling ("transpiling") java to browser compatible JavaScript.
For various reasons, this project has been stalled (but not dropped) by Google in favour of their new Dart language before the technology reached a 'mature' state. It has been reported that turn-around times get unacceptable with larger applications and its unlikely this will change (because resources have been moved to Dart). Remembering the law suit of Oracle against use of Java in Android, this move of Google is not too surprising.

Dart has the potential to reach a level of convenience which has been there for native apps for years:

extensive API to the browser. Anything achievable with JavaScript can be done with Dart.
a modern structured language with (optional) compile time type checking, a package concept and support for many well-known principles of object oriented programming like Classes, Interfaces (Mixins), Inheritance, Lambdas. The concurrency model of Dart (Isolates, similar to Actors) could be interesting on the server side also.
compilable to JavaScript
support for fully encapsulated tag-based UI components (via polymer lib and upcoming WebComponents w3c standard)
backed by a company providing the worlds leading browser and a believable self-dependency on the technology.

Conclusion

Mankind not only is capable to build chips with 800 million transistors, it also might be capable creating standards defined by 800 million words .. scary ;-) .

What about Java ?

Update: Back2Brwsr and TeaVM are projects based on Java=>JS transpiling.

While the Scala guys join the party late http://www.scala-lang.org/news/2013/11/29/announcing-scala-js-v0.1.html, Oracle misses the train and seems to favour JavaFX as a GUI platform (my prediction: Will fall asleep prior to reaching maturity).

GWT has been open sourced, so maybe the community will do further development. Given that there are more open source projects than developers willing to contribute, I doubt this will be able to keep pace with Google's money backed Dart. Also I could imagine the (aging) Java community shys away from (partially) deprecating their habitual server side framework landscape.

There is also http://qooxdoo.org/contrib/project/qwt/about/java backed by a bigger german hosting company. I have not tested this, but I doubt that pure community driven development will keep pace. Being compliant to ever changing draft specs and dealing with browser implementation variance is not fun ...

What's required is a byte-code to JavaScript transpiler.

The browser API (DOM access) could be modeled by a set of interfaces at compile time and replaced by the transpiler with the appropriate javascript calls.

Using static bytecode analysis or by tracking required classes at runtime, the set of required transpiled classes/methods could be found. Additionally one would need a fall-back transpiler at runtime. If a client side script instantiates an "un-transpiled" java class or method, it could be transpiled on-the-fly at server side and loaded to the client.

Nice thing of such a bytecode level transpiler would be, that all VM-based languages (groovy, scala, jruby, jphyton, ..) could profit from such a system.

Example of a possible Java2JavaScript transpiling solution: Since Java is a wide spread server platform, a java2js transpiler could work dynamically, which is a major strategic advantage over Dart, which needs to use static analysis & compilation because most server systems are not running native dart.

There needs to be put real money on such a Java solution, with slow moving JCP we probably can expect something like this 2016 or later ...
Would be fun building such a transpiler system .. but no time - gotta mess with the MESS.

Update: The J2EE java 8 project seems to adapt to new 'thin' webapp architecture with project "Avatar". I still haven't figured out the details. On a first glance it looks like they bring JS to the server instead of bringing Java transpilation to the client .. which is a mad idea imho :-).

Update on GWT: A google engineer answered back on GWT telling me the project is alive and is in active development. It turns out Dart and GWT are handled by distinct (competing?) divisions of google. There still is evidence, that Dart receives the real money while GWT is handed over to the community. Google plans to bring native DartVM to chrome within ~1 year (rumours) and plans on delivering the DartVM to Android also.

Still using Externalizable to get faster Serialization with Java ?

2013-10-12T00:24:00.000+02:00

Update: When turning off detection of cycles, performance can be improved even further (FSTConfiguration.setUnshared(true)). However in this mode an object referenced twice is written twice.

According to popular belief, the only way to go is handcrafted implementation of the Externalizable interface in order to get fast Java Object serialization.

Manual coding of "readExternal" and "writeExternal" methods is an errorprone and boring task. Additionally, with each change of a class's fields, the externalizable methods need adaption.

In contradiction to popular belief, a good implementation of generic serialization can be faster than handcrafted implementation of the Externalizable interface 99% if not 100% of the time.

Fortunately I managed to save the world with the fast-serialization library by addressing the disadvantages of Serialization vs Externalizable.

The Benchmark

The following class will be benchmarked.

public class BlogBench implements Serializable {
public BlogBench(int index) {
// avoid benchmarking identity references instead of StringPerf
str = "Some Value "+index;
str1 = "Very Other Value "+index;
switch (index%3) {
case 0: str2 = "Default Value"; break;
case 1: str2 = "Other Default Value"; break;
case 2: str2 = "Non-Default Value "+index; break;
}
}
private String str;
private String str1;
private String str2;
private boolean b0 = true;
private boolean b1 = false;
private boolean b2 = true;
private int test1 = 123456;
private int test2 = 234234;
private int test3 = 456456;
private int test4 = -234234344;
private int test5 = -1;
private int test6 = 0;
private long l1 = -38457359987788345l;
private long l2 = 0l;
private double d = 122.33;
}

To implement Externalizable, a copy of the class above is made but with Externalizable implementation
(source is here).
The main loop for fast-serialization is identical, I just replace "ObjectOutputStream" with "FSTObjectOutput" and "ObjectInputStream" with "FSTObjectInput".

Result:

Externalizable performance with JDK-Serialization is much better compared to Serializable
FST manages to serialize faster than manually written "read/writeExternal' implementation

Size is 319 bytes for JDK Serializable, 205 for JDK Externalizable, 160 for FST Serialization. Pretty big gain for a search/replace operation vs handcrafted coding ;-). BTW if the "Externalizable" class is serialized with FST it is still slightly slower than letting FST do generic serialization.

There is still room for improvement ..

The test class is rather small, so setup + allocation of the Input and Output streams take a significant part on the times measured. Fortunately FST provides mechanisms to reuse both FSTObjectInput and FSTObjectOutput. This yields ~200ns better read and write times.

So "new FSTObjectOutput(inputStream)" is replaced with

FSTConfiguration fstConf = FSTConfiguration.getDefaultConfiguration();
...
fstConf.getObjectOutput(bout)

There is even more improvement ..

Since Externalizable does not need to track references and this is not required for the test class, we turn off reference tracking for our sample by using the @Flat annotation. We can also make use of the fact, "str3" is most likely to contain a default value ..

@Flat
public class BlogBenchAnnotated implements Serializable {
public BlogBenchAnnotated(int index) {
// avoid benchmarking identity references instead of StringPerf
str = "Some Value "+index;
str1 = "Very Other Value "+index;
switch (index%3) {
case 0: str2 = "Default Value"; break;
case 1: str2 = "Other Default Value"; break;
case 2: str2 = "Non-Default Value "+index; break;
}
}
@Flat private String str;
@Flat private String str1;
@OneOf({"Default Value","Other Default Value"})
@Flat private String str2;

and another one ..

To be able to instantiate the correct class at readtime, the classname must be transmitted. However in many cases both reader and writer know (at least most of) serialized classes at compile time. FST provides the possibility to register classes in advance, so only a number instead of a full classname is transmitted.

FSTConfiguration.getDefaultConfiguration().registerClass(BlogBenchAnnotated.class);

What's "Bulk"

Setup/Reuse of Streams actually require some nanoseconds, so by benchmarking just read/write of a tiny objects, a good part of per-object time is stream init. If an array of 10 BenchMark objects is written, per object time goes <300ns per object read/write.
Frequently an application will write more than one object into a single stream. For RPC encoding applications, a kind of "Batching" or just writing into the same stream calling "flush" after each object are able to actually get <300ns times in the real world. Of course Object Reference sharing must be turned off then (FSTConfiguration.setShared(false)).

For completeness: JDK (with manual Externalizable) Bulk yields 1197 nanos read and 378 nanos write, so it also profits from less initilaization. Unfortunately reuse of ObjectInput/OutputStream is not that easy to achieve mainly because ObjectOutputStream already writes some bytes into the underlying stream as it is instantiated.

Note that if (constant) initialization time is taken out of the benchmarks, the relative performance gains of FST are even higher (see benchmarks on fast serialization site).

Links:

Source of this Benchmark
FST Serialization Library (moved to github from gcode recently)

Impact of large primitive arrays (BLOBS) on Garbage Collection

2013-07-31T15:49:00.001+02:00

While OldGen GC duration ("FullGC") depends on number of objects and the locality of their references to each other, young generation duration depends on the size of OldSpace memory. Alexej Ragozin has made extensive tests which give a good impression of young GC duration vs HeapSize here.

In order to avoid heavy impact of long lived data on OldGen GC, there are several workarounds/techniques.

Put large parts of mostly static data Off-Heap using ByteBuffer.allocateDirect() or Unsafe.allocateMemory(). This memory is then used to store data (e.g. by using a fast serialization like http://code.google.com/p/fast-serialization/ [oops, I did it again] or specialized solutions like http://code.google.com/p/vanilla-java/wiki/HugeCollections ).
Downside is, that one frequently has to implement a manual memory mangement on top.
"instance saving" on heap by serializing into byte-arrays or transformation of datastructures. This usually involves using open adressed hashmaps without "Entry" Objects, large primitive arrays instead of small Objects like
class ReferenceDataArray {
int x[];
double y[];
long z[];
public ReferenceDataArray(int size) {
x = new int[size];
y = new ...;
z = ...;
}
public long getZ(int index) { return z[index]; }
},
replacement of generic collections with <Integer>, <Long> by specialized implementations with direct primitve int, long, ..
If its worth to cripple your code this way is questionable, however the option exists.

Going the route outlined in (2) improves the effectivity of OldGen GC a lot. FullGC duration can be in the range of 2s even with heap sizes in the 8 GB area. CMS performs significantly better as it can scan OldSpace faster and therefore needs less headroom in order to avoid Full GC.

However there is still the fact, that YoungGen GC scales with OldSpace size.

The scaling effect is usually associated with "cardmarking". Young GC has to remember which areas of OldSpace have been modified (in such a way they reference objects in YoungGen). This is done with kind of a BitField where each bit (or byte) denotes the state of (modified/reference created or similar) a chunk ("card") of OldSpace.
Primitive Arrays basically are BLOBS for the VM, they cannot contain a reference to other Java Objects, so theoretically there is no need to scan or card-mark areas containing BLOBS them when doing GC. One could think e.g. of allocating large primitive arrays from top of oldspace, other objects from bottom this way reducing the amount of scanned cards.

Theory: blobs (primitive arrays) result in shorter young GC pauses then equal amount of heap allocated in smallish Objects.

Therefore I'd like to do a small test, measuring the effects of allocating large primitive arrays (such as byte[], int[], long[], double[]) on NewGen GC duration.

The Test

public class BlobTest {
static ArrayList blobs = new ArrayList();
static Object randomStuff[] = new Object[300000];
public static void main( String arg[] ) {
if ( Runtime.getRuntime().maxMemory() > 2*1024*1024*1024l) { // 'autodetect' avaiable blob space from mem settings
int blobGB = (int) (Runtime.getRuntime().maxMemory()/(1024*1024*1024l));
System.out.println("Allocating "+blobGB*32+" 32Mb blobs ... (="+blobGB+"Gb) ");
for (int i = 0; i < blobGB*32; i++) {
blobs.add(new byte[32*1024*1024]);
}
System.gc(); // force VM to adapt ..
}
// create eden collected tmps with a medium promotion rate (promotion rate can be adjusted by size of randomStuff[])
while( true ) {
randomStuff[((int) (Math.random() * randomStuff.length))] = new Rectangle();
}
}
}

The while loop at the bottom simulates the allocating application. Because I rewrite random indizes of the randomStuff arrays using a random index, a lot of temporary objects are created, because they same index is rewritten with another object instance. However because of random, some indices will not be hit in time and live longer, so they get promoted. The larger the array, the less likely index overwriting gets, the higher the promotion rate to OldSpace.

In order to avoid bias by VM-autoadjusting, I pin NewGen sizes, so the only variation is the allocation of large byte[] on top the allocation loop. (Note these settings are designed to encourage promotion, they are in now way optimal).

commandline:

java -Xms1g -Xmx1g -verbose:gc -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=12 -XX:NewSize=100m -XX:MaxNewSize=100m -XX:MaxTenuringThreshold=2

by adding more GB the upper part of the test will use any heap above 1 GB to allocate byte[] arrays.

java -Xms3g -Xmx3g -verbose:gc -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=12 -XX:NewSize=100m -XX:MaxNewSize=100m -XX:MaxTenuringThreshold=2

...

java -Xms11g -Xmx11g -verbose:gc -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=12 -XX:NewSize=100m -XX:MaxNewSize=100m -XX:MaxTenuringThreshold=2

I am using byte[] arrays in the test, I verified int[], long[] behave exactly the same (must apply divisor then to adjust for larger size).

Results

(jdk 1.7_u21)

The 'Objects' test was done by replacing the static byte[] allocation loop in the benchmark by

for ( int i = 0; i < blobGB*2700000; i++ )

nonblobs.add(new Object[] {
new Rectangle(),new Rectangle(),new Rectangle(),
new Rectangle(),new Rectangle(),new Rectangle(),
new Rectangle(),new Rectangle(),new Rectangle(),
new Rectangle()});

Conclusion

Flattening data structures using on-heap allocated primitive arrays ('BLOBS') reduces OldGen GC overhead very effective.

Young Gen pauses slightly reduce for CMS, so scaling with OldGen size is damped but not gone. For DefaultGC (PSYoung), minor pauses are actually slightly higher when the heap is filled with BLOBs.

I am not sure if the observed young gen duration variance has anything to do with "card marking" however i am satisfied quantifying effects of different allocation types and sizes :-)

Further Improvement incoming ..

With this genious little optimization coming up in JDK 7_u40

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7068625

card scanning of unmarked cards speeds up by a factor of 8.

Additonally notice

http://blog.ragozin.info/2012/03/secret-hotspot-option-improving-gc.html

(for the test -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=512 gave the best results)
At least CMS Young Gen pause scaling is not too bad.

And G1 ?

G1 fails to execute the test. If one only allocates 6GB of byte[] with 11GB of heap, it still is much more disruptive than CMS. It works if I use small byte[] chunks of 1MB size and set page size to 32MB. Even then pauses are longer compared to CMS. G1 seems to have problems with large object arrays which will be problematic for IO intensive applications requiring big and many byte buffers.

What drives Full GC duration ?

2013-07-28T20:42:00.002+02:00

Its obvious that, the more memory allocated, the longer a full GC pause will be. However a GC has to track references, so I want to find out if primitve fields (like int, long) also increase Full GC duration. Additionally I'd like to find if the structure of an object graph has impact on GC duration.

The Test

I am creating a number of instances of the following classes:

static class GCWrapper {
GCWrapper(Object referenced) {
this.referenced = referenced;
}
Object referenced;
}

static class GCWrapperMiddle {
int a,b,c,d,e,f,g,h,i,j,k,l;
GCWrapperMiddle(Object referenced) {
this.referenced = referenced;
}
Object referenced;
}

static class GCWrapperFat {
int a,b,c,d,e,f,g,h,i,j,k,l;
long aa,ab,ac,ad,ae,af,ag,ah,ai,jj,ak,al;
GCWrapperFat(Object referenced) {
this.referenced = referenced;
}
Object referenced;
}

they all hold exactly one reference. But "Middle" and "Fat" additionally contain primitive fields, which increases their memory consumption a lot.
I am chaining those objects after creation to a linked list. For locality tests, the reference is pointed to a random other "GCWrapper" Object to create "far" references.

Additionally I am using the class below to test effects of "many references", a highly interlinked Object graph:

static class GCWrapperLinked {
GCWrapperLinked(Object referenced) {
this.r1 = referenced;
}
Object r1,r2,r3,r4,r5;
}

After instantiation, the test does 5 GC's using System.gc() and takes the last value (~3 calls are required until time stabilizes).

Tests are done on an Intel i7 4 core 8 threads mobile processor with a somewhat outdated JDK 7 u 21, as I am in vacation currently with a pretty bad internet connection ...

What has more impact, Memory Size or Number of Object (-References) ?

The test was done with all collectors and different number of Objects on the Heap. Objects were allocated in a linked List like

GCWrapperMiddle res = new GCWrapperMiddle(null);
for ( int i = 0; i < len; i++ ) {
res = new GCWrapperMiddle(res);
}
return res;

The result shows clear linear dependency inbetween the number of Objects and Full GC duration.
Please don't conclude Default GC (ParOldGC) is slowest from the results, because
a) subsequent calls to system.gc() without mutation may enable Collectors to cut corners

b) they might choose to use fewer/more threads during collection

I test all Collectors just to ensure that observed interdependencies are present for all collectors.

Now let's compare effects of Object size on Full GC durations. As described above, I pumped up object size by adding int and long fields to them. Those fields cannot contain references to other objects (no need to scan them) but still consume space. The test creates 6 million objects of the classes shown above. Heap consumption is:

163 Mb for 'small'
427 Mb for 'middle'
1003 Mb for 'large'

Now that's interesting:
Object size has a minor impact on Full GC duration (the differences observed can be attributed mostly to CPU cache misses, i assume). Main driver of FullGC duration is the number of object (-references).

Does GC duration depend on number of objects or the number of object-references ?

To test this, I create an object graph where each gcwrapper object points to another randomly choosen gcwrapper object.

GCWrapper res[] = new GCWrapper[len];
for ( int i = 0; i < len; i++ ) {
res[i] = new GCWrapper(null);
}
for ( int i = 0; i < len; i++ ) {
res[i].referenced = res[((int) (Math.random() * len))];
}
return res;

and ('5 refs')

GCWrapperLinked res[] = new GCWrapperLinked[len];
for (int i = 0; i < res.length; i++) {
res[i] = new GCWrapperLinked(null);
}
for (int i = 0; i < res.length; i++) {
res[i].r1 = res[((int) (Math.random() * len))];
res[i].r2 = res[((int) (Math.random() * len))];
res[i].r3 = res[((int) (Math.random() * len))];
res[i].r4 = res[((int) (Math.random() * len))];
res[i].r5 = res[((int) (Math.random() * len))];
}
return res;

Oops, now that's surprising. It's strange, that DefaultGC (ParOldGC) actually gets faster for objects containing *more* references to (random) other objects !! This is not a test error and reproducable.
You probably now, that todays CPU's have a multi stage cache and that access to main memory is extremely expensive compared to cache hits, so locality (memory wise) is pretty important. Its obvious this also hit's Full GC when iterating the live set of objects.
(see http://www.akkadia.org/drepper/cpumemory.pdf for more information you'd probably want to read about caches ;-) ).

In contradiction to the CMS collector, ParOldGC ('DefaultGC') is able to compact the heap during Full GC. It seems it does this in a way optimizing the randomly linked object graph by moving objects close to objects referencing them (that's my speculative explanation, others are welcome ..).

This observation leads to the following test:
A comparion of FullGC of 6 million GCWrapper object graphs. One setup as linked list (linked in allocation order) and one with randomly choosen referenced objects.

GCWrapper res = new GCWrapper(null);
for ( int i = 0; i < len; i++ ) {
res = new GCWrapper(res);
}
return res;

GCWrapper res[] = new GCWrapper[len];
for ( int i = 0; i < len; i++ ) {
res[i] = new GCWrapper(null);
}
for ( int i = 0; i < len; i++ ) {
res[i].referenced = res[((int) (Math.random() * len))];
}
return res;

Boom ! The impact of locality is massive. This will also have a strong impact on performance of a java application accessing those data structures.

Conclusion:

Full GC duration depends on the number of objects allocated and the locality of their references. It does not depend that much on actual heap size.

Questionable design patterns regarding (OldGen) GC and overall locality

I think one can safely assume, that objects allocated at the same time have a high chance to be "near" each other (Note: CMS heap fragmentation may behave different if the application fragments the heap). So whenever statically allocated data is updated by refering to a newly allocated Object, chance is, that the new Object is far away from the referencee, which will cause cache misses. Especially the massive use of Immutables as favoured by many designs and (newer VM-based languages) actually is deadly regarding locality and OldGen GC.

Lazy initialization/instantiation
Value classes. E.g. GC-wise its better to use a primitive long instead of Date.
Immutables (except preallocated Flyweight's)
Fine grained class design
Preferring Composition over Subclassing (many wrapper, delegate Objects)
JDK's HashMap uses entry objects. Use open addressed Hashmaps instead
Use of Long,Integer,Short instead of primitives in static data (unfortunately Java's Generics encourage this). Even if the VM does some optimization here (e.g. internal caching of Integer's), one can verify easily that these objects still have severe impact on GC duration

Tuning and benchmarking Java 7's Garbage Collectors: Default, CMS and G1

2013-07-07T11:54:00.000+02:00

[i am not a native english speaker, so enable quirks mode pls ...]

Due to trouble in a project with long GC pauses, I just had myself a deeper look into GC details. There is not that much acessible information/benchmarks on the Web, so I thought I might share my tests and enlightments ^^. Last time I tested GC some years ago I just came to the conclusion, that allocation of any form is evil in Java and avoided it as far as possible in my code.

I will not describe the exact algorithms here as there is plenty of material regarding this. Understanding the algorithms in detail does not necessary help on predicting actual behaviour in real systems, so I will do some tests and benchmarks.

The concepts of Garbage Collection in Hotspot are explained e.g. here.
In depth coverage of algorithms and parameters can be found here. I will cover GC from an empirical point of view here.

The basic idea of multi generational Garbage Collection is, to collect newer ("younger") Objects more frequent using a "minor" (short duration/pause) Collection algorithm. Objects which survived one or more minor collections then move to the "OldSpace". The OldSpace is garbage collected by the "major" Garbage Collector. I will name them NewSpace and NewGC, OldSpace and OldGC.

The NewGC Algorithm is pretty much the same amongst the 3 Garbage Collectors HotSpot provides. The Old Generation Collector ("OldGC") makes the difference.

In the Default Collector, OldSpace is cleaned up using a "Stop the World" Mark-Sweep.
The Concurrent Mark&Sweep Collector (CMS) uses (surprise!) a concurrent implementation of the Mark & Sweep Collection. This means the running application is not stopped during major GC. However it falls back to Stop-The-World GC if it can't keep up with applications allocation (promotion, tenuring) rate.
The G1 is also concurrent. It segments memory into equal chunks, trying to split the Garbage Collection Process into smaller pieces by collecting segments which are likely to contain a lot of garbage first. A more detailed description can be found here. It also falls back to Full-Stop-GC in case it can't keep up with application allocation rate. However the duration of those Full-GC pauses are of shorter duration compared to the other OldSpace collectors.

Note that the term "parallel GC" refers to collection algorithms which run multithreaded, not necessary in parallel to your application.

The Benchmark

I wrote a small program which emulates most of the stuff Garbage Collectors have problems with:

4GB of statically allocated Objects which are rarely freed.
Problem for GC: with each major GC objects must be traversed, so the larger your reference data, cache's etc., the longer major GC will need for a full traversal collection.
A lot of temporary Object allocation of various size and age. "Age" refers to the amount of time the Object is referenced by the application.
Intentional partial replacements of pseudo-static data by new objects. This way I enforce "promotion" of objects to OldSpace, as they are long lived.

This is achieved by putting a lot of objects into a HashMap, then replace a fraction of it. Additionally the latency of a ~0,5 ms operation is measured and memorized to simulate processing of e.g. Requests. This way I get a distribution of VM/GC-related pauses. The benchmark runs for 5 minutes, so long term effects like heap fragmentation are not covered by the tests.

Note that this benchmark has an insane allocation rate and object age distribution. So results and VM tuning evaluated in this post illustrate the effects of some GC settings, its not cut & paste stuff, most of the sizings used to get this allocation greedy benchmark to work are way to big for real world applications.

Real men use default settings ..

Running the benchmark with -Xms8g -Xmx8g (the bench consumes 4g static, ~1g dynamic data) yields to quite disgusting results:

Each of the spikes represents a full GC with a duration of ~15 seconds ! During the full GC, the program is stopped. One can see, that the Garbage Collecor has some automatic calibration. Initially it did a pretty good job, but after first GC things get really bad.
I've noticed this in several tests. Most of the time the automatic calibration improves things, but sometimes it's vice versa (see Enlightment #2).

Hiccups ( [interval in milliseconds] => number of occurences ).

Iterations 9324, -Xmx8g -Xms8g
[0] 4930
[1] 4366
[2] 8
[7] 1
[14000] 1
[15000] 10
[16000] 4
[17000] 3
[18000] 1

Read like e.g '4930' requests done in 0 ms, 10 requests had a response time of 15 seconds. Throughput was 9324 requests.

CMS did somewhat better, however there were also severe program-stopping GC's. Note the >10 times higher throughput.

Iterations 115726, ‐XX:+UseConcMarkSweepGC -Xmx8g -Xms8g
[0] 52348
[1] 62317
[2] 157
[46] 7
[61] 13
[62] 127
[63] 234
[64] 164
[65] 81
[66] 55
[67] 45
[68] 42
[69] 25
[70] 28
[71] 16
[72] 11
[73] 5
[74] 12
[14000] 7
[15000] 4
[17000] 1

G1 does the best job here (still 10 inacceptable full stop GC's..)

Iterations 110052

[0] 37681
[1] 67201
[2] 1981
[3] 725
[4] 465
[5] 306
[6] 217
[7] 182
[8] 113
[9] 105
[39] 13 (skipped some intervals of minor occurence count)
[100] 288
[200] 23
[300] 6
[1000] 11
[12000] 5
[13000] 5

Hm .. pretty bad. Now, let's have a look into the GC internals:

If OldSpace (big green one) is full, the application gets a full-stop GC. In order to avoid Full GC, we need to reduce the promotion rate to OldSpace. An object gets promoted if it is alive (aka referenced) for a longer time than NewSpace (=Eden+Survivor Spaces) holds it.

So time for

Finding #1:

It is key to reduce the promotion rate from young gen to OldSpace. The promotion rate to OldSpace needs to be lowered in order to avoid Full-GC's. In case of concurrent OldSpace GC (CMS,G1), this will enable collectors to keep up with allocation rate.

Tuning the Young Generation

All 3 Collectors avaiable in Java 7 will profit of a proper NewGC setup.

Promotion happens if:

The "survivor space" (S0, S1) is full.
The size of Survivor vs Eden is specified with the -XX:SurvivorRatio=N. A large Eden will increase throughput (usually) and will catch ultra-short lived temporary Objects better. However large Eden means small Survivor Spaces, so middle aged Objects might get promoted too quickly to OldSpace then, putting load on OldSpace GC.
Survivors have survived more than -XX:MaxTenuringThreshold=N (Default 15) minor collections. Unfortunately I did not find an option to give a lower bound for this value, so one can specify a maximum here only. -XX:InitialTenuringThreshold might help, however I found the VM will choose lower values anyway in case.

The following actions will reduce promotion (by encouraging survivors to live longer in young generation)

Decreasing -XX:SurvivorRatio=N to lower values than 8 (this actually increases the size of survivor spaces and decreases Eden size).
Effect is that survivors will stay for a longer time in young gen (if there is sufficient size)
This will reduce throughput as survivors are copied with each minor GC between S0,S1.

Increase the overall size of young generation with -XX:NewRatio=N.
"1" means, young generation will use 50% of your heap, 2 means it will use 33% etc. A larger young gen reduces heap size for long-lived objects but will reduce the number of minor GCs and increase the size avaiable for survivors.

Increase -XX:MaxTenuringThreshold=N to values > 15.
Of course this only reduces promotion, if the survivor space is large enough. Additonally this is only an upper bound, so the VM might choose a lower value regardless of your setting (you can also try -XX:InitialTenuringThreshold).

Increasing overall VM heap will help (in fact more GB always help :-) ), as this will increase young generation (Eden+Survivor) and OldSpace size. An increase in OldSpace size reduces the number of required major GC's (or give concurrent OldSpace collectors more headroom to complete a major GC concurrent, pauseless).

It depends on the allocation behaviour of your application which of this actions will have effect.

Finding #2:

When adjusting Survivor Ratio and/or MaxTenuringThreshold manually, always switch off auto adjustment with
-XX:-UseAdaptiveSizePolicy

Below the effects of NewSpace settings on memory sizing. Note that the sizing of NewSpace directly reduces avaiable space in OldSpace So if your application holds a lot of statically allocated data, increasing size of young generation might require you to increase overall memory size.

Note: In practice one would evaluate the required size of NewSpace and specify it with -XX:NewSize=X -XX:MaxNewSize=X absolutely. This way changing -Xmx will affect OldSpace size only and will not mess up absolute Survivor Space sizes by applying ratios.

Actually the Default GC is the most useful collector to use in order to profile NewSpace setup, since there is no 2cnd background collector bluring OldSpace growth.

Survivor Sizing

The most important thing is, to figure out a good sizing (absolute, not ratio) for the survivor spaces and the promotion counter (MaxTenuringThreshold). Unfortunately there is a strong interaction between Eden size and MaxTenuringThreshold: If Eden is small, then Objects are put into survivor spaces faster, additional the tenuring counter is incremented more quickly. This means if Eden is doubled in size, you probably want to decrease your MaxTenuringThreshold and vice versa. This gets even more complicated as the number of Eden GC (=minor GC) also depends on application allocation rate.

The optimal survivor size is large enough to hold middle lived Objects under full application load without promoting them due to size shortage.

If survivor space is too large, its just a waste of memory which could be given to Eden instead. Additionally there is a correlation between survivor size and minor GC pauses (throughput degradation, jitter) [Fixme: to be proven].
If survivor space is too small, Objects will be promoted to OldSpace too early even if TenuringThreshold is not reached yet.

MaxTenuringThreshold

This defines how many minor GC's an Object may survive in SurvivorSpace before getting tenured to OldSpace. Again you have to optimize this under max application load, as without load there are fewer minor GC's so the "survived"-counters will be lower. Another issue to think of is that Eden size also affects the frequency of minor GC's. The VM will handle your value as an upper bound only and will automatically use lower values if it thinks these are more appropriate.

If MaxTenuringThreshold is too high, throughput might suffer, as non-temporary Objects will be subject to minor collection which slows down application. As said, the VM automatically corrects that.
If MaxTenuringThreshold is too low, temporary Objects might get promoted to OldSpace too early. This can hamper the OldSpace collector and increase the number of OldSpace GC's. If the promotion rate gets too high, even concurrent Collecors (CMS, G1) will do a full-stop-GC.

If in doubt, set MaxTenuringThreshold too high, this won't have a significant impact on application performance in practice.
It also strongly depends on the coding quality of the application: if you are the kind of guy preferring zero allocation programming, even a MaxTenuringThreshold=0 might be adequate (there is also kind of "alwaysTenure" option). The other extreme is "return new HashBuilder().computeHash(this);"-style (some alternative VM language produce lots of short to mid-lived temporary Garbage) where a settings like '30' or higher (which most often means: keep survivors as long there is room in SurvivorSpace) might be required.

Common sign of too slow promotion (MaxTenuringThreshold too high, Survivor size too high):
(-Xmx12g -Xms12g -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=3 -XX:NewRatio=2 -XX:MaxTenuringThreshold=20)

bad survivor, TenuringThreshold setting

Initially it looks like there are no Objects promoted to OldSpace, as they actually "sit" in Survivor Space. Once the survivor spaces get filled, survivors are tenured to old Space resulting in a sudden increase of promotion rate. (5 minute chart of benchmark running with default GC, actually the application does not allocate more memory, it just constantly replaces small fractions of initially allocated pseudo-static Objects). This will probably confuse concurrent OldSpace Collectors, as they will start concurrent collection too late. Beware: Clever project manager's might bug you to look for memory leaks in your application or to plow through the logs to find out "what happened 14:36 when memory consumption all over a sudden starts to rise".

Same benchmark with better settings (-Xmx12g -Xms12g -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=4 -XX:NewSize=4g -XX:MaxNewSize=4g -XX:MaxTenuringThreshold=15):

better survivor, TenuringThreshold setting

The promotion rate now reflects the actual allocation rate of long-lived objects. Since there is no concurrent OldSpace Collector in the default GC, it looks like a permanent growth. Once the limit is reached, a Full-GC will be triggered and will clean up unused long lived Objects. A concurrent collector like CMS, G1 will now be able to detect promotion rate and keep up with it.

Eden Size

Eden Size directly correlates with throughput as most java applications will create a lot of temporary objects.

as measured with high allocation rate application

Eden size: red: 642MB, blue: 1,88GB, yellow: 3,13GB, green: 4,39GB

As one can see, Eden size correlates with number, not the length of minor collections.

Settings used for benchmark (Survivor spaces are kept constant, only Eden is enlarged):

-XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=1 -XX:NewSize=1926m -XX:MaxNewSize=1926m -XX:MaxTenuringThreshold=40
= 642m Eden

-XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=3 -XX:NewSize=3210m -XX:MaxNewSize=3210m -XX:MaxTenuringThreshold=15
= 1,88g Eden

-XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=5 -XX:NewSize=4494m -XX:MaxNewSize=4494m -XX:MaxTenuringThreshold=12
= 3,13g Eden

-XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=7 -XX:NewSize=5778m -XX:MaxNewSize=5778m -XX:MaxTenuringThreshold=8
= 4,39g Eden (heap size 14Gb required)

Finding #3:

Eden size strongly correlates with throughput/performance for common java applications (with common allocation patterns). The difference can be massive.

Eden size, Allocation rate, Survivor size and TenuringThreshold are interdependent. If one of those factors is modified, the others need readjustment

Ratio-based options can be dangerous and lead to false assumptions. NewSpace size should be specified using absolute settings (-XX:NewSize=)

Wrong sizing can lead to strange allocation and memory consumption patterns

Tuning OldSpace Garbage Collectors

Default GC

Default GC has a Full-Stop-GC (>15 seconds with the benchmark), so there is not much to do. If you want to run your application with Default GC (e.g. because of high throughput requirements), your only choice is to tune NewSpace GC very agressively, then throw tons of Heap to your application in order to avoid Full-GC during the day. If your system is 24x7 consider triggering a full GC using System.gc() at night if you expect the system load to be low.

Another possibility would be to even size the VM bigger than your physical RAM, so tenured Objects are written to swap disk. However you have to be sure then no Full-GC is triggered ever, because duration of Full-GC will go into the minutes then. I have not tried this.

Ofc one can improve things always by coding less memory intensive, however this is not the topic of this post.

Concurrent Mark & Sweep (CMS)

The CMS Collector does a pretty good job as long your promotion rate is not too high (which should not be the case if you optimized that as described above).

One Key setting of CMS Collector is, when to trigger a concurrent full GC. If it is triggered too late, it might not be able to finish in time and a Full-Stop-GC will happen. If you know your application has like 30% statically allocated data you might want to set this to 30% like -XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=30. In practice I always start with a value of 0, then experiment with higher values once everything (NewSpace, OldSpace) is calibrated to operate without triggering FullGC under load.

When i copy the settings evaluated in the "NewGen Tuning" settings straight forward, the result will be a permanent Full-GC. Why ?
Because CMS requires more heap than the default GC. It seems like the same data structures just require ~20-30% more memory. So we just have to multiply NewGC settings evaluated from Default GC with 1.3.
Additionally, a good start is to let the concurrent Mark & Sweep run all the time.

So I go with (copied 2cnd NewSpace config from above and multiplied):

‐XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=10 -Xmx12g -Xms12g -XX:-UseAdaptiveSizePolicy
-XX:SurvivorRatio=3 -XX:NewSize=4173m -XX:MaxNewSize=4173m
-XX:MaxTenuringThreshold=15

The CMS is barely able to keep up with promotion rate, throughput is 300.000 which is acceptable given that cost of (in contradiction to tests above) Full-GC is included.

CMS OldSpace collection can keep up with promotion rate barely

We can see from Visual GC (an excellent jVisualVM plugin), that OldSpace size is on the edge of triggering a full GC. In order to improve throughput, we would like to increase eden. Unfortunately there is not enough headroom in OldSpace, as a reduction of OldSpace would trigger Full-Stop-GC's.
Reducing the size of Surviver Spaces is not an option, as this would result in a higher tenured Object rate and again trigger Full GC's. The only solution is: More Memory.
Comparing throughput with the Default GC test above is not fair, as the Default GC would run into Full-Stop-GC's for sure, if the test would run for a longer time than 5 minutes.
On a side note: the memory chart of jVisual VM's (and printouts by -verbose:gc) does not tell you the full story as you cannot see the fraction of used memory in OldSpace.

Ok, so lets add 2 more Gb to be able to increase eden resulting in

-XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSInitiatingOccupancyFraction=10 -Xmx14g -Xms14g -XX:-UseAdaptiveSizePolicy
-XX:SurvivorRatio=4 -XX:NewSize=5004m -XX:MaxNewSize=5004m
-XX:MaxTenuringThreshold=15

(Eden size is 3,25 GB).
This increases throughput by 10% to 330.000.

blue is Heap 12g, Eden 2,4g, red is Heap 14g, Eden 3,2g

The longest pause noted in processing were 400ms. Exact Reponse time distribution:

-XX:+UseConcMarkSweepGC
-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=10
-Xmx14g -Xms14g
-XX:-UseAdaptiveSizePolicy
-XX:SurvivorRatio=4 -XX:NewSize=5004m -XX:MaxNewSize=5004m
-XX:MaxTenuringThreshold=15
Iterations 327175

[0] 146321
[1] 179968
[2] 432
[3] 33
[14] 1
[70] 2
[82] 1
[83] 1
[85] 2
[87] 1
[100] 3
[200] 355
[300] 44
[400] 11

Finding #4:

Eden size still correlates with throughput for CMS

Compared with the Default Hotspot GC all NewGen sizes must be multiplied by 1.3

CMS also requires 20%-30% more Heap for identical data structures

No Full-Stop-GCs at all if you can provide the necessary amount of memory (large NewSpace + HeadRoom in OldSpace)

The G1 Collector

The G1 collector is different. Each segment of G1 has its own NewSpace, so absolute values set with -XX:NewSize=5004m -XX:MaxNewSize=5004m are actually ignored (don't ask me how long i fiddled with G1 and the NewSize parameters until I got this). Additionally VisualGC does not work for G1 Collector.

Anyway, we still know that the benchmark succeeds in being one of the most greedy java programs out there, so ..

we like large survivor spaces and a large NewSpace to reduce promotion to OldSpace
we like large Eden to trade memory efficiency against throughput. Memory is cheap.

Unfortunately not much is known about the efficiency of the G1 OldSpace collector, so the only possibility is testing. If the G1 OldSpace collector is more efficient than CMS OldSpace collector, we could decrease survivor space in favour of bigger eden spaces, because with a potentially more efficient G1 OldSpace collector, we may afford a higher promotion rate.

Here we go:

-Xmx12g -Xms12g -XX:+UseG1GC -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=1 -XX:NewRatio=1 -XX:MaxTenuringThreshold=15

yields:

Iterations 249372
(values < 100ms skipped)

[100] 10

[200] 1

[400] 4

[500] 198

[600] 15

[700] 1

well, that's pretty good as first effort. Lets try to increase decrease survivors and start concurrent G1 earlier ..

-Xmx12g -Xms12g -XX:+UseG1GC -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=2 -XX:NewRatio=1 -XX:MaxTenuringThreshold=15 -XX:InitiatingHeapOccupancyPercent=0

Iterations 233262
[100] 3
[200] 1
[400] 81
[500] 96
[600] 3
[700] 5
[800] 2
[1000] 18

err .. nope. We can see that decreasing survivor space puts more load on G1 OldSpace GC as the number of 1 second area pauses increased significantly. Just another blind shot relaxing survivors and increase G1 OldSpace load ..

-Xmx12g -Xms12g -XX:+UseG1GC -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=6 -XX:NewRatio=1 -XX:MaxTenuringThreshold=15 -XX:InitiatingHeapOccupancyPercent=0

Iterations 253293
[100] 5
[200] 1
[400] 186
[500] 2
[600] 9
[700] 5
[1000] 16

same throughput as intial trial, but larger pauses .. so I still need to look for large survivor and eden space. The only way to influence absolute NewSpace size is changing segment size (I take max size ofc).

-Xmx12g -Xms12g -XX:+UseG1GC -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=1 -XX:NewRatio=1 -XX:MaxTenuringThreshold=15 -XX:G1HeapRegionSize=32m

Iterations 275464
[100] 37
[300] 1
[400] 5
[500] 189
[600] 16
[900] 1

Bingo ! Unfortunately all known tuning options are exploited now .. except main Heap size ..

-Xmx14g -Xms14g -XX:+UseG1GC -XX:SurvivorRatio=1 -XX:NewRatio=1 -XX:MaxTenuringThreshold=15 -XX:-UseAdaptiveSizePolicy -XX:G1HeapRegionSize=32m

Iterations 296270
[100] 58
[200] 3
[300] 2
[400] 4
[500] 160
[600] 15
[700] 11

This is slightly worse in latency and througput than a proper configured CMS. One last shot in order to figure out behaviour regarding memory size constraints (at least real peak size of the bench is round 4.5g ... sigh):

-Xmx8g -Xms8g -XX:+UseG1GC -XX:SurvivorRatio=1 -XX:NewRatio=2 -XX:MaxTenuringThreshold=15 -XX:-UseAdaptiveSizePolicy -XX:G1HeapRegionSize=32m

Iterations 167721
[100] 37
[200] 1

[300] 3
[400] 297
[500] 7
[600] 9
[700] 5
[1000] 15
[2000] 1

It seems G1 excels in memory efficiency without having to do too long pauses (CMS falls into permanent full GC if I try to run the bench with 8g heap). This can be important in memory constrained server environments and at client side.

Finding #5:

Absolute settings for NewSpace size are ignored by G1. Only the ratio based options are acknowledged.

In order to influence eden/survivor space, one can increase/decreas segment size up to 32m. This also gave best throughput (possibly because of larger eden)

G1 can handle memory constraints the best without extremely long pauses

Latency distribution is ok, but behind CMS

Conclusion

Disclaimer:

The benchmark used is worst case ever regarding allocation rate and memory waste. So take any finding with a grain of salt, when applying GC optimizations to your program.
I did not drive long term tests. All tests ran for 5 minutes only. Due to the extreme allocation rate of the benchmark, 5 minute benchmark is likely aequivalent to an hour operation of a "real" program. Anyway in an application with lower allocation rate, concurrent collectors will have more time to complete GC's concurrent, so you probably never will need an eden size of 4Gb in practice :).
I will provide long term runs in a separate post (maybe :) ).

Default GC (Serial Mark&Sweep, Serial NewGC) shows highest throughput as long no Full GC is triggered.
If your system has to run for a limited amount of time (like 12 hours) and you are willing to invest into a very careful programming style regarding allocation; keep large datasets Off-Heap, DefaultGC can be the best choice. Of course there are applications which are ok with some long GC pauses here and there.

CMS does best in pause-free low latency operation as long you are willing to throw memory at it. Throughput is pretty good. Unfortunately it does not compact the heap, so fragmentation can be an issue over time. This is not covered here as it would require to run real application tests with many different Object sizes for several days. CMS is still way behind commercial low-latency solutions such as Azul's Zing VM.

G1 excels in robustness, memory efficiency with acceptable throughput. While CMS and DefaultGC react to OldSpace overflow with Full-Stop-GC of several seconds up to minutes (depends on Heap size and Object graph complexity), G1 is more robust in handling those situations. Taking into account the benchmark represents a worst case scenario in allocation rate and programming style, the results are encouraging.

Blue:CMS, Red: Default GC, Yellow: G1

Note: G1 has the lowest number of pauses. (Default GC has been tweaked to not do >15 second Full GC during test duration)

Default GC has been tweaked to not Full GC during test duration

The Benchmark source

public class FSTGCMark {

static class UseLessWrapper {

Object wrapped;

UseLessWrapper(Object wrapped) {

this.wrapped = wrapped;

}

static HashMap map = new HashMap();

static int hmFillRange = 1000000 * 30; //

static int mutatingRange = 2000000; //

static int operationStep = 1000;

int operCount;

int milliDelayCount[] = new int[100];

int hundredMilliDelayCount[] = new int[100];

int secondDelayCount[] = new int[100];

Random rand = new Random(1000);

int stepCount = 0;

public void operateStep() {

stepCount++;

if ( stepCount%100 == 0 ) {

// enforce some tenuring

for ( int i = 0; i < operationStep; i++) {

int key = (int) (rand.nextDouble() * mutatingRange)+mutatingRange;

map.put(key, new UseLessWrapper(new UseLessWrapper(""+stepCount)));

}

if ( stepCount%200 == 199 ) {

// enforce some tenuring

for ( int i = 0; i < operationStep; i++) {

int key = (int) (rand.nextDouble() * mutatingRange)+mutatingRange*2;

map.put(key, new UseLessWrapper(new UseLessWrapper("a"+stepCount)));

}

if ( stepCount%400 == 299 ) {

// enforce some tenuring

for ( int i = 0; i < operationStep; i++) {

int key = (int) (rand.nextDouble() * mutatingRange)+mutatingRange*3;

map.put(key, new UseLessWrapper(new UseLessWrapper("a"+stepCount)));

}

if ( stepCount%1000 == 999 ) {

// enforce some tenuring

for ( int i = 0; i < operationStep; i++) {

int key = (int) (rand.nextDouble() * hmFillRange);

map.put(key, new UseLessWrapper(new UseLessWrapper("a"+stepCount)));

}

for ( int i = 0; i < operationStep/2; i++) {