Wednesday, July 31, 2013

Impact of large primitive arrays (BLOBS) on Garbage Collection

While OldGen GC duration ("FullGC") depends on number of objects and the locality of their references to each other, young generation duration depends on the size of OldSpace memory. Alexej Ragozin has made extensive tests which give a good impression of young GC duration vs HeapSize here.

In order to avoid heavy impact of long lived data on OldGen GC, there are several workarounds/techniques.

  1. Put large parts of mostly static data Off-Heap using ByteBuffer.allocateDirect() or Unsafe.allocateMemory(). This memory is then used to store data (e.g. by using a fast serialization like http://code.google.com/p/fast-serialization/ [oops, I did it again] or specialized solutions like http://code.google.com/p/vanilla-java/wiki/HugeCollections ).
    Downside is, that one frequently has to implement a manual memory mangement on top.
  2. "instance saving" on heap by serializing into byte-arrays or transformation of datastructures. This usually involves using open adressed hashmaps without "Entry" Objects, large primitive arrays instead of small Objects like
    class ReferenceDataArray {
        int x[];
        double y[];
        long z[]; 
        public ReferenceDataArray(int size) {
             x = new int[size];
             y = new ...;
             z = ...;
        }
        public long getZ(int index) { return z[index]; }
    },
    replacement of generic collections with <Integer>, <Long> by specialized implementations with direct primitve int, long, ..
    If its worth to cripple your code this way is questionable, however the option exists.

Going the route outlined in (2) improves the effectivity of OldGen GC a lot. FullGC duration can be in the range of 2s even with heap sizes in the 8 GB area. CMS performs significantly better as it can scan OldSpace faster and therefore needs less headroom in order to avoid Full GC.

However there is still the fact, that YoungGen GC scales with OldSpace size.

The scaling effect is usually associated with "cardmarking". Young GC has to remember which areas of OldSpace have been modified (in such a way they reference objects in YoungGen). This is done with kind of a BitField where each bit (or byte) denotes the state of (modified/reference created or similar) a chunk ("card") of OldSpace.
Primitive Arrays basically are BLOBS for the VM, they cannot contain a reference to other Java Objects, so theoretically there is no need to scan or card-mark areas containing BLOBS them when doing GC. One could think e.g. of allocating large primitive arrays from top of oldspace, other objects from bottom this way reducing the amount of scanned cards.

Theory: blobs (primitive arrays) result in shorter young GC pauses then equal amount of heap allocated in smallish Objects.

Therefore I'd like to do a small test, measuring the effects of allocating large primitive arrays (such as byte[], int[], long[], double[]) on NewGen GC duration.

The Test

public class BlobTest {
    static ArrayList blobs = new ArrayList();
    static Object randomStuff[] = new Object[300000];
    public static void main( String arg[] ) {
        if ( Runtime.getRuntime().maxMemory() > 2*1024*1024*1024l) { // 'autodetect' avaiable blob space from mem settings
            int blobGB = (int) (Runtime.getRuntime().maxMemory()/(1024*1024*1024l));
            System.out.println("Allocating "+blobGB*32+" 32Mb blobs ... (="+blobGB+"Gb) ");
            for (int i = 0; i < blobGB*32; i++) {
                blobs.add(new byte[32*1024*1024]);
            }
            System.gc(); // force VM to adapt ..
        }
        // create eden collected tmps with a medium promotion rate (promotion rate can be adjusted by size of randomStuff[])
        while( true ) {
            randomStuff[((int) (Math.random() * randomStuff.length))] = new Rectangle();
        }
    }
}


The while loop at the bottom simulates the allocating application. Because I rewrite random indizes of the randomStuff arrays using a random index, a lot of temporary objects are created, because they same index is rewritten with another object instance. However because of random, some indices will not be hit in time and live longer, so they get promoted. The larger the array, the less likely index overwriting gets, the higher the promotion rate to OldSpace.

In order to avoid bias by VM-autoadjusting, I pin NewGen sizes, so the only variation is the allocation of large byte[] on top the allocation loop. (Note these settings are designed to encourage promotion, they are in now way optimal).


commandline:

java -Xms1g -Xmx1g -verbose:gc -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=12 -XX:NewSize=100m -XX:MaxNewSize=100m -XX:MaxTenuringThreshold=2
by adding more GB the upper part of the test will use any heap above 1 GB to allocate byte[] arrays.

java -Xms3g -Xmx3g -verbose:gc -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=12 -XX:NewSize=100m -XX:MaxNewSize=100m -XX:MaxTenuringThreshold=2
...
...
java -Xms11g -Xmx11g -verbose:gc -XX:-UseAdaptiveSizePolicy -XX:SurvivorRatio=12 -XX:NewSize=100m -XX:MaxNewSize=100m -XX:MaxTenuringThreshold=2

I am using byte[] arrays in the test, I verified int[], long[] behave exactly the same (must apply divisor then to adjust for larger size).


Results

(jdk 1.7_u21)






The 'Objects' test was done by replacing the static byte[] allocation loop in the benchmark by

            for ( int i = 0; i < blobGB*2700000; i++ )
                nonblobs.add(new Object[] {
                         new Rectangle(),new Rectangle(),new Rectangle(),
                         new Rectangle(),new Rectangle(),new Rectangle(),
                         new Rectangle(),new Rectangle(),new Rectangle(),
                         new Rectangle()});


Conclusion

Flattening data structures using on-heap allocated primitive arrays ('BLOBS') reduces OldGen GC overhead very effective. 
Young Gen pauses slightly reduce for CMS, so scaling with OldGen size is damped but not gone. For DefaultGC (PSYoung), minor pauses are actually slightly higher when the heap is filled with BLOBs.
I am not sure if the observed young gen duration variance has anything to do with "card marking" however i am satisfied quantifying effects of different allocation types and sizes :-) 

Further Improvement incoming ..


With this genious little optimization coming up in JDK 7_u40


card scanning of unmarked cards speeds up by a factor of 8. 

Additonally notice 

(for the test  -XX:+UnlockDiagnosticVMOptions -XX:ParGCCardsPerStrideChunk=512 gave the best results)
At least CMS Young Gen pause scaling is not too bad.

And G1 ?

G1 fails to execute the test. If one only allocates 6GB of byte[] with 11GB of heap, it still is much more disruptive than CMS. It works if I use small byte[] chunks of 1MB size and set page size to 32MB. Even then pauses are longer compared to CMS. G1 seems to have problems with large object arrays which will be problematic for IO intensive applications requiring big and many byte buffers.

No comments:

Post a Comment