Sweeping 700× faster

Our initial experiments using the new garbage collector design indicate that it dramatically improves the performance of the slowest GC phase (sweeping) as expected, to the extent that it brings the overall performance of our C++ prototype within 10% of OCaml without introducing any of the overheads of generational garbage collection:

Moreover the algorithm is very simple and, in particular, parallel and concurrent variants should be much easier to design with this style of collector than with generational collectors because regions are a suitable granularity.

The sweep phase of the GC accounted for around a third of the total running time of the entire program using the traditional algorithm with allocated and free lists, with each sweep taking around 70µs. Using the new bitwise algorithm, the sweep phase takes just 100ns and accounts for just 0.7% of the total running time of the program.

Our prototype mark region collector using a fake marking phase is now between 2 and 10% slower than OCaml without having sacrificed either its better performance on other benchmarks or its multicore capability. However, our new design has increased the cost of both allocation and marking. Although they accounted for a tiny proportion of the total running time, it remains to be seen whether or not this new GC algorithm will be faster than a traditional generational collector in this best-case scenario for generational collection when it is implemented in a real VM. In particular, the overheads of HLVM's current shadow stack may well make it slower.

Our collection strategy also captures the benefits of Appel's semi-generational collection algorithm. Appel's algorithm compacted nursery survivors to one end of the nursery and used the remaining space as a smaller nursery repeatedly until the entire nursery was full of reachable values that were then promoted to the old generation. Our region-based collector can allow each thread to sweep its own region locally, independently of all other regions and cores. This also leads to the same non-linearity that makes Appel's algorithm so effective but without the overheads of copying. Using regions containing 62 blocks each 64 bytes long, local sweeping increased the number of allocations performed before a new region was required from 62 to a whopping 3,000 allocations. Obtaining a new region requires synchronization, so this simple improvement has reduced the rate of synchronizations by around a factor of 50. With 8-byte regions, the 11-queens problem can be solved with a single region and, therefore, without any contention at all between threads.

Therefore, we believe our new GC design will allow us to make HLVM competitively performant on functional benchmarks like this without losing its substantial advantages on imperative code, where it is often several times faster than OCaml.


Comments

Popular posts from this blog

Bjarne Stroustrup is catching up

Does reference counting really use less memory than tracing garbage collection? Mathematica vs Swift vs OCaml vs F# on .NET and Mono

Does reference counting really use less memory than tracing garbage collection? Swift vs OCaml