Posts

Showing posts from January, 2010

Naive parallelism with HLVM

Image
The latest OCaml Journal article High-performance parallel programming with HLVM (23rd January 2010) described a simple but effective parallelization of the HLVM implementation of the ray tracer. Comparing with similarly naive parallelizations of the fastest serial implementations in C++ and Haskell, we obtained the following results: These results exhibit several interesting characteristics: Even the naive parallelization in C++ is significantly faster than HLVM and Haskell. C++ and HLVM scale well, with performance improving as more cores are used. Despite having serial performance competitive with HLVM, the naively-parallelized Haskell scales poorly. In particular, Haskell failed to obtain a competitive speedup with up to 5 cores and its performance even degraded significantly beyond 5 cores, running 4.4× slower than C++ on 7 cores. The efficiency of the parallelization can be quantified as the speed of the parallel version on multiple cores relative to its speed on a single core:...

Quadcore ARMs

Since the publication of our recent article about the upcoming ARM architecture in the context of the exploding netbook market, Marvell have announced their production of the world's first quadcore ARM CPU . ARM also announced 2GHz capable Cortex-A9 cores in September 2009. Hopefully we'll get the chance to port our new multicore-capable HLVM project to the ARM architecture before long. That should be an easy task thanks to LLVM's existing support for ARM and LLVM is already used on the Apple iPhone, which is ARM based.

Naïve Parallelism: a rebuttal

Image
Several people including Simon Marlow of Microsoft have objected to our rejection of Saynte's new Haskell code, claiming that the alterations were minor and that they optimize serial performance. This is simply not true. Firstly, over half of the lines of code in the entire program have been altered. Secondly, the new version is slower than Lennart's original version 5 on 1 core. So there is no logical reason to choose the revised Haskell for the basis of a naive parallelization unless you want to cherry pick results by leveraging knowledge of how they will scale after parallelization. Suffice to say, doing so would be bad science. This is illustrated in the following graph of performance results for Lennart's original version 5 vs Saynte's new revised Haskell with 11 levels of spheres at 1,024×1,024: The original code is 5% faster on 1 core but scales poorly and is 2.7× slower on 7 cores. The substantially-revised "serial" Haskell code was obviously specif...

Naïve Parallelization: C++ vs Haskell

Image
A member of the Haskell community recently published a blog article revisiting our ray tracer language comparison, claiming to address the question of how naïve parallelizations in these two languages compare. The objective was to make only minimal changes to the programs in order to parallelize them and then compare performance. Our attempts to verify those results turned up a lot of interesting information. Firstly, the Haskell program that was supposedly naïvely parallelized was not the original but, in fact, a complete rewrite. This raises the question of whether or not the rewrite was specifically designed to be amenable to parallelization and, therefore, is not representative of naïve parallelization at all. The C++ used was the original with minimal changes to parallelize a single loop. Secondly, although the serial benchmark results covered a spectrum of inputs, the parallel results covered only a single case and retrospectively identified the optimal results without alluding ...

HLVM on the ray tracer language comparison

Image
We recently completed a translation of our famous ray tracer language comparison to HLVM . The translation is equivalent to the most highly optimized implementations written in other languages and this allows us to compare HLVM with a variety of competing languages for the first time. The results are astonishing. Running the benchmark with the default settings (level=9, n=5 to render 87,381 spheres at 512×512) on 32-bit x86 gives the following times for different languages: These results show that HLVM already provides competitive performance for a non-trivial benchmark. HLVM took 6.7s whereas C++ (compiled with g++ 4.3.3) took only 4.3s and Haskell (compiled with GHC 6.12) took 13.9s. However, cranking up the level parameter to 12 in order to increase the complexity of the scene, rendering a whopping 5,592,405 spheres, we find that HLVM blows away the other garbage collected languages and is even able to keep up with C++: This remarkable result is a consequence of HLVM's space-ef...

Will Intel lose the computer market to ARM in 2012?

The core of the computer industry, laptops and desktops, is on the brink of revolution with Intel and Microsoft set to lose their long-term stranglehold for three main reasons: Netbook sales are exploding and eating into the laptop and desktop markets. Ever more users are operating "in the cloud" where the CPU architecture and OS are irrelevant. Demand has shifted with users now wanting small solid-state fanless computers with long battery lives. Sales of netbooks rose from only 0.4M in 2007 to 11.4M in 2008 and 35M in 2009 with projected sales of 139M in 2013 . By the end of 2008, notebook sales had overtaken desktop sales for the first time in history . The tremendous growth in netbook sales was precipitated by the global financial climate making cheaper devices more alluring for cash-strapped consumers . Google are desperately trying to get a piece of the action with their Chrome OS , an operating system designed around their Chrome web browser specifically for the new bre...

High-performance parallelism with HLVM

Image
Our open source HLVM project recently reached a major milestone with new support for high-performance shared-memory parallel programming. This is a major advance because it places HLVM among the world's fastest high-level language implementations. The previous HLVM implementation had demonstrated the proof of concept using a stop-the-world mark-sweep garbage collector. This new release optimizes that robust foundation by carrying thread-local data in registers in order to provide excellent performance. The following benchmark results show HLVM beating OCaml on many serial benchmarks on x86 despite the overheads of HLVM's new multicore-capable garbage collector: HLVM is over 2× faster than OCaml on average over these results and 4× faster on the floating-point Fibonacci function.