DSC 2014. Day 1

Peter Dalgaard, Deepayan Sarkar and Martin Maechler in Brixen

Peter Dalgaard, Deepayan Sarkar and Martin Maechler in Brixen

This is a report of the first day of the Directions in Statistical Computing (DSC) conference that took place in Brixen, Italy (See here for an introduction). Performance enhancements were the main theme of the day, covering not just improvements to R itself but alternate language implementations.

Luke Tierney started by presenting the new implementation of reference counting, an experimental feature of the current R development branch. This should reduce unnecessary copies of objects when they are modified, which is an important cause of poor performance. Reference counting replaces the simpler “named” mechanism for determining whether an object needs to be copied when it is modified. It uses the same 2-bit field used by “named”, so counts are limited to the set {0,1,2,3+}, where 3+  is “sticky” and cannot be decremented. However, this is sufficient to stop user-generated replacement functions (i.e. function calls that appear on the left hand side of an assignment) from generating spurious copies.

Radford Neal (University of Toronto) reviewed some of the optimizations to the R interpreter that he has included in Pretty quick R (pqR), his fork of the R-2.15.0 code base that includes many optimizations. Some of these optimizations are described in more detail in his blog. In particular, deferred evaluation and task merging are also used in alternative R implementations Riposte and Renjin (see below). Notably, pqR already includes full reference counting.

Thomas Kalibera talked about changes to the R byte code compiler that can improve performance without changing R semantics. He has been working with Luke Tierney to incorporate these changes into R.

Of course, optimization requires performance analysis. Helena Kotthaus (Technical University of Dortmund) presented a suite of R-based facilities for performance analysis, including an instrumented version of R and a set of benchmarks.  They can all be found in the allr github repository.

In addition to these efforts to improve R, there are several alternate implementations written from scratch. Most of these were well represented at the meeting.

  • Michael Haupt presented FastR, an re-implementation of R in Java, built on top of the Truffle interpreter and the Graal byte compiler.  FastR is a collaboration between Purdue University, Johannes Kepler University Linz, and Oracle Labs.
  • Alexander Bertram (BeDataDriven) presented Renjin another open source R interpreter that runs on the JVM.
  • Justin Talbot (Tableau Software) presented Riposte, a fast interpreter and Just-In-Time (JIT) compiler for R written in C++.

Two further R implementations that were not specifically presented at the meeting should also be mentioned here:

  • CXXR is a refactorization of the R code base by Andrew Runnalls (University of Kent). It replaces parts of the R interpreter written in C with C++ code. Andrew Runnalls was not present at the meeting.
  • TIBCO Enterprise Runtime for R (TERR) is a clean room re-implementation of R. This is the only closed source R implementation. Bill Dunlap from TIBCO was an active participant in the meeting but did not give a presentation. However, a presentation on TERR vs R performance concentrating on memory management issues was given by Michael Sannella (TIBCO) at last year’s UseR! conference, and is available here.

Jan Vitek (Purdue Universtiy) gave a nice overview of these R implementations and the problems they face. One of the key issues is the lack of formal specification of the R language. In other words, there is no document that formally sets out what is allowed, what is not allowed and what is undefined behaviour. The only way to test whether you have a correct re-implementation is to try to run code that also runs on R.  The prize here is to be able to run the 5709 R packages on CRAN  without modification, but many of the re-implementations fall short of this goal.

As Robert Gentleman noted in summing up, we now have a “language community” of computer scientists interested in R in addition to the R user and developer communities.


About these ads