Sunday, July 29, 2012

Pre-allocate your vectors

Or else welcome the good ol' friend, the O(n^2) back in your life.

There are three common ways people add elements to vectors in R. First way:

f1 <- function (n) {
    l <- list()
    for(idx in 1:n) {
        l <- append(l, idx)
    }
    return(l)
}

This is bad. It does what it looks like: adds one element to the vector and then copies it to another variable, which just happens to be the same variable 'l'. This is clearly O(n^2).

The second method is this:
f2 <- function (n) {
    l <- list()
    for(idx in 1:n) {
        l[[length(l) + 1]] <- idx
    }
    return(l)
}

This approach looks decent because, coming from a Python and Java background, a good implementation would allow appending elements to vectors in amortized O(1) time. However, this holds an unpleasant surprize. Another good ol' O(n^2) in lurking in the shadows here.

Time taken to create a vector of size n
The method f2 is arguably better than f1, but none of them is any match for f3 which simply uses pre-allocation:

f3 <- function (n) {
    l <- vector("list", n)
    for(idx in 1:n) {
        l[[idx]] <- idx
    }
    return(l)
}


Of course these are just cooked up examples, but it is not difficult to imagine situations where such incremental creation of list is inevitable (think reading from DB cursors). For this particular example, the fastest solution involves a sapply but is not much faster then f3:

f4 <- function (n) {
    return(as.list(sapply(1:n, function (idx) idx)))
}

So in most cases it'll be better to overestimate the number of items in a list and then truncating it to the right size than dynamically expanding it.

And don't forget to smile and wave at the hidden O(n^2)!

~
musically_ut

PS: If you want to run the profile yourself, here's the gist of the functions and the profiling code.

Colors in R console

Update 2: The colorout package has moved again and is now available on GitHub.

install.packages('devtools')
library(devtools)
install_github('jalvesaq/colorout')

Update: The colorout package on CRAN has not been updated to be compatible with R version 3.x.x yet. However, if you compile and install it yourself, it still works.

download.file("http://www.lepem.ufc.br/jaa/colorout_1.0-1.tar.gz", destfile = "colorout_1.0-1.tar.gz")
install.packages("colorout_1.0-1.tar.gz", type = "source", repos = NULL)

 R console


Let's face it, the R-console is one of the more uninviting things I have seen, perhaps second only to the Ocaml console which comes without readline.

This is what my (and probably your) R console looks like this:
R console without color
What I see on the console is a single color which, firstly, makes it a challenge to separate stderr, output, warnings and errors, and, secondly, is just boring. I did not know what I could do about it and I stuck with it because the only other option seemed to be moving to a GUI (JGR or Rstudio). This is not to say that there is something wrong with the GUIs, but I prefer working with only my keyboard and, hence, shun GUIs almost as a rule. (Eclipse for Java is an exception.)

But it changed when I discovered the package colorout on CRAN, which makes my console look like this:
R console with color (using colorout)
This makes it remarkably easier to differentiate different forms of output.
The coloring is not perfect (notice that the second '-' in the interval outputted for tmp30 is assumed to be a negative sign instead of being a separator), but I would choose it any given day over the drab single color.
I did need a little tweak to my .Rprofile file since I did not like the way the default settings display the error:

~
musically_ut