Wednesday, August 31, 2011

R: Going faster than vectorization

Introduction

Update: The by function of R can be used for the same task since data for me is stored in the same data.frame. I have tested that out later on in the post.


Recently I had to run though a large data.frame in R, selecting values from a given column based on an equality constraint on another column, in this format:

for(i in 1:length(values)) { 
  t <- my.var$row2[ my.var$row1 == value[i] ]; 
  do.something.interesting(t); 
}

This is a fairly common operation and I had thought that beyond the vectorization I had done (borrowed term from MATLAB), R would take care of optimizations under-the-hood.

At this point, I thought how I would do it on a larger database and it struck me that perhaps I can do better than R if I manually index the data and find relevant intervals myself. How much faster can that be?



Analysis

How the vectorized version works is that it loops through each element of row2, selecting only those for which corresponding row1 value is equal to value[i]. Hence, data extraction itself is an O(n) operation.

In worst case, if all values in row1 are unique, then complexity of this solution will be O(n * n)! Welcome back good ol' O(n^2) solutions, I missed you. [See question 2 here].

However, if I have sorted the data by row1 and I know apriori what are start[i] and end[i] for the i-th unique value in row1, then while iterating for value[i], I can splice row2 vector from start[i] to end[i] and get the result in roughly O(end - start) time for each value[i]. Hence, summing these together and telescoping, I should be able to finish the entire loop in O(n) time irrespective of what was input. Of course, I have assumed that it takes constant amount of time to locate indexes in a vector (that a vector behaves like classical array data structure).

Now these start[i] and end[i] values can be found in O(n) time by linear search though the sorted row1, or in any faster way; it does not effect overall worst case complexity.

Implementation

For the implementation, I use my own handy data.frame and loop over all unique values.

> length(tmp2$user.id)  # -> n
[1] 16569
> length(unique(tmp2$user.id))  # -> m
[1] 459
> is.unsorted(tmp2$user.id)
[1] FALSE

The naive implementation is as expected. For implementing the optimized splicing version, I used findInterval to get correct boundaries of values in row1. From the help page for findInterval:

if i <- findInterval(x,v), we have v[i[j]] ≤ x[j] < v[i[j] + 1] where v[0] := - Inf,              v[N+1] := + Inf, and N <- length(vec)
Hence, i[1] would be the location of last occurrence of x[1] in v, and i[N] would be the last location of x[N]. In my case, since all values are necessarily contiguous, the boundary indices for 1st value will be (1, i[1]) and for the k-th value will be (i[k-1] + 1, i[k]), if k > 1. This can be easily adapted for non-contiguous values as well.

Since findInterval does a check to see whether the vector passed to it is sorted or not, which takes O(n) time, it is not exactly the O(m * log(n)) binary search I would like, but since I make only one call to it to get all boundaries, I choose to tolerate that much delay for an easy to write implementation.

Naive Version:
# Relying on Vectorization to extract data from row 'interval'
f1 <- function() {  
  uq <- unique(tmp2$user.id) 
  for (i in 1:100) { 
    for(j in 1:length(uq)) { 
      tmp3 <- tmp2$interval[ tmp2$user.id == uq[j]] 
    } 
  } 
}

Optimized Version:
# Splicing 'interval' row myself
f2 <- function() { 
  uq <- unique(tmp2$user.id) 
  for (i in 1:100) { 
    loc <- c(0, findInterval(uq, tmp2$user.id)) # Boundaries of each user.id 
    for(j in 1:length(uq)) { 
      tmp4 <- tmp2$interval[ (loc[j]+1):loc[j+1] ] 
    } 
  } 
}

The functions run each method 100 times to make warm-up times negligible.

Results

The results are here:

# Time for the vectorized version
> system.time(f1())
  user  system elapsed
 96.566   0.880 174.327

And for the manually optimized version:

# Time for the manual splicing version
> system.time(f2())
   user  system elapsed
  1.152   0.024   2.165

Conclusion

So, at least for my dataset, the hand-optimized version is almost 60 times faster than the naive implementation even though my dataset is far from worst possible input. This increase comes at the cost of a slightly more complicated algorithm which took one more line of code than a naive implementation.

Lesson learned: Vectorization is much easier to write (and fairly fast) for data extraction, but a little more thought into the algorithm could result in a much faster implementation. Do not expect R (or any language for that matter) to do all the optimization.

Update
It turns out that this is such a common operation that R has a primitive function to do it called by.
It seemed unfair to compare the optimized version with the naive version when there is a better version which requires much less code. Also, it did not seem correct to just compare extraction of relevant data now.

Hence, I upgraded the code a tad bit for each part to calculated the average of the second column (renamed to be timestamp here):

Naive Version:

# Relying on Vectorization to extract data from row 'interval'
f1 <- function() {  
  uq <- unique(tmp2$user.id) 
  tmp3 <- vector('numeric', length(uq))
  for (i in 1:100) { 
    for(j in 1:length(uq)) { 
      tmp3[j] <- mean(tmp2$timestamp[ tmp2$user.id == uq[j]])
    } 
  } 
  tmp3
}

Optimized Version:
# Splicing 'interval' row myself
f2 <- function() { 
  uq <- unique(tmp2$user.id) 
  tmp3 <- vector('numeric', length(uq))
  for (i in 1:100) { 
    loc <- c(0, findInterval(uq, tmp2$user.id)) # Boundaries of each user.id 
    for(j in 1:length(uq)) { 
      tmp3[j] <- mean(tmp2$timestamp[ (loc[j]+1):loc[j+1] ])
    } 
  } 
  tmp3
}

Version with by:
f3 <- function() {
    for (i in 1:100) {
        tmp3 <- by(tmp2$timestamp, tmp2$user.id, mean)
    }
    as.vector(tmp3)
}

The results are still as before:

> system.time(t1 <- f1()); system.time(t2 <- f2()); system.time(t3 <- f3())
   user  system elapsed
581.028 215.762 803.874  # Naive version
   user  system elapsed
117.935 118.775 239.934  # Optimized version
   user  system elapsed
204.529 120.748 329.097  # Version with by

Hence, the advantage of using a cleverer algorithm which exploits the structure in the data is still there, but now the code required for it is much more than what was needed for the really naive version. The trade off is much tighter now.

If I were using this only once, then I could possibly not bother about the optimized version and go with a one liner using by. However, if I was putting it into a script, then I would still opt for the optimized version.


3 comments:

Tushar Jain said...

hey ut.. gud going!.. soon u wid b getting a nobel prize for sure.. :)

Shreyes said...

Hey,

Awesome post, just one quick and silly question, while explaining the implementation you have used a loop for i 1:100, why is that? What does "i" do here?

musically_ut said...

The functions run each method 100 times to make warm-up times negligible.


If I run each of the function just once, it might run slower because of a multitude of reasons (pre-fetching data into the cache for the first time, swapping data in/out of the drive, etc.)

However, after such warm up, the majority of the runs which follow should give me approximately the average value. I can simply divide the values I show by 100 to get the average execution time of each variant of the function.

This is similar to how one would reduce error in the calculation of the mean in the presence of noise: the confidence interval for the mean value reduces proportionally to the square-root of the number of (independent) iterations.