Introduction

Recently I had to run though a large

This is a fairly common operation and I had thought that beyond the vectorization I had done (borrowed term from MATLAB), R would take care of optimizations

At this point, I thought how I would do it on a larger database and it struck me that perhaps I can do better than R if I manually index the data and find relevant intervals myself. How much faster can that be?

Analysis

How the vectorized version works is that it loops through each element of

In worst case, if all values in

However, if I have sorted the data by

Now these

Implementation

For the implementation, I use my own handy

The naive implementation is as expected. For implementing the optimized

Since

The functions run each method 100 times to make

Results

The results are here:

And for the manually optimized version:

**Update: The***by*function of R can be used for the same task since data for me is stored in the same data.frame. I have tested that out later on in the post.

Recently I had to run though a large

**data.frame**in R,**selecting**values from a given column based on an equality constraint on another column, in this format:```
for(i in 1:length(values)) {
t <- my.var$row2[ my.var$row1 == value[i] ];
do.something.interesting(t);
}
```

This is a fairly common operation and I had thought that beyond the vectorization I had done (borrowed term from MATLAB), R would take care of optimizations

*under-the-hood*.At this point, I thought how I would do it on a larger database and it struck me that perhaps I can do better than R if I manually index the data and find relevant intervals myself. How much faster can that be?

Analysis

How the vectorized version works is that it loops through each element of

*row2,*selecting only those for which corresponding*row1*value is equal to*value[i]*. Hence, data extraction itself is an*O(n)*operation.In worst case, if all values in

*row1*are unique, then complexity of this solution will be*O(n * n)!*Welcome back good ol' O(n^2) solutions, I missed you. [See question 2 here].However, if I have sorted the data by

*row1*and I know*apriori*what are*start[i]*and*end[i]*for the*i-th*unique value in*row1*, then while iterating for*value[i]*, I can splice*row2*vector from*start[i]*to*end[i]*and get the result in roughly*O(end - start)*time for each*value[i]*. Hence, summing these together and telescoping, I should be able to finish the entire loop in*O(n)*time irrespective of what was input. Of course, I have assumed that it takes constant amount of time to locate indexes in a vector (that a vector behaves like classical*array*data structure).Now these

*start[i]*and*end[i]*values can be found in*O(n)*time by linear search though the sorted*row1*, or in any faster way; it does not effect overall worst case complexity.Implementation

For the implementation, I use my own handy

**data.frame**and loop over**all unique values**.```
> length(tmp2$user.id) # -> n
[1] 16569
> length(unique(tmp2$user.id)) # -> m
[1] 459
> is.unsorted(tmp2$user.id)
[1] FALSE
```

*splicing*version, I used*findInterval*to get correct*boundaries*of values in*row1*. From the help page for*findInterval*:ifHence,`i <- findInterval(x,v)`

, we havev[i[j]] ≤ x[j] < v[i[j] + 1]wherev[0] := - Inf,v[N+1] := + Inf, and`N <- length(vec)`

*i[1]*would be the location of last occurrence of*x[1]*in*v,*and*i[N]*would be the last location of*x[N]*. In my case, since all values are necessarily contiguous, the boundary indices for*1st*value will be (*1, i[1]*) and for the*k-th*value will be (*i[k-1] + 1, i[k]*), if*k > 1*. This can be easily adapted for non-contiguous values as well.Since

*findInterval*does a check to see whether the vector passed to it is sorted or not, which takes*O(n)*time, it is not exactly the*O(m * log(n))*binary search I would like, but since I make only one call to it to get*all*boundaries, I choose to tolerate that much delay for an easy to write implementation.__Naive Version:__# Relying on Vectorization to extract data from row 'interval' f1 <- function() { uq <- unique(tmp2$user.id) for (i in 1:100) { for(j in 1:length(uq)) { tmp3 <- tmp2$interval[ tmp2$user.id == uq[j]] } } }

__Optimized Version:__```
# Splicing 'interval' row myself
f2 <- function() {
uq <- unique(tmp2$user.id)
for (i in 1:100) {
loc <- c(0, findInterval(uq, tmp2$user.id)) # Boundaries of each user.id
for(j in 1:length(uq)) {
tmp4 <- tmp2$interval[ (loc[j]+1):loc[j+1] ]
}
}
}
```

The functions run each method 100 times to make

*warm-up*times negligible.Results

The results are here:

```
# Time for the vectorized version
> system.time(f1())
user system elapsed
96.566 0.880
```**174.327**

And for the manually optimized version:

```
# Time for the manual splicing version
> system.time(f2())
user system elapsed
1.152 0.024
```**2.165**

Conclusion

So, at least for my dataset, the hand-optimized version is almost 60 times faster than the naive implementation even though my dataset is far from worst possible input. This increase comes at the cost of a slightly more complicated algorithm which took one more line of code than a naive implementation.

**Lesson learned:**Vectorization is much easier to write (and fairly fast) for data extraction, but a little more thought into the algorithm could result in a much faster implementation. Do not expect R (or any language for that matter) to do all the optimization.

Update

It turns out that this is such a common operation that R has a primitive function to do it called

*by.*

It seemed unfair to compare the optimized version with the naive version when there is a better version which requires much less code. Also, it did not seem correct to just compare extraction of relevant data now.

Hence, I upgraded the code a tad bit for each part to calculated the average of the second column (renamed to be

*timestamp*here):

Naive Version:Naive Version:

```
# Relying on Vectorization to extract data from row 'interval'
f1 <- function() {
uq <- unique(tmp2$user.id)
tmp3 <- vector('numeric', length(uq))
for (i in 1:100) {
for(j in 1:length(uq)) {
tmp3[j] <- mean(tmp2$timestamp[ tmp2$user.id == uq[j]])
}
}
tmp3
}
```

__Optimized Version:__```
# Splicing 'interval' row myself
f2 <- function() {
uq <- unique(tmp2$user.id)
tmp3 <- vector('numeric', length(uq))
for (i in 1:100) {
loc <- c(0, findInterval(uq, tmp2$user.id)) # Boundaries of each user.id
for(j in 1:length(uq)) {
tmp3[j] <- mean(tmp2$timestamp[ (loc[j]+1):loc[j+1] ])
}
}
tmp3
}
```

__Version with__*by:*```
f3 <- function() {
for (i in 1:100) {
tmp3 <- by(tmp2$timestamp, tmp2$user.id, mean)
}
as.vector(tmp3)
}
```

The results are still as before:

> system.time(t1 <- f1()); system.time(t2 <- f2()); system.time(t3 <- f3()) user system elapsed 581.028 215.762803.874 # Naive versionuser system elapsed 117.935 118.775239.934 # Optimized versionuser system elapsed 204.529 120.748329.097 # Version withby

Hence, the advantage of using a cleverer algorithm which exploits the structure in the data is still there, but now the code required for it is much more than what was needed for the really naive version. The trade off is much tighter now.

If I were using this only once, then I could possibly not bother about the optimized version and go with a one liner using

*by*. However, if I was putting it into a script, then I would still opt for the optimized version.

## 3 comments:

hey ut.. gud going!.. soon u wid b getting a nobel prize for sure.. :)

Hey,

Awesome post, just one quick and silly question, while explaining the implementation you have used a loop for i 1:100, why is that? What does "i" do here?

The functions run each method 100 times to make warm-up times negligible.

If I run each of the function just once, it might run slower because of a multitude of reasons (pre-fetching data into the cache for the first time, swapping data in/out of the drive, etc.)

However, after such

warm up, the majority of the runs which follow should give me approximately the average value. I can simply divide the values I show by 100 to get the average execution time of each variant of the function.This is similar to how one would reduce error in the calculation of the mean in the presence of noise: the confidence interval for the mean value reduces proportionally to the square-root of the number of (independent) iterations.

Post a Comment