Showing posts with label perl. Show all posts
Showing posts with label perl. Show all posts

Thursday, October 27, 2011

Verbosity of programming languages

It is not often that one sees the same functionality being implemented in more than one language outside toy examples and for mere bragging rights. However, the AI challenge's starter packages:

  1. Each implement the same algorithm in different languages
  2. With the same I/O infrastructure
  3. Were not written intentionally to be better than the other (no bias)
So I thought it would be a good data-set to look at how verbose the languages were while controlling for the task at hand:


Lines of code it takes to implement the same functionality
in different programming languages.
(Red bars likely to be higher than shown)

Sunday, September 11, 2011

What Perl got right and R got wrong

R tries to do the right thing by having very short names for functions one uses often:

c()
Creating vectors.
t()
Transpose a matrix.
q()
Quitting R
is()
Generic isInstanceOf
by()
apply after grouping
as()
Generic type cast
Common primitive functions in R

As much as I like not having to type extra characters to get to these functions, I have always had to be extra cautious when it comes to naming my variables out of fear of accidentally overwriting any of these. Interestingly, R selectively ignores such overrides letting the primitives prevail if possible:

> R.version.string
[1] "R version 2.12.1 (2010-12-16)"

> c <- function(...) { 42 }   # Accidentally overriding c()
> c(45, 67, 78)               # Expected behavior
[1] 42

> c <- "42"                   # Now it should be a scalar, right?
> c
[1] "42"

> c(45, 67, 78)               # Magical return from the dead!
[1] 45 67 78

This overriding of an identifier as both variable and primitive function is grossly inconsistent, specially since functions are first class objects, same as any vector or string. If I override an identifier, even a primitive, I expect it to be really overridden.

Though the intention of cutting keystrokes is clearly good here, this inconsistency feels avoidable.

On the other hand
In contrast, this highlights how well Perl does the same thing. Larry Wall, the creator of Perl, was a linguist by education and one of his tenets while creating Perl was making it as succinct as possible. To this end, it ventured to those corners of the keyboard which only APL had gone beyond. Also, on an unrelated note, he was the winner of the International Obfuscated C contest twice and some say that after Perl became mainstream, there wasn't much point left in obfuscation contests.

Perl also short-cuts most commonly used functions and type qualifiers:

my
Variable definitions
@a
Array variable prefix
%a
Hash variable prefix
$a
Scalar variable prefix
s//
Substitute function
m//
Match function
Common primitive functions in Perl

However, in Perl, the domain of the abbreviations looks completely different from identifiers; it is impossible to even imagine confusing them. In R, the use of identifiers and built-in primitives seems uniform, and is actually inconsistent.

And we have a winner ...

I think between R and Perl, Perl is the one which got it completely right.

The design principle to be learned here perhaps is that of having different keystrokes for different folks. The onus of preserving the (abbreviated) primitive functions should lie with the language, and it can be done:

  • the right way by syntactical obviousness as Perl does it,
  • the good ol' way by reserving keywords as Python does it or,
  • the wrong way by allowing overrides and resuscitating primitives as R does it.



Sunday, December 12, 2010

Are Firefox sessions are really this short?

I have been meaning to play with the Test pilot data for quite a while now. The primary problem was, err ..., my Harddisk space.

Anyhow, I got the data downloaded finally, only to see that my computer's memory was the next bottleneck. Things looked hopeless till I started looking at it as an exercise in online algorithms.

Then I saw one thin silver lining: the data in the csv file was presorted by the user_ids.

There was my window! If I could pre-segment the data based on the user-ids, then I could run though one user at a time, instead of reading all the data. Though not much, but linear grouping of the users already saved me a lot of effort. My next hope was that after ordering by the user ids, the data would be ordered by timestamps. Sadly it wasn't, but one cannot have everything in life anyway. :-)


  • Extracting the data

After a quick perl script to confirm the ordering of user_ids, I wrote a script to calculate the time intervals between a BROWSER_STARTUP event and a BROWSER_SHUTDOWN event in the data. There were various instances when the BROWSER_STARTUP event was present without a BROWSER_SHUTDOWN event. Probably the browser was started when the system was online (most uses of Firefox are online, after all) but the system was not online when the browser was shutdown (or crashed). So, I skipped all the orphaned start-up events. Also, some startup-events might also be made while the host was offline, and ignoring these would bias the intervals towards being longer.  I cleaned up the data (there were a few instances ~ 10s when the session lasted beyond 7 days?) and I plotted the intervals (on a semi-log scale) in R.


Now I am a user who generally carries his laptop around and has his browser open almost all the time. I was in for a small surprise, enough to make me go over the code twice before I realized that the code was correct and my intuition was wrong.

  • Visualizing the data

This data extraction took a whopping 30 minutes on my meager laptop. The stats are here (all numerical values in milliseconds):



Maximum interval = 534616265 ms  (6 days)
Minimum interval = 31 ms
Mean = 4286667.42334182 ms (1.1 hours)
Median = 896310ms (15 minutes!)
Skipped 54748 out of 633212 BROWSER_STARTS




50% sessions were less than 15 minutes long, and this is while the intervals are biased to be longer!


Of course, there are some very long periods, which pull the mean up to 1 hour, but considering that there might be outliers in the data, median is a far more robust measure. In case it is not apparent, this is a beautiful log-normal distribution staring one in the face.

It took me a while to imagine that most operations with a browser end rather quickly for most people, they do not live their lives in a browser. In other words, I am an outlier here!


How long do Firefox sessions last?

  • Conclusion

I am sure most people at Mozilla know this and, if this is indeed correct, they probably are making sure that Firefox is well suited for short operations and not extended week long uses. I am not sure whether these are competing demands, perhaps making the start-up quicker, pre-indexing search files, etc. also help in extended sessions. However, it appears (with more than just a pinch of salt) that Firefox should focus on the quick-draw and fire users for gaining market in the short term, at least. This is taking the exact opposite direction of Chrome's development, by the way.

However, there are alternate explanations possible. I am not sure how many of these were voluntary terminations; it might be that these are regular restarts. Or it might be an indication of a bigger problem, as this comment indicates:  frequent cold-restarts forced by the user to reduce the memory footprint of the browser. Or it might be that people are using Firefox only as a compatibility browser, when something (like Java applets, or broken Javascript code for example) does not work correctly on other browsers, they fire up Firefox, do the task and return back. This last question can be answered using the Survey data.

Will keep digging and hoping to find something interesting.

Feedback welcome.

Update: A few more important factors are:

  1. What were the average session times per user?
  2. How many Browser sessions were still open when the week ended?
These concerns will be addressed in the newer post.
These have been addressed in this post.


~
musically_ut