Musically yours: data mining

Showing posts with label data mining. Show all posts

Saturday, November 26, 2011

Handling NULLs and NAs

Real world data always has missing and blatantly incorrect values.

This becomes a painful issue when it comes to coming up with predictive models. While there are multiple ways of imputing data, it is difficult to figure out whether one is doing a good enough job. To make matters worse, the rows missing data might not be random. For example, all incomes above a certain threshold might be deliberately made NA to preserve anonymity. However, the model developer might not be aware of this censoring. Imputing data using any central measure will not only fail to capture this bit of information, but will actually make predictions worse.

Similar encoding might be present when one sees columns with values outside the natural limits. For example, say a column that contains number of questions answered from 5 questions in a test having the value -1 to indicate absentees.

In the worst case, a model developed by completely dropping the offending parameter might perform better than an imputed data-set.

In most cases, we can do better.

Humble Indie Bundle 3: A review

It was jolly good ride for the Humble Bundle 3, and it was far better and more exciting in the second half than the first one:

15 days of Hunble India Bundle 3

Tough times for Geneva

Lurking though the data for number of companies which declared bankruptcy in Geneva (data available from here), I noticed something:

So, as soon as I left for my internship in Japan in July '10, things headed down. They hit rock bottom in October 2010 and improved spectacularly in March '11, the month I landed back in Switzerland.

Something was up in Geneva while I was not here (July '10 - Feb 11).

Just sayin ...

~
musically_ut

PS: I was a little sad to see 27 business close doors on 24th December, 2010 and 15 businesses on 23th December, 2009.

Sunday, June 19, 2011

Topics of TED Talks

I today noticed on Twitter this link to the description of all TED Talks which have happened so far (16th June, 2011). Immediately, it looked like something worth looking (read: visualizing) into. So I did.

I created a wordle of the topics (after removing names of authors from some topics):

Topics of TED talks, created using wordle

The topics world, new, life and future clearly stand out, showing what TED is really about.

Declaring software open-source considered helpful

There have been two Humble Bundles released in the last six months:

Humble Indie Bundle in December, 2010
Humble Frozenbyte Bundle in April, 2011

The summary statistics themselves are enough to challenge some established stereotypes, e.g. the average amount contributed by Linux users has consistently been significantly larger than the Mac OS or Windows counterparts, showing that Linux users are willing to pay more than non-linux users for quality software.

On a larger scale, considering these to be real-world economic experiments, there is still much information buried deeper in the data than can be gleaned by looking at the summary. Here, I look at five-minute sampled data to see what effect making the software open-source had on the sales:

Contribution and purchases data for the Humble Frozenbyte bundle.
The projections show what the contributions and purchases might have been if Frozenbyte had not declared their games open source.

Bar charts with icons and most experimenting users

Showing icons instead of the boring colors in charts is an oft-used Infographics technique. I too used it to draw the lower part of a graph.

Now with the PictureBar class, making them in Processing is easy. The output of the code is akin to this (the icons are only samples, can be substituted by your own icons):

Sample output of the PictureBar program

Sample output after some post processing

The program
The source code of the class is well documented but here are a few pointers as to how the code (in particular the draw() function) works:

Given the number of icons to be present in a bar and the distance desired between bars, it determines the width of bars:

This also fixes the icon size because we know the number of icons to put at each level

Select the bar which has the maximum value and determine how many rows of icons will fit in the provided height (the entire height is used)
Decide how many icons to put in the last row of the bar with maximum value:

The total number of icons in this bar will determine the value (in percentage) of each icon
Decision is based on what number minimizes the desired and represented percentages across bars.

The bars are drawn using the given number of icons either from top to bottom or from bottom to top (default)

However, the rows are always filled from left to right.

Example usage of this class is given in the picturebars.pde file. As I am learning Processing as well as playing with the Mozilla Test Pilot data, this is the result of combining both together (and then post processing the output with Gimp).

A word about the data
The data used here is the user survey conducted as part of the Test Pilot suite. The survey was optional and a total of 4,081 users answered it. Out of these, 3361 users used Firefox as their primary browser or used only Firefox and 279 people either did not answer this question or marked other. The chart here shows the distribution of these pagan beta testers. The exact numbers are given to the right of the graph.

Digging a little deeper
Though the graphics shown here is only for demonstrative purposes, after some analysis there is one interesting observation which can be made.

We know already that the user share of different browsers (as on 27th December 2010) is approximately:

Internet Explorer: 46 % of users
Firefox: 30 % of users
Chrome: 12 % of users
Safari: 6 % of users
Opera: 2 % of users
Mobile browsers (others): 4 %

This safely can be taken to be the prior distribution of web users. This means that an average user (about whose browser preference we know nothing) will be an IE user with 46% probability, a Firefox user with 30 % probability, etc. Further, if we know that they are not Firefox users (that is, have honestly marked on the survey they voluntarily submitted that Firefox is not their primary browser), then using Bayes' rule the probability of being in each class is jacked up by 1/(1 - 0.3) or by about 1.43:

65 % users should be of IE
17 % users should be of Chrome
9 % users should be Safari users
3 % users should be Opera users

By these estimates, those 441 users who are Beta testing Firefox should have been thusly divided:

286 IE users instead of 114
75 Chrome users instead of 243 (!)
37 Safari users (Bang on!)
13 Opera users instead of 47

Now before we conclude anything from here, 441 is not a large number and the 279 people who did not answer this question (or gave other as an answer) could have completely changed our outlook.

With this pinch of salt, there are some interesting hypothesis which can be formed, keeping in mind that these users were beta testing Firefox:

IE users either love their browser very much, or they do not experiment a lot
Chrome and Opera users love to experiment with other browsers (Firefox)

Of course, one can make lofty claims about the attrition rates of other browsers, but I do not think that the data is sufficient to conclude that.

Any way to test these?

Appendix

All logos are taken from the respective Wikipedia entries of the browsers.
Large Firefox logo taken from here.
All graphics shared under Creative Commons Attribution-ShareAlike 3.0 license

Also interesting:

How long will your browser sessions last

Thursday, December 16, 2010

How long will your browser session last?

Browser sessions

This is a follow up from my last post and the analysis takes a different direction in the next post, where I talk about someone beta testers who are not Firefox users.

First a short recap:

Extracted the BROWSER_STARTUP and BROWSER_SHUTDOWN events from this data set.
Sorted them by user_ids and then timestamps.
Preserved only alternating startup/shutdown events for each user.

Discarded about 10% of the data here (578,496 entries remained)

Ignoring the user, found out the distribution of the session times and plotted it.
Was surprised.

Unterminated sessions

One of the concerns was that it might be the case that the longer browser sessions were still 'on' when the study period ended. There were only about 10,000 browser session open at the week end, which is less than 2% of the total browser sessions in the data set. Hence, the long lasting browser sessions would not have effected the end results much.

User sessions

Also, it is clear (actually, only in hindsight) that the users who open their browser only for short periods, will open it often in a given fixed period. This is a classical problem of Palm Calculus. As we are looking at time limited data (1 week long), the shorter browser sessions have a greater propensity to occur. However, this does not invalidate the previous results: from the browser's point of view, it will still be closed in under 15 minutes 50% of the time.

Browser's point of view of session times

Or, when stated more aesthetically:

Firefox session time distribution

However, from the User's point of view, the scenario is a little different. Upon looking at the average length of Browser sessions by each user (more than 25,000 users have at least one open/close event pair and 95% have more than 2 such events), it clearly stands out that the number of people who have average time from 15 sec to 15 minutes is not very high:

Number of users who have the given average session time (log scale)

Note that the graph ticks are not aligned to the bin divisions.

Difference

Hence, this visualization which makes clear the difference between how many users experience an average session length and how many times does a browser experience a given session length:

The distribution of users and Firefox sessions against their distribution times.

This is close to what will be my final entry to the Mozilla Open Data Visualization Competition.

Update:

I did not like in the cramped feel of the objects on the graph, so I sacrificed some accuracy (the 5% and 3% bars are the same length in pixels now, but on the other hand, they do not even have error bars).

Hence, I condensed the graphs, changed a little text and decided to go with this:

The data is the same, but the Firefox bar lengths and the User bar lengths are comparable in size now. Even though comparing them does not make any sense, but it is slightly better to have the percentage sizes nearly equal, I think.

Conclusion

So what can we take away from this? The next improvement which Firefox should aim at. Consider the following feature from the two different point of views:

If for only 10% users the average Firefox utilization is less then 15 minutes, and Firefox takes 5 seconds less to start, would it make a difference?
If 45% of the time Firefox is opened and closed in a span of 15 seconds to 15 minutes, would shaving 5 seconds off the startup times make a difference?

Should the priority be more satisfied users or better software?

Which feature / improvements will appeal to users more and which are minor updates?

Which ones should you advertise?

Which point of view should the development team take?

This is just one trade-off, there might be more trade-offs involved in making long term uses/users better than the short term users/users. The information of how the scenario looks like from the user's and the browser's point of view would certainly help in making these decisions and deciding when the feature is a killer one.

Update: The visualization, along with several other excellent entries, is featured here: https://testpilot.mozillalabs.com/testcases/datacompetition

~
musically_ut

Epilogue:

Test pilot Visualization taken from here, designed by mart3ll
Mozilla Logo from here.
All graphics shared under Creative Commons Attribution-ShareAlike 3.0 license

Sunday, December 12, 2010

Are Firefox sessions are really this short?

I have been meaning to play with the Test pilot data for quite a while now. The primary problem was, err ..., my Harddisk space.

Anyhow, I got the data downloaded finally, only to see that my computer's memory was the next bottleneck. Things looked hopeless till I started looking at it as an exercise in online algorithms.

Then I saw one thin silver lining: the data in the csv file was presorted by the user_ids.

There was my window! If I could pre-segment the data based on the user-ids, then I could run though one user at a time, instead of reading all the data. Though not much, but linear grouping of the users already saved me a lot of effort. My next hope was that after ordering by the user ids, the data would be ordered by timestamps. Sadly it wasn't, but one cannot have everything in life anyway. :-)

Extracting the data

After a quick perl script to confirm the ordering of user_ids, I wrote a script to calculate the time intervals between a BROWSER_STARTUP event and a BROWSER_SHUTDOWN event in the data. There were various instances when the BROWSER_STARTUP event was present without a BROWSER_SHUTDOWN event. Probably the browser was started when the system was online (most uses of Firefox are online, after all) but the system was not online when the browser was shutdown (or crashed). So, I skipped all the orphaned start-up events. Also, some startup-events might also be made while the host was offline, and ignoring these would bias the intervals towards being longer. I cleaned up the data (there were a few instances ~ 10s when the session lasted beyond 7 days?) and I plotted the intervals (on a semi-log scale) in R.

Now I am a user who generally carries his laptop around and has his browser open almost all the time. I was in for a small surprise, enough to make me go over the code twice before I realized that the code was correct and my intuition was wrong.

Visualizing the data

This data extraction took a whopping 30 minutes on my meager laptop. The stats are here (all numerical values in milliseconds):

Maximum interval = 534616265 ms (6 days)
Minimum interval = 31 ms
Mean = 4286667.42334182 ms (1.1 hours)
Median = 896310ms (15 minutes!)
Skipped 54748 out of 633212 BROWSER_STARTS

50% sessions were less than 15 minutes long, and this is while the intervals are biased to be longer!

Of course, there are some very long periods, which pull the mean up to 1 hour, but considering that there might be outliers in the data, median is a far more robust measure. In case it is not apparent, this is a beautiful log-normal distribution staring one in the face.

It took me a while to imagine that most operations with a browser end rather quickly for most people, they do not live their lives in a browser. In other words, I am an outlier here!

How long do Firefox sessions last?

Conclusion

I am sure most people at Mozilla know this and, if this is indeed correct, they probably are making sure that Firefox is well suited for short operations and not extended week long uses. I am not sure whether these are competing demands, perhaps making the start-up quicker, pre-indexing search files, etc. also help in extended sessions. However, it appears (with more than just a pinch of salt) that Firefox should focus on the quick-draw and fire users for gaining market in the short term, at least. This is taking the exact opposite direction of Chrome's development, by the way.

However, there are alternate explanations possible. I am not sure how many of these were voluntary terminations; it might be that these are regular restarts. Or it might be an indication of a bigger problem, as this comment indicates: frequent cold-restarts forced by the user to reduce the memory footprint of the browser. Or it might be that people are using Firefox only as a compatibility browser, when something (like Java applets, or broken Javascript code for example) does not work correctly on other browsers, they fire up Firefox, do the task and return back. This last question can be answered using the Survey data.

Will keep digging and hoping to find something interesting.

Feedback welcome.

Update: A few more important factors are:

What were the average session times per user?
How many Browser sessions were still open when the week ended?

~~These concerns will be addressed in the newer post.~~
These have been addressed in this post.

~
musically_ut

Saturday, November 13, 2010

Get more play count information in Rhythmbox

I am obsessed with keeping track of what my musical taste is like, and to be able to get quantitative as well as qualitative information about it. Last.fm (and now libre.fm)with audio scrobbling (both via Rhythmbox as well as my old iPod Shuffle), satiates my needs for numbers most of the time, but I would like to know a little more. Initially, I just wanted to know which is the album I listen to the most, but it could easily be generalised to much more easily.
Hence, I made a small plugin extra-playcount for Rhythmbox, (Note: It maybe better to get more recent version of the files from the launchpad code branch) which gives me just that little bit more information. To see the plugin in action, select multiple songs and select properties to get an extra tab in the info window.
The additional data presented is:

Number of songs by each artist
Playcount of songs by each artist
Number of songs in each album
Playcount of each album

For bug tracking and development I have set it up as a project on launchpad here.

Installation instructions:

Extract the contents of the tar ball in home_directory/.gnome2/rhythmbox/plugins
Restart Rhythmbox.
Go to menu Edit → Plugins and enable Extra-Playcount

Extra playcount plugin in action in Rhythmbox

Musically yours