Contact Me


  • Akshay Java's Facebook profile

Social Media Events

Friends

Disclaimer

  • Thoughts and comments expressed here are those of the author. Creative Commons License

hacks

July 19, 2008

What is the Dunbar's Number for Social Networks?

Many folks are really excited about FriendFeed. Personally, I have found that there are a lot more comments when something gets posted on FriendFeed. Recently Yuval Atzmon's User21 blog released a list of most followed users on FriendFeed. Since I too had a crawl of FriendFeed running in much the same way as Yuval, I decided to look at the complementary question: "How many users do people follow on FriendFeed"? While the crawl is not yet complete (and complete statistics would have to wait), the numbers are really striking! Some users follow more than a 1000 "friends":

sthayden 3190
scobleizer 3087
juliomedina 2760
thomashawk 2557
jasoncalacanis 2447
theillife 2045
mrsth 1961
pookakoo 1814
czarphanguye 1736
brynyoungblut 1716
eposter 1562
susangrisantiguitarist 1550


I find this really amazing. Unlike Twitter, FriendFeed posts are accompanied with longer conversations so it can be more involved. I can barely keep up with all the information flying past me everywhere right now! I guess, 1500+ "friends" would be way too much for me!

Sociologists often talk about the Dunbar's Number which

is the supposed cognitive limit to the number of individuals with whom any one person can maintain stable social relationships.

In human contact network the Dunbar's number is said to be around 150. It might as well be the case that social tools and especially, microblogging is pushing this limit further. Studies on Twitter, Livejournal and other social networking sites seem to support this observation. I wonder then: what would be the Dunbar's number on social networks? 300? 500??? Any guesses? Perhaps some comparison across all the published papers that have studied different social networks might have some clues.

[BTW, I am akshayjava on FriendFeed]

July 13, 2008

Watching your Work

Via Marc Pickett I

Here is a cool script that Marc wrote to help improve productivity. It takes a screenshot every minute, so you can view a "timelapse" of your day (or week even). See how much time you spend on wikipedia, email, video games, and writing time management scripts vs. doing actual work.


The video is 6.8 hours crammed into 17 seconds. Marc says he was on the train from Cambridge, so he wasted less time on the internet than normal. So the apparent productivity was'nt perhaps cuz big bother was watching, eh?

But this is a really neat way to summarize how you spent your time during the day. Thanks Marc, for sharing the script and the video with us!

Download Watchme script

July 07, 2008

A Basic Proxy Server Using Google APP Engine

Googleappengine_2.jpeg I had been hoping to find time to check out Google App Engine. It has been widely touted as Google's answer to Amazon EC2 (which I have used a bit recently). Here is a very simple hack -- a basic proxy server built on Google App engine:

http://proxifyit.appspot.com/

I have seen many "Free" anonymization/proxy servers online -- but in my experience, most of them are pretty slow or just crowded with advertisements. This is where Google App Engine (or EC2 for that matter) could be a good platform! It was dead simple to code up this demo by just following the guide and using the URL Fetch API.

My first reaction: Google App Engine is real Fun! It was much easier than using EC2 and the turn around time was unbelievable -- python gurus would love it!

Unfortunately, I am pretty swamped right now but I would like to get to the following tasks to improve it:

  • Handle relative URLs correctly
  • Handle Stylesheets
  • When clicking a new link pass it to proxifyit again
  • Options to diable cookies
  • Options to disable javascript
  • Options to strip out ads
  • Add a captcha if there are too many requests in a given period of time.

Nothing like a quick hack at the end of a busy day... :-)

July 02, 2008

Community Detection via Matrix Factorization

Communities One form of matrix factorization is Singular Valued Decomposition (SVD). This is a powerful technique and it has several applications in information retrieval and graph analysis.

Another matrix factorization technique I had mentioned recently was Non Negative Matrix Factorization (NNMF). The advantage of NNMF over SVD is that it is easier to compute and is generally much easier to interpret due to the strict positivity constraint.

Matrix factorization can be achieved via optimization methods. Suppose a matrix A (shown in the figure on the left) of size 20*20 was to be factorized into two matrices X of size 20*4 and W' of size 4*20, the following objective function can be minimized:

J = || A - XW'||_f 

The cost function minimizes the Frobenius norm between the original matrix A an XW', i.e. the error in approximating A as a product of two matrices. This can be solved using conjugate gradient methods and MATLAB's optimization toolbox (fminunc; tutorial) is one way to implement this. Following is the MATLAB code as an example:

test = ones(5,5);
B = blkdiag(test,test,test,test);
M = rand(40,4);

[xnew,fval] = fminunc(@obj_fun1,M,options,B,20);

function [fun,Grad] = obj_fun1(Z,A,nodes)
    [m,n] = size(Z);
   
    X = Z(1:nodes,:);
    W = Z(nodes+1:end,:);
   
   % Objective Function
   fun = norm(A-X*W','fro')^2+norm(W,'fro')^2;
  
    if nargout > 1  
      Grad1 = 2*(X*(W'*W)-A*W); 
      Grad2 = 2*(W*(X'*X)-A'*X)+W;
      Grad = [Grad1; Grad2];
    end


Once we minimize the objective function, we can obtain the solution for X as
  82.5664   -1.1484   79.4176  -39.0137
   82.5664   -1.1482   79.4176  -39.0137
   82.5666   -1.1485   79.4176  -39.0139
   82.5666   -1.1485   79.4176  -39.0139
   82.5667   -1.1472   79.4173  -39.0141
   -6.3391  -18.4040   68.2625   88.6399
   -6.3389  -18.4039   68.2623   88.6397
   -6.3389  -18.4039   68.2624   88.6398
   -6.3388  -18.4037   68.2622   88.6396
   -6.3386  -18.4036   68.2621   88.6395
   75.9984   13.2685  -57.4890   70.8761
   75.9989   13.2680  -57.4891   70.8759
   75.9985   13.2681  -57.4889   70.8761
   75.9985   13.2687  -57.4891   70.8760
   75.9989   13.2681  -57.4890   70.8758
  -17.6262  112.9716   27.4847   14.5483
  -17.6257  112.9713   27.4844   14.5482
  -17.6263  112.9716   27.4847   14.5483
  -17.6259  112.9715   27.4844   14.5484
  -17.6255  112.9708   27.4844   14.5482

From this essentially the community structure can be easily determined (observe the rows can be grouped to reflect the original communities). However, a much faster and efficient (in terms of implementation) way to accomplish this goal is using something like Singular Valued Decomposition (SVD).

The above code is just a simple illustrative example. However, for me it was a worthwhile experiment to try out and to understand how matrix factorization via optimization can be useful in community detection.

Some recent, interesting papers that use different Matrix Factorizations:

I would appreciate if anyone has pointers to other interesting references/tutorials/software and could please leave me a comment.

June 09, 2008

The Echo Chamber Model

The blogosphere if often described as an echo chamber. Ideas, comments, controversies and discussions are all reverberate in this online space. In the paper titled "Information Diffusion in the Blogspace", Daniel Gruhl et al. studied the dynamics of topics propagate in the Blogosphere. They say that resonance is a rare phenomenon that can be described as

fascinating phenomenon in which a massive response in the community is triggered
by a minute event in the real world.

While few topics reach the state where they "resonate" through out the community/blogosphere, the echo chamber is still active with millions of posts and discussions in the form of trackbacks, comments and twitter posts, etc. The blogosphere is a huge graph and the echo chamber phenomena can be easily modeled in terms of random walks in this graph. Imagine that your post is read by your immediate neighbors and their posts (in reply to yours) is read by their neighbors and so on. Now what is the probability that this sequence would terminate with a post or a comment back on your blog? This is nothing but the commute time or the round trip time for the random walk to start on a given node (say A), traverse to a node (say B) and return back to the node A.

KarateCommute time embedding is a way to map the original graph (which lies in a high dimensional space) into a two (or any low) dimensional space in such a way that the euclidean distance in the low dimensional embedding preserves (i.e proportional to) the commute times in the original graph. Interestingly, this can be quite useful in describing how ideas or information might flow in the network. For more details about the exact method to compute the embedding please refer to the paper. It is a neat technique that can be described in terms of the eigenvectors of the graph Laplacian (and can also be related to heat diffusion in the graph).Commutetime

Consider the classic Karate club example. The original graph and the corresponding embedding in a two dimensional space is shown below. Notice how the nodes 33 and 34 are very close to each other. Compare this to node 3 which is almost equidistant to the central nodes (33/34 and 1) of both the communities.

I think that this might be an interesting way to model the echo chamber of the blogosphere. In this representation, blogs that are closer to each other in Euclidean space are also closer in terms of social distance (another term used in the literature that approximately means the same thing as commute time).

May 30, 2008

My First FireEagle App: Pizza Coupons Search

FireaglePizzaYesterday, I discussed an idea around the FireEagle geolocation API. I was envisioning an app where you could have a mobile phone and as you walk down the Mall or any location, it would pre-fetch relevant coupons and offers from the local restaurants. Being a grad student, we always learn to find good Pizza deals online. So I decided to use the FireEagle API to develop a Pizza  coupon finder. The way it works is that it authenticates with FireEagle to access your current location and then fetches the coupons from Google Maps and then parses the output to display on your mobile phone or a browser. You can try it at the following URL  http://wikimatix.com/coupon/pizza.php
if you have a FireEagle account already. First the application will try to authenticate with FireEagle and request the appropriate permission to access the exact or approximate location information and then passes this to the Google Coupon Finder.

Finally you have all the coupons you need to order your fresh pizza. The Documentation and example walkthrough code on FireEagle's developer area is excellent. It took hardly any time to put together this demo!

I think that the possibilities that this opens up for mobile advertising are exciting. We should also keep an eye on Android -- this space is gonna be fun to watch. [Update: Fixed the broken link. Sorry]

May 29, 2008

Yahoo FireEagle: Geolocation made simple

Fireeage This service is currently in alpha but thanks to Pranam Kolari I was able to get an invitation to Yahoo!'s FireEagle platform. FireEagle is an easy way to manage and share location information across many applications. Currently, I publish my location information across many different sites and applications and it is rare that I put in the actual effort to update it everywhere. For example I use Dopplr to publish my travel plans, twitter and Brightkite to update my current location and Facebook to indicate my home address and other details. I was impressed with how easy it was (using OAuth) to allow Dopplr and others to share and access information with FireEagle. If you have a GPS enabled phone you can even update the geolocation on the go! Damn! Thats is neat!

WikinearMetosphereOne really compelling application is Wikinear.com -- it shows you the nearest places of interest by matching the location information obtained from FireEagle with Wikipedia entries. This is great especially if you are traveling to a new location or a tourist spot and would like to know the places of interest nearby.

Another very cool application is Metosphere. (PS: I wish I had an iPhone!). With this app, you can leave a digital message for a given location, see places and events of interest and even report Graffiti and City Repair! This gives me a reason to believe that the next big thing is going to be mobile advertising. The advantage of easy availability of geolocation information specific to a user is immense. This reminds me of a project at eBiquity research group a few years back, called Agents2go,  that talked about a very similar concept. Imagine that you were walking down the during lunch and the agent on your iPhone would automatically collect coupons or find deals at the nearest restaurants as you walk by. The idea that we can have a query free, geographically relevant search is really exciting. Yahoo! is innovating and pushing hard on the open initiative. With the availability of an API it would be fun to integrate Google Coupons! (OK here is one more fascinating idea and little time at hand!)

Location is a very sensitive piece of information and the best part of FireEagle is that you can manage permissions and privacy settings or even temporarily stop sharing your location. You can allow a specific application to only access location information at a certain granularity: exact, zip, neighborhood, state or even country.  More at Techcrunch.

May 09, 2008

News feed vs. blog posts vs. email

What is the difference in size distribution of a news wire vs. a blog post vs. email message?

The below three images compare the size distribution of news wires (Reuters collection) , blog posts (from the ICWSM dataset) and email messages (Enron Corpus).  The charts show the histograms of the size of the documents in these collections:

Reuters Blogposts_3 Enron_2

The three distributions above (ignoring documents smaller than 2000 bytes) were fitted using the matlab scripts for powerlaw fits (Thanks to Aaron Cluaset). 

ReuterslawBlogpostlaw Emaillaw_3

The linguistic properties of blogs email and news stories are quite different and this has already been highlighted in several research papers. While the three data sets are quite different in many ways, here I am analyzing just the size distributions. The  important point to note is 

  • News wire stories are quite short
  • Blogs and emails are much longer and have a heavy tail distribution
  • Power law exponents for blog size distribution and email size distribution are quite similar (around 2.7)

So...what does this mean? It is fairly obvious that news wire stories are quite short due to the nature of reporting. Sometimes the initial news story is quickly reported by agencies like Reuters/AP. These are at times brief and to the point to allow readers to get a quick gist of its contents.

In contrast the size of blogs tend to be much larger than news wires. Citizen journalism is full of opinions thoughts and punditry thus bloating the post. This also goes back to my previous analysis of the blog homepage size vs. Web page size. Indeed the contribution of blogs has been reported to be 4-5 times that of edited text (like the news wires).

What I had not expected was the similarity in the slopes for email and blogs. One thing to note however is that here the emails are aggregated across a number of different users. This is an important distinction. While a single user may receive a few hundred emails, they potentially have access to millions of blogs. Recently, industry's top usability expert Jakob Nielsen concluded that readers skim through and read at most 20% of the words on a Webpage. While there are millions of blog posts every day... there is very little time to read them all in detail. The volume of email is limited by a person's social network but for blogs the act of prioritizing what to read is entirely left upon the user. This essentially necessitates the use of Memetrackers and explains the popularity of filtering tools like digg, techmeme etc. By summarizing popular blog posts and providing blurbs for these, such tools essentially act as a  "social news wire service for the blogosphere".

May 02, 2008

Leaveraging Web and Social Media for Recommendations

Both Amazon and Netflix's business models rely on effective recommendation systems. The recommendations provided by such systems are based on the purchasing habits of millions of customers. As such, these systems are non-trivial and have evolved out of years of research in both academia and industry.

In addition to mining millions of customer transaction records, for many products there is a vast amount of information available online. While I do not have a lot of familiarity with recommendation systems literature, it seems obvious that the Web and Social Media is a great source of information that could be useful when building such systems.  Bloggers' profile pages, wishlists, netflix queues, book lists and the blog posts themselves are potential clues to learn which two items may be related to each other.

As a simple example, consider the movie "Pulp Fiction", by querying Google for all the inlinks to the IMDB homepage of Sin City Pulp Fiction and counting which are the other movies that are "co-cited" here is a list of five movies that are most likely to be related to "Pulp Fiction":

Most of these look quite relevant. Some critics have claimed similarities between Pulp Fiction and Snatch. One surprise though was LOTR, I wouldn't have expected it to be grouped with Pulp Fiction, but I guess I like them both very much -- so it seems reasonable in my case atleast.

Just for fun, here is another example with "Sin City" another one of my favorite movies.

Unless you have a large index of the Blogosphere or the Web, it would be quite inefficient to mine for such correlations (by passing queries to search engines) on a large scale. I do not know how much of the search engine information is leveraged in recommendation systems built by Amazon or Netflix.  It might also be worth looking into differences in the recommendations produced on the basis of "how people co-cite two products" vs. "how people purchase two products".

April 24, 2008

Favorite Commandline Hack

One of my favorite commandline hacks is demonstrated by the following example:

history | gawk -F ' ' '{print $2}' | sort | uniq -c | sort -nr | more

What this does is takes a text file (or history of the commands used in our case), chops it to print the right field, sorts and counts the number of times a particular term occurs. For example here are the top commands I have used on this server:

    373    ls
    268    cd
     42    more
     29    ps
     27    rm
     25    du
     24    ./bin/startup.sh
     22    exit
     17    source
     15    emacs
     14    sudo
     13    ssh

This is immensely useful and a quick way to do anything from process a huge file, count the number of times a link occurs, word counting and all sorts of processing that come up frequently in large social media datasets. Its a really easy way to do some mundane tasks without having to write a script or much code for it. So what are your favorite commandline hacks? share the joy! :-)

Google Ads

Related Wikipedia Entries

Ads

Recent Readers

Search this blog


  • WWW
    socialmedia.typepad.com

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
I Love 6A

Please Support