Contact Me


  • Akshay Java's Facebook profile

Social Media Events

Friends

Disclaimer

  • Thoughts and comments expressed here are those of the author. Creative Commons License

data mining

July 17, 2008

Clustering Triples from Social Data

TagtriplesBy far, the most prevalent data available in social media is tagging information. For example, in del.icio.us a user may tag a URL or in Flickr she may tag an image. One of the questions that comes up is how to then cluster social data that is rich in tags. Some techniques available ignore the user information and use only a bipartite graph consisting of tags and URLs. Another method is to represent two pieces of evidence (user-tag;tag-blog) in a tripartite graph (where nodes are of three different types: users, tags and urls). However, Realtripleseven this type of structure actually  misses the higher order relation between the three nodes. Note that the information available is really in triples of the type <user, tag, url>. This information is not captured by the tripartite graph model. In particular, two users may be connected via a common tag even if the actual URL they bookmarked is vastly different.

There are some techniques using Tensor Matrix Factorization that can handle such data. However, the question of how to deal with triple (or higher) information from social data is quite interesting. Moreover, being able to do so efficiently and in an online fashion would also be important. I believe that this topic may be of significant interest in the upcoming social media and data mining conferences. The implications of these techniques would be in building better recommendation systems and personalization algorithms.

[Thanks Vlad Korolev for some of the discussions related to this post]

July 02, 2008

Community Detection via Matrix Factorization

Communities One form of matrix factorization is Singular Valued Decomposition (SVD). This is a powerful technique and it has several applications in information retrieval and graph analysis.

Another matrix factorization technique I had mentioned recently was Non Negative Matrix Factorization (NNMF). The advantage of NNMF over SVD is that it is easier to compute and is generally much easier to interpret due to the strict positivity constraint.

Matrix factorization can be achieved via optimization methods. Suppose a matrix A (shown in the figure on the left) of size 20*20 was to be factorized into two matrices X of size 20*4 and W' of size 4*20, the following objective function can be minimized:

J = || A - XW'||_f 

The cost function minimizes the Frobenius norm between the original matrix A an XW', i.e. the error in approximating A as a product of two matrices. This can be solved using conjugate gradient methods and MATLAB's optimization toolbox (fminunc; tutorial) is one way to implement this. Following is the MATLAB code as an example:

test = ones(5,5);
B = blkdiag(test,test,test,test);
M = rand(40,4);

[xnew,fval] = fminunc(@obj_fun1,M,options,B,20);

function [fun,Grad] = obj_fun1(Z,A,nodes)
    [m,n] = size(Z);
   
    X = Z(1:nodes,:);
    W = Z(nodes+1:end,:);
   
   % Objective Function
   fun = norm(A-X*W','fro')^2+norm(W,'fro')^2;
  
    if nargout > 1  
      Grad1 = 2*(X*(W'*W)-A*W); 
      Grad2 = 2*(W*(X'*X)-A'*X)+W;
      Grad = [Grad1; Grad2];
    end


Once we minimize the objective function, we can obtain the solution for X as
  82.5664   -1.1484   79.4176  -39.0137
   82.5664   -1.1482   79.4176  -39.0137
   82.5666   -1.1485   79.4176  -39.0139
   82.5666   -1.1485   79.4176  -39.0139
   82.5667   -1.1472   79.4173  -39.0141
   -6.3391  -18.4040   68.2625   88.6399
   -6.3389  -18.4039   68.2623   88.6397
   -6.3389  -18.4039   68.2624   88.6398
   -6.3388  -18.4037   68.2622   88.6396
   -6.3386  -18.4036   68.2621   88.6395
   75.9984   13.2685  -57.4890   70.8761
   75.9989   13.2680  -57.4891   70.8759
   75.9985   13.2681  -57.4889   70.8761
   75.9985   13.2687  -57.4891   70.8760
   75.9989   13.2681  -57.4890   70.8758
  -17.6262  112.9716   27.4847   14.5483
  -17.6257  112.9713   27.4844   14.5482
  -17.6263  112.9716   27.4847   14.5483
  -17.6259  112.9715   27.4844   14.5484
  -17.6255  112.9708   27.4844   14.5482

From this essentially the community structure can be easily determined (observe the rows can be grouped to reflect the original communities). However, a much faster and efficient (in terms of implementation) way to accomplish this goal is using something like Singular Valued Decomposition (SVD).

The above code is just a simple illustrative example. However, for me it was a worthwhile experiment to try out and to understand how matrix factorization via optimization can be useful in community detection.

Some recent, interesting papers that use different Matrix Factorizations:

I would appreciate if anyone has pointers to other interesting references/tutorials/software and could please leave me a comment.

June 19, 2008

Email Interview: Nihaar Gupta, Youlicit

Nihaar Gupta, VP of Product development at Youlicit has kindly obliged to have an email interview with SocialMedia Research Blog. Following are the responses to some of the questions I had for him:

1) Please describe Youlicit to us?

Youlicit at its core is a discovery engine (http://blog.youlicit.com/?p=23). We want to connect you to the most relevant and recommended information as effortlessly as possible. As of now, we are building a technology that allows a user to find the most recommended sites (recommended by people around the web) related to a given URL. We believe that people are the best judges of content and more often than not, the information you are looking for has been found by someone before. Our goal is to aggregate that information and allow the user to access it with the click of a button.


2) Tell Us about your background, the team behind Youlicit and how started it?

Youlicit came about as a result of trying to solve our own frustrations with trying to find information on the web. With the enormous amount of user-generated content and annotations on the web, we saw a huge amount of valuable data that was inaccessible and fragmented. For the sake of brevity, the team bios & background are here http://blog.youlicit.com/?page_id=6


3) Please give us a brief overview of the technology behind Youlicit?

Youlicit aggregates user annotations of websites and other user generated content and analyzes it to create a URL-URL mapping of websites based on relevance and quality. Using this mapping we are able to deliver related and recommended sites to a user with a click of a button.


4) While using Youlicit plugin, I felt that one of the challenges is the coverage -- how do you plan to address this and build your current index?

We are constantly working on improving our coverage. There are two metrics we strive to maximize for our results, quality and relevance. In regards to quality, we’re always looking to increase our database of “quality sites” by tapping into the various kinds of user annotations (denoting quality content) that exist on the web (bookmarks, tags, votes, comments). In regards to relevance, we’re always researching novel ways to extrapolate connections between websites and map URL’s back to our database of “quality sites”.


5) How do you ranking the 'Enhanced Links' in the plugin? Do you also take into account how many users actually click through the suggested links?

Each result in the Youlicit More widget (and on Youlicit’s site) has a score based on the metrics above, quality of the site and relevance to the item being queried. We are looking into ways of scoring the results from implicit/explicit feedback that we get from users (clicks, recommends).


6) How do you ensure that the Enhanced links feature is non-intrusive?

The current version has manifested itself after a few weeks of alpha testing with a handful of bloggers. That said, we are still looking for feedback on the user interface and would love to hear opinions on how to make it more useful and less intrusive for bloggers/blog readers.


7) How would you compare the plugin to sphere's related blog posts?

While Sphere focuses related & recent blogosphere content, we, at Youlicit, are trying to provide the blog reader with more seminal information related to the blogger’s topic of conversation. For instance, if you are reading a blog entry on global warming, you are more likely to receive the most recommended articles (blogs, sites, essays) on Global warming from around the web rather than  recent blog entries on that topic.


8) What are the other features on Youlicit?

Youlicit’s primary product is a Firefox extension to access that allows a user to access our results during his/her browsing experience. We are in the process of redesigning our website and streamlining the current offering to focus on this button. Down the road we would like to be able to deliver personalized recommendations for users as well as connect users to people based on transient and long-terms interests (ideally using a person’s interests to enhance his/her social graph).


9) Would the plugin be adverting supported?

We do see advertising as a very possible source of monetization. Given the fact that we are providing contextually relevant information, the search model of advertising applies nicely. We are also exploring other possible means of monetization but as of right now the priority is to build something that people find useful.


10) What are the next things to look out for at Youlicit?

As I mentioned above, we are stripping down Youlicit to bring the focus back to its core; the Youlicit More functionality via the Firefox extension and blog widget. We expect to release a new designed website very soon. And as always, we love to hear feedback on what you think so far and how we can improve.


Youlicit: Search Less and Find More!

Youlicit Youlicit is a new tool that helps you "Search less and find more". Often we forget that search is only one means to find what we are looking for. Even search by itself is not the endpoint of an information need or a query. This tool reminds me of the "berry picking model" of Information Retrieval that I had read about first in my IR Class. The model basically says that:

Information need is not satisfied by a single set of documents but by bits and pieces found along the way.

The paper  titled "The Design of Browsing and Berrypciking Technique for Online Search Interface" describes a searcher as

Moving through many actions towards a general goal of satisfactory completion of research related to an information need.

What Youlicit does is provide this ability implicitly, without the reader (or more generally a searcher) having to go through the trouble of navigating and mentally processing through hyperlinks or firing search queries to find related content. Youlicit takes care of all that on your behalf. By providing a simple plugin, the Youlicit widget automatically highlights some of the related, relevant links and provides useful suggestions -- all without your audience ever leaving your blog. I love the idea and the neat implementation that these guys have built. (The very same need was what lead me to hack this Wikipedia related widget a few weeks earlier.)

On the Youlicit site, you have lots more interesting tools. You can discover new content that is relevant to your interests, find related users and share links with them or follow their interests. Youlicit is paving the way for social browsing tools and is a neat concept that is well implemented. Their index does not seem to be very large at the moment and I feel that it would get better as they start to seriously scale up. In the interim, I feel that there might be stopgap solutions that they could be employ -- for example the Alexa related URLs for the links that are not currently in Youlicit's index.

In relation to this plugin, one tool that is similar is the  Sphere plugin that shows related blog posts. I feel that sphere serves a complementary need. From what I understand Youlicit aims to find the interesting blogs and Web URLs one might want to look into in relation to a given hyperlink.

Another plugin is the Snap plugin -- which shows a screenshot of the outlink. However, in my opinion snap does not really serve much purpose and is a bad tool from a usability perspective.

Youlicit is non-intrusive and you are gonna enjoy the serendipity of finding interesting new links! Give it a spin!

June 09, 2008

The Echo Chamber Model

The blogosphere if often described as an echo chamber. Ideas, comments, controversies and discussions are all reverberate in this online space. In the paper titled "Information Diffusion in the Blogspace", Daniel Gruhl et al. studied the dynamics of topics propagate in the Blogosphere. They say that resonance is a rare phenomenon that can be described as

fascinating phenomenon in which a massive response in the community is triggered
by a minute event in the real world.

While few topics reach the state where they "resonate" through out the community/blogosphere, the echo chamber is still active with millions of posts and discussions in the form of trackbacks, comments and twitter posts, etc. The blogosphere is a huge graph and the echo chamber phenomena can be easily modeled in terms of random walks in this graph. Imagine that your post is read by your immediate neighbors and their posts (in reply to yours) is read by their neighbors and so on. Now what is the probability that this sequence would terminate with a post or a comment back on your blog? This is nothing but the commute time or the round trip time for the random walk to start on a given node (say A), traverse to a node (say B) and return back to the node A.

KarateCommute time embedding is a way to map the original graph (which lies in a high dimensional space) into a two (or any low) dimensional space in such a way that the euclidean distance in the low dimensional embedding preserves (i.e proportional to) the commute times in the original graph. Interestingly, this can be quite useful in describing how ideas or information might flow in the network. For more details about the exact method to compute the embedding please refer to the paper. It is a neat technique that can be described in terms of the eigenvectors of the graph Laplacian (and can also be related to heat diffusion in the graph).Commutetime

Consider the classic Karate club example. The original graph and the corresponding embedding in a two dimensional space is shown below. Notice how the nodes 33 and 34 are very close to each other. Compare this to node 3 which is almost equidistant to the central nodes (33/34 and 1) of both the communities.

I think that this might be an interesting way to model the echo chamber of the blogosphere. In this representation, blogs that are closer to each other in Euclidean space are also closer in terms of social distance (another term used in the literature that approximately means the same thing as commute time).

June 07, 2008

Quantifying Social Capital

Since attending the talk by Dr. Tufekci on "The New Social Physics", I have been thinking about social capital and how we can quantify it.



Social Capital, is a term used to describe the intrinsic value of social networks. Robert Putnam attributes social capital to "civic engagement" and as an indicator of "communal health". Pierre Bourdieu describes how social capital explains how people find jobs through their social connections. Similarly, the strength of weak links, describes the importance of social relations in job searches.

In general social capital can be described in terms of bridging capital and bonding capital. Bridging capital is the notion that chance, long range social connections, that we build as part of our social interactions, can help us connect with heterogeneous groups. The idea of "bridges" in social relation is again similar to the notion of connectors in the book "Tipping Point" by Malcolm Gladwell. As Gladwell describes it 

"Connectors are people who link us to the world ... people with a special gift for bringing the world together"

So how can we quantify the connectors? According to the Wikipedia entry:

"There is no widely held consensus on how to measure social capital, which is one of its weaknesses."

One simple way to explain such relations is to use the analogy of bridges. These are individuals who link across the different departments in your office, they are the researchers who have worked with people from other fields and universities. We all know such people and run to them to find our link to the "other side of the planet". One measure that might be useful in quantifying bridging capital is "Edge betweenness". It is a measure that indicates the number of all pair shortest paths that flow through an edge.  In other words, if the edge is removed, many pairs of nodes need to follow a longer path to communicate with each other. Edge-betweenness is also used in some of the community detection techniques. By iteratively removing edges that are most in-between, we can identify the communities that exist in a social network.

Bridging capital is easier to conceptualize than bonding capital. Bonding capital is defined as

"the value assigned to social networks between homogeneous groups of people"

Another way of thinking about bonding capital is as follows: If you were in an urgent need for $500, which you promised to return the next day -- whom would you turn to? The person most likely to lend you the money is one with whom you share a strong bond. In my opinion, one simple way in which bonding capital can be quantified is by measuring the "strength of ties" or the number of common relations you share with another individual. If we think about business partners, it is quite likely that they share the same set of social relationships. Thus their bonding capital will be quite high. By identifying social relationships, which if removed would cause the least effect in the overall shortest path distance between other nodes, we can identify the edges that have a high bonding capital.

Karate Here is a simple example, from the classic karate club dataset. In this graph there are 34 nodes that represent the members of a karate club. During this study the group split into two with half the members going to the founder of the club, node 34 and the rest to the instructor, node 1. Here we can find interesting examples of bridging and bonding capital. The edge between node 33 and node 34 reflect a high bonding capital.. both these nodes have several friends in common. On the other hand the link from 9 to 1 among others are examples of bridging capital.

I would also like to point out a very interesting piece of research by Dr. Tufekci on social relations in Facebook. According to her study, she found that women use Facebook as a means to establish bonding capital while men are trying to increase bridging capital. This is an excellent study on Facebook from a social science perspective and I would highly recommend reading it.

June 06, 2008

The Cost of Manual Annotation

PaintingmonkeyFor many machine learning tasks it can be quite difficult to get the "ground truth" data. In some cases the best way to verify the results is by a painful, laborious and mindnumbing task of labeling data manually. When Pranam and I were working on the Splog detection task, we spent a good deal of time painstakingly labeling independently if a blog from a random sample was legitimate or spam. The part that makes this task difficult is:

  • Sometimes it is not clear when something is a blog is a splog
  • You need to also look at the inlinks and outlinks
  • Plagiarized content makes it harder to judge authenticity
  • Sploggers are getting more sophisticated in the methods they are using

On an average we spent about 2-3 minutes per blog and in the end, were only able to hand label a small collection. In comparison to many other tasks this was still a relatively straightforward judgment. Consider the task of relevance ranking that NIST has to perform each year for the TREC tracks. Here the goal is for the annotators to figure out if a result is relevant for a query. The guidelines are strict and NIST has many professional annotators who are trained to perform these tasks. Even more complicated are some of the annotation that might be required in certain Natural Language Processing experiments. These can range anywhere from just verifying a parse tree or an output to actually constructing gold standards or hand crafted parse trees. Moreover, some NLP tools require tremendous amounts of linguistic resources -- be it tediously constructing an ontology, lexicon, gazetteer lists or identifying word senses. Many of these tasks require linguists or experts whose time might be quite valuable.

lets consider a simple case where the annotator was asked to label a URL with a tag. Lets also say that it takes roughly a minute to load the page, quickly glance over it, make a judgment and then type in the appropriate labels. I know from experience that this is not a minute but more like 1.5-2 minutes on an average (try it! it is a braindead boring task and if you are asked to do it continuously, you will slow down!). If say I can work 10 hours on this task without loss in quality of my annotations, it would only result in 600 URLs being tagged. UMBC pays lets say around $10/hour for on-campus jobs. That means we would spend about $100 just to label 600 URLs. Not so sure if that would be the best way I would like to spend a hundred bucks! Additionally, just one human annotator is never sufficient. You always need to answer questions like : "So, what was your inter-annotator agreement?". Well then you just blew another $200 or $300 on this task and still have just 600 URLs marked up. No wonder del.icio.us and Flickr are such amazing sources for free (yaay!), human assessed labels and annotations. It works out great if you can use these instead.

Mechanical Turk is an attempt by Amazon to make it easy for such tasks to be distributed to people who would perform them in return for small micropayments. However, it seems to me that the incentive for completing the task (or doing it well) is so minimal that there is very little enthusiasm around this product. I suspect, the only people completing these HITS are individuals in countries where the dollar still has some value. In fact, there seems to be something fishy about atleast a handful of HITS that are high paying. It looks like the system is totally getting gamed by spammers - look at this example requesting 20 backlinks to a site ($3) or creating bogus accounts ($7) and likes.

The most interesting tool for manual annotation that I have ever seen is the ESP game. It is an excellent entertaining and I must warn you totally addictive. The game works by showing you an image and asking you to label it. You are randomly paired with another player and both get a score if a word matches. This totally ingenious way of collecting annotations for images means that the annotations come absolutely free of cost!

To bring the cost of manual annotation to zero or close to it the best incentive is to provide some value to the annotator. Ofcourse some tasks are so specific or specialized that this might be truly difficult without actually paying someone to go through it. 

From a research perspective, when we build a classifier and use the UCI dataset we have a good "gold standard" and accessible body of literature that has studied the very same data. But as we are dealing with ever increasing size of datasets, access to ground truth or manually verified samples is becoming even more challenging. So is the significance of using it. What does it really mean to annotate less than 0.1% of the data (say you have a very large collection of blogs, images, graph -- whatever social media content you can imagine)?

[Image Courtesy http://www.socialfiction.org/]


May 16, 2008

Nonnegative Matrix Factorization

Sometimes you learn about a new mathematical technique that is so intriguing that it can be only described as "beautiful". Nonnegative matrix factorization is one such method that I did not know of until quite recently. The details of the method are available in the paper "Document Clustering Based On Non-negative Matrix Factorization" by Wei Xu, Xin Liu, Yihong Gong.   

The basic idea behind this method is that you want to factorize a matrix X into two smaller matrices U and V such that, both U and V are non negative. This is achieved by using minimizing the following optimization function

Equation So if we have a matrix X that represents a Term*Document matrix: it can be factorized into the two matrices U and V such that U signifies the Term*ClusterAssociation and V transpose signifies the ClusterAssociation*Document matrix. Now since the two matrices U and V are non negative, meaning all the elements in them are >= 0, we can identify the cluster to which a document belongs by projecting the vector V onto the dimension with the highest value.

Classsic3nmfSingular Valued Decomposition(SVD), decomposes X into dense matrices that can contain negative elements and it is not always intuitive what the basis vectors really signify. However using NMF the clusters are readily and directly available from the factorization. In addition, the sparsity makes this technique quite appealing.

In the following example, I have clustered the CLASSIC3 dataset, which is a standard corpus frequently used for evaluating different clustering methods. Notice how the three datasets CISI, MEDLINE and CRANFIELD line up nicely along the three different axis.

I like this method for its simplicity and intuition and have been exploring its use in clustering blog/social data.

May 09, 2008

News feed vs. blog posts vs. email

What is the difference in size distribution of a news wire vs. a blog post vs. email message?

The below three images compare the size distribution of news wires (Reuters collection) , blog posts (from the ICWSM dataset) and email messages (Enron Corpus).  The charts show the histograms of the size of the documents in these collections:

Reuters Blogposts_3 Enron_2

The three distributions above (ignoring documents smaller than 2000 bytes) were fitted using the matlab scripts for powerlaw fits (Thanks to Aaron Cluaset). 

ReuterslawBlogpostlaw Emaillaw_3

The linguistic properties of blogs email and news stories are quite different and this has already been highlighted in several research papers. While the three data sets are quite different in many ways, here I am analyzing just the size distributions. The  important point to note is 

  • News wire stories are quite short
  • Blogs and emails are much longer and have a heavy tail distribution
  • Power law exponents for blog size distribution and email size distribution are quite similar (around 2.7)

So...what does this mean? It is fairly obvious that news wire stories are quite short due to the nature of reporting. Sometimes the initial news story is quickly reported by agencies like Reuters/AP. These are at times brief and to the point to allow readers to get a quick gist of its contents.

In contrast the size of blogs tend to be much larger than news wires. Citizen journalism is full of opinions thoughts and punditry thus bloating the post. This also goes back to my previous analysis of the blog homepage size vs. Web page size. Indeed the contribution of blogs has been reported to be 4-5 times that of edited text (like the news wires).

What I had not expected was the similarity in the slopes for email and blogs. One thing to note however is that here the emails are aggregated across a number of different users. This is an important distinction. While a single user may receive a few hundred emails, they potentially have access to millions of blogs. Recently, industry's top usability expert Jakob Nielsen concluded that readers skim through and read at most 20% of the words on a Webpage. While there are millions of blog posts every day... there is very little time to read them all in detail. The volume of email is limited by a person's social network but for blogs the act of prioritizing what to read is entirely left upon the user. This essentially necessitates the use of Memetrackers and explains the popularity of filtering tools like digg, techmeme etc. By summarizing popular blog posts and providing blurbs for these, such tools essentially act as a  "social news wire service for the blogosphere".

May 02, 2008

Leaveraging Web and Social Media for Recommendations

Both Amazon and Netflix's business models rely on effective recommendation systems. The recommendations provided by such systems are based on the purchasing habits of millions of customers. As such, these systems are non-trivial and have evolved out of years of research in both academia and industry.

In addition to mining millions of customer transaction records, for many products there is a vast amount of information available online. While I do not have a lot of familiarity with recommendation systems literature, it seems obvious that the Web and Social Media is a great source of information that could be useful when building such systems.  Bloggers' profile pages, wishlists, netflix queues, book lists and the blog posts themselves are potential clues to learn which two items may be related to each other.

As a simple example, consider the movie "Pulp Fiction", by querying Google for all the inlinks to the IMDB homepage of Sin City Pulp Fiction and counting which are the other movies that are "co-cited" here is a list of five movies that are most likely to be related to "Pulp Fiction":

Most of these look quite relevant. Some critics have claimed similarities between Pulp Fiction and Snatch. One surprise though was LOTR, I wouldn't have expected it to be grouped with Pulp Fiction, but I guess I like them both very much -- so it seems reasonable in my case atleast.

Just for fun, here is another example with "Sin City" another one of my favorite movies.

Unless you have a large index of the Blogosphere or the Web, it would be quite inefficient to mine for such correlations (by passing queries to search engines) on a large scale. I do not know how much of the search engine information is leveraged in recommendation systems built by Amazon or Netflix.  It might also be worth looking into differences in the recommendations produced on the basis of "how people co-cite two products" vs. "how people purchase two products".

Google Ads

Related Wikipedia Entries

Ads

Recent Readers

Search this blog


  • WWW
    socialmedia.typepad.com

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
I Love 6A

Please Support