Contact Me


  • Akshay Java's Facebook profile

Social Media Events

Friends

Disclaimer

  • Thoughts and comments expressed here are those of the author. Creative Commons License

similarity measures

May 16, 2008

Nonnegative Matrix Factorization

Sometimes you learn about a new mathematical technique that is so intriguing that it can be only described as "beautiful". Nonnegative matrix factorization is one such method that I did not know of until quite recently. The details of the method are available in the paper "Document Clustering Based On Non-negative Matrix Factorization" by Wei Xu, Xin Liu, Yihong Gong.   

The basic idea behind this method is that you want to factorize a matrix X into two smaller matrices U and V such that, both U and V are non negative. This is achieved by using minimizing the following optimization function

Equation So if we have a matrix X that represents a Term*Document matrix: it can be factorized into the two matrices U and V such that U signifies the Term*ClusterAssociation and V transpose signifies the ClusterAssociation*Document matrix. Now since the two matrices U and V are non negative, meaning all the elements in them are >= 0, we can identify the cluster to which a document belongs by projecting the vector V onto the dimension with the highest value.

Classsic3nmfSingular Valued Decomposition(SVD), decomposes X into dense matrices that can contain negative elements and it is not always intuitive what the basis vectors really signify. However using NMF the clusters are readily and directly available from the factorization. In addition, the sparsity makes this technique quite appealing.

In the following example, I have clustered the CLASSIC3 dataset, which is a standard corpus frequently used for evaluating different clustering methods. Notice how the three datasets CISI, MEDLINE and CRANFIELD line up nicely along the three different axis.

I like this method for its simplicity and intuition and have been exploring its use in clustering blog/social data.

April 20, 2008

SemanticHacker Challenge and Wikipedia Widgets

When writing a blog post, we create links to Wikipedia articles because it is an easy and quick way to provide information to readers, without having to go into too many details. It is not always possible to find, link or even know about all the articles that might be related to the post's topic.

Zareen Syed, one of my colleagues, recently presented a paper at ICWSM on "Wikipedia as an Ontology for Describing Documents". This paper talks about how Wikipedia articles can be used to associate the concepts with a given document. Over at SemanticHacker a similar approach is taken to find the "Simplified Semantic Signatures" for a document.  TextWise, is offering upto a Million Dollars in funding to any idea that can

Turn on the Power of Semantics

I liked their demo, API and tools they provide. Inspired by the effectiveness ofWikimatix Wikipedia in describing documents and my desire to play with widgets, I decided to mash these two goals together. If you look on the right column of this blog, you would find a small widget that displays the related Wikipedia entries for a given page/post and is powered by the SemanticHacker API. This was a quick hack and the system can be improved by

  • focusing on the post content alone (right now I just pass the current page URL).
  • Adding more features, like identifying people, places etc and highlighting them differently

While I dont think this by itself might qualify for a $1 Million challenge, I think there might be something interesting here. I have a couple of ideas around this and scant time to work on it. Finally, in true Web 2.0 spirit -- not sure what, if any, is the business model (I just built this for kicks; semantic hacker is looking for a bizplan from participants).

I dont think that the widget itself is ready for prime time -- it was an evening hack. But if you would like to play with it leave me a note in the comments and I shall provide a link to you so you can install it on your blog.

April 17, 2008

Clustering Tags and Links Simultaneously

I have a large collection of blogs that I would like to cluster. For the blogs in my collection, I have the following information:

  • Text aggregated across all the posts
  • A dump of the homepage of the blog
  • Tags obtained from del.icio.us or "Feeds that matter"
  • And finally how the blogs link to each other (aggregated across all the posts)

If we want to cluster the blogs, one can take the following approaches:

  1. Cluster the blogs using only the text (Kmeans style)
  2. Cluster the blogs using only the link information (By say performing NCut)
  3. First cluster all the tags (so you would group "tech", "technology", "geek" etc) and then group the blogs that are categorized in each of these clusters.
  4. Using a bipartite graph of the form <Blog,tag>. This is also known as co-clustering.

CoclusteringThe problem with using the text alone is that the data can be huge (typically in Gigs) and it is almost essential to use some dimensionality reduction techniques like PCA etc. By performing NCut over the graph alone, you land up ignoring the text or the tag information that could be quite helpful. By clustering in only the tag space and then mapping the resulting clusters to the blog graph, we miss how the blogs might be interconnected. Finally, co-clustering technique too misses the fact that the blogs might link to each other and this information can be quite useful in grouping them together.

The question I was wondering about was how to cluster the blogs while using both the link and the tag information simultaneously. I think that the tags provide a reduce dimensional representation of the text. The tag space is of a much lower dimension than the complete text of the collection and it can be efficient to cluster the blogs using tags than the complete text.

Recently, I came across a very interesting piece of work that talked about "Classification ConstrainedMatrixlink_2 Dimensionality Reduction". The idea here is that if we have a bunch of data points for which we know the true class information, we would like to classify the remaining points such that the available class information is utilized to guide the dimensionality reduction.

Matrix_2The above paper applied very well to our problem. One simple way to combine the two different information is to use the matrix W' shown here. C is a k*n matrix where n is the number of blogs and k is the number of tags. While W is the adjacency matrix of size n*n. I is an identity matrix of size k*k. Typically, we can get away by using only the top 200-500 most popular tags. Now, if we perform NCut using the matrix W' instead of using W alone, we can combine/fold-in the tag information with the link information. The parameter \beta controls how much importance we would like to give to the graph versus the tag information.


For example for a blog graph consisting of ~3000 blogs, I fetched the tags from del.icio.us and the link graph from the WWE/Buzzmetrics collection. Figure on the right shows 20 clusters if we use just the link information and perform NCut and the same graph using the simultaneous clustering method. Notice, how the 20 odd clusters seem nicely block diagonalized in comparison with the results using link information alone. It would be interesting to look at how these techniques compare with that using the entire text from all the blogs. I am working on this and the intuition is: we might be able to show that using just tag and link information, we can do as good (or better) than using the entire text. Afterall tags are nothing but a place holder for the "topic" of the blog.

An even more interesting extension would be to use the original paper that presented the constrained classification using dimensionality reduction and apply the same technique to our problem, such that we can perform clustering using just a small set of labeled blog and labels (matrix C).

March 04, 2008

Measuring Similarity in Social Media?

What is similarity? Does the richness of meta-data and the range of content in social media make it harder to define similarity in the context of Social Media? Jonathan's comments reminded me of a few thoughts I had recently on similarity in Social Media. And I think Jon raises a critical point which applies to this post as well. Like relevance, how we define similarity would really be dependent on the task.

Nevertheless, the traditional view of similarity measure is in terms of document similarity measured using cosine similarity. An extension is that rather than using content similarity, some researchers model topics using techniques like LDA, thus resulting in topic similarity. While, content/topic similarity is surely still used in blog analysis there are wider set of interesting features to build on.

Tag Similarity: Folksonomies for instance offer user-generated tags. This is a helpful feature to find out if two blogs are somehow related. The tags used by the author of the post is another feature. These may or may not correspond with the tags used by their readers (or on bookmarks in del.icio.us). I find it interesting that these two may often complement each other. Authors might tag their posts with specifics (like "Hilary clinton", "immigration" etc), while the reader can simply categorize the blog under a general tag "politics".

Community Structure: The graph itself has cues that offer hints about the similarity between two blogs. Community structure is definitely of importance and blogs that structurally belong to the same cluster are often related. The slight issue is that link semantics in the Blogosphere can be different from that of the Web.

Link Semantics: Traditional view of links is that of endorsement, but in the Blogosphere, sentiments may even play a big role. Related with link semantics is also the question of relation-type. Blogrolls and linkrolls are yet another way in which two blogs can be related. Adding a blog in my blogroll might not necessarily mean that its topically related -- it could just mean a sort of vague friendship / foaf:friend type of a relation.

Additionally, a significant percentage of links created on blogs point to main stream media. One can consider this as a bipartite graph  with blogs and MSM sites as the two node types. Clustering blogs that point to similar set of MSM sources can reveal additional similarity cues. This would be more particularly distinct in political domain.

While I approach this question from the viewpoint of blogs, social network and flickr communities have still further relation-types and features. In such systems comments, group memberships, affiliations etc. can be additional indicators for similarity measures.

I believe that it is still an open question how such varied features can be utilized to measure similarity in Social Media. I am sure that there are going to be a TON of interesting discussions on some of these issues at ICWSM, which I am eagerly looking forward to.

Google Ads

Related Wikipedia Entries

Ads

Recent Readers

Search this blog


  • WWW
    socialmedia.typepad.com

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
I Love 6A

Please Support