Contact Me


  • Akshay Java's Facebook profile

Social Media Events

My Network

FriendFeed

Disclaimer

  • Thoughts and comments expressed here are those of the author. Creative Commons License

« Notes from DEBSM Workshop: Using SentiWordNet for Multilingual Sentiment Analysis | Main | Notes from DEBSM Workshop: Inferring Privacy Information via Social Relations »

April 17, 2008

Clustering Tags and Links Simultaneously

I have a large collection of blogs that I would like to cluster. For the blogs in my collection, I have the following information:

  • Text aggregated across all the posts
  • A dump of the homepage of the blog
  • Tags obtained from del.icio.us or "Feeds that matter"
  • And finally how the blogs link to each other (aggregated across all the posts)

If we want to cluster the blogs, one can take the following approaches:

  1. Cluster the blogs using only the text (Kmeans style)
  2. Cluster the blogs using only the link information (By say performing NCut)
  3. First cluster all the tags (so you would group "tech", "technology", "geek" etc) and then group the blogs that are categorized in each of these clusters.
  4. Using a bipartite graph of the form <Blog,tag>. This is also known as co-clustering.

CoclusteringThe problem with using the text alone is that the data can be huge (typically in Gigs) and it is almost essential to use some dimensionality reduction techniques like PCA etc. By performing NCut over the graph alone, you land up ignoring the text or the tag information that could be quite helpful. By clustering in only the tag space and then mapping the resulting clusters to the blog graph, we miss how the blogs might be interconnected. Finally, co-clustering technique too misses the fact that the blogs might link to each other and this information can be quite useful in grouping them together.

The question I was wondering about was how to cluster the blogs while using both the link and the tag information simultaneously. I think that the tags provide a reduce dimensional representation of the text. The tag space is of a much lower dimension than the complete text of the collection and it can be efficient to cluster the blogs using tags than the complete text.

Recently, I came across a very interesting piece of work that talked about "Classification ConstrainedMatrixlink_2 Dimensionality Reduction". The idea here is that if we have a bunch of data points for which we know the true class information, we would like to classify the remaining points such that the available class information is utilized to guide the dimensionality reduction.

Matrix_2The above paper applied very well to our problem. One simple way to combine the two different information is to use the matrix W' shown here. C is a k*n matrix where n is the number of blogs and k is the number of tags. While W is the adjacency matrix of size n*n. I is an identity matrix of size k*k. Typically, we can get away by using only the top 200-500 most popular tags. Now, if we perform NCut using the matrix W' instead of using W alone, we can combine/fold-in the tag information with the link information. The parameter \beta controls how much importance we would like to give to the graph versus the tag information.


For example for a blog graph consisting of ~3000 blogs, I fetched the tags from del.icio.us and the link graph from the WWE/Buzzmetrics collection. Figure on the right shows 20 clusters if we use just the link information and perform NCut and the same graph using the simultaneous clustering method. Notice, how the 20 odd clusters seem nicely block diagonalized in comparison with the results using link information alone. It would be interesting to look at how these techniques compare with that using the entire text from all the blogs. I am working on this and the intuition is: we might be able to show that using just tag and link information, we can do as good (or better) than using the entire text. Afterall tags are nothing but a place holder for the "topic" of the blog.

An even more interesting extension would be to use the original paper that presented the constrained classification using dimensionality reduction and apply the same technique to our problem, such that we can perform clustering using just a small set of labeled blog and labels (matrix C).

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00e550300155883400e551f28a018834

Listed below are links to weblogs that reference Clustering Tags and Links Simultaneously :

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Comments are moderated, and will not appear until the author has approved them.

Sponsors

Ads

Search this blog


  • WWW
    socialmedia.typepad.com

Recent Readers

April 2009

Sun Mon Tue Wed Thu Fri Sat
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30    

Please Support