I have a large collection of blogs that I would like to cluster. For the blogs in my collection, I have the following information:
- Text aggregated across all the posts
- A dump of the homepage of the blog
- Tags obtained from del.icio.us or "Feeds that matter"
- And finally how the blogs link to each other (aggregated across all the posts)
If we want to cluster the blogs, one can take the following approaches:
- Cluster the blogs using only the text (Kmeans style)
- Cluster the blogs using only the link information (By say performing NCut)
- First cluster all the tags (so you would group "tech", "technology", "geek" etc) and then group the blogs that are categorized in each of these clusters.
- Using a bipartite graph of the form <Blog,tag>. This is also known as co-clustering.
The problem with using the text alone is that the data can be huge (typically in Gigs) and it is almost essential to use some dimensionality reduction techniques like PCA etc. By performing NCut over the graph alone, you land up ignoring the text or the tag information that could be quite helpful. By clustering in only the tag space and then mapping the resulting clusters to the blog graph, we miss how the blogs might be interconnected. Finally, co-clustering technique too misses the fact that the blogs might link to each other and this information can be quite useful in grouping them together.
The question I was wondering about was how to cluster the blogs while using both the link and the tag information simultaneously. I think that the tags provide a reduce dimensional representation of the text. The tag space is of a much lower dimension than the complete text of the collection and it can be efficient to cluster the blogs using tags than the complete text.
Recently, I came across a very interesting piece of work that talked about "Classification Constrained Dimensionality Reduction". The idea here is that if we have a bunch of data points for which we know the true class information, we would like to classify the remaining points such that the available class information is utilized to guide the dimensionality reduction.
The above paper applied very well to our problem. One simple way to combine the two different information is to use the matrix W' shown here. C is a k*n matrix where n is the number of blogs and k is the number of tags. While W is the adjacency matrix of size n*n. I is an identity matrix of size k*k. Typically, we can get away by using only the top 200-500 most popular tags. Now, if we perform NCut using the matrix W' instead of using W alone, we can combine/fold-in the tag information with the link information. The parameter \beta controls how much importance we would like to give to the graph versus the tag information.
For example for a blog graph consisting of ~3000 blogs, I fetched the tags from del.icio.us and the link graph from the WWE/Buzzmetrics collection. Figure on the right shows 20 clusters if we use just the link information and perform NCut and the same graph using the simultaneous clustering method. Notice, how the 20 odd clusters seem nicely block diagonalized in comparison with the results using link information alone. It would be interesting to look at how these techniques compare with that using the entire text from all the blogs. I am working on this and the intuition is: we might be able to show that using just tag and link information, we can do as good (or better) than using the entire text. Afterall tags are nothing but a place holder for the "topic" of the blog.
An even more interesting extension would be to use the original paper that presented the constrained classification using dimensionality reduction and apply the same technique to our problem, such that we can perform clustering using just a small set of labeled blog and labels (matrix C).