Monitoring the evolution of interests in the blogosphere
Iraklis Varlamis (Presenter), Vasilis Vassalos, Antonis Palaios
The authors present a simple (perhaps even obvious) hypothesis that "Interests of bloggers converge around real world events".
A prototype system was built to to analyze blogs and monitor the evolution of interests. Existing search engines monitor the term popularity of words in isolation and the related terms are not clustered in any way to identify the topics. A motivating example presented was around the football world cup season, where there might be terms like "football" "round" "final" "result" "world cup" and aggregated scores for topic would be helpful rather than having to find/view the histograms for each of these terms. Also one can not know all the terms for a given topic. In many cases, multilingual content and difference in vocabulary can further make it difficult to just monitor the term popularity. Other approaches require training a topic based on a given set of training/pre-annotated document pool. This is an expensive and a semi-supervised method can be strongly effected by the selection of the initial document pool.
The goal of their paper is to build an unsupervised approach to cluster the documents and identify the evolution of interest. However, the current algorithm presented is not incremental; but they claim some recent work on an incremental version. The current version of their algorithm uses the DBScan clustering technique. Term space is clustered at the post level and their evaluation is done using inter cluster, intra cluster similarity and the utility function. They have used the BuzzMetrics dataset, however they only focus on blogs that have a post everyday, which brings the dataset down to 2500 blogs. The posts are classified into topics and then the blogs are clustered. The categories are determined by using the Cyberfiber news groups to find the topics and training documents. This is an interesting dataset that I was not aware of.
Some questions that came up were the choice of using newswire documents to learn classifiers, and also the accuracy of the classifiers. While the original hypothesis seems obvious, it can be seen as a litmus test for the efficacy of the clustering algorithms. In particular the examples presented were using the "London Bombing" Event. Any clustering algorithm that groups the related terms must at the very least pass the litmus test to show that it works.
I think that in general there are perhaps two types of events, one concerning a very specific group of individuals ("County elections", social events, conferences etc) where online communities emerge due to offline events. The other is around real word events like "London bombings" or "Hurricane Katrina": both of which are examples of events that cut across community boundaries and which invoke reaction from almost every part of the blogosphere and almost in unison. In such cases on might see that the original community affiliation of the bloggers dont matter and tech, politics, religion and every other community might be talking about the same issue.
the link to cyberfier is not working. i'm quite interested in this kind of research, as i investigate research designs concerning agenda-setting processes on web environments.
Posted by: jan | April 16, 2008 at 01:32 PM
Sorry about the broken link. I have fixed it now and it is http://www.cyberfiber.com/
Posted by: Akshay Java | April 17, 2008 at 02:19 PM