By far, the most prevalent data available in social media is tagging information. For example, in del.icio.us a user may tag a URL or in Flickr she may tag an image. One of the questions that comes up is how to then cluster social data that is rich in tags. Some techniques available ignore the user information and use only a bipartite graph consisting of tags and URLs. Another method is to represent two pieces of evidence (user-tag;tag-blog) in a tripartite graph (where nodes are of three different types: users, tags and urls). However,
even this type of structure actually
misses the higher order relation between the three nodes. Note that the information available is really in triples of the type <user, tag, url>. This information is not captured by the tripartite graph model. In particular, two users may be connected via a common tag even if the actual URL they bookmarked is vastly different.
There are some techniques using Tensor Matrix Factorization that can handle such data. However, the question of how to deal with triple (or higher) information from social data is quite interesting. Moreover, being able to do so efficiently and in an online fashion would also be important. I believe that this topic may be of significant interest in the upcoming social media and data mining conferences. The implications of these techniques would be in building better recommendation systems and personalization algorithms.
[Thanks Vlad Korolev for some of the discussions related to this post]
I guess this paper is related to what you mean:
"Automated Tag Clustering: Improving search and exploration in the tag space" by Philipp Keller et al.
(here one of the authors explains the idea in simple words in his blog: http://www.pui.ch/phred/automated_tag_clustering/)
The difference is they propose to cluster tags, not triples, to improve tag space navigation and exploration.
Having the database of triples, one can compute distance between tags. Then using Girvan-Newman's community detection one can cluster tags into semantically cohesive clusters.
I actually loved this idea, and modified it to use Wikipedia-based semantic similarity between tags.
Posted by: Maria Grineva | July 18, 2008 at 03:18 AM