Contact Me


  • Akshay Java's Facebook profile

Social Media Events

Friends

Disclaimer

  • Thoughts and comments expressed here are those of the author. Creative Commons License

clustering

July 17, 2008

Clustering Triples from Social Data

TagtriplesBy far, the most prevalent data available in social media is tagging information. For example, in del.icio.us a user may tag a URL or in Flickr she may tag an image. One of the questions that comes up is how to then cluster social data that is rich in tags. Some techniques available ignore the user information and use only a bipartite graph consisting of tags and URLs. Another method is to represent two pieces of evidence (user-tag;tag-blog) in a tripartite graph (where nodes are of three different types: users, tags and urls). However, Realtripleseven this type of structure actually  misses the higher order relation between the three nodes. Note that the information available is really in triples of the type <user, tag, url>. This information is not captured by the tripartite graph model. In particular, two users may be connected via a common tag even if the actual URL they bookmarked is vastly different.

There are some techniques using Tensor Matrix Factorization that can handle such data. However, the question of how to deal with triple (or higher) information from social data is quite interesting. Moreover, being able to do so efficiently and in an online fashion would also be important. I believe that this topic may be of significant interest in the upcoming social media and data mining conferences. The implications of these techniques would be in building better recommendation systems and personalization algorithms.

[Thanks Vlad Korolev for some of the discussions related to this post]

July 12, 2008

Detecting Commmunities via Simultaneous Clustering of Graphs and Folksonomies

Great news! Our WebKDD 2008 submission titled "Detecting Commmunities via Simultaneous Clustering of Graphs and Folksonomies" was accepted. I received a number of very helpful comments and suggestions from the reviewers and I am working on updating the paper. I will make the pre-print version available on the ebiquity site as soon as it is ready. In the mean time here is the abstract for this paper:

We present a simple technique for detecting communities by utilizing both the link structure and folksonomy (or tag) information that is readily available in most social media systems. A simple way to describe our approach is by defining a community as a set of nodes in a graph that link more frequently to within this set than outside it and they share similar tags. Our technique is based on the Normalized Cut (NCut) algorithm and can be easily and efficiently implemented. We validate our method by using a real network of blogs and tag information obtained from a social bookmarking site. We also verify our results on a citation network for which we have access to ground truth cluster information. Our method, Simultaneous Cut (SimCut), has the advantage that it can group related tags and cluster the nodes simultaneously.

Interestingly, I had discussed some of the basic idea behind this paper in my earlier blog post "Clustering Tags and Links Simultaneously."

July 07, 2008

Shiny New Toys at Searchme!

A quick note:

Searchme launches cool new Video and Flickr search tools. This is combined with the power of stacks. Stacks are a neat way to organize relevant results, be it videos, images or Webpages, into bundles. (how often have you wondered -- "if only I could recall that search term that gave me the right answer". Well now just stack it!) Searchme provides a feature that allows users to share stacks they have created on your blog or as a Facebook page. This is really neat. Whats even better is that since the stack is AJAX based, it is dynamically updated whenever new results are added to it. Check out the stack titled "Colbert" I created as an example. The best part of visual search is that you can quickly scan for documents and media and identify the stuff you are most interested in.

Another really neat feature in Searchme is its Classification engine. For example, did you know that another celebrity that shares a similar name is Claudette Colbert, a 1930's actress! Would never have found this trivia had it not been for the movies section that Searchme identified for my query "colbert"!

July 02, 2008

Community Detection via Matrix Factorization

Communities One form of matrix factorization is Singular Valued Decomposition (SVD). This is a powerful technique and it has several applications in information retrieval and graph analysis.

Another matrix factorization technique I had mentioned recently was Non Negative Matrix Factorization (NNMF). The advantage of NNMF over SVD is that it is easier to compute and is generally much easier to interpret due to the strict positivity constraint.

Matrix factorization can be achieved via optimization methods. Suppose a matrix A (shown in the figure on the left) of size 20*20 was to be factorized into two matrices X of size 20*4 and W' of size 4*20, the following objective function can be minimized:

J = || A - XW'||_f 

The cost function minimizes the Frobenius norm between the original matrix A an XW', i.e. the error in approximating A as a product of two matrices. This can be solved using conjugate gradient methods and MATLAB's optimization toolbox (fminunc; tutorial) is one way to implement this. Following is the MATLAB code as an example:

test = ones(5,5);
B = blkdiag(test,test,test,test);
M = rand(40,4);

[xnew,fval] = fminunc(@obj_fun1,M,options,B,20);

function [fun,Grad] = obj_fun1(Z,A,nodes)
    [m,n] = size(Z);
   
    X = Z(1:nodes,:);
    W = Z(nodes+1:end,:);
   
   % Objective Function
   fun = norm(A-X*W','fro')^2+norm(W,'fro')^2;
  
    if nargout > 1  
      Grad1 = 2*(X*(W'*W)-A*W); 
      Grad2 = 2*(W*(X'*X)-A'*X)+W;
      Grad = [Grad1; Grad2];
    end


Once we minimize the objective function, we can obtain the solution for X as
  82.5664   -1.1484   79.4176  -39.0137
   82.5664   -1.1482   79.4176  -39.0137
   82.5666   -1.1485   79.4176  -39.0139
   82.5666   -1.1485   79.4176  -39.0139
   82.5667   -1.1472   79.4173  -39.0141
   -6.3391  -18.4040   68.2625   88.6399
   -6.3389  -18.4039   68.2623   88.6397
   -6.3389  -18.4039   68.2624   88.6398
   -6.3388  -18.4037   68.2622   88.6396
   -6.3386  -18.4036   68.2621   88.6395
   75.9984   13.2685  -57.4890   70.8761
   75.9989   13.2680  -57.4891   70.8759
   75.9985   13.2681  -57.4889   70.8761
   75.9985   13.2687  -57.4891   70.8760
   75.9989   13.2681  -57.4890   70.8758
  -17.6262  112.9716   27.4847   14.5483
  -17.6257  112.9713   27.4844   14.5482
  -17.6263  112.9716   27.4847   14.5483
  -17.6259  112.9715   27.4844   14.5484
  -17.6255  112.9708   27.4844   14.5482

From this essentially the community structure can be easily determined (observe the rows can be grouped to reflect the original communities). However, a much faster and efficient (in terms of implementation) way to accomplish this goal is using something like Singular Valued Decomposition (SVD).

The above code is just a simple illustrative example. However, for me it was a worthwhile experiment to try out and to understand how matrix factorization via optimization can be useful in community detection.

Some recent, interesting papers that use different Matrix Factorizations:

I would appreciate if anyone has pointers to other interesting references/tutorials/software and could please leave me a comment.

June 09, 2008

The Echo Chamber Model

The blogosphere if often described as an echo chamber. Ideas, comments, controversies and discussions are all reverberate in this online space. In the paper titled "Information Diffusion in the Blogspace", Daniel Gruhl et al. studied the dynamics of topics propagate in the Blogosphere. They say that resonance is a rare phenomenon that can be described as

fascinating phenomenon in which a massive response in the community is triggered
by a minute event in the real world.

While few topics reach the state where they "resonate" through out the community/blogosphere, the echo chamber is still active with millions of posts and discussions in the form of trackbacks, comments and twitter posts, etc. The blogosphere is a huge graph and the echo chamber phenomena can be easily modeled in terms of random walks in this graph. Imagine that your post is read by your immediate neighbors and their posts (in reply to yours) is read by their neighbors and so on. Now what is the probability that this sequence would terminate with a post or a comment back on your blog? This is nothing but the commute time or the round trip time for the random walk to start on a given node (say A), traverse to a node (say B) and return back to the node A.

KarateCommute time embedding is a way to map the original graph (which lies in a high dimensional space) into a two (or any low) dimensional space in such a way that the euclidean distance in the low dimensional embedding preserves (i.e proportional to) the commute times in the original graph. Interestingly, this can be quite useful in describing how ideas or information might flow in the network. For more details about the exact method to compute the embedding please refer to the paper. It is a neat technique that can be described in terms of the eigenvectors of the graph Laplacian (and can also be related to heat diffusion in the graph).Commutetime

Consider the classic Karate club example. The original graph and the corresponding embedding in a two dimensional space is shown below. Notice how the nodes 33 and 34 are very close to each other. Compare this to node 3 which is almost equidistant to the central nodes (33/34 and 1) of both the communities.

I think that this might be an interesting way to model the echo chamber of the blogosphere. In this representation, blogs that are closer to each other in Euclidean space are also closer in terms of social distance (another term used in the literature that approximately means the same thing as commute time).

May 16, 2008

Nonnegative Matrix Factorization

Sometimes you learn about a new mathematical technique that is so intriguing that it can be only described as "beautiful". Nonnegative matrix factorization is one such method that I did not know of until quite recently. The details of the method are available in the paper "Document Clustering Based On Non-negative Matrix Factorization" by Wei Xu, Xin Liu, Yihong Gong.   

The basic idea behind this method is that you want to factorize a matrix X into two smaller matrices U and V such that, both U and V are non negative. This is achieved by using minimizing the following optimization function

Equation So if we have a matrix X that represents a Term*Document matrix: it can be factorized into the two matrices U and V such that U signifies the Term*ClusterAssociation and V transpose signifies the ClusterAssociation*Document matrix. Now since the two matrices U and V are non negative, meaning all the elements in them are >= 0, we can identify the cluster to which a document belongs by projecting the vector V onto the dimension with the highest value.

Classsic3nmfSingular Valued Decomposition(SVD), decomposes X into dense matrices that can contain negative elements and it is not always intuitive what the basis vectors really signify. However using NMF the clusters are readily and directly available from the factorization. In addition, the sparsity makes this technique quite appealing.

In the following example, I have clustered the CLASSIC3 dataset, which is a standard corpus frequently used for evaluating different clustering methods. Notice how the three datasets CISI, MEDLINE and CRANFIELD line up nicely along the three different axis.

I like this method for its simplicity and intuition and have been exploring its use in clustering blog/social data.

May 02, 2008

Leaveraging Web and Social Media for Recommendations

Both Amazon and Netflix's business models rely on effective recommendation systems. The recommendations provided by such systems are based on the purchasing habits of millions of customers. As such, these systems are non-trivial and have evolved out of years of research in both academia and industry.

In addition to mining millions of customer transaction records, for many products there is a vast amount of information available online. While I do not have a lot of familiarity with recommendation systems literature, it seems obvious that the Web and Social Media is a great source of information that could be useful when building such systems.  Bloggers' profile pages, wishlists, netflix queues, book lists and the blog posts themselves are potential clues to learn which two items may be related to each other.

As a simple example, consider the movie "Pulp Fiction", by querying Google for all the inlinks to the IMDB homepage of Sin City Pulp Fiction and counting which are the other movies that are "co-cited" here is a list of five movies that are most likely to be related to "Pulp Fiction":

Most of these look quite relevant. Some critics have claimed similarities between Pulp Fiction and Snatch. One surprise though was LOTR, I wouldn't have expected it to be grouped with Pulp Fiction, but I guess I like them both very much -- so it seems reasonable in my case atleast.

Just for fun, here is another example with "Sin City" another one of my favorite movies.

Unless you have a large index of the Blogosphere or the Web, it would be quite inefficient to mine for such correlations (by passing queries to search engines) on a large scale. I do not know how much of the search engine information is leveraged in recommendation systems built by Amazon or Netflix.  It might also be worth looking into differences in the recommendations produced on the basis of "how people co-cite two products" vs. "how people purchase two products".

April 17, 2008

Notes from DEBSM Workshop: Inferring Privacy Information via Social Relations

Inferring Privacy Information via Social Relations
Wanhong Xu, Xi Zhou Lei Li (presenter)

The presentation started with a neat quote: "Your social activities tell who you are". Social Networking are part of everyday life and for many of us a primary way in which we keep in touch with friends and family. Privacy is a huge problem in such networks. Li provided some statistics from a recent survey of British social network users: about  62 percent are concerned about the security of their personal data. 31% of the users falsify information to protect identity. This is a huge figure and definitely shows the level of concern.

This paper was motivated with applications for social advertising (which is a 2.2B US market). One can target users based on location, age gender etc. and social advertising allows advertisers to choose their audience. The problem comes up where users do not wish to disclose too much of their information online. The proposed solution is to automatically infer such missing information.

The authors use an undisclosed dataset for this study. The assumption is users fake personal information but not their activities. It is well known in offline-social networks that gender preference exist in friendship. however in online social networks this is not true (Jure Leskovec, WWW,2008). The key question that authors of this paper ask is : "Users may fake their personal information...But what about social activities?"

The insight that this paper offers is that certain group membership gives hints of user gender information. Joining groups has a gender preference. So the way in which one can infer the gender is to use a bipartite graph of users->groups with missing gender information using relation between users and groups. One approach is to use the User*Group matrix and build a classifier. However, they found that Naive Bayes does not work very well. Many social groups in fact dont have any gender preference. This can really hurt the classifier accuracy. So one approach they propose is to choose the discriminating social groups. i.e. groups that dominated by males or females. One major disadvantage that this technique suffers from is that the membership for users that dont join any of these groups can not be predicted. Once you restrict to the set of discriminating groups, now Naive Bayes performs well.

For users that cant be predicted they propose the use of an iterative algorithm to combine discriminative social groups and results from Naive Bayes classifier. Testing is performed by removing or making some data missing and then predicting the missing gender information.

My concern here is that the authors might have been able to avoid this disadvantage if they had used SVD to map the Groups to a set of lower dimensions that way it would automatically "cluster" the groups based on whatever the discriminating factor is (in their case the gender information). Secondly, while the authors had access to "verifiable" ground truth data, in real world how do we know the influence of fake profiles on these discriminative groups?

Clustering Tags and Links Simultaneously

I have a large collection of blogs that I would like to cluster. For the blogs in my collection, I have the following information:

  • Text aggregated across all the posts
  • A dump of the homepage of the blog
  • Tags obtained from del.icio.us or "Feeds that matter"
  • And finally how the blogs link to each other (aggregated across all the posts)

If we want to cluster the blogs, one can take the following approaches:

  1. Cluster the blogs using only the text (Kmeans style)
  2. Cluster the blogs using only the link information (By say performing NCut)
  3. First cluster all the tags (so you would group "tech", "technology", "geek" etc) and then group the blogs that are categorized in each of these clusters.
  4. Using a bipartite graph of the form <Blog,tag>. This is also known as co-clustering.

CoclusteringThe problem with using the text alone is that the data can be huge (typically in Gigs) and it is almost essential to use some dimensionality reduction techniques like PCA etc. By performing NCut over the graph alone, you land up ignoring the text or the tag information that could be quite helpful. By clustering in only the tag space and then mapping the resulting clusters to the blog graph, we miss how the blogs might be interconnected. Finally, co-clustering technique too misses the fact that the blogs might link to each other and this information can be quite useful in grouping them together.

The question I was wondering about was how to cluster the blogs while using both the link and the tag information simultaneously. I think that the tags provide a reduce dimensional representation of the text. The tag space is of a much lower dimension than the complete text of the collection and it can be efficient to cluster the blogs using tags than the complete text.

Recently, I came across a very interesting piece of work that talked about "Classification ConstrainedMatrixlink_2 Dimensionality Reduction". The idea here is that if we have a bunch of data points for which we know the true class information, we would like to classify the remaining points such that the available class information is utilized to guide the dimensionality reduction.

Matrix_2The above paper applied very well to our problem. One simple way to combine the two different information is to use the matrix W' shown here. C is a k*n matrix where n is the number of blogs and k is the number of tags. While W is the adjacency matrix of size n*n. I is an identity matrix of size k*k. Typically, we can get away by using only the top 200-500 most popular tags. Now, if we perform NCut using the matrix W' instead of using W alone, we can combine/fold-in the tag information with the link information. The parameter \beta controls how much importance we would like to give to the graph versus the tag information.


For example for a blog graph consisting of ~3000 blogs, I fetched the tags from del.icio.us and the link graph from the WWE/Buzzmetrics collection. Figure on the right shows 20 clusters if we use just the link information and perform NCut and the same graph using the simultaneous clustering method. Notice, how the 20 odd clusters seem nicely block diagonalized in comparison with the results using link information alone. It would be interesting to look at how these techniques compare with that using the entire text from all the blogs. I am working on this and the intuition is: we might be able to show that using just tag and link information, we can do as good (or better) than using the entire text. Afterall tags are nothing but a place holder for the "topic" of the blog.

An even more interesting extension would be to use the original paper that presented the constrained classification using dimensionality reduction and apply the same technique to our problem, such that we can perform clustering using just a small set of labeled blog and labels (matrix C).

April 16, 2008

Notes from DEBSM Workshop: Monitoring the Evolution of Interest in Blogosphere

Monitoring the evolution of interests in the blogosphere
Iraklis Varlamis (Presenter), Vasilis Vassalos, Antonis Palaios

The authors present a simple (perhaps even obvious) hypothesis that "Interests of bloggers converge around real world events".

A prototype system was built to to analyze blogs and monitor the evolution of interests. Existing search engines monitor the term popularity of words in isolation and the related terms are not clustered in any way to identify the topics. A motivating example presented was around the football world cup season, where there might be terms like "football" "round" "final" "result" "world cup" and aggregated scores for topic would be helpful rather than having to find/view the histograms for each of these terms. Also one can not know all the terms for a given topic. In many cases, multilingual content and difference in vocabulary can further make it difficult to just monitor the term popularity. Other approaches require training a topic based on a given set of training/pre-annotated document pool. This is an expensive and a semi-supervised method can be strongly effected by the selection of the initial document pool.

The goal of their paper is to build an unsupervised approach to cluster the documents and identify the evolution of interest. However, the current algorithm presented is not incremental; but they claim some recent work on an incremental version. The current version of their algorithm uses the DBScan clustering technique. Term space is clustered at the post level and their evaluation is done using inter cluster, intra cluster similarity and the utility function. They have used the BuzzMetrics dataset, however they only focus on blogs that have a post everyday, which brings the dataset down to 2500 blogs. The posts are classified into topics and then the blogs are clustered. The categories are determined by using the Cyberfiber news groups to find the topics and training documents. This is an interesting dataset that I was not aware of.

Some questions that came up were the choice of using newswire documents to learn classifiers, and also the accuracy of the classifiers. While the original hypothesis seems obvious, it can be seen as a litmus test for the efficacy of the clustering algorithms. In particular the examples presented were using the "London Bombing" Event. Any clustering algorithm that groups the related terms must at the very least pass the litmus test to show that it works.

I think that in general there are perhaps two types of events, one concerning a very specific group of individuals  ("County elections", social events, conferences etc) where online communities emerge due to offline events. The other is around real word events like "London bombings" or "Hurricane Katrina": both of which are examples of events that cut across community boundaries and which invoke reaction from almost every part of the blogosphere and almost in unison. In such cases on might see that the original community affiliation of the bloggers dont matter and tech, politics, religion and every other community might be talking about the same issue.

Google Ads

Related Wikipedia Entries

Ads

Recent Readers

Search this blog


  • WWW
    socialmedia.typepad.com

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
I Love 6A

Please Support