Recently I learned about semi-supervised learning over graphs. Using a small number of labeled nodes, the task is to find the appropriate nodes that belong to the same class as the labeled, training data. This problem is typical in Web spam detection where some hosts are identified as spammers (possibly manually) and the goal is to automatically tag the rest of the hosts as either spammers/non-spammers.
A similar task can be formulated for learning the labels in blog graphs. For popular blogs, it is not hard to find good quality labels using either Bloglines or del.icio.us. Often these labels are most reliable for popular blogs that have been tagged by many readers. Now in order to find out the other blogs that share the same label, a method recently suggested is Graph Regularization. (And for an overview on label propagation methods here is a nice tutorial).
Graph regularization uses the spectral properties of the graph to find a function that satisfies the following two constraints:
- Smoothness Constraint
- Fitting Constraint
According to the smoothness constraint, nodes that are close to each other in the graph would have the same label. On the other hand the fitting constraint ensures that the final label assignments do not deviate too much from the training data. Thus imposing a penalty for misclassification of training examples. The two constraints can be formulated in terms of the graph Laplacian. For details refer Zhou et al.
As an example, for a small blog graph, I obtained the del.icio.us tags corresponding to each blog. Empirically, I find that del.icio.us tags provide very good description of the blog. Once the tags are obtained (I used the JSON API), I use just 10 blogs tagged as "sports" and use the above regularization method to find other blogs that might be similar. The final scores (see figure, right) after regularization correspond to the community structure (found using Ncut, left) as shown in the figure below.
Here is the list of some blogs that were categorized under the label "sports", starting with the training set of 10 blogs.
In my opinion, two things stand out... one the Ncut algorithm itself may be improved if it was to be using additional tag information. Secondly, label propagation for blogs is slightly more complicated than just the original binary classification (spam/not spam) problem. One could have a few hundreds different labels that might need to be propagated, moreover -- the labels themselves may be quite related ("tech","technology","Programming", etc). Adding the constraint that labels should be propagated, subject to some known distribution makes optimizing the above cost function slightly more complicated. Furthermore, a blog may be equally relevant to two very different labels -- "politics" and "sports".
Any thoughts? Other techniques that might be useful? Ideas on how inter-related, multiple labels, can be propagated over a graph?
Recent Comments