Live blogging from WebKDD, 2008:
Query-log mining for detecting polysemy and spam
Carlos Castillo, Claudio Corsi, Debora Donato, Paolo Ferragina, and Aristides Gionis
The paper presents an approach to identify spam from both query logs and usage. The queries are a way to obtain the wisdom of the crowds. This information is used to tackle spam detection and identify polysemy.
The sources of information used by the authors were:
- Query graphs: query to matched documents
- click graph: query to clicked documents
- view graph : The set of matches that were viewed
- anti-click graph (if the user skipped one of the previous documents from the top 3). This can be thought of as negative feedback.
Further feature extraction & topics are identified from web directory. The intuition is
"documents that attract queries on many topics and queries have the potential of being spam."
Syntactic features used were: degree of node; the top query terms a document attracts.
Topics are propagated in the click graph using categories from dmoz to queries in the bipartite graph. Two approaches are used for the propagation: weighted avg and topic specific pagerank.The query logs were obtained from Yahoo! and labeled spam collection is WEBSPAM-UK2006 collection.
It was interesting that many polysemous queries were actually country names. For spam detection content + link + usage obtained the best results which was almost same as link+content.