Yesterday, I had the opportunity to visit JHU to attend Dr. Jon Kleinberg's talk "Social process, Information Flow, Anonymized Network Data". Following are some running notes from this talk.
Dr. Kleinberg (aka the rebel king) talks about how the convergence of social and technological network now lets us observe information at a much larger scale. We have come a long way in social network analysis. From the initial days of 34 node karate club datasets to analysis of almost 240M users of Instant Messengers. But Dr. Kleinberg points out that it isn't just a matter of scale. Some phenomena that is not visible at smaller scales can become observable when you look at a large number of users. However, this also comes at a cost: often, we understand little about each node or link means. Compare the large scale studies to those of the karate club dataset where we know the roles of individuals in the two splits.
Given the large, networked social data that we can observe, the question Dr. Kleinberg asks is "what can be said about the diffusion of information?". It has several implications like studies of epidemic propagation, information cascades, viral marketing, how rumors, opinions and fads diffuse in these networks. Recently, there has been a great deal of interest in these topics from both computer science and social sciences. One interesting and innovative approach by which Jon and his students study this is using chain letters -- those pesky emails that keep getting forwarded! And that is exactly what makes it possible to analyze how the information spreads.
D. Liben-Nowell, J. Kleinberg. Tracing Information Flow on a Global Scale Using Internet Chain-Letter Data. Proc. National Academy of Sciences, March 2008
They use two instances of widely circulated chain letter: petition against the Iraq war and a petition for increased funding for NPR. The neat trick they use is to mine the snippets from search engine results and crawl public mailing lists for these forwarded messages -- which gives them a clear picture of how the chain spreads in the network. I was really fascinated by this statement that Jon made:
Rather than thinking of it as people browsing through a network of documents, we should think of the chain letters as "documents blazing through a network of people".
Indeed, the analogy presented here was that of a document going through mutations as it was being forwarded by people. Some folks would append names to the list, some would delete them and some would change names to humorous variants like 'Donald Rumsfeld' or 'Karl Rove'. Dr. Kleinberg's slide containing the structure of this chain letter propagation was rather surprising. It consisted of long chains which is counter intuitive. One would normally have expected it to be a large number of short chains. Rather, it was like performing a depth first search traversal of the network.
There were two models suggested by the authors: one was based on temporal ideas and the other spatial homophily in networks:
- Temporal model: chain letters are like DFS and the model is based on timing. The basic idea is that people do not act at the same time. Unlike several epidemic models (which are based on the assumption of synchronization), in this model a node v participates in the chain with some probability and forwards the letter to her neighbors as some function f(t) = t^c (where c is typically -1.5 in human networks).
- Spatial model: is based on homophily, the idea that people tend to associate with similar individuals in networks. This similarity may be spatial, geographical or demographical. The cool insight that this model offers is that long range links create awareness but for participation it is the dense local structures that really motivate people to act.
The second part of this talk focused on privacy and anonymization in social networks. This is an important research problem. As computer scientists, we are always looking for more data but at the same time we have to take the privacy of a user very seriously. From personal experience, after the AOL search query log debacle, it has become almost impossible to get companies to release their datasets -- and for pretty good reasons. Given this deadlock, as researchers it becomes even more important for us to understand the type of attacks and what we can do to prevent these and ensure anonymity of the users when a dataset is released. Dr. Kleinberg points us to a few recent research papers in this direction:
- On anonymizing query logs via token-based hashing Ravi Kumar, Jasmine Novak, Bo Pang, Andrew Tomkins
- A. Narayanan, V. Shmatikov. How to Break Anonymity of the Netflix Prize Dataset. Oakland 2008. (online)
The basic attack strategy is that you can easily add yourself to the system. For example if you were to know that the email network would be released, you would signup for a new account and then add a few edges (by sending an email to known set of individuals). This way you can identify yourself in the 'anonymized' network by looking for these signatures or subgraph patters.
One would expect this problem to be NP-Hard and for very large graphs almost intractable, but in the following paper Dr. Kleinberg shows that under certain conditions this can be solved easily, thus breaching the network's privacy.
L. Backstrom, C. Dwork, J. Kleinberg. Wherefore Art Thou R3579X? Anonymized Social Networks, Hidden Patterns, and Structural Steganography. Proc. 16th Intl. World Wide Web Conference, 2007.
The basic intuition was to build a subgraph and attach it to the network. If the subgraph is small and randomly generated, it will be likely that it is unique and thus efficiently findable. Furthermore, they show that in real networks, these types of subgraphs are often quite readily available in the form of small networks or cliques. The paper presents interesting bounds on the problem and proves that you indeed need a subgraph H of the size sqrt(log n) which is reasonably small.
I thoroughly enjoyed the talk, especially since the last conference I attended, Dr. Kleinberg's talk was jam packed and I was hardly able to enter the room. It is always inspiring to hear such speakers and learn from their talks.
This post is from my running notes during the talk and hopefully benefits folks who were unable to attend.