SocialDevCamp East Fall 2008 once again invites east coast developers and business leaders to come together for a thoughtful discussion of the ideas and technologies that will drive the future of the social web.
Because of the central location, the event draws from Washington, Baltimore, Wilmington, Philadelphia, New York, Boston and beyond -- the emerging "Amtrak Corridor" technology scene.
Where is the social web going? It's going mobile, to geocentric services, and to open platforms. Join a community of like minded developers, social media gurus and thought leaders for an unconference to discuss the future of the social web.
We're looking for thought leaders from DC to Boston to meet, forge relationships, and envision the future.
Following are a few running notes from the KDD invited talk "Internet Advertising and Optimal Auction Design" by Dr. Michael Schwarz from Yahoo! Research:
Dr. Schwarz explains that the original advertising models were quite simple but now new tweaks and optimizations are being added to, for example, estimate click probabilities, match ads for long tail of queries, etc. So there is a "Fudge" factor in online advertising algorithms, and the question now is how to combine these correctly?
In sponsored search options space, Overture was really the pioneer of pay-per-click. In such a scheme, the bidder would pay whatever price that was fixed during the auction of an advertising position. However, soon people realized that generalized first auction is unstable. This is because if advertisers can keep changing their bids and the minimum bid keeps fluctuating. The situation is illustrated in the following paper and as shown in the image below:
This system was also prone to manipulation by automated robots quite easily.
Next, in 2002 Google came up with the Generalized Second-Price option (GSP): here the advertiser pays for the price fixed by the next highest bidder (plus a minor amount like a cent), upon winning an auction. Later Yahoo! followed. In this mechanism the assumption is that for the ad position i, clicks are almost independent of the advertisement. A clock shows the current price that keeps increasing, a bid price is fixed only at the time of dropping out and the final payments are computed according to GSP. Dr. Schwarz presents how this is related to "Generalized English Auctions". The intuition is that the advertisers values the price just at the point where the advertiser drops out of the game. If we think about it, at each point the advertiser is making a decision to go closer to the point where he values the position enough but not to a point where he/she would have regrets if the next highest bidder would drop out of the auction, causing a win. The GSP approach has been shown to have a unique equilibrium. For more details you can read the paper: "Internet advertising, generalized second price options".
Finally, the talk discusses an optimal matching problem. If you know the probability distribution from which the values for the bidders are drawn, in theory, the highest probable match (ie optimal auction) is deterministic. According to further studies in these advertising models, Dr. Schwarz suggests that despite multiple units the value for a slot, the auction is not dependent on the number of slots! This is great from a search engine's perspective. Although I guess this hold to a certain number of slots -- or else search results would have looked rather different! Further, the decline in CTR from one slot to another also does not have any influence on the auction. This is perhaps where clever data mining techniques can be applied to help estimate the probability distribution of bidder's values for each slot and ultimately, lead to setting optimal reserve price. I guess this is perhaps the point where search engines start guarding their trade secrets closely.
Finally, during discussions with many people at the conference, on theme that kept coming up is that of social advertising and monetization of social networks -- which has been a typically difficult problem. Personally, the way I see this (more of my 2 cents here) is three fold:
KDD is an exciting venue precisely because it is at the confluence of data mining, social media, large scale text and graph analysis and monetization-- all the themes that have made it a great learning experience for me.
Liveblogging from WebKDD 08
Learning Preferences of New Users in Recommender Systems: An Information Theoretic Approach
Al M. Rashid, George Karypis, and John Riedl
[Update: This was the winner for the best paper award]
This paper is addresses the cold start problem in collaborative filtering, i.e. for new users since the system does not know of the user's preferences, the system cannot recommend items. This paper looks at a few different heuristics:
Evaluation was done on the MovieLens recommendation system.Overall ICGN and Entropy0 perform the best.
Live blogging from WebKDD 08:
Exploring the Impact of Profile Injection Attacks in Social Tagging Systems
Maryam Ramezani, J.J. Sandvig, Runa Bhaumik, Robin Burke, and Bamshad Mobasher
Explores how folksonomies can be used for social navigation and protect it from 'self-promotion tags'. in profile injection attacks, users create number of identities and add 'tag spam' for malicious intent. Spurl.net is an example of a social tagging system that was inundated with spam and had to shut down.With various entities (resource, tag and users) the profile injection attacks can effect each of the 'navigation contexts' like: related resources, co-occurring tags, recent items etc. The paper goes in depth into the various classes of such attacks. The paper uses del.icio.us crawls for this study that were split into three partitions: low freq, med freq, high freq.
Two main types of attacks discussed:
One interesting point is that piggyback attack is a problem mostly in the Low and medium freq partitions and just 3-5% of fake user profiles injected (using relatively few number of tags) can cause considerable problems in social tagging systems. The authors present various results based on number of tags, number of profiles etc.
I think as tagging systems become more main stream dealing with spam in such environments is quite important.
Live blogging from WebKDD 08:
I/O efficient computation of First Order Markov Measures for Large and Evolving Graphs
Prasanna K Desikan and Jaideep Srivastava
This paper addresses
in pagerank computation.
Typical method is to do binning of the vectors so that the vectors are accessed as blocks in the graph. This requires more IO but low memory. The approach presented in this paper is to make used of the structure of the graph in a divide and conquer like fashion (sort of like community detection).
A related paper I recalled during this talk is:
Further the paper presents an approach for dynamic, evolving graphs that takes advantage of the fact that subgraphs can have a faster convergence rate. Their approach provides about 20% improvement in time efficiency.
Live blogging from WebKDD, 2008:
Query-log mining for detecting polysemy and spam
Carlos Castillo, Claudio Corsi, Debora Donato, Paolo Ferragina, and Aristides Gionis
The paper presents an approach to identify spam from both query logs and usage. The queries are a way to obtain the wisdom of the crowds. This information is used to tackle spam detection and identify polysemy.
The sources of information used by the authors were:
Further feature extraction & topics are identified from web directory. The intuition is
"documents that attract queries on many topics and queries have the potential of being spam."
Syntactic features used were: degree of node; the top query terms a document attracts.
Topics are propagated in the click graph using categories from dmoz to queries in the bipartite graph. Two approaches are used for the propagation: weighted avg and topic specific pagerank.The query logs were obtained from Yahoo! and labeled spam collection is WEBSPAM-UK2006 collection.
It was interesting that many polysemous queries were actually country names. For spam detection content + link + usage obtained the best results which was almost same as link+content.
Live Blogging from WebKDD 2008
An Empirical Study of Dynamic Pari-mutuel Markets: Evidence from the Tech Buzz Game
Yiling Chen, David Pennock, and Tejaswi Kasturi
Presented by Dr. Dave Pennock
Tech buzz game uses the wisdom of the crowds and people can 'bet' on events like: "Bird flu outbreak in UK in 2008" etc. An example of a prediction market that David talks about is InTrade and there are tons of other clones out there: Hollywood stock exchange etc. Pretty amazing some of the prediction accuracies that are shown in some of the examples.
Parimutuel markets: All the money goes into one pool and there is a (Zero sum) wagering game in play an winnings are shared equally. In dynamic parimutuel markets (DPM) the entry price changes as more people purchase the share (for either outcomes) in the pie. There is a cost function associated with each purchase price. C(q1,q) = sqrt(q1^2+q2^2). It has a nice property that the ratio of prices is same as ratio of shares.
This idea was used behind the Yahoo! Tech buzz game which was an amazingly cool site, if you remember. It still amazes me how much folks would do for fake money or points! Some research strategies presented for DPM as a prediction mechanism:
Very cool! Watchout for a new prediction game called Yoopick, a sports prediction game! Here you can predict not just the outcome but the actual score ranges.
I will be attending the KDD 2008 conference in Las Vegas this week where I will be presenting our paper titled "Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies" at the WebKDD 2008 workshop. Please let me know if you are at KDD and I would love to sync up with you!
I also plan to make a short detour to the Bay Area, CA (August 27-31st). We could try to organize a Social Media meetup at a coffee shop if there are enough folks interested in meeting up on August 28th or 29th? Really looking forward to this trip and catching up with my old friends (school, ex-colleagues, academic) as well as my "Linkedin/Facebook/Blog/Twitter/FriendFeed/..... friends":
Just for kicks, here is a funny video on this topic:
Seriously.. Sixapart? I am sure you could do a lot better than having an expensive team of editors cherry pick their favorite blogs in a handful of categories. You guys have a ton of good, high quality blogs and the interesting thing is that you can use some of the tag information to cluster these blogs into the different communities they belong to. For example blogs that talk about business are likely to use tags like marketing, advertising, finance etc. Algorithmic approaches can be effectively used to mine blog graphs to find high quality feeds.
At eBiquity, we have looked at this problem from two different approaches:
The point is, I beleive that the top 10 offered by sixapart and bloglines are not really that interesting anymore. In the era of hundreds of thousands of potentially good blog (of the millions that exist), we can rely on algorithmic approaches to find (more than just 10) high quality blogs. This is also related to the "Feed Distillation Task" that TREC ran for 2007. If you are interested in feed distillation... check out CMU's submission by Elsas et al. and the follow up papers by Jon at other venues.
Another challenge is being able to find not just popular blogs but also that cool blog that probably be down there somewhere in the long tail.