A couple of months back, I had mentioned about a recrawl of bloglines (publicly listed) subscriptions that I was running. Here is an excerpt from my previous post:
From this new crawl that I have, the roughly 200K public users have about 4,289,988 distinct items in their collective OPML files. These correspond to about roughly over a Million distinct feeds that have at least one subscriber. The following graph shows the distribution of these feed subscriptions, showing the hallmark power-law distribution.
According to Technorati there are 133 Million blogs that they are tracking with about 1.5 Million that have updated in the last 7 days. This is HUGE!
In comparison, it looks like there are about 1 Million feeds that really matter (in the sense that people actually subscribe to them). A few points:
- Here is a list of top feeds from this collection.
- Roughly 11% of all feeds subscribed are served via feedburner. I am a bit surprised by this as I expected this fraction to be much higher.
- The number of feeds that have at least one subscriber has remained relatively stable since last year as far as bloglines is any indicator.
Looking at this data, I wondered what is the best way to index millions of feeds? Crawling is a very heavy duty job requiring huge bandwidth and disk writes. It isn't that easy to implement and I have in the past written my own crawlers for a few projects, indexing tens of thousands of feeds. But to scale it up much further you need a good deal of resources -- which doesn't come cheap either. Ping servers were supposed to be the critical component in the blogosphere infrastructure, unfortunately they are inundated with spam right now. That is why tools like Tailrank's Spinn3r are a great alternative for small startups who want to index blogs. They take the heavy lifting off your shoulders and allow you to build and focus on the main product.
There are a number of research challenges that come up with crawling and ensuring that the index is fresh. Recently, Greg points to a paper at WWW that is pretty remarkable. They crawled 6 Billion pages on a commodity server. (This paper also won the best paper award). I believe that crawling feeds and blog data is quite different from crawling the Web. Ensuring freshness of the index means indexing new posts within minutes of publishing. Given the inundation of ping servers with spam, one option is to poll the feeds and the other approach is to ignore all pings other than those from the feeds of interest.
From Greg's post:
This made me think -- how about feeds? If there are only 1Million feeds that are subscribed, perhaps most of the resources should be dedicated to keeping this core set as fresh as possible. In context of this question there is some recent work by Richard Sia on this topic: Efficient Monitoring Algorithm for Fast News Alert, with Junghoo Cho and Hyun-Kyu Cho, in IEEE Transaction of Knowledge and Data Engineering (PDF). They model the posting patterns of the RSS feeds and optimize multiple fetches to minimize the delay in indexing.
But one question that comes to my mind is that of the tradeoffs between index freshness and quality. One could index very few feeds and ensure freshness at the cost of quality. On the other hand one could index a large number of feeds and increase the delays in indexing. What is the optimum balance between these two? How do you incorporate the number of users who subscribe to the feed in the (re-)crawl prioritization algorithm? Can recent link activity prioritize recrawl of feeds?