What is the difference in size distribution of a news wire vs. a blog post vs. email message?
The below three images compare the size distribution of news wires (Reuters collection) , blog posts (from the ICWSM dataset) and email messages (Enron Corpus). The charts show the histograms of the size of the documents in these collections:
The three distributions above (ignoring documents smaller than 2000 bytes) were fitted using the matlab scripts for powerlaw fits (Thanks to Aaron Cluaset).
The linguistic properties of blogs email and news stories are quite different and this has already been highlighted in several research papers. While the three data sets are quite different in many ways, here I am analyzing just the size distributions. The important point to note is
- News wire stories are quite short
- Blogs and emails are much longer and have a heavy tail distribution
- Power law exponents for blog size distribution and email size distribution are quite similar (around 2.7)
So...what does this mean? It is fairly obvious that news wire stories are quite short due to the nature of reporting. Sometimes the initial news story is quickly reported by agencies like Reuters/AP. These are at times brief and to the point to allow readers to get a quick gist of its contents.
In contrast the size of blogs tend to be much larger than news wires. Citizen journalism is full of opinions thoughts and punditry thus bloating the post. This also goes back to my previous analysis of the blog homepage size vs. Web page size. Indeed the contribution of blogs has been reported to be 4-5 times that of edited text (like the news wires).
What I had not expected was the similarity in the slopes for email and blogs. One thing to note however is that here the emails are aggregated across a number of different users. This is an important distinction. While a single user may receive a few hundred emails, they potentially have access to millions of blogs. Recently, industry's top usability expert Jakob Nielsen concluded that readers skim through and read at most 20% of the words on a Webpage. While there are millions of blog posts every day... there is very little time to read them all in detail. The volume of email is limited by a person's social network but for blogs the act of prioritizing what to read is entirely left upon the user. This essentially necessitates the use of Memetrackers and explains the popularity of filtering tools like digg, techmeme etc. By summarizing popular blog posts and providing blurbs for these, such tools essentially act as a "social news wire service for the blogosphere".
Job well done. Keep up the good work.
Posted by: inhattnup | October 16, 2008 at 09:37 PM