For many machine learning tasks it can be quite difficult to get the "ground truth" data. In some cases the best way to verify the results is by a painful, laborious and mindnumbing task of labeling data manually. When Pranam and I were working on the Splog detection task, we spent a good deal of time painstakingly labeling independently if a blog from a random sample was legitimate or spam. The part that makes this task difficult is:
- Sometimes it is not clear when something is a blog is a splog
- You need to also look at the inlinks and outlinks
- Plagiarized content makes it harder to judge authenticity
- Sploggers are getting more sophisticated in the methods they are using
On an average we spent about 2-3 minutes per blog and in the end, were only able to hand label a small collection. In comparison to many other tasks this was still a relatively straightforward judgment. Consider the task of relevance ranking that NIST has to perform each year for the TREC tracks. Here the goal is for the annotators to figure out if a result is relevant for a query. The guidelines are strict and NIST has many professional annotators who are trained to perform these tasks. Even more complicated are some of the annotation that might be required in certain Natural Language Processing experiments. These can range anywhere from just verifying a parse tree or an output to actually constructing gold standards or hand crafted parse trees. Moreover, some NLP tools require tremendous amounts of linguistic resources -- be it tediously constructing an ontology, lexicon, gazetteer lists or identifying word senses. Many of these tasks require linguists or experts whose time might be quite valuable.
lets consider a simple case where the annotator was asked to label a URL with a tag. Lets also say that it takes roughly a minute to load the page, quickly glance over it, make a judgment and then type in the appropriate labels. I know from experience that this is not a minute but more like 1.5-2 minutes on an average (try it! it is a braindead boring task and if you are asked to do it continuously, you will slow down!). If say I can work 10 hours on this task without loss in quality of my annotations, it would only result in 600 URLs being tagged. UMBC pays lets say around $10/hour for on-campus jobs. That means we would spend about $100 just to label 600 URLs. Not so sure if that would be the best way I would like to spend a hundred bucks! Additionally, just one human annotator is never sufficient. You always need to answer questions like : "So, what was your inter-annotator agreement?". Well then you just blew another $200 or $300 on this task and still have just 600 URLs marked up. No wonder del.icio.us and Flickr are such amazing sources for free (yaay!), human assessed labels and annotations. It works out great if you can use these instead.
Mechanical Turk is an attempt by Amazon to make it easy for such tasks to be distributed to people who would perform them in return for small micropayments. However, it seems to me that the incentive for completing the task (or doing it well) is so minimal that there is very little enthusiasm around this product. I suspect, the only people completing these HITS are individuals in countries where the dollar still has some value. In fact, there seems to be something fishy about atleast a handful of HITS that are high paying. It looks like the system is totally getting gamed by spammers - look at this example requesting 20 backlinks to a site ($3) or creating bogus accounts ($7) and likes.
The most interesting tool for manual annotation that I have ever seen is the ESP game. It is an excellent entertaining and I must warn you totally addictive. The game works by showing you an image and asking you to label it. You are randomly paired with another player and both get a score if a word matches. This totally ingenious way of collecting annotations for images means that the annotations come absolutely free of cost!
To bring the cost of manual annotation to zero or close to it the best incentive is to provide some value to the annotator. Ofcourse some tasks are so specific or specialized that this might be truly difficult without actually paying someone to go through it.
From a research perspective, when we build a classifier and use the UCI dataset we have a good "gold standard" and accessible body of literature that has studied the very same data. But as we are dealing with ever increasing size of datasets, access to ground truth or manually verified samples is becoming even more challenging. So is the significance of using it. What does it really mean to annotate less than 0.1% of the data (say you have a very large collection of blogs, images, graph -- whatever social media content you can imagine)?
[Image Courtesy http://www.socialfiction.org/]