Contact Me


  • Akshay Java's Facebook profile

Social Media Events

Friends

Disclaimer

  • Thoughts and comments expressed here are those of the author. Creative Commons License

research

June 20, 2008

Some things are just Semi-Social

Social Media is a lot about sharing. Prior to the growth of social software, it wasn't that people did not share stuff -- they just did it offline or via email. Now we share at a massive scale and a lot more easily. 

Some things we are willing to share "openly"

  • Music playlists (Last.fm)
  • Books we read (iread, shelfari)
  • Calendars and Travel plans (google calendar)
  • Status updates (via Twitter and Microblogging)
  • Restaurant recommendations (yelp)
  • Knowledge and expertise (via Wikipedia)

As we start to experiment with social software we realize that sharing is good and soon become open to sharing a lot more. There are some things though, that just seem semi-social. What I mean by Semi-Social is roughly "Thing I would not mind sharing with a small group of trusted friends and family members".

Until just a few years back there would have been a lot more people squirming if they were asked to share such 'sensitive data' with others. I see this perception slowly eroding away. There is a small, albeit enthusiastic bunch experimenting with new tools that fall into the category of Semi-Social. 

Some cases that I can think of are as follows:

  • Investment portfolio: One example is Covestor. I have an account there but it is under pseudonym. I would not be that enthusiastic to reveal my pathetic attempt to bet on the stock market by watching (mostly tech) blogs. sigh!
  • TV watching habits: I think Television as we know it today is completely broken. There is no social aspect to it whatsoever. At ICWSM, Noor Ali-Hassan presented a paper on "Social Media Scenarios for Television". What struck me about this talk was her statement that "Despite its social nature, there is a private aspect of TV that people want to preserve".
  • Income and financial information: This is something we had least anticipated. How did we get to a point where I am actually not that scared while putting all my bank details and credit card information into a site like Mint? Mint is not a social site as such. But it reflects how we are now willing to part with some really sensitive data. In contrast, there are other examples of recruitment sites like SimplyHired where people reveal their salary information and can search for companies by salary. A more recent startup that is quite similar is Glassdoor.
  • Location: Location can be an extremely sensitive piece of information. Fortunately, Yahoo's fireeagle provides access control for various applications and one can set the privilege that each app has to access location information (latlong, zip, state, country etc).

There will always be some who are at the extreme end of the spectrum and are quite comfortable with being completely (publicly) transparent about "sensitive data". However, most would still only dare to share some of this data with close friends and select people -- i.e. if there is enough value proposition in it for them. Some would be comfortable with aggregate analysis over the data as long as they are not personally identified or targeted in some way (advertising or otherwise).

Although it requires a great deal of courage (to work with privacy sensitive data), the opportunity to invent in the semi-social space may be quite a bit.

June 10, 2008

These Tweets are from Mars

Mars I am really enjoying the tweets from @MarsPhoenix. Ofcourse this isn't the actual robot sending Twitter updates from millions of miles away, the researchers tweeting on it's behalf are definitely engaging in some interesting conversations. This is one fantastic example of how large organizations can engage in Social Media.

The thought that we are having a conversation with a tiny bot makes the whole experience rather exciting. It wouldn't have been half as much fun if it were for a human persona at the other end. This "bot persona" is more lovable, in part due to our collective imagination and desire of being able to have an intelligent conversation with machines -- our R2-D2s and WALL*Es.

CapressoThis has been a fantastic experiment in social psychology as well as a superb publicity approach.  MarsPhoenix has about 20K followers  making it one of the most popular Twitter users. Accolades to JPL researcher Veronica McGregor, for this terrific idea and posting interesting updates.

I think that there is a lot more to this story. I imagine that soon we will have more devices that we can talk to directly on Twitter and IM. One idea I had recently was to rig up our lab's coffee machine, Mr. Capresso, with a temperature sensor so that he can automatically inform us when fresh coffee is brewed in the lab.

And at the cost of sounding much like Eliza, I think that for a limited domain, we might even have the capabilities to build Natural Language Generation tools that could automatically post Tweets. I am aware that there are many bots on Twitter. But the tools I would like to see are the ones that can do more than just post a message (like a new video on qik, etc) -- true interaction would come only from conversations. A really wacky (but simple) example would be a poetic bot (yep! people have researched on that too! ;-) )that would send intelligible rhymes in response to @ messages. Might be quite hilarious to follow it!!

June 07, 2008

Quantifying Social Capital

Since attending the talk by Dr. Tufekci on "The New Social Physics", I have been thinking about social capital and how we can quantify it.



Social Capital, is a term used to describe the intrinsic value of social networks. Robert Putnam attributes social capital to "civic engagement" and as an indicator of "communal health". Pierre Bourdieu describes how social capital explains how people find jobs through their social connections. Similarly, the strength of weak links, describes the importance of social relations in job searches.

In general social capital can be described in terms of bridging capital and bonding capital. Bridging capital is the notion that chance, long range social connections, that we build as part of our social interactions, can help us connect with heterogeneous groups. The idea of "bridges" in social relation is again similar to the notion of connectors in the book "Tipping Point" by Malcolm Gladwell. As Gladwell describes it 

"Connectors are people who link us to the world ... people with a special gift for bringing the world together"

So how can we quantify the connectors? According to the Wikipedia entry:

"There is no widely held consensus on how to measure social capital, which is one of its weaknesses."

One simple way to explain such relations is to use the analogy of bridges. These are individuals who link across the different departments in your office, they are the researchers who have worked with people from other fields and universities. We all know such people and run to them to find our link to the "other side of the planet". One measure that might be useful in quantifying bridging capital is "Edge betweenness". It is a measure that indicates the number of all pair shortest paths that flow through an edge.  In other words, if the edge is removed, many pairs of nodes need to follow a longer path to communicate with each other. Edge-betweenness is also used in some of the community detection techniques. By iteratively removing edges that are most in-between, we can identify the communities that exist in a social network.

Bridging capital is easier to conceptualize than bonding capital. Bonding capital is defined as

"the value assigned to social networks between homogeneous groups of people"

Another way of thinking about bonding capital is as follows: If you were in an urgent need for $500, which you promised to return the next day -- whom would you turn to? The person most likely to lend you the money is one with whom you share a strong bond. In my opinion, one simple way in which bonding capital can be quantified is by measuring the "strength of ties" or the number of common relations you share with another individual. If we think about business partners, it is quite likely that they share the same set of social relationships. Thus their bonding capital will be quite high. By identifying social relationships, which if removed would cause the least effect in the overall shortest path distance between other nodes, we can identify the edges that have a high bonding capital.

Karate Here is a simple example, from the classic karate club dataset. In this graph there are 34 nodes that represent the members of a karate club. During this study the group split into two with half the members going to the founder of the club, node 34 and the rest to the instructor, node 1. Here we can find interesting examples of bridging and bonding capital. The edge between node 33 and node 34 reflect a high bonding capital.. both these nodes have several friends in common. On the other hand the link from 9 to 1 among others are examples of bridging capital.

I would also like to point out a very interesting piece of research by Dr. Tufekci on social relations in Facebook. According to her study, she found that women use Facebook as a means to establish bonding capital while men are trying to increase bridging capital. This is an excellent study on Facebook from a social science perspective and I would highly recommend reading it.

June 06, 2008

The Cost of Manual Annotation

PaintingmonkeyFor many machine learning tasks it can be quite difficult to get the "ground truth" data. In some cases the best way to verify the results is by a painful, laborious and mindnumbing task of labeling data manually. When Pranam and I were working on the Splog detection task, we spent a good deal of time painstakingly labeling independently if a blog from a random sample was legitimate or spam. The part that makes this task difficult is:

  • Sometimes it is not clear when something is a blog is a splog
  • You need to also look at the inlinks and outlinks
  • Plagiarized content makes it harder to judge authenticity
  • Sploggers are getting more sophisticated in the methods they are using

On an average we spent about 2-3 minutes per blog and in the end, were only able to hand label a small collection. In comparison to many other tasks this was still a relatively straightforward judgment. Consider the task of relevance ranking that NIST has to perform each year for the TREC tracks. Here the goal is for the annotators to figure out if a result is relevant for a query. The guidelines are strict and NIST has many professional annotators who are trained to perform these tasks. Even more complicated are some of the annotation that might be required in certain Natural Language Processing experiments. These can range anywhere from just verifying a parse tree or an output to actually constructing gold standards or hand crafted parse trees. Moreover, some NLP tools require tremendous amounts of linguistic resources -- be it tediously constructing an ontology, lexicon, gazetteer lists or identifying word senses. Many of these tasks require linguists or experts whose time might be quite valuable.

lets consider a simple case where the annotator was asked to label a URL with a tag. Lets also say that it takes roughly a minute to load the page, quickly glance over it, make a judgment and then type in the appropriate labels. I know from experience that this is not a minute but more like 1.5-2 minutes on an average (try it! it is a braindead boring task and if you are asked to do it continuously, you will slow down!). If say I can work 10 hours on this task without loss in quality of my annotations, it would only result in 600 URLs being tagged. UMBC pays lets say around $10/hour for on-campus jobs. That means we would spend about $100 just to label 600 URLs. Not so sure if that would be the best way I would like to spend a hundred bucks! Additionally, just one human annotator is never sufficient. You always need to answer questions like : "So, what was your inter-annotator agreement?". Well then you just blew another $200 or $300 on this task and still have just 600 URLs marked up. No wonder del.icio.us and Flickr are such amazing sources for free (yaay!), human assessed labels and annotations. It works out great if you can use these instead.

Mechanical Turk is an attempt by Amazon to make it easy for such tasks to be distributed to people who would perform them in return for small micropayments. However, it seems to me that the incentive for completing the task (or doing it well) is so minimal that there is very little enthusiasm around this product. I suspect, the only people completing these HITS are individuals in countries where the dollar still has some value. In fact, there seems to be something fishy about atleast a handful of HITS that are high paying. It looks like the system is totally getting gamed by spammers - look at this example requesting 20 backlinks to a site ($3) or creating bogus accounts ($7) and likes.

The most interesting tool for manual annotation that I have ever seen is the ESP game. It is an excellent entertaining and I must warn you totally addictive. The game works by showing you an image and asking you to label it. You are randomly paired with another player and both get a score if a word matches. This totally ingenious way of collecting annotations for images means that the annotations come absolutely free of cost!

To bring the cost of manual annotation to zero or close to it the best incentive is to provide some value to the annotator. Ofcourse some tasks are so specific or specialized that this might be truly difficult without actually paying someone to go through it. 

From a research perspective, when we build a classifier and use the UCI dataset we have a good "gold standard" and accessible body of literature that has studied the very same data. But as we are dealing with ever increasing size of datasets, access to ground truth or manually verified samples is becoming even more challenging. So is the significance of using it. What does it really mean to annotate less than 0.1% of the data (say you have a very large collection of blogs, images, graph -- whatever social media content you can imagine)?

[Image Courtesy http://www.socialfiction.org/]


May 28, 2008

The Trilogy of Social Networks Research in Four Parts

Following are some of the books that I highly recommend for anyone interested in the science behind Social Networks research. I like to call this set of books as the "A Trilogy of Social Network Research in Four Parts".





Linked Dr. Albert-László Barabási is a pioneer in social networks research. The concepts of preferential attachment and scale-free networks were first proposed by Dr. Barabási. This has led to our understanding of how human communication works, fault tolerance in real-world networks and discovery of several algorithms that describe the growth of networks, community formation. Linked is a story of a researchers quest for answers to complex phenomena from the spread of viruses to behavior of hubs. Both Linked and Sync are books that teach us how the simplest explanation is usually the best.

Six Degrees
Dr. Duncan Watts (Ph.D. student of Dr. Strogatz) presents an excellent look into the recent discoveries in network theory. The book is a tribute to all the academic work that went behind the discovery of small world phenomena, scale free networks and the theory behind search in such complex networks. I particularly enjoyed the book because being in school and working towards a Ph.D., I can really relate to the author's narration of the trials, tribulations and all excitement (yes!) of grad school.

Sync written by Dr. Steven Strogatz, this book was rated as the best of 2003 by Discover magazine. This book talks about how synchrony emerges from a seemingly random and chaotic nature of universe and nature. Its a true science thriller that touches upon complex topics with ease and finesse. It is an inspiring book that truly reflects the passion of someone who is excited about his work, research. Dr. Strogatz has the ability to engage even someone who may have a very little understanding of the subject and describe complex theories in really simple terms.

Nexus This was an interesting read that complemented the Six Degrees and Sync quite nicely. Dr. Marc Buchanan talks about how networks that seem random are actually quite closely linked. The book is a journey from the early days of social networks research and Milgrams experiment of "six degrees of separation" to the most recent discoveries in Physics, Biology and Computer Science that deal with network theory.

Of these I am currently reading  Sync. I read both Linked and Six Degrees simultaneously and really enjoyed how the two books complemented each other and show how two scientists approach the same problem in very different, and equally exciting ways.

[Update]
This post should have really been titled "A Researchers Guide to Social Networks: A trilogy in five parts" with the inclusion of "Tipping point". However, despite being a great book, I felt that Tipping point was not as scientifically in-depth and hence decided to leave it at "A trilogy in four parts". But feel free to include tipping point in this reading list since it is a book that highlights some important ideas and in many ways has made the subject appealing to a vast audience.

Oh BTW, Trilogy here was a reference to the three main underlying themes in these books: scale-free networks, small world phenomena and emergence/Synchronization in such systems.

May 22, 2008

Social Media Conferences and Workshops

ICWSM was a great hit! And now there are a growing number of conferences (WWW 08 social networks and Web 2.0)  and workshops (DEBSM) for social media research. This is fantastic for the community as a whole: as more people are excited about working in this area, we can bet there will be some significant advances in research and improve our understanding of online communities and social media content. What I particularly love is the fact that Social Media research is an exciting interplay of computer science, social science, psychology and other related fields.

I decided to maintain a list of upcoming conference deadlines and venues for social media research. With a little help from the community, I will try and keep this list up-to-date and accurate. Here is an initial list of upcoming venues for the next few months, while I gather and organize all the deadlines in Google Calendar or something.
Please comment below or email me if you know of any other venues and I shall make sure to add them to this list. Hope to see you at the next conference/workshop!

May 16, 2008

Nonnegative Matrix Factorization

Sometimes you learn about a new mathematical technique that is so intriguing that it can be only described as "beautiful". Nonnegative matrix factorization is one such method that I did not know of until quite recently. The details of the method are available in the paper "Document Clustering Based On Non-negative Matrix Factorization" by Wei Xu, Xin Liu, Yihong Gong.   

The basic idea behind this method is that you want to factorize a matrix X into two smaller matrices U and V such that, both U and V are non negative. This is achieved by using minimizing the following optimization function

Equation So if we have a matrix X that represents a Term*Document matrix: it can be factorized into the two matrices U and V such that U signifies the Term*ClusterAssociation and V transpose signifies the ClusterAssociation*Document matrix. Now since the two matrices U and V are non negative, meaning all the elements in them are >= 0, we can identify the cluster to which a document belongs by projecting the vector V onto the dimension with the highest value.

Classsic3nmfSingular Valued Decomposition(SVD), decomposes X into dense matrices that can contain negative elements and it is not always intuitive what the basis vectors really signify. However using NMF the clusters are readily and directly available from the factorization. In addition, the sparsity makes this technique quite appealing.

In the following example, I have clustered the CLASSIC3 dataset, which is a standard corpus frequently used for evaluating different clustering methods. Notice how the three datasets CISI, MEDLINE and CRANFIELD line up nicely along the three different axis.

I like this method for its simplicity and intuition and have been exploring its use in clustering blog/social data.

The Psychology of Social Networks (KQED Talk)

Just wanted to share a quick note to the KQED/NPR radio talk on "The Psychology of Social Networks" (via Meghavini Shah, Thanks for the pointer!Forum_2

Radio host par excellence,  Michael Krasny talks to

  • B.J. Fogg, director of the Persuasive Technology Laboratory at Stanford University and the author of an upcoming book on the psychology of Facebook
  • Sam Gosling, assistant professor of psychology at the University of Texas at Austin

They cover a wide range of topics and discuss the how social networks are changing the way we interact with each other. It is a really good show and I would highly recommend listening to it if you can.

Over the past few weeks, I have learned of many interesting anecdotal evidences about our online and offline behaviors and how social networks have become such an important part of the equation. I thought I would share it in the context of this talk. Here are a few noteworthy examples:

  • Teenagers in India are socially quite comfortable expressing their relationship status on Facebook/myspace/orkut -- but would not reveal this information to their parents or family.
  • Social networks have provided a socially acceptable setting for "checking out" profiles. Arranged marriages in India are still fairly common and it is not unusual for people to check out the profile, scraps and testimony pages of prospective partners before actually meeting them in person. I guess the same is true for dating in general, people judge you by not just who you are and how you look but also who your friends are (and I guess even how they look) and what they have to say about you.
  • Coaches usually "friend" athletes on Facebook so that they can keep tabs on any parties that students have been going to and to check if they have a "red cup" in their hand (indicating that they have been consuming alcohol). Cell phone cameras are the easiest way for such information to leak onto Facebook. So parties these days have a "NO CELL PHONE" policy.
  • Dont assume that your school teachers or professors dont know what Facebook is! Students found cheating on exams have been completely baffled to see that their profs actually checked their FB profiles to know if the students are friends -- despite their claims of innocence and that they dont know each other.
  • Finally, at SocialDevCamp one really cool trend was that people were exchanging their Twitter ids more frequently than business cards. I am still enjoying the conversations that this community of users is having on Twitter. What would have been a one-off meeting is not a sustained community thanks to the power of social networks.

Footnote: Please consider supporting KQED or your local public broadcasting station, who bring to you such excellent programming.

May 09, 2008

News feed vs. blog posts vs. email

What is the difference in size distribution of a news wire vs. a blog post vs. email message?

The below three images compare the size distribution of news wires (Reuters collection) , blog posts (from the ICWSM dataset) and email messages (Enron Corpus).  The charts show the histograms of the size of the documents in these collections:

Reuters Blogposts_3 Enron_2

The three distributions above (ignoring documents smaller than 2000 bytes) were fitted using the matlab scripts for powerlaw fits (Thanks to Aaron Cluaset). 

ReuterslawBlogpostlaw Emaillaw_3

The linguistic properties of blogs email and news stories are quite different and this has already been highlighted in several research papers. While the three data sets are quite different in many ways, here I am analyzing just the size distributions. The  important point to note is 

  • News wire stories are quite short
  • Blogs and emails are much longer and have a heavy tail distribution
  • Power law exponents for blog size distribution and email size distribution are quite similar (around 2.7)

So...what does this mean? It is fairly obvious that news wire stories are quite short due to the nature of reporting. Sometimes the initial news story is quickly reported by agencies like Reuters/AP. These are at times brief and to the point to allow readers to get a quick gist of its contents.

In contrast the size of blogs tend to be much larger than news wires. Citizen journalism is full of opinions thoughts and punditry thus bloating the post. This also goes back to my previous analysis of the blog homepage size vs. Web page size. Indeed the contribution of blogs has been reported to be 4-5 times that of edited text (like the news wires).

What I had not expected was the similarity in the slopes for email and blogs. One thing to note however is that here the emails are aggregated across a number of different users. This is an important distinction. While a single user may receive a few hundred emails, they potentially have access to millions of blogs. Recently, industry's top usability expert Jakob Nielsen concluded that readers skim through and read at most 20% of the words on a Webpage. While there are millions of blog posts every day... there is very little time to read them all in detail. The volume of email is limited by a person's social network but for blogs the act of prioritizing what to read is entirely left upon the user. This essentially necessitates the use of Memetrackers and explains the popularity of filtering tools like digg, techmeme etc. By summarizing popular blog posts and providing blurbs for these, such tools essentially act as a  "social news wire service for the blogosphere".

April 22, 2008

Distinguished Speaker: Jiawei Han on "Research Challenges in Data Mining"

It was an honor to have Dr. Jiawei Han on our campus today. He was hosted by the Information SystemsHanj_tour (IS) department as part of the distinguished speaker series. Dr. Han is a pioneer in the field of data mining and has written a book that I had used for the Data Mining class I took with Dr. Kargupta.  In fact,  Dr. Han even obliged to autograph the copy for me :-D. (I know thats geeky but is'nt that cool?)

Here is a summary of his talk on "Research Challenges in Data Mining".

Dr. Han narrates and interesting episode where Dr. Jim Gray brought in a hard disk and asked his students to guess the capacity... this was perhaps just 10 years back and a 2G disk evoked a jaw dropping response from the audience. According to Dr. Jim Gray, "We are in the era of Data Science". Science used to be
experimental (observe the stars) then it moved to computational (run simulations) but now we are really moving towards  "Data Science" (Gigabytes, Terabytes and Petabytes of data all around us).

The main themes of the talk were:

1) Pattern Mining: Classification by finding frequent, discriminative patterns.

Dr. Han described an interesting experiment in which they analyzed the feature length vs information gain. What they found was that you get more information by combining features (2,3,4), while single features dont give you much information. On the other hand too long pattern, and it is not too informative either. This has an important implication, especially in text mining where using NGrams (bigrams and trigrams) has generally been found to be useful but if you use too long a pattern, it is not that helpful.

Another analysis presented was of information gain vs pattern freq: Frequent patterns have more information gain than less frequent ones. On the other hand extremely frequent patterns have little gain (for example stop words).

The implication of these results is in using SVMs and C4.5,  classification accuracy can be improved by using discriminative patters/feature selection.  Dr. Han suggests that "It is not the number of patterns its the quality of patterns. Its Not the more the better ......instead, the better the better." A few selected discriminative patterns are more useful than a lot of features in case of classification.

2) Streams Data Mining Classification for rare events.

When you have multiple streams of information and you can not store all the data, but need to remember some of the history, (in order to do classification of a new stream) it is helpful to use sampling techniques. Dr. Han discusses results from one of their recent papers on using biased sampling. The idea here is that you dont toss off positive data (all of it is stored since there are very few samples). Now, by selective sampling of negative data one can use equal amount of positive and negative data for training a classifier. In case of multiple streams, using an ensemble approach to classification is suggested.

3) Information network analysis. Distinguishing objects with identical names. (eg author names, song names)

Name disambiguation is a very difficult problem. For eg there are 14 Wei Wang on DBLP -- some with even same co-authors, same conferences, etc. in such cases, textual similarity can not be used, since they are all basically in the same field. Dr. Han discusses how using only information from DBLP, they make links powerful. The basic idea is to group references according to their similarity. By performing a random walks along different join paths, they can group the authors and determine the different name ambiguities in the graph. Often, in computer science and other fields, Collaboration behavior within the same community share some similarity and this can be helpful in disambiguation.

Another research paper that Dr. Han highlights is on "Truth discovery with multiple conflicting information providers on the web". The goal here is to algorithmically determine what piece on the Web is more trustworthy. The proposed approach is to model it as a tripartite graph of (website,facts,object).

Assumption is that there is just one true fact and that false 'facts' are introduced due to random factors (and not malicious intent). The domain was restricted to identifying the right book title, which is slightly easier than something like political opinions. The basic idea behind their approach is that a site is trustworthy if it provides many facts with high confidence and once again this can be modeled as a matrix computation problem of identifying trustworthiness.

Some of the other areas that were very briefly mentioned by Dr. Han were:

4) Mining moving objects - unrestricted, restricted (cars) scattered (Cell phone and rfid)
5) Spatio temporal multimedia clustering with obstacles
6) blog mining, etc.
7) data cubes.
8) Visual data mining
9) Biological data mining
10) DM for software engineering/bug mining

This was a great talk and very informative. I really liked how the Dr. Han gave the basic intuition behind their methods and yet managed to give a broad overview of some of the challenges in the field.

Google Ads

Related Wikipedia Entries

Ads

Recent Readers

Search this blog


  • WWW
    socialmedia.typepad.com

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
I Love 6A

Please Support