Contact Me


  • Akshay Java's Facebook profile

Social Media Events

Friends

Disclaimer

  • Thoughts and comments expressed here are those of the author. Creative Commons License

search

July 17, 2008

Clustering Triples from Social Data

TagtriplesBy far, the most prevalent data available in social media is tagging information. For example, in del.icio.us a user may tag a URL or in Flickr she may tag an image. One of the questions that comes up is how to then cluster social data that is rich in tags. Some techniques available ignore the user information and use only a bipartite graph consisting of tags and URLs. Another method is to represent two pieces of evidence (user-tag;tag-blog) in a tripartite graph (where nodes are of three different types: users, tags and urls). However, Realtripleseven this type of structure actually  misses the higher order relation between the three nodes. Note that the information available is really in triples of the type <user, tag, url>. This information is not captured by the tripartite graph model. In particular, two users may be connected via a common tag even if the actual URL they bookmarked is vastly different.

There are some techniques using Tensor Matrix Factorization that can handle such data. However, the question of how to deal with triple (or higher) information from social data is quite interesting. Moreover, being able to do so efficiently and in an online fashion would also be important. I believe that this topic may be of significant interest in the upcoming social media and data mining conferences. The implications of these techniques would be in building better recommendation systems and personalization algorithms.

[Thanks Vlad Korolev for some of the discussions related to this post]

July 07, 2008

Shiny New Toys at Searchme!

A quick note:

Searchme launches cool new Video and Flickr search tools. This is combined with the power of stacks. Stacks are a neat way to organize relevant results, be it videos, images or Webpages, into bundles. (how often have you wondered -- "if only I could recall that search term that gave me the right answer". Well now just stack it!) Searchme provides a feature that allows users to share stacks they have created on your blog or as a Facebook page. This is really neat. Whats even better is that since the stack is AJAX based, it is dynamically updated whenever new results are added to it. Check out the stack titled "Colbert" I created as an example. The best part of visual search is that you can quickly scan for documents and media and identify the stuff you are most interested in.

Another really neat feature in Searchme is its Classification engine. For example, did you know that another celebrity that shares a similar name is Claudette Colbert, a 1930's actress! Would never have found this trivia had it not been for the movies section that Searchme identified for my query "colbert"!

July 04, 2008

Recency and Interestingness in Search

I was looking at the popular searches, as ranked on Google trends. Almost instantly one thing caught my attention -- a whole bunch of these search terms are those related to recent events or news. This should not be too surprising at all and it has also been well studied in the literature [1].

Trends

But what really struck me was that when we look at the search results, as ranked by any of the search engines -- the important factor there is relevance and PageRank (or the variant thereof). Google Trends itself shows News results, Blog Search results and Web results. News search results and blog results show what the most recent item are: its all about what is happening right now.

To me, there seems to be a gap that we are not focusing enough on. A common reason why a user searches for a terms like "Ingrid Betancourt" is to find out and read more about some recent events that may be in the News. For example, a user who is unfamiliar with the history of Betacourt's struggles might want to look up her background on Wikipedia, but then might be more keen on investigating the recent stories and the reason why her name is in the News. Once that is done, the interest shifts in knowing what people have to say about this event.

However, none of the search results are really helpful to allow exploring what users are really looking for -- which is interesting stories and blog posts about recent events. What I mean by this is we need a 100 different techmemes constructed on the fly for popular search queries, at the very least. Alongside there can be results from Wikipedia and other sources that might inform the user about the background of the search. However, for most part the links and snippets shown here should really be the ones that can be ranked as interesting stories in a techmeme like fashion: not purely on the basis of relevance alone, not only on the basis of time (or a list of articles since the event that triggered the rise in the search volumes) but a compilation of recent news stories that are interesting.

I'm adding this to my "One more interesting thing to hack up" list.

[1] Why We Search: Visualizing and Predicting User Behavior, Eytan Adar, Daniel Weld, Brian Bershad, and Steven Gribble [Link]

June 26, 2008

Evri: Search Less, Understand More

I just received the beta invite to Evri.com (Yaaayy!). It is a really cool site that aims to help people find information. Right now they just have a browse interface. You can see what are the top concepts and named entities (primarily from News sources) and navigate through semantically related terms. The main idea behind their approach is that you can construct the graph of all the concepts and entities by analyzing the text. Here is an example of the top names in the news. Clicking the terms (from the graph) "Barack Obama" and "Ralph Nader" for example, would pull up all the stories related to recent controversies.

Evri One can browse through the graph or the popular terms. I checked out what they found on Obama. Here is a snapshot on the left. I think that a really neat trick that Evri is using is the idea that working on sentence level semantics can provide sufficient meaning to help organize information. Constructing a complete parse tree that is both syntactically and semantically accurate is a difficult problem. There are many vagaries of natural language text that make this challenging. Evri, at least for now, bypasses some of these problems by organizing information around simple questions like "what is Obama doing?" which can have easy to identify clues directly accessible from the text (critisizing, leading, denying, facing....). Similarly for other entities like organizations one can ask "What is happening with Yahoo?" (bidding, reject, acquire, etc.). 

Obama

This is a fascinating approach to organizing information and I think that Evri has a great potential. Lets think about it for a minute. One of my favorite passtime is to go to Wikipedia, pull up a random article and then browse through related articles. It is this serendipity and the feeling of chance discovery of something interesting that is so compelling about Evri.

Evri also reminded me about the way I had hoped to implement SemNews, a semantic search engine, that analyzed RSS snippets of News articles and processed it through OntoSem, an ontological semantics based Natural Langugage Processing system. Once the semantics/meaning representations were extracted, I would store the meanings in an OWL store so that RDQL queries could be performed to find relevant news items. I believe that the way we can accomplish Dr. Tim Berners-Lee's vision of Semantic Web is by advancing both information extraction (web scraping, entity annotation etc) and NLP techniques that would automatically annotate text and make it available in machine readable format.

AdsAlthough, the founder claims that they are not a search engine, they surely join the group of NLP-based startups like Powerset and Hakia. Another powerful tool is Freebase which uses primarily Wikipedia as its source of information. Finally, it is also worth mentioning that Kosmix is yet another startup that aims to "Organize the Web so that you can explore, learn and discover".

The next obvious question that comes up is regarding the monetization and business model of these startups. Ofcourse, the story goes... the information is more focussed so ads would be more relevant... and no surprise that is indeed so TRUE. Just check out some of the advertisements on Evri. On the left, is a screenshot of an advertisement on Barack Obama's info page. But I think there is an opportunity beyond simply relevant advertising!

Many companies have huge websites with lots of information -- some organized and most not quite as much. If you wanted to ensure that your customers are able to get to the exact information they need -- Evri like approach can be ideal to help them browse through the various facets to get to what they really need. The applications to Enterprises and Enterprise search can be another monetization platform for Evri. 

Finally, IMHO, some hurdles that Evri faces could be dealing with noisy text, especially with Social Media. Many approaches that rely on linguistic or gramatical correctness of sentences simply fail miserably when dealing with social media content. The second problem might be esuring coverage of information. Right now, it seems to me like the News soruces Evri relies on are primarily US centric. As they aim to capture more audience outside US as well, they would have to concentrate on foreign languages, disambiguating named entities and location names. These are all interesting research problems and fun stuff to work on!

June 22, 2008

Searching for Math Documents on the Web

Equation There are a number of vertical search engines and several specialized tools like Google Scholar that help academics search for scholarly works on the Web. While searching for some documents on an optimization problem, I was curious if there was a good way to search for Mathematical documents on the Web. Most search engines do not have any special means to search for equations or support for special mathematical symbols (like summation or exponent). Many a times one knows the exact term that they are looking for, however sometimes you might be more familiar with an equation or might try to find special terms in a document. I wonder how Mathematicians and Physicists find what they are looking for?

LaTeX is a widely used authoring tool in academia and I believe that support for LaTeX queries would be extremely beneficial for search. At the very least, it would be great if Google Scholar could support this. There are a few prototypes online, but none that I am aware of have the coverage or are free or currently functional.

I see many challenges in parsing and representing mathematical data and it seems like an interesting problem that researchers in information retrieval would have definitely looked into. I would appreciate if anyone familiar with this topic can share some good references. 

(Photo Courtessey darkmatter)

June 19, 2008

Email Interview: Nihaar Gupta, Youlicit

Nihaar Gupta, VP of Product development at Youlicit has kindly obliged to have an email interview with SocialMedia Research Blog. Following are the responses to some of the questions I had for him:

1) Please describe Youlicit to us?

Youlicit at its core is a discovery engine (http://blog.youlicit.com/?p=23). We want to connect you to the most relevant and recommended information as effortlessly as possible. As of now, we are building a technology that allows a user to find the most recommended sites (recommended by people around the web) related to a given URL. We believe that people are the best judges of content and more often than not, the information you are looking for has been found by someone before. Our goal is to aggregate that information and allow the user to access it with the click of a button.


2) Tell Us about your background, the team behind Youlicit and how started it?

Youlicit came about as a result of trying to solve our own frustrations with trying to find information on the web. With the enormous amount of user-generated content and annotations on the web, we saw a huge amount of valuable data that was inaccessible and fragmented. For the sake of brevity, the team bios & background are here http://blog.youlicit.com/?page_id=6


3) Please give us a brief overview of the technology behind Youlicit?

Youlicit aggregates user annotations of websites and other user generated content and analyzes it to create a URL-URL mapping of websites based on relevance and quality. Using this mapping we are able to deliver related and recommended sites to a user with a click of a button.


4) While using Youlicit plugin, I felt that one of the challenges is the coverage -- how do you plan to address this and build your current index?

We are constantly working on improving our coverage. There are two metrics we strive to maximize for our results, quality and relevance. In regards to quality, we’re always looking to increase our database of “quality sites” by tapping into the various kinds of user annotations (denoting quality content) that exist on the web (bookmarks, tags, votes, comments). In regards to relevance, we’re always researching novel ways to extrapolate connections between websites and map URL’s back to our database of “quality sites”.


5) How do you ranking the 'Enhanced Links' in the plugin? Do you also take into account how many users actually click through the suggested links?

Each result in the Youlicit More widget (and on Youlicit’s site) has a score based on the metrics above, quality of the site and relevance to the item being queried. We are looking into ways of scoring the results from implicit/explicit feedback that we get from users (clicks, recommends).


6) How do you ensure that the Enhanced links feature is non-intrusive?

The current version has manifested itself after a few weeks of alpha testing with a handful of bloggers. That said, we are still looking for feedback on the user interface and would love to hear opinions on how to make it more useful and less intrusive for bloggers/blog readers.


7) How would you compare the plugin to sphere's related blog posts?

While Sphere focuses related & recent blogosphere content, we, at Youlicit, are trying to provide the blog reader with more seminal information related to the blogger’s topic of conversation. For instance, if you are reading a blog entry on global warming, you are more likely to receive the most recommended articles (blogs, sites, essays) on Global warming from around the web rather than  recent blog entries on that topic.


8) What are the other features on Youlicit?

Youlicit’s primary product is a Firefox extension to access that allows a user to access our results during his/her browsing experience. We are in the process of redesigning our website and streamlining the current offering to focus on this button. Down the road we would like to be able to deliver personalized recommendations for users as well as connect users to people based on transient and long-terms interests (ideally using a person’s interests to enhance his/her social graph).


9) Would the plugin be adverting supported?

We do see advertising as a very possible source of monetization. Given the fact that we are providing contextually relevant information, the search model of advertising applies nicely. We are also exploring other possible means of monetization but as of right now the priority is to build something that people find useful.


10) What are the next things to look out for at Youlicit?

As I mentioned above, we are stripping down Youlicit to bring the focus back to its core; the Youlicit More functionality via the Firefox extension and blog widget. We expect to release a new designed website very soon. And as always, we love to hear feedback on what you think so far and how we can improve.


Youlicit: Search Less and Find More!

Youlicit Youlicit is a new tool that helps you "Search less and find more". Often we forget that search is only one means to find what we are looking for. Even search by itself is not the endpoint of an information need or a query. This tool reminds me of the "berry picking model" of Information Retrieval that I had read about first in my IR Class. The model basically says that:

Information need is not satisfied by a single set of documents but by bits and pieces found along the way.

The paper  titled "The Design of Browsing and Berrypciking Technique for Online Search Interface" describes a searcher as

Moving through many actions towards a general goal of satisfactory completion of research related to an information need.

What Youlicit does is provide this ability implicitly, without the reader (or more generally a searcher) having to go through the trouble of navigating and mentally processing through hyperlinks or firing search queries to find related content. Youlicit takes care of all that on your behalf. By providing a simple plugin, the Youlicit widget automatically highlights some of the related, relevant links and provides useful suggestions -- all without your audience ever leaving your blog. I love the idea and the neat implementation that these guys have built. (The very same need was what lead me to hack this Wikipedia related widget a few weeks earlier.)

On the Youlicit site, you have lots more interesting tools. You can discover new content that is relevant to your interests, find related users and share links with them or follow their interests. Youlicit is paving the way for social browsing tools and is a neat concept that is well implemented. Their index does not seem to be very large at the moment and I feel that it would get better as they start to seriously scale up. In the interim, I feel that there might be stopgap solutions that they could be employ -- for example the Alexa related URLs for the links that are not currently in Youlicit's index.

In relation to this plugin, one tool that is similar is the  Sphere plugin that shows related blog posts. I feel that sphere serves a complementary need. From what I understand Youlicit aims to find the interesting blogs and Web URLs one might want to look into in relation to a given hyperlink.

Another plugin is the Snap plugin -- which shows a screenshot of the outlink. However, in my opinion snap does not really serve much purpose and is a bad tool from a usability perspective.

Youlicit is non-intrusive and you are gonna enjoy the serendipity of finding interesting new links! Give it a spin!

May 09, 2008

News feed vs. blog posts vs. email

What is the difference in size distribution of a news wire vs. a blog post vs. email message?

The below three images compare the size distribution of news wires (Reuters collection) , blog posts (from the ICWSM dataset) and email messages (Enron Corpus).  The charts show the histograms of the size of the documents in these collections:

Reuters Blogposts_3 Enron_2

The three distributions above (ignoring documents smaller than 2000 bytes) were fitted using the matlab scripts for powerlaw fits (Thanks to Aaron Cluaset). 

ReuterslawBlogpostlaw Emaillaw_3

The linguistic properties of blogs email and news stories are quite different and this has already been highlighted in several research papers. While the three data sets are quite different in many ways, here I am analyzing just the size distributions. The  important point to note is 

  • News wire stories are quite short
  • Blogs and emails are much longer and have a heavy tail distribution
  • Power law exponents for blog size distribution and email size distribution are quite similar (around 2.7)

So...what does this mean? It is fairly obvious that news wire stories are quite short due to the nature of reporting. Sometimes the initial news story is quickly reported by agencies like Reuters/AP. These are at times brief and to the point to allow readers to get a quick gist of its contents.

In contrast the size of blogs tend to be much larger than news wires. Citizen journalism is full of opinions thoughts and punditry thus bloating the post. This also goes back to my previous analysis of the blog homepage size vs. Web page size. Indeed the contribution of blogs has been reported to be 4-5 times that of edited text (like the news wires).

What I had not expected was the similarity in the slopes for email and blogs. One thing to note however is that here the emails are aggregated across a number of different users. This is an important distinction. While a single user may receive a few hundred emails, they potentially have access to millions of blogs. Recently, industry's top usability expert Jakob Nielsen concluded that readers skim through and read at most 20% of the words on a Webpage. While there are millions of blog posts every day... there is very little time to read them all in detail. The volume of email is limited by a person's social network but for blogs the act of prioritizing what to read is entirely left upon the user. This essentially necessitates the use of Memetrackers and explains the popularity of filtering tools like digg, techmeme etc. By summarizing popular blog posts and providing blurbs for these, such tools essentially act as a  "social news wire service for the blogosphere".

May 08, 2008

"Personal Brand" Monitoring Tools

Dr. Finin pointed to this interesting post on "branding yourself with a blog":

“… Certainly personal branding isn’t a new concept, but the future of personal branding could be in at your fingertips—with a blog. One of the first steps in creating a brand for yourself is to make your blog visible. Post meaningful entries, comment on your industry’s top blogs, or simply gain a regular readership. “Visibility creates opportunities,” says Schawbel, a social media specialist at EMC Corporation. He believes that when you brand yourself, the competition becomes irrelevant. “The goal of personal branding is to be recruited based on your brand, not applying for jobs,” Schawbel says. …”

Many brand monitoring startups are helping big companies keep track of what their (potential) customers have to say about them or their products. While the space of corporate brand monitoring is  fiercely competed, one area that is overlooked is that of "personal branding" tools. Most of us are highly interested in knowing what is said about us online. As the TechCareers blog points out:

“You are the chief marketing officer for the brand called you, but what others say about your brand is more impactful than what you say about yourself,” says Schawbel.

Keeping an eye on what others have to say about you is not always easy. I started thinking about these issues and outlined how I try to keep up with this information. Here is my "Personal Brand Monitoring Toolbox":

  1. Search Engines: The typical way for me to keep tabs on this is by setting up Google alerts for my name, projects, organization (University/workplace) etc. In addition, I frequently perform "ego searches" to forage for mentions of my name.
  2. Statistics and Tools: One very interesting tool that I have found useful is Lijit. It provides you stats on who is searching for you, what keywords were used to reach your blog, etc. In addition I use Google Analytics to know more information about my visitors, most visited pages and time they spent on my site. If you are an academic like me, you would like to know who has cited your papers recently (Google Scholar) and the number of downloads, who has linked to your paper (Google link: search) and/or your blog posts (Technorati searches). Yessss! I admit! I have become a total statoholic! :-)
  3. Comments and Scraps: Twitter is another important tool in our arsenal for personal branding and your replies say something interesting about you. Finally, the comments on my blog, Facebook messages, scraps and photos are all part of my "brand" and I take interest in replying to them just like I would to an email.

As our information spaces diversify, monitoring "your brand" becomes a part of the everyday online activity. I dont think we have exactly cracked the nut yet -- keeping track of your profile and "your brand " is a highly addictive activity and I think that the tool(s) that make it fun and exciting will enjoy a great deal of popularity.

May 02, 2008

Leaveraging Web and Social Media for Recommendations

Both Amazon and Netflix's business models rely on effective recommendation systems. The recommendations provided by such systems are based on the purchasing habits of millions of customers. As such, these systems are non-trivial and have evolved out of years of research in both academia and industry.

In addition to mining millions of customer transaction records, for many products there is a vast amount of information available online. While I do not have a lot of familiarity with recommendation systems literature, it seems obvious that the Web and Social Media is a great source of information that could be useful when building such systems.  Bloggers' profile pages, wishlists, netflix queues, book lists and the blog posts themselves are potential clues to learn which two items may be related to each other.

As a simple example, consider the movie "Pulp Fiction", by querying Google for all the inlinks to the IMDB homepage of Sin City Pulp Fiction and counting which are the other movies that are "co-cited" here is a list of five movies that are most likely to be related to "Pulp Fiction":

Most of these look quite relevant. Some critics have claimed similarities between Pulp Fiction and Snatch. One surprise though was LOTR, I wouldn't have expected it to be grouped with Pulp Fiction, but I guess I like them both very much -- so it seems reasonable in my case atleast.

Just for fun, here is another example with "Sin City" another one of my favorite movies.

Unless you have a large index of the Blogosphere or the Web, it would be quite inefficient to mine for such correlations (by passing queries to search engines) on a large scale. I do not know how much of the search engine information is leveraged in recommendation systems built by Amazon or Netflix.  It might also be worth looking into differences in the recommendations produced on the basis of "how people co-cite two products" vs. "how people purchase two products".

Google Ads

Related Wikipedia Entries

Ads

Recent Readers

Search this blog


  • WWW
    socialmedia.typepad.com

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
I Love 6A

Please Support