Contact Me


  • Akshay Java's Facebook profile
    Call akshayjava from your phone!

Twitter Updates

    follow me on Twitter

    Friends

    Disclaimer

    • Thoughts and comments expressed here are those of the author. Creative Commons License

    May 11, 2008

    SocialDevCamp Trip Report

    SocialDevCamp totally rokced! The event was best described as:Socialdevcamp1

    SocialDevCamp East is the Unconference for Thought Leaders of the Future Social Web

    Where is the social web going? It's going mobile, to geocentric services, and to open platforms. Join a community of like minded developers, social media gurus and thought leaders for an unconference to discuss the future of the social web.

    Here is my trip report from this event:

    1. Innovation in Social Media if being fueled by brilliant people who are running some really successful startups.
    2. The "Amtrak Corridor" has a ton of talent. There were some really amazing people I met at this event -- some who came down from NY, Philadelphia and even Boston. There was a strong sense of community and entrepreneurial brotherhood, if you will.
    3. The startup scene on the East coast is quite different from that on the West. There is very little VC funding here since most VCs in this area are super conservative. In fact from the audience members who attended the "Who Needs VCs" session perhaps only two companies had taken funding.
    4. Location! Location! Location! Dave Troy gave a great talk on geolocation and his vision for the openlocation initiative. I think we will see a number of startups in this space and it will be exciting to see how devices like iPhone and others change how we find location relevant information.
    5. "Semantic Web" is no longer just a vision. The session "Social Media and Semantic Web" proposed and presented by our very own eBiquity alum Dr. Harry Chen was one of the most attended session. It says something BIG is about to happen when a bunch of really smart entrepreneurs are interested in Semantic Web. Bear from Seesmic shared his thoughts and talked about how they were using Semantic Web technologies.
    6. Amazon EC2 and S3 offer a fantastic alternative to startups. It is the best way to scale up your product and the benefits of using EC2 outweigh the costs. 
    7. Twitter is the new business card. I have said this before, business cards are a very short lifespan. SocialDevCampers instead preferred to exchange their twitter handles and add folks by checking the #socialdevcamp interest on twitter. And I think we my have finally convinced Harry about the utility of Twitter!
    8. Twitter was the best backchannel at the conference. In all the sessions, we talked about some great sites and shared resources we all knew collectively. The Twitter streams recorded all the highlights of each session for posterity. Even if you did not attend a session now you know a few things that were discussed there.
    9. Lots of iPhones. Photos, Videos, Twitter messages were flying everywhere. There was even a session on the iPhone development (unfortunately I missed that one; it was tough deciding which session to attend when all seemed so good).
    10. Techies know how to party hard! Thanks to the After Party sponsors, geektalk ruled at the local bar Brewers Art. This is a fantastic local microbrewery that serves 7% 10% alcohol beers. :-) Beer+ Free Food + Techies = one hell of a party!

    Those who did not make it really missed out on a great event. But fret no more. Soon there will be an announcement for the SocialDevCamp being held in the Fall -- no excuses that time!

    A special thanks to the organizers and sponsors for supporting this event and bringing us all together. (Blogs covering this event)

    May 09, 2008

    News feed vs. blog posts vs. email

    What is the difference in size distribution of a news wire vs. a blog post vs. email message?

    The below three images compare the size distribution of news wires (Reuters collection) , blog posts (from the ICWSM dataset) and email messages (Enron Corpus).  The charts show the histograms of the size of the documents in these collections:

    Reuters Blogposts_3 Enron_2

    The three distributions above (ignoring documents smaller than 2000 bytes) were fitted using the matlab scripts for powerlaw fits (Thanks to Aaron Cluaset). 

    ReuterslawBlogpostlaw Emaillaw_3

    The linguistic properties of blogs email and news stories are quite different and this has already been highlighted in several research papers. While the three data sets are quite different in many ways, here I am analyzing just the size distributions. The  important point to note is 

    • News wire stories are quite short
    • Blogs and emails are much longer and have a heavy tail distribution
    • Power law exponents for blog size distribution and email size distribution are quite similar (around 2.7)

    So...what does this mean? It is fairly obvious that news wire stories are quite short due to the nature of reporting. Sometimes the initial news story is quickly reported by agencies like Reuters/AP. These are at times brief and to the point to allow readers to get a quick gist of its contents.

    In contrast the size of blogs tend to be much larger than news wires. Citizen journalism is full of opinions thoughts and punditry thus bloating the post. This also goes back to my previous analysis of the blog homepage size vs. Web page size. Indeed the contribution of blogs has been reported to be 4-5 times that of edited text (like the news wires).

    What I had not expected was the similarity in the slopes for email and blogs. One thing to note however is that here the emails are aggregated across a number of different users. This is an important distinction. While a single user may receive a few hundred emails, they potentially have access to millions of blogs. Recently, industry's top usability expert Jakob Nielsen concluded that readers skim through and read at most 20% of the words on a Webpage. While there are millions of blog posts every day... there is very little time to read them all in detail. The volume of email is limited by a person's social network but for blogs the act of prioritizing what to read is entirely left upon the user. This essentially necessitates the use of Memetrackers and explains the popularity of filtering tools like digg, techmeme etc. By summarizing popular blog posts and providing blurbs for these, such tools essentially act as a  "social news wire service for the blogosphere".

    May 08, 2008

    "Personal Brand" Monitoring Tools

    Dr. Finin pointed to this interesting post on "branding yourself with a blog":

    “… Certainly personal branding isn’t a new concept, but the future of personal branding could be in at your fingertips—with a blog. One of the first steps in creating a brand for yourself is to make your blog visible. Post meaningful entries, comment on your industry’s top blogs, or simply gain a regular readership. “Visibility creates opportunities,” says Schawbel, a social media specialist at EMC Corporation. He believes that when you brand yourself, the competition becomes irrelevant. “The goal of personal branding is to be recruited based on your brand, not applying for jobs,” Schawbel says. …”

    Many brand monitoring startups are helping big companies keep track of what their (potential) customers have to say about them or their products. While the space of corporate brand monitoring is  fiercely competed, one area that is overlooked is that of "personal branding" tools. Most of us are highly interested in knowing what is said about us online. As the TechCareers blog points out:

    “You are the chief marketing officer for the brand called you, but what others say about your brand is more impactful than what you say about yourself,” says Schawbel.

    Keeping an eye on what others have to say about you is not always easy. I started thinking about these issues and outlined how I try to keep up with this information. Here is my "Personal Brand Monitoring Toolbox":

    1. Search Engines: The typical way for me to keep tabs on this is by setting up Google alerts for my name, projects, organization (University/workplace) etc. In addition, I frequently perform "ego searches" to forage for mentions of my name.
    2. Statistics and Tools: One very interesting tool that I have found useful is Lijit. It provides you stats on who is searching for you, what keywords were used to reach your blog, etc. In addition I use Google Analytics to know more information about my visitors, most visited pages and time they spent on my site. If you are an academic like me, you would like to know who has cited your papers recently (Google Scholar) and the number of downloads, who has linked to your paper (Google link: search) and/or your blog posts (Technorati searches). Yessss! I admit! I have become a total statoholic! :-)
    3. Comments and Scraps: Twitter is another important tool in our arsenal for personal branding and your replies say something interesting about you. Finally, the comments on my blog, Facebook messages, scraps and photos are all part of my "brand" and I take interest in replying to them just like I would to an email.

    As our information spaces diversify, monitoring "your brand" becomes a part of the everyday online activity. I dont think we have exactly cracked the nut yet -- keeping track of your profile and "your brand " is a highly addictive activity and I think that the tool(s) that make it fun and exciting will enjoy a great deal of popularity.

    May 02, 2008

    Leaveraging Web and Social Media for Recommendations

    Both Amazon and Netflix's business models rely on effective recommendation systems. The recommendations provided by such systems are based on the purchasing habits of millions of customers. As such, these systems are non-trivial and have evolved out of years of research in both academia and industry.

    In addition to mining millions of customer transaction records, for many products there is a vast amount of information available online. While I do not have a lot of familiarity with recommendation systems literature, it seems obvious that the Web and Social Media is a great source of information that could be useful when building such systems.  Bloggers' profile pages, wishlists, netflix queues, book lists and the blog posts themselves are potential clues to learn which two items may be related to each other.

    As a simple example, consider the movie "Pulp Fiction", by querying Google for all the inlinks to the IMDB homepage of Sin City Pulp Fiction and counting which are the other movies that are "co-cited" here is a list of five movies that are most likely to be related to "Pulp Fiction":

    Most of these look quite relevant. Some critics have claimed similarities between Pulp Fiction and Snatch. One surprise though was LOTR, I wouldn't have expected it to be grouped with Pulp Fiction, but I guess I like them both very much -- so it seems reasonable in my case atleast.

    Just for fun, here is another example with "Sin City" another one of my favorite movies.

    Unless you have a large index of the Blogosphere or the Web, it would be quite inefficient to mine for such correlations (by passing queries to search engines) on a large scale. I do not know how much of the search engine information is leveraged in recommendation systems built by Amazon or Netflix.  It might also be worth looking into differences in the recommendations produced on the basis of "how people co-cite two products" vs. "how people purchase two products".

    May 01, 2008

    Guest Post by Blazej Bulka: Social Networking That Went Wrong

    (Guest post by Blazej Bulka. Thanks Blazej! Especially for translating and summarizing these articles for us non-Polish speakers..)

    nasza-klasa.pl is currently the most popular social networking site in Poland. "Nasza klasa" means "our class," and the website is similar in design to classmates.com. It allows people to join schools, from which they graduated, and classes within schools. It also allows to build a social network by reconnecting to old friends, and post photographs.

    Initially, nasza-klasa lacked any privacy controls because it was meant as a small, local project. After one year of existence, a sudden surge in popularity came. The number of registered users quickly exceeded 1 million, which started exponential growth. After a few months, the site had more than 5 million users. Currently, the site has more than 11 million users. (The population of Poland is roughly 40 million.) The unexpected growth choked the website, and the utmost priority of the developers was performance, and not privacy.

    Right now, the website offers only basic privacy protection such as black listing, and restricting the visibility of the information in the profile. However, the biggest privacy problem are the users themselves. Most of them have had no prior experience with social networks before. The theme of the site is very encouraging to share as much personal information as possible. After all, the users are surrounded with their classmates, who are the people they usually trust, and with whom they shared their personal secrets in the childhood. They have not seen each other for multiple years; therefore, they are extremely willing to share a lot of personal information to make up. They also want their profiles to be easily searchable, in case another old friend should join the site in the future.

    Unfortunately, many of the users are unaware that their information can be actually accessed by almost everyone, and not just their friends are interested in it. Moreover, because of their inexperience, they do not know that any information published on the internet may be copied endless number of times, changed, and may stay there forever. Sometimes, consequences of careless publishing of information on the site were quite surprising.

    Below, I present an overview of five such unexpected uses of published information based on the articles published in press or on the internet (mostly in Polish).

    1. Tracking users by debt collectors and law enforcement
    (Source: "Nasza-klasa is a true treasury for debt collectors and law enforcement")
    Debt collectors in Poland tend to massively use nasza-klasa to track the debtors. Nasza-klasa seems to be very good source of information for them. Personally, I also heard of cases when the debt collectors compare the friends' lists of the wanted person with the lists of their classmates. A classmate who is missing from the friends' list may indicate that a dislike existed between the two. Apparently, the debt collectors tried to exploit such dislikes to extract more information.

    2. Police officers expose themselves and their colleagues
    (Source: "Nasza-klasa.pl exposes police officers")

    Irresponsible officers post at portal nasza-klasa.pl pictures from the educational institutions, from which they graduated. This reveals identities of undercover officers -- experienced officers say.

    [..] The officers add themselves to classes, in which they studied [at police schools]; often, the classes specializing in police operations, investigations, or reconnaissance. And they add class pictures. "I hunt bad people" -- this is the profile information in "About me" section of Marek -- a graduate of a reconnaissance-operational police school in Pila. More experienced officers are disgusted by the behavior of the novices. At police internet forums, they write that such a database about police officers and their friends may be used to track a particular officer.

    3. Police officers brag about their authority, and how they could abuse it
    (Source:   "Give me their names, I will find them in the national ID registry")

    A police officer from Grodzisk Mazowiecki brags on nasza-klasa portal that he can determine the addresses of his former classmates because he has access to the national ID registry (PESEL). [..] On the forum for the 2nd Elementary School, Jaroslaw B. writes: "If anybody of you has telephone numbers of our former classmates, please call them so that they sign up. I also have access to PESEL database. I need last names, and I will determine the addresses."
    A former classmate asks him: "Jarek (i.e., Jaroslaw), where do you work, if you have access to PESEL database?"

    The police officer responds: "the less you know, the better."
    Another classmate writes: "(..) Now the whole school knows about Jarek's capabilities :) However, I propose not to use authority or even force to find our classmates :) (...) And maybe we should move our conversation to the class forum from the school's forum :):).)

    In his profile, Jaroslaw wrote that he graduated from the Police University in Szczytno.

    4. Intelligence agents should know better to be low-profile ... well, apparently not!
    (Source:  "Our new secret service revealed themselves on the internet"; the article contains interesting pictures posted by the intelligence agents ...)

    Pictures of six SKW (Military Counter-intelligence Agency) officers are shown without their names or rank. According to the SKW Act, this data is considered particularly sensitive. The officers themselves did not exercise enough care -- during their mission, they posted pictures showing them armed, both in uniform and disguise. They were posted in profiles registered with their real names. They did not mention that they are SKW officers, they only mentioned that they are officers of the Polish Military Force. But they did not try to conceal the fact that the pictures were taken in Afghanistan.

    5. People brag about committing crimes and spreading hateful ideology
    (Source: "Responsibility for 'Sieg Heil' (Nazi greeting) on Nasza-klasa")
    One of the profile pictures is here and another "nazi" profile picture is here.

    In February, the journalists from "Gazeta Wyborcza" noticed the son of the mayor of Leczna is one of the millions people who signed up with nasza-klasa. One of the photographs in his profile showed him with two baldly-shaved friends, with their right hands raised in the fascist gesture "Sieg heil". The faces of the two friends were covered with scarves with logo of the fans for soccer club Gornik Leczna.

    [..] As the journal reports, the state prosecutor's office in Lublin has already contacted nasza-klasa.pl with an inquiry regarding the user profile. The investigators want to find out how long the picture was available on the site, and how many people may have watched it. Promoting fascist ideology may lead to a financial fine, and a penalty of imprisonment up to two years in jail.

    April 28, 2008

    Avg Size of a Web Document Compared to A Blog

    I read this fascinating article (via Techmeme) that indicates that "the average Web page size has tripled since 2003". IMHO, average is still a problematic measure when dealing with Power-law distributions. I like the example that Clay Shirky mentions in the book "Here Comes Everybody":

    If Bill Gates walked into a bar ... we'd all be Millionaires ......... ON AVERAGE!

    Same holds when we are talking about web pages. Nevertheless, the study is quite interesting and provides a very good analysis of how Web content is changing.

    This made me wonder:
    "What is the contribution of Social Media content in tripling the size of an average Web page?"

    Blogsweb_2

    While not a comprehensive study, I did a very quick back of the envelope experiment to see what this would be like. I fetched the top 400 Web pages, as ranked by Alexa. Similarly, I got a bunch of 400 blogs (from the Buzzmetrics dataset) and cached their homepages as well (wget -p <url>). Following is a graph that compares the sizes of the homepages from the two datasets.

    Looks like the size of a blog is "on an average" is larger than the size of a regular Webpage suggesting that a good deal of the  increase in the size of a Webpage could be due to Social Media content.

    I think a more detailed  study here would be insightful.

    LiveBlogging Tools: CoveritLive

    I had this idea, but turns out that CoveritLive does exactly the same thing and they have implemented it beautifully!

    The current blogging tools usually fall short when trying to put up a liveblogging / running commentary post. The blogging software itself is supposed to be a reverse chronological listing of posts. Now when writing a live blog post of an event that the blogger is covering, what is usually needed is chronological listing of the news. (So its like an inverted blog within a post!). Mostly, people manage their own timestamps and formatting for the liveblogging post. This tends to be tedious and a distraction from the real event that one is trying to cover. Instead, CoveritLive is a great tool that handles all this for you!

    The cool thing is that it also has features that support multiple authors, uploading media and rich formatting. Its a great idea and a fulfills a surprisingly obvious need. I think I am gonna use this for the next liveblogging event I cover. Do check it out!

    April 24, 2008

    Favorite Commandline Hack

    One of my favorite commandline hacks is demonstrated by the following example:

    history | gawk -F ' ' '{print $2}' | sort | uniq -c | sort -nr | more

    What this does is takes a text file (or history of the commands used in our case), chops it to print the right field, sorts and counts the number of times a particular term occurs. For example here are the top commands I have used on this server:

        373    ls
        268    cd
         42    more
         29    ps
         27    rm
         25    du
         24    ./bin/startup.sh
         22    exit
         17    source
         15    emacs
         14    sudo
         13    ssh

    This is immensely useful and a quick way to do anything from process a huge file, count the number of times a link occurs, word counting and all sorts of processing that come up frequently in large social media datasets. Its a really easy way to do some mundane tasks without having to write a script or much code for it. So what are your favorite commandline hacks? share the joy! :-)

    April 22, 2008

    Distinguished Speaker: Jiawei Han on "Research Challenges in Data Mining"

    It was an honor to have Dr. Jiawei Han on our campus today. He was hosted by the Information SystemsHanj_tour (IS) department as part of the distinguished speaker series. Dr. Han is a pioneer in the field of data mining and has written a book that I had used for the Data Mining class I took with Dr. Kargupta.  In fact,  Dr. Han even obliged to autograph the copy for me :-D. (I know thats geeky but is'nt that cool?)

    Here is a summary of his talk on "Research Challenges in Data Mining".

    Dr. Han narrates and interesting episode where Dr. Jim Gray brought in a hard disk and asked his students to guess the capacity... this was perhaps just 10 years back and a 2G disk evoked a jaw dropping response from the audience. According to Dr. Jim Gray, "We are in the era of Data Science". Science used to be
    experimental (observe the stars) then it moved to computational (run simulations) but now we are really moving towards  "Data Science" (Gigabytes, Terabytes and Petabytes of data all around us).

    The main themes of the talk were:

    1) Pattern Mining: Classification by finding frequent, discriminative patterns.

    Dr. Han described an interesting experiment in which they analyzed the feature length vs information gain. What they found was that you get more information by combining features (2,3,4), while single features dont give you much information. On the other hand too long pattern, and it is not too informative either. This has an important implication, especially in text mining where using NGrams (bigrams and trigrams) has generally been found to be useful but if you use too long a pattern, it is not that helpful.

    Another analysis presented was of information gain vs pattern freq: Frequent patterns have more information gain than less frequent ones. On the other hand extremely frequent patterns have little gain (for example stop words).

    The implication of these results is in using SVMs and C4.5,  classification accuracy can be improved by using discriminative patters/feature selection.  Dr. Han suggests that "It is not the number of patterns its the quality of patterns. Its Not the more the better ......instead, the better the better." A few selected discriminative patterns are more useful than a lot of features in case of classification.

    2) Streams Data Mining Classification for rare events.

    When you have multiple streams of information and you can not store all the data, but need to remember some of the history, (in order to do classification of a new stream) it is helpful to use sampling techniques. Dr. Han discusses results from one of their recent papers on using biased sampling. The idea here is that you dont toss off positive data (all of it is stored since there are very few samples). Now, by selective sampling of negative data one can use equal amount of positive and negative data for training a classifier. In case of multiple streams, using an ensemble approach to classification is suggested.

    3) Information network analysis. Distinguishing objects with identical names. (eg author names, song names)

    Name disambiguation is a very difficult problem. For eg there are 14 Wei Wang on DBLP -- some with even same co-authors, same conferences, etc. in such cases, textual similarity can not be used, since they are all basically in the same field. Dr. Han discusses how using only information from DBLP, they make links powerful. The basic idea is to group references according to their similarity. By performing a random walks along different join paths, they can group the authors and determine the different name ambiguities in the graph. Often, in computer science and other fields, Collaboration behavior within the same community share some similarity and this can be helpful in disambiguation.

    Another research paper that Dr. Han highlights is on "Truth discovery with multiple conflicting information providers on the web". The goal here is to algorithmically determine what piece on the Web is more trustworthy. The proposed approach is to model it as a tripartite graph of (website,facts,object).

    Assumption is that there is just one true fact and that false 'facts' are introduced due to random factors (and not malicious intent). The domain was restricted to identifying the right book title, which is slightly easier than something like political opinions. The basic idea behind their approach is that a site is trustworthy if it provides many facts with high confidence and once again this can be modeled as a matrix computation problem of identifying trustworthiness.

    Some of the other areas that were very briefly mentioned by Dr. Han were:

    4) Mining moving objects - unrestricted, restricted (cars) scattered (Cell phone and rfid)
    5) Spatio temporal multimedia clustering with obstacles
    6) blog mining, etc.
    7) data cubes.
    8) Visual data mining
    9) Biological data mining
    10) DM for software engineering/bug mining

    This was a great talk and very informative. I really liked how the Dr. Han gave the basic intuition behind their methods and yet managed to give a broad overview of some of the challenges in the field.

    April 20, 2008

    SemanticHacker Challenge and Wikipedia Widgets

    When writing a blog post, we create links to Wikipedia articles because it is an easy and quick way to provide information to readers, without having to go into too many details. It is not always possible to find, link or even know about all the articles that might be related to the post's topic.

    Zareen Syed, one of my colleagues, recently presented a paper at ICWSM on "Wikipedia as an Ontology for Describing Documents". This paper talks about how Wikipedia articles can be used to associate the concepts with a given document. Over at SemanticHacker a similar approach is taken to find the "Simplified Semantic Signatures" for a document.  TextWise, is offering upto a Million Dollars in funding to any idea that can

    Turn on the Power of Semantics

    I liked their demo, API and tools they provide. Inspired by the effectiveness ofWikimatix Wikipedia in describing documents and my desire to play with widgets, I decided to mash these two goals together. If you look on the right column of this blog, you would find a small widget that displays the related Wikipedia entries for a given page/post and is powered by the SemanticHacker API. This was a quick hack and the system can be improved by

    • focusing on the post content alone (right now I just pass the current page URL).
    • Adding more features, like identifying people, places etc and highlighting them differently

    While I dont think this by itself might qualify for a $1 Million challenge, I think there might be something interesting here. I have a couple of ideas around this and scant time to work on it. Finally, in true Web 2.0 spirit -- not sure what, if any, is the business model (I just built this for kicks; semantic hacker is looking for a bizplan from participants).

    I dont think that the widget itself is ready for prime time -- it was an evening hack. But if you would like to play with it leave me a note in the comments and I shall provide a link to you so you can install it on your blog.

    Google Ads

    Related Wikipedia Entries

    Ads

    Recent Readers

    Search this blog


    • WWW
      socialmedia.typepad.com

    May 2008

    Sun Mon Tue Wed Thu Fri Sat
            1 2 3
    4 5 6 7 8 9 10
    11 12 13 14 15 16 17
    18 19 20 21 22 23 24
    25 26 27 28 29 30 31
    I Love 6A