Contact Me


  • Akshay Java's Facebook profile

Social Media Events

Friends

Disclaimer

  • Thoughts and comments expressed here are those of the author. Creative Commons License

Weblogs

May 08, 2008

"Personal Brand" Monitoring Tools

Dr. Finin pointed to this interesting post on "branding yourself with a blog":

“… Certainly personal branding isn’t a new concept, but the future of personal branding could be in at your fingertips—with a blog. One of the first steps in creating a brand for yourself is to make your blog visible. Post meaningful entries, comment on your industry’s top blogs, or simply gain a regular readership. “Visibility creates opportunities,” says Schawbel, a social media specialist at EMC Corporation. He believes that when you brand yourself, the competition becomes irrelevant. “The goal of personal branding is to be recruited based on your brand, not applying for jobs,” Schawbel says. …”

Many brand monitoring startups are helping big companies keep track of what their (potential) customers have to say about them or their products. While the space of corporate brand monitoring is  fiercely competed, one area that is overlooked is that of "personal branding" tools. Most of us are highly interested in knowing what is said about us online. As the TechCareers blog points out:

“You are the chief marketing officer for the brand called you, but what others say about your brand is more impactful than what you say about yourself,” says Schawbel.

Keeping an eye on what others have to say about you is not always easy. I started thinking about these issues and outlined how I try to keep up with this information. Here is my "Personal Brand Monitoring Toolbox":

  1. Search Engines: The typical way for me to keep tabs on this is by setting up Google alerts for my name, projects, organization (University/workplace) etc. In addition, I frequently perform "ego searches" to forage for mentions of my name.
  2. Statistics and Tools: One very interesting tool that I have found useful is Lijit. It provides you stats on who is searching for you, what keywords were used to reach your blog, etc. In addition I use Google Analytics to know more information about my visitors, most visited pages and time they spent on my site. If you are an academic like me, you would like to know who has cited your papers recently (Google Scholar) and the number of downloads, who has linked to your paper (Google link: search) and/or your blog posts (Technorati searches). Yessss! I admit! I have become a total statoholic! :-)
  3. Comments and Scraps: Twitter is another important tool in our arsenal for personal branding and your replies say something interesting about you. Finally, the comments on my blog, Facebook messages, scraps and photos are all part of my "brand" and I take interest in replying to them just like I would to an email.

As our information spaces diversify, monitoring "your brand" becomes a part of the everyday online activity. I dont think we have exactly cracked the nut yet -- keeping track of your profile and "your brand " is a highly addictive activity and I think that the tool(s) that make it fun and exciting will enjoy a great deal of popularity.

May 02, 2008

Leaveraging Web and Social Media for Recommendations

Both Amazon and Netflix's business models rely on effective recommendation systems. The recommendations provided by such systems are based on the purchasing habits of millions of customers. As such, these systems are non-trivial and have evolved out of years of research in both academia and industry.

In addition to mining millions of customer transaction records, for many products there is a vast amount of information available online. While I do not have a lot of familiarity with recommendation systems literature, it seems obvious that the Web and Social Media is a great source of information that could be useful when building such systems.  Bloggers' profile pages, wishlists, netflix queues, book lists and the blog posts themselves are potential clues to learn which two items may be related to each other.

As a simple example, consider the movie "Pulp Fiction", by querying Google for all the inlinks to the IMDB homepage of Sin City Pulp Fiction and counting which are the other movies that are "co-cited" here is a list of five movies that are most likely to be related to "Pulp Fiction":

Most of these look quite relevant. Some critics have claimed similarities between Pulp Fiction and Snatch. One surprise though was LOTR, I wouldn't have expected it to be grouped with Pulp Fiction, but I guess I like them both very much -- so it seems reasonable in my case atleast.

Just for fun, here is another example with "Sin City" another one of my favorite movies.

Unless you have a large index of the Blogosphere or the Web, it would be quite inefficient to mine for such correlations (by passing queries to search engines) on a large scale. I do not know how much of the search engine information is leveraged in recommendation systems built by Amazon or Netflix.  It might also be worth looking into differences in the recommendations produced on the basis of "how people co-cite two products" vs. "how people purchase two products".

May 01, 2008

Guest Post by Blazej Bulka: Social Networking That Went Wrong

(Guest post by Blazej Bulka. Thanks Blazej! Especially for translating and summarizing these articles for us non-Polish speakers..)

nasza-klasa.pl is currently the most popular social networking site in Poland. "Nasza klasa" means "our class," and the website is similar in design to classmates.com. It allows people to join schools, from which they graduated, and classes within schools. It also allows to build a social network by reconnecting to old friends, and post photographs.

Initially, nasza-klasa lacked any privacy controls because it was meant as a small, local project. After one year of existence, a sudden surge in popularity came. The number of registered users quickly exceeded 1 million, which started exponential growth. After a few months, the site had more than 5 million users. Currently, the site has more than 11 million users. (The population of Poland is roughly 40 million.) The unexpected growth choked the website, and the utmost priority of the developers was performance, and not privacy.

Right now, the website offers only basic privacy protection such as black listing, and restricting the visibility of the information in the profile. However, the biggest privacy problem are the users themselves. Most of them have had no prior experience with social networks before. The theme of the site is very encouraging to share as much personal information as possible. After all, the users are surrounded with their classmates, who are the people they usually trust, and with whom they shared their personal secrets in the childhood. They have not seen each other for multiple years; therefore, they are extremely willing to share a lot of personal information to make up. They also want their profiles to be easily searchable, in case another old friend should join the site in the future.

Unfortunately, many of the users are unaware that their information can be actually accessed by almost everyone, and not just their friends are interested in it. Moreover, because of their inexperience, they do not know that any information published on the internet may be copied endless number of times, changed, and may stay there forever. Sometimes, consequences of careless publishing of information on the site were quite surprising.

Below, I present an overview of five such unexpected uses of published information based on the articles published in press or on the internet (mostly in Polish).

1. Tracking users by debt collectors and law enforcement
(Source: "Nasza-klasa is a true treasury for debt collectors and law enforcement")
Debt collectors in Poland tend to massively use nasza-klasa to track the debtors. Nasza-klasa seems to be very good source of information for them. Personally, I also heard of cases when the debt collectors compare the friends' lists of the wanted person with the lists of their classmates. A classmate who is missing from the friends' list may indicate that a dislike existed between the two. Apparently, the debt collectors tried to exploit such dislikes to extract more information.

2. Police officers expose themselves and their colleagues
(Source: "Nasza-klasa.pl exposes police officers")

Irresponsible officers post at portal nasza-klasa.pl pictures from the educational institutions, from which they graduated. This reveals identities of undercover officers -- experienced officers say.

[..] The officers add themselves to classes, in which they studied [at police schools]; often, the classes specializing in police operations, investigations, or reconnaissance. And they add class pictures. "I hunt bad people" -- this is the profile information in "About me" section of Marek -- a graduate of a reconnaissance-operational police school in Pila. More experienced officers are disgusted by the behavior of the novices. At police internet forums, they write that such a database about police officers and their friends may be used to track a particular officer.

3. Police officers brag about their authority, and how they could abuse it
(Source:   "Give me their names, I will find them in the national ID registry")

A police officer from Grodzisk Mazowiecki brags on nasza-klasa portal that he can determine the addresses of his former classmates because he has access to the national ID registry (PESEL). [..] On the forum for the 2nd Elementary School, Jaroslaw B. writes: "If anybody of you has telephone numbers of our former classmates, please call them so that they sign up. I also have access to PESEL database. I need last names, and I will determine the addresses."
A former classmate asks him: "Jarek (i.e., Jaroslaw), where do you work, if you have access to PESEL database?"

The police officer responds: "the less you know, the better."
Another classmate writes: "(..) Now the whole school knows about Jarek's capabilities :) However, I propose not to use authority or even force to find our classmates :) (...) And maybe we should move our conversation to the class forum from the school's forum :):).)

In his profile, Jaroslaw wrote that he graduated from the Police University in Szczytno.

4. Intelligence agents should know better to be low-profile ... well, apparently not!
(Source:  "Our new secret service revealed themselves on the internet"; the article contains interesting pictures posted by the intelligence agents ...)

Pictures of six SKW (Military Counter-intelligence Agency) officers are shown without their names or rank. According to the SKW Act, this data is considered particularly sensitive. The officers themselves did not exercise enough care -- during their mission, they posted pictures showing them armed, both in uniform and disguise. They were posted in profiles registered with their real names. They did not mention that they are SKW officers, they only mentioned that they are officers of the Polish Military Force. But they did not try to conceal the fact that the pictures were taken in Afghanistan.

5. People brag about committing crimes and spreading hateful ideology
(Source: "Responsibility for 'Sieg Heil' (Nazi greeting) on Nasza-klasa")
One of the profile pictures is here and another "nazi" profile picture is here.

In February, the journalists from "Gazeta Wyborcza" noticed the son of the mayor of Leczna is one of the millions people who signed up with nasza-klasa. One of the photographs in his profile showed him with two baldly-shaved friends, with their right hands raised in the fascist gesture "Sieg heil". The faces of the two friends were covered with scarves with logo of the fans for soccer club Gornik Leczna.

[..] As the journal reports, the state prosecutor's office in Lublin has already contacted nasza-klasa.pl with an inquiry regarding the user profile. The investigators want to find out how long the picture was available on the site, and how many people may have watched it. Promoting fascist ideology may lead to a financial fine, and a penalty of imprisonment up to two years in jail.

April 28, 2008

Avg Size of a Web Document Compared to A Blog

I read this fascinating article (via Techmeme) that indicates that "the average Web page size has tripled since 2003". IMHO, average is still a problematic measure when dealing with Power-law distributions. I like the example that Clay Shirky mentions in the book "Here Comes Everybody":

If Bill Gates walked into a bar ... we'd all be Millionaires ......... ON AVERAGE!

Same holds when we are talking about web pages. Nevertheless, the study is quite interesting and provides a very good analysis of how Web content is changing.

This made me wonder:
"What is the contribution of Social Media content in tripling the size of an average Web page?"

Blogsweb_2

While not a comprehensive study, I did a very quick back of the envelope experiment to see what this would be like. I fetched the top 400 Web pages, as ranked by Alexa. Similarly, I got a bunch of 400 blogs (from the Buzzmetrics dataset) and cached their homepages as well (wget -p <url>). Following is a graph that compares the sizes of the homepages from the two datasets.

Looks like the size of a blog is "on an average" is larger than the size of a regular Webpage suggesting that a good deal of the  increase in the size of a Webpage could be due to Social Media content.

I think a more detailed  study here would be insightful.

LiveBlogging Tools: CoveritLive

I had this idea, but turns out that CoveritLive does exactly the same thing and they have implemented it beautifully!

The current blogging tools usually fall short when trying to put up a liveblogging / running commentary post. The blogging software itself is supposed to be a reverse chronological listing of posts. Now when writing a live blog post of an event that the blogger is covering, what is usually needed is chronological listing of the news. (So its like an inverted blog within a post!). Mostly, people manage their own timestamps and formatting for the liveblogging post. This tends to be tedious and a distraction from the real event that one is trying to cover. Instead, CoveritLive is a great tool that handles all this for you!

The cool thing is that it also has features that support multiple authors, uploading media and rich formatting. Its a great idea and a fulfills a surprisingly obvious need. I think I am gonna use this for the next liveblogging event I cover. Do check it out!

April 20, 2008

SemanticHacker Challenge and Wikipedia Widgets

When writing a blog post, we create links to Wikipedia articles because it is an easy and quick way to provide information to readers, without having to go into too many details. It is not always possible to find, link or even know about all the articles that might be related to the post's topic.

Zareen Syed, one of my colleagues, recently presented a paper at ICWSM on "Wikipedia as an Ontology for Describing Documents". This paper talks about how Wikipedia articles can be used to associate the concepts with a given document. Over at SemanticHacker a similar approach is taken to find the "Simplified Semantic Signatures" for a document.  TextWise, is offering upto a Million Dollars in funding to any idea that can

Turn on the Power of Semantics

I liked their demo, API and tools they provide. Inspired by the effectiveness ofWikimatix Wikipedia in describing documents and my desire to play with widgets, I decided to mash these two goals together. If you look on the right column of this blog, you would find a small widget that displays the related Wikipedia entries for a given page/post and is powered by the SemanticHacker API. This was a quick hack and the system can be improved by

  • focusing on the post content alone (right now I just pass the current page URL).
  • Adding more features, like identifying people, places etc and highlighting them differently

While I dont think this by itself might qualify for a $1 Million challenge, I think there might be something interesting here. I have a couple of ideas around this and scant time to work on it. Finally, in true Web 2.0 spirit -- not sure what, if any, is the business model (I just built this for kicks; semantic hacker is looking for a bizplan from participants).

I dont think that the widget itself is ready for prime time -- it was an evening hack. But if you would like to play with it leave me a note in the comments and I shall provide a link to you so you can install it on your blog.

April 18, 2008

SocialDevCamp East: Local BarCamp Event in Baltimore

Socialdevcamp I am excited about the local BarCamp event organized by David Troy of Twittervision and Ann Bernard, Keith Casey from WhyGoSolo. It looks like a fun event and a great "UnConference" for local entrepreneurs, social media enthusiasts and researchers to get together and talk about the future, innovation of Social Media and the Web.

It promises to be an event that would

bring together forward thinkers – developers, social media gurus, bizdev types – to discuss and Chart the Next Course.

The Agenda and signup details are available on the PBWiki site. Some of the proposed session topics (From the Wiki):

  • Mobile Application Development with iPhone & Android
  • What are the implications of Geo-based web applications?
  • How will Agile Development practices affect ideation and funding processes?
  • Platforms Present & Future
  • What will matter on the web a year or two from now?
  • Where will information aggregation take us?
  • Social Media is here to stay – so where will it go next?
  • How will Google keep affecting development, the web and our lives?
  • What are the next things web users will want from the web?
  • What will be the impact of future generations to the current web we know today?
  • Social Media in the workplace vs. in life
  • How the Web is being broken apart into pieces - Justin Thorp

Fantastic! Cant wait to be there!

There are also some sponsorship opportunities (starting @ $250) available for this event. Also please note that there is no cost for participants (which is great for grad students like me!). :-D

For more details please contact Dave Troy or come and chat with us on Facebook.

The Bane of Fame

I was working on this post and read a timely piece by Erik at Techcrunch on the level of noise and our inability to keep up with the conversations online. Further comments by Alexander as well.

Clay Shirky, in his book "Here Comes Everybody" mentions how fame can come at a cost. The example he presents is that of a blogger with 10,000 strong audience, is interacting with each of his readers for 1 min; it amounts to almost a full-time job ~ 40 Hours/Week. Here the interaction can mean that someone has just created a link and you go check it out; someone has written a blog post in response; posted a video/podcast; just commented on your blog, sent a tweet in response to yours, posted a photo from the conference you attended or plain old fashioned email/phone conversations.

GraphI tried to extrapolate this and see how it would work in a typical scenario. Lets say you are an average blogger who spends 30 Mins /post/day  and your audience created 1000 items for you to responding to. Perhaps on an average you spend 5 Mins per response (this might include reading the comment, tweet, post, email; thinking time; writing time etc.) Then one would pretty much spend about ~25 Hours / Week for about 1000 responses/month.

This is the bane of fame! As you get famous, and receive more attention in your social network -- you cant respond to everyone. I can hardly figure out how Scoble can keep up his pace on his blog and Twitter! I think Tech bloggers are slightly better at managing how they use technology to their advantage and can keep up with a much larger audience than most people. Still, Shirky's book is a great place to understand how the attention economy is changing our social interactions.

That also makes me wonder if someone has seen any research on email/usenet usage patterns? Somehow that is one area that shares similarities and I am certain that usability experts would have done some such studies already. How do you think this would compare with Facebook and Social Media in general? There are people for whom Facebook is their default -- everything from email/IM to SN -- So it does not feel like they are trying to respond to requests from many different networks! I wonder what is the cost of context switching from one network to another and perhaps that is the reason why FriendFeed seems like a neat way to manage multiple information streams?

April 17, 2008

Notes from DEBSM Workshop: Inferring Privacy Information via Social Relations

Inferring Privacy Information via Social Relations
Wanhong Xu, Xi Zhou Lei Li (presenter)

The presentation started with a neat quote: "Your social activities tell who you are". Social Networking are part of everyday life and for many of us a primary way in which we keep in touch with friends and family. Privacy is a huge problem in such networks. Li provided some statistics from a recent survey of British social network users: about  62 percent are concerned about the security of their personal data. 31% of the users falsify information to protect identity. This is a huge figure and definitely shows the level of concern.

This paper was motivated with applications for social advertising (which is a 2.2B US market). One can target users based on location, age gender etc. and social advertising allows advertisers to choose their audience. The problem comes up where users do not wish to disclose too much of their information online. The proposed solution is to automatically infer such missing information.

The authors use an undisclosed dataset for this study. The assumption is users fake personal information but not their activities. It is well known in offline-social networks that gender preference exist in friendship. however in online social networks this is not true (Jure Leskovec, WWW,2008). The key question that authors of this paper ask is : "Users may fake their personal information...But what about social activities?"

The insight that this paper offers is that certain group membership gives hints of user gender information. Joining groups has a gender preference. So the way in which one can infer the gender is to use a bipartite graph of users->groups with missing gender information using relation between users and groups. One approach is to use the User*Group matrix and build a classifier. However, they found that Naive Bayes does not work very well. Many social groups in fact dont have any gender preference. This can really hurt the classifier accuracy. So one approach they propose is to choose the discriminating social groups. i.e. groups that dominated by males or females. One major disadvantage that this technique suffers from is that the membership for users that dont join any of these groups can not be predicted. Once you restrict to the set of discriminating groups, now Naive Bayes performs well.

For users that cant be predicted they propose the use of an iterative algorithm to combine discriminative social groups and results from Naive Bayes classifier. Testing is performed by removing or making some data missing and then predicting the missing gender information.

My concern here is that the authors might have been able to avoid this disadvantage if they had used SVD to map the Groups to a set of lower dimensions that way it would automatically "cluster" the groups based on whatever the discriminating factor is (in their case the gender information). Secondly, while the authors had access to "verifiable" ground truth data, in real world how do we know the influence of fake profiles on these discriminative groups?

Clustering Tags and Links Simultaneously

I have a large collection of blogs that I would like to cluster. For the blogs in my collection, I have the following information:

  • Text aggregated across all the posts
  • A dump of the homepage of the blog
  • Tags obtained from del.icio.us or "Feeds that matter"
  • And finally how the blogs link to each other (aggregated across all the posts)

If we want to cluster the blogs, one can take the following approaches:

  1. Cluster the blogs using only the text (Kmeans style)
  2. Cluster the blogs using only the link information (By say performing NCut)
  3. First cluster all the tags (so you would group "tech", "technology", "geek" etc) and then group the blogs that are categorized in each of these clusters.
  4. Using a bipartite graph of the form <Blog,tag>. This is also known as co-clustering.

CoclusteringThe problem with using the text alone is that the data can be huge (typically in Gigs) and it is almost essential to use some dimensionality reduction techniques like PCA etc. By performing NCut over the graph alone, you land up ignoring the text or the tag information that could be quite helpful. By clustering in only the tag space and then mapping the resulting clusters to the blog graph, we miss how the blogs might be interconnected. Finally, co-clustering technique too misses the fact that the blogs might link to each other and this information can be quite useful in grouping them together.

The question I was wondering about was how to cluster the blogs while using both the link and the tag information simultaneously. I think that the tags provide a reduce dimensional representation of the text. The tag space is of a much lower dimension than the complete text of the collection and it can be efficient to cluster the blogs using tags than the complete text.

Recently, I came across a very interesting piece of work that talked about "Classification ConstrainedMatrixlink_2 Dimensionality Reduction". The idea here is that if we have a bunch of data points for which we know the true class information, we would like to classify the remaining points such that the available class information is utilized to guide the dimensionality reduction.

Matrix_2The above paper applied very well to our problem. One simple way to combine the two different information is to use the matrix W' shown here. C is a k*n matrix where n is the number of blogs and k is the number of tags. While W is the adjacency matrix of size n*n. I is an identity matrix of size k*k. Typically, we can get away by using only the top 200-500 most popular tags. Now, if we perform NCut using the matrix W' instead of using W alone, we can combine/fold-in the tag information with the link information. The parameter \beta controls how much importance we would like to give to the graph versus the tag information.


For example for a blog graph consisting of ~3000 blogs, I fetched the tags from del.icio.us and the link graph from the WWE/Buzzmetrics collection. Figure on the right shows 20 clusters if we use just the link information and perform NCut and the same graph using the simultaneous clustering method. Notice, how the 20 odd clusters seem nicely block diagonalized in comparison with the results using link information alone. It would be interesting to look at how these techniques compare with that using the entire text from all the blogs. I am working on this and the intuition is: we might be able to show that using just tag and link information, we can do as good (or better) than using the entire text. Afterall tags are nothing but a place holder for the "topic" of the blog.

An even more interesting extension would be to use the original paper that presented the constrained classification using dimensionality reduction and apply the same technique to our problem, such that we can perform clustering using just a small set of labeled blog and labels (matrix C).

Google Ads

Related Wikipedia Entries

Ads

Recent Readers

Search this blog


  • WWW
    socialmedia.typepad.com

July 2008

Sun Mon Tue Wed Thu Fri Sat
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
I Love 6A

Please Support