Using SentiWordNet for Multilingual Sentiment Analysis
Kerstin Denecke
Subjectivity Analysis is about identifying a set of opinions emotions and thoughts that are expressed in text. As opposed to factual statements (objective text), the goal is to identify if there are any:
- sentiment or emotions expressed in a piece of text
- type of subjectivity can be either positive/negative opinions
The question addressed in this paper is how to analyze subjectivity and sentiments in in multilingual datasets. This was a really well motivated paper and a it was great to hear this talk especially after attending the recent tutorial by Dr. Jan Weibe at ICWSM. One motivation is that a company that sells products in many different countries might want to know the sentiments of its clients. A big difficulty in aggregating across different geographies is analyzing multilingual text.
Some of the difficulties in dealing with multilingual sentiment analysis are:
- missing language specific lexical resources
- missing linguistic resources (POS Tagger, Parser)
- missing training materials
The solution proposed by Kerstin in this paper is to rely on resources available in English and use existing training sets (from English text) to identify the subjectivity in foreign language text. This is such a simple and effective idea that my first reaction was that of surprise! I am amazed that someone had not proposed it yet!
Kerstin's approach is to take the documents and translate them to English and sentiment analysis is applied on the translated version of the document to determine the polarity. One key technology being used in this work is SentiWordNet. It is a lexical resource for opinion mining that was manually created. It offers a triple of polarity scores for each wordnet synset (pos, neg, objectivity). This is an amazing resource for anyone working on Sentiment analysis. I was unaware of SentiWordNet when we were working on the TREC Blog track 2006. In retrospect, I wish I knew of it back then.
Working with the translated text, this paper identifies each word+POS and does a lookup in sentiword to obtain the scores per synset. Next, they calculate the score for each word class (adj,noun, verbs) for a given document. Finally, the score triples are averaged across the words in the sentences. The document is given a polarity score using the triples across all the sentences.
To identify the polarity there were two approaches that were used:
- Use a rule based approach by thresholding
- Use Machine Learning (WEKA) to train a logistic regression classifier.
Evaluation was done by using IMDB archive (1000 pos,neg reviews in English) and the testset was MPQA (English) and Amazon movie reviews (German).
Some sources of errors mentioned in the presentation:
- Statistical translation errors
- "writing errors", which I think means slang usage (scaaaaade = what a pity or aaaawful)
- The system does not consider negated structures (not bad)
- mising resolution abiguities 14 synset for "bad"
Their results indicate that ML based approach is better and obtains an accuracy of around .65.
Recent Comments