Using SentiWordNet for Multilingual Sentiment Analysis
Kerstin Denecke
Subjectivity Analysis is about identifying a set of opinions emotions and thoughts that are expressed in text. As opposed to factual statements (objective text), the goal is to identify if there are any:
- sentiment or emotions expressed in a piece of text
- type of subjectivity can be either positive/negative opinions
The question addressed in this paper is how to analyze subjectivity and sentiments in in multilingual datasets. This was a really well motivated paper and a it was great to hear this talk especially after attending the recent tutorial by Dr. Jan Weibe at ICWSM. One motivation is that a company that sells products in many different countries might want to know the sentiments of its clients. A big difficulty in aggregating across different geographies is analyzing multilingual text.
Some of the difficulties in dealing with multilingual sentiment analysis are:
- missing language specific lexical resources
- missing linguistic resources (POS Tagger, Parser)
- missing training materials
The solution proposed by Kerstin in this paper is to rely on resources available in English and use existing training sets (from English text) to identify the subjectivity in foreign language text. This is such a simple and effective idea that my first reaction was that of surprise! I am amazed that someone had not proposed it yet!
Kerstin's approach is to take the documents and translate them to English and sentiment analysis is applied on the translated version of the document to determine the polarity. One key technology being used in this work is SentiWordNet. It is a lexical resource for opinion mining that was manually created. It offers a triple of polarity scores for each wordnet synset (pos, neg, objectivity). This is an amazing resource for anyone working on Sentiment analysis. I was unaware of SentiWordNet when we were working on the TREC Blog track 2006. In retrospect, I wish I knew of it back then.
Working with the translated text, this paper identifies each word+POS and does a lookup in sentiword to obtain the scores per synset. Next, they calculate the score for each word class (adj,noun, verbs) for a given document. Finally, the score triples are averaged across the words in the sentences. The document is given a polarity score using the triples across all the sentences.
To identify the polarity there were two approaches that were used:
- Use a rule based approach by thresholding
- Use Machine Learning (WEKA) to train a logistic regression classifier.
Evaluation was done by using IMDB archive (1000 pos,neg reviews in English) and the testset was MPQA (English) and Amazon movie reviews (German).
Some sources of errors mentioned in the presentation:
- Statistical translation errors
- "writing errors", which I think means slang usage (scaaaaade = what a pity or aaaawful)
- The system does not consider negated structures (not bad)
- mising resolution abiguities 14 synset for "bad"
Their results indicate that ML based approach is better and obtains an accuracy of around .65.
I missed putting this information in the post. The triple here corresponds to pos,neg,objectivity. I am still not clear how objectivity relates to positivity or negativity. Like for "good" -- it is clearly positive but I dont quite know how it says anything about objectivity. I guess I will have to read more about SentiWordNet to be able to comment further.
Posted by: Akshay Java | April 17, 2008 at 02:57 PM
You should probably also check out
@inproceedings{Mihalcea:2007lr,
Abstract = {This paper explores methods for generating subjectivity analysis resources in a new language by leveraging on the tools and resources available in English. Given a bridge between English and the selected target language (e.g., a bilingual dictionary or a parallel corpus), the methods can be used to rapidly create tools for subjectivity analysis in the new language.},
Address = {Prague},
Author = {Rada Mihalcea and Carmen Banea and Janyce Wiebe},
Booktitle = {Proceedings of the Association for Computational Linguistics (ACL 2007)},
Month = {June},
Title = {Learning Multilingual Subjective Language via Cross-Lingual Projections},
Url = {http://www.cs.unt.edu/~rada/papers/mihalcea.acl07.pdf},
Year = {2007},
}
which looks at projecting English language resources onto foreign language text.
I'm concerned about the idea of translating foreign language documents and then performing English sentiment analysis on the result because of all the subtlety that might be lost in the translation process. I think a better approach is to build systems for each target language. That's a tough task though, and it is nice to see work looking at multiple languages.
You might also be interested in the NTCIR-6 Opinion Analysis Pilot Task (http://research.nii.ac.jp/ntcir/ntcir-ws6/opinion/index-en.html) - the data from that was just recently made available. It contains Japanese, Chinese (traditional), and English annotated opinion data. Sentence level opinionated (YES|NO), whether the sentence is relevant to the topic (YES|NO), opinion polarity (POSITIVE|NEGATIVE|NEUTRAL), and opinion holder (string.)
Information on obtaining that data is available at: http://research.nii.ac.jp/ntcir/permission/ntcir-6/perm-en-OPINION.html
Posted by: David Kirk Evans | April 17, 2008 at 09:06 PM
David, thanks for the fantastic links and resources that you point to. I am so amazed by how interesting and challenging this area is. After D. Weibe's tutorial at ICWSM, I realized that I need to catch up on some of the work that has already been done over the years. Its great that now we have come to a point where we have some really interesting resources and availability of annotated corpora for subjectivity and multilingual subjectivity analysis.
You mention a great point about the subtlety of translation. During the presentation, I had been wondering about the same thing but could not think of a concrete example where I can nail this type of loss. Ofcourse, I was thinking in terms of Hindi/Sindhi which happens to be my native language. However, if you have a good example it would be great if you can share this with us?
Thanks for stopping by.
Posted by: | April 17, 2008 at 09:19 PM
I haven't really looked into this much, because I'm working on just applying the same techniques in Japanese that I use in English, but it is an interesting question. Just for fun, I looked at a few sentences in Japanese that are marked by 3 out of 3 annotators as "opinionated".
Japanese: 藤原「キレる」という言葉を安易に使わない方が良いと思う。
Google's translation: Fujiwara "spewing" careful not to use the word is good, I think.
My translaton: Fujiwara: I don't think you should toss around the word "pissed off"
Japanese: 雑誌やテレビ、映画という伝統的なメディアを持つワーナーと、情報販売企業として飛躍したAOLが一緒になると「売れればいい」という商業至上主義がより大きくなる可能性がある。
Google's translation: Magazines and television, the movie with Warner traditional media and information company as a sales leap by AOL to be together and "do売れれ" is a commercial supremacist could be bigger.
My translation: If Warner, which has traditional media holdings like magazines and movies, and AOL, which rose rapidly in the information-based business realm, merge then the possibility that they will adopt a principle focused mainly on sales is very high.
Japanese: 原告側の伊藤幹郎弁護士も「判決が『極秘文書』に一切触れていないのに驚く。
Google's translation: Mikio Itoo plaintiff's lawyer, "The ruling is not a classified document, not to mention surprised.
My translation: The plaintiff's lawyer Mikio Itoo also said "I'm surprised that the judge's ruling did not make any reference at all to the 'secret documents'.
(and the next sentence)
Japanese: 司法が行政に屈したと判断している」と語った。
Google's translation: Yielded judicial and administrative decisions have, "he said.
My translation: The Judicial system gave in to the administration in their decision" he said.
So those are really just four random samples I checked. With a language pair like Japanese and English there are problems with the quality of the translation system. I think with, for example, Spanish and English the approach would work well, but I'm not convinced that MT is good enough at the subtleties to work well with Japanese (the only other language I am familiar with.)
Still, for document level opinionated or not-opinionated classification, the terms alone could be sufficient. I'm thinking more of sentence or clause-level, where you would want to identify the opinion holder and opinion target, and for something like that I think translation introduces too much noise (at the current state of MT performance. Given perfect MT systems, there is no problem. Given better MT systems, there is probably some interesting research to do.)
Anyway, it was interesting for me to look at this small sample.
Oh, by the way, do you know if the above paper is available online? I would like to read it.
Posted by: David Kirk Evans | April 17, 2008 at 10:31 PM
I found your blog by chance. Thanks for listening to my talk. So are you working in this area, too?
Are you aware of any other opinion analysis corpus available?
Posted by: Kerstin | May 16, 2008 at 07:41 AM
Hi Kerstin, Thanks for the comments. Your talk was really interesting. I have worked on the TREC Blog track Opinion retrieval and that is one of the dataset I am aware of. The other dataset is the epinions movie reviews. I am also aware that Dr. Jan Weibe's group has some datasets. You might also be interesting in getting in touch with fellow commentator David Kirk Evans who was kind enough to share some of the interesting English/Japanese opinion translations.
This is a very interesting field and I enjoy learning more about sentiment/subjectivity analysis. Multilingual sentiment mining is a hard problem and I am really glad that people like yourself and David are working on such problems. Thanks so much for your efforts!
Posted by: Akshay Java | May 16, 2008 at 04:06 PM