Sometimes you learn about a new mathematical technique that is so intriguing that it can be only described as "beautiful". Nonnegative matrix factorization is one such method that I did not know of until quite recently. The details of the method are available in the paper "Document Clustering Based On Non-negative Matrix Factorization" by Wei Xu, Xin Liu, Yihong Gong.
The basic idea behind this method is that you want to factorize a matrix X into two smaller matrices U and V such that, both U and V are non negative. This is achieved by using minimizing the following optimization function
So if we have a matrix X that represents a Term*Document matrix: it can be factorized into the two matrices U and V such that U signifies the Term*ClusterAssociation and V transpose signifies the ClusterAssociation*Document matrix. Now since the two matrices U and V are non negative, meaning all the elements in them are >= 0, we can identify the cluster to which a document belongs by projecting the vector V onto the dimension with the highest value.
Singular Valued Decomposition(SVD), decomposes X into dense matrices that can contain negative elements and it is not always intuitive what the basis vectors really signify. However using NMF the clusters are readily and directly available from the factorization. In addition, the sparsity makes this technique quite appealing.
In the following example, I have clustered the CLASSIC3 dataset, which is a standard corpus frequently used for evaluating different clustering methods. Notice how the three datasets CISI, MEDLINE and CRANFIELD line up nicely along the three different axis.
I like this method for its simplicity and intuition and have been exploring its use in clustering blog/social data.
You might want to check out this paper from CHI '08, which applies PCA (a similar technique) to categorize blogs:
Word Usage and Posting Behaviors: Modeling Blogs with Unobtrusive Data Collection Methods
Adam D. I. Kramer, University of Oregon, USA
Kerry Rodden, Google, USA
We analyzed of 1.8 million Blogger blogs using LIWC, resulting in a 5-factor structure of the psychologically relevant words used in weblogs: Rantiness, Metaphysics, Social Activity, Work, and Melancholy.
http://portal.acm.org/citation.cfm?id=1357054.1357230
Posted by: Daniel Tunkelang | May 16, 2008 at 09:54 PM
Hi Daniel, Thanks for sharing this interesting paper. I will surely try to read it as it looks like a significantly large scale study of this sort.
You also brought out an important point regarding PCA, which is a similar technique. From Wikipedia entry on NMF:
"t was later shown that some types of NMF are an instance of a more general probabilistic model called "multinomial PCA".[7] When NMF is obtained by minimizing the Kullback–Leibler divergence, it is in fact equivalent to another instance of multinomial PCA, probabilistic latent semantic analysis,[8] trained by maximum likelihood estimation. That method is commonly used for analyzing and clustering textual data and is also related to the latent class model.
It was also shown[9] that when the Frobenius norm is used as a divergence, NMF is equivalent to a relaxed form of K-means clustering: matrix factor W contains cluster centroids and H contains cluster membership indicators. This also justifies the use of NMF for data clustering."
From my brief understanding of this subject, despite the similarities of PCA,K-Means with NMF, NMF is far easier to implement (since it is just iterative updates). I would like to learn more and gain a better understanding on this. Thanks again for sharing your thoughts with me.
Posted by: Akshay Java | May 17, 2008 at 12:01 AM
Daniel, I will add to your knowledge of new techniques. One that has been around for over a decade that is called ICA (Independent Component Analysis), there are tons of publications (peer review) available in different journals. Just Google. Also there are lots of free ICA codes available on the net. I have implemented a few variants of NNMF (non negative matrix factorisation) for my own use. There are lots of different variants of NNMF, with different performance(error rates) capabilities. There is also a kernel version of NNMF , called kernel-NNMF, which is more robust than its non-kernel counterpart. There is also an online version of NNMF (ie, it updates itself in realtime as new data arrives), compared to the batch version (offline) of NNMF. There are tons of these techniques available in the literatures. They are all called dimensional reduction or feature extractions, because a high dimensional data can be run thru these algorithms, where it reduces the high dimensions (variables) into a lower one, where much of the original information is still retained.
Posted by: Falafulu Fisi | May 29, 2008 at 06:59 PM