It was an honor to have Dr. Jiawei Han on our campus today. He was hosted by the Information Systems
(IS) department as part of the distinguished speaker series. Dr. Han is a pioneer in the field of data mining and has written a book that I had used for the Data Mining class I took with Dr. Kargupta. In fact, Dr. Han even obliged to autograph the copy for me :-D. (I know thats geeky but is'nt that cool?)
Here is a summary of his talk on "Research Challenges in Data Mining".
Dr. Han narrates and interesting episode where Dr. Jim Gray brought in a hard disk and asked his students to guess the capacity... this was perhaps just 10 years back and a 2G disk evoked a jaw dropping response from the audience. According to Dr. Jim Gray, "We are in the era of Data Science". Science used to be
experimental (observe the stars) then it moved to computational (run simulations) but now we are really moving towards "Data Science" (Gigabytes, Terabytes and Petabytes of data all around us).
The main themes of the talk were:
1) Pattern Mining: Classification by finding frequent, discriminative patterns.
Dr. Han described an interesting experiment in which they analyzed the feature length vs information gain. What they found was that you get more information by combining features (2,3,4), while single features dont give you much information. On the other hand too long pattern, and it is not too informative either. This has an important implication, especially in text mining where using NGrams (bigrams and trigrams) has generally been found to be useful but if you use too long a pattern, it is not that helpful.
Another analysis presented was of information gain vs pattern freq: Frequent patterns have more information gain than less frequent ones. On the other hand extremely frequent patterns have little gain (for example stop words).
The implication of these results is in using SVMs and C4.5, classification accuracy can be improved by using discriminative patters/feature selection. Dr. Han suggests that "It is not the number of patterns its the quality of patterns. Its Not the more the better ......instead, the better the better." A few selected discriminative patterns are more useful than a lot of features in case of classification.
2) Streams Data Mining Classification for rare events.
When you have multiple streams of information and you can not store all the data, but need to remember some of the history, (in order to do classification of a new stream) it is helpful to use sampling techniques. Dr. Han discusses results from one of their recent papers on using biased sampling. The idea here is that you dont toss off positive data (all of it is stored since there are very few samples). Now, by selective sampling of negative data one can use equal amount of positive and negative data for training a classifier. In case of multiple streams, using an ensemble approach to classification is suggested.
3) Information network analysis. Distinguishing objects with identical names. (eg author names, song names)
Name disambiguation is a very difficult problem. For eg there are 14 Wei Wang on DBLP -- some with even same co-authors, same conferences, etc. in such cases, textual similarity can not be used, since they are all basically in the same field. Dr. Han discusses how using only information from DBLP, they make links powerful. The basic idea is to group references according to their similarity. By performing a random walks along different join paths, they can group the authors and determine the different name ambiguities in the graph. Often, in computer science and other fields, Collaboration behavior within the same community share some similarity and this can be helpful in disambiguation.
Another research paper that Dr. Han highlights is on "Truth discovery with multiple conflicting information providers on the web". The goal here is to algorithmically determine what piece on the Web is more trustworthy. The proposed approach is to model it as a tripartite graph of (website,facts,object).
Assumption is that there is just one true fact and that false 'facts' are introduced due to random factors (and not malicious intent). The domain was restricted to identifying the right book title, which is slightly easier than something like political opinions. The basic idea behind their approach is that a site is trustworthy if it provides many facts with high confidence and once again this can be modeled as a matrix computation problem of identifying trustworthiness.
Some of the other areas that were very briefly mentioned by Dr. Han were:
4) Mining moving objects - unrestricted, restricted (cars) scattered (Cell phone and rfid)
5) Spatio temporal multimedia clustering with obstacles
6) blog mining, etc.
7) data cubes.
8) Visual data mining
9) Biological data mining
10) DM for software engineering/bug mining
This was a great talk and very informative. I really liked how the Dr. Han gave the basic intuition behind their methods and yet managed to give a broad overview of some of the challenges in the field.
Recent Comments