您的位置:站长主页 -> 繁星客栈 -> 图灵塔 (应用技术论坛) -> Machine Learning and Statistics September 2, 2010

Machine Learning and Statistics

用户登陆 | 刷新

Omni

发表文章数: 305
武功等级: 太极剑法
     (第五重)
内力值: 374/374

Machine Learning and Statistics



[Note]: Recently, I have been reading a relatively new textbook "Data Mining ---Practical Machine Learning Tools and Techniques" by I.H. Witten and E. Frank (second edition, 2005). It's an excellent textbook overall, giving you a lot of chances to try some hands-on examples. I found one short commentary section in the book very insightful and may fill in some gaps of my comments on machine learning downstairs. In my comments, I kind of downplayed the difference between machine learning and statistics to help beginners get a jump start in the field. I think Witten & Frank's comments are more rigorous, so I decide to type their paragraphs here (with my occasional notes inserted to clarify some concepts) for students who might be interested.

====================================================
Witten & Frank (2005)

1.4 Machine learning and statistics

What's the difference between machine learning and statistics? Cynics, looking wryly at the explosion of commercial interest (and hype) in this area, equate data mining to statistics plus marketing. In truth, you should not look for a dividing line between machine learning and statistics, for there is a continuum—and a multidimensional one at that—of data analysis techniques. Some derive from the skills taught in standard statistics courses, and others are more closely associated with the kind of machine learning that has arisen out of computer science. Historically, the two sides have had rather different traditions. If forced to point to a single difference of emphasis, it might be that statisics has been more concerned with testing hypotheses, whereas machine learning has been more concerned with formulating the process of generalization as a search through possible hypotheses. But this is a gross oversimplification: statistics is far more than hypothesis testing, and many machine learning techniques do not involve any searching at all.

In the past, very similar methods have developed in parallel in machine learning and statistics. One is decision tree induction. Four statisticians (Breiman et al. 1984, note that Leo Breiman is also the inventor of the famous Random Forests method) published a book on "Classification and regression trees" in the mid 1980s, and throughout the 1970s and early 1980s Ross Quinlan (a prominent machine learning researcher) was developing a system for inferring classification trees from examples. These two independent projects produced quite similar methods for generating trees from example. A second area in which similar methods have arisen involves the use of nearest-neighbor methods for classification. These are standard statistical techniques that have been extensively adapted by machine learning researchers, both to improve classification performance and to make the procedure more efficient computationally.

But now the two perspectives have converged. The techniques we will examine in this book incorporate a great deal of statistical thinking. From the beginning, when constructing and refining the initial example set, standard statistical methods apply: visualization of data, selection of attributes (Note: also known as "variables" or "features"), discarding outliers, and so on. Most learning algorithms use statistical tests when constructing rules or trees and for correcting models that are "overfitted" in that they depend too strongly on the details of the particular examples used to produce them (Note: "training set" or "training data"). Statitical tests are used to validate machine learning models (Note: a good example is the "leave-one-out cross validation" method) and to evaluate machine learning algorithms.


海天一片,对景愁怀倦。心似木船独飘零,惆怅远景难见。
命里沉浮谁主,流年似水空度。浩翰烟波如故,当时容颜何处。


发表时间:2006-02-07, 14:01:49  作者资料