Decision trees: Do Splitting Rules Really Matter?
Posted by Armando Brito Mendes | Filed under Sem categoria
Um bom texto sobre o critério de divisão em subgrupos nas árvores de decisão.
Do decision-tree splitting criteria matter? Contrary to popular opinion in data mining circles, our experience indicates that splitting criteria do matter; in fact, the difference between using the right rule and the wrong rule could add up to millions of dollars of lost opportunity.
So, why haven’t the differences been noticed? The answer is simple. When data sets are small and highly-accurate trees can be generated easily, the particular splitting rule does not matter. When your golf ball is one inch from the cup, which club or even which end you use is not important because you will be able to sink the ball in one stroke. Unfortunately, previous examinations of splitting rule performance, the ones that found no differences, did not look at data-mining problems with large data sets where obtaining a good answer is genuinely difficult.
When you are trying to detect fraud, identify borrowers who will declare bankruptcy in the next 12 months, target a direct mail campaign, or tackle other real-world business problems that do not admit of 90+ percent accuracy rates (with currently available data), the splitting rule you choose could materially affect the accuracy and value of your decision tree. Further, even when different splitting rules yield similarly accurate classifiers, the differences between them may still matter. With multiple classes, you might care how the errors are distributed across classes. Between two trees with equal overall error rates, you might prefer a tree that performs better on a particular class or classes. If the purpose of a decision tree is to yield insight into a causal process or into the structure of a database, splitting rules of similar accuracy can yield trees that vary greatly in their usefulness for interpreting and understanding the data.
This paper explores the key differences between three important splitting criteria: Gini, Twoing and Entropy, for three- and greater-level classification trees, and suggests how to choose the right one for a particular problem type. Although we can make recommendations as to which splitting rule is best suited to which type of problem, it is good practice to always use several splitting rules and compare the results. You should experiment with several different splitting rules and should expect different results from each. As you work with different types of data and problems, you will begin to learn which splitting rules typically work best for specific problem types. Nevertheless, you should never rely on a single rule alone; experimentation is always wise.
Gini, Twoing, and Entropy
The best known rules for binary recursive partitioning are Gini, Twoing, and Entropy. Because each rule represents a different philosophy as to the purpose of the decision tree, each grows a different style of tree.
Guardar
Tags: análise de dados, data mining
Comments are closed.