In the last post, we discussed how artificial neural networks are modeled on a human brain. There are other algorithms too which have been inspired by the real world. For instance, we have the decision tree algorithm in machine learning which is founded on the basis of a tree. Such an algorithm is used for decision analysis. The algorithm is also frequently used in data mining to derive meaningful results. So, how exactly does it work?
To understand a decision tree, let’s suppose an elementary example in which we have a dataset of passengers of a ship. It is expected that a violent storm would cause the ship to get wrecked. Now the problem at hand is to predict the survival rate of a passenger based on their characteristics. Their attributes (also known as features) are mainly their age, spch (any spouse or children with them), and age.
As you can see, a decision tree is visually represented in an upside-down approach where instead of placing the root at the top, we present it at the top. The italicized text which shows our condition represents the internal node which divides the three into edges (branches). The branch which is not divided any further is referred to as the decision (leaf).
If you analyze the above example, then you can recognize the fact that all the relevant relations are easily viewable, thereby making for strong feature importance. This approach is also called a “learning decision tree from data”. The tree in our drawn example is categorized under a classification tree as it is used to classify the survival rate or fatality rate of a passenger tree. The other category is known as regression tree which is not too dissimilar, except for the fact that they deal with continuous values. The decision tree algorithms are broadly depicted as CAT— Classification and Regression Trees.
The growth of a decision tree depends upon its features (attributes or characteristics) and the conditions which are used to divide the tree with a clear intent about the stopping point of the three. Often, the growth of tree exceeds to arbitrary levels where some trimming is required for better results.
Recursive Binary Splitting
Recursive binary splitting uses a cost function to test all the features and the split points. For instance, in the above example from the root, all features were analyzed after which groups were formed from the divisions of the training data. Our example has 3 examples which mean we require 3 splits. Subsequently, we are going to compute the cost of each split in terms of accuracy. When the least costly split is discovered, which refers to the sex feature in our example then the feature is chosen. This approach of the algorithm is naturally recursive because more groups can undergo subdivisions by repeating the same process. Therefore, the algorithm falls into the category of greedy algorithms. This also means that the most effective classifier is the root node.
Let’s try to understand cost functions more closely while working with classification and regression. Cost functions always attempt to identify the branches which exhibit similarity. Therefore, it is certain that any input which is test data is bound to adhere to the specific path.
For instance, consider the real estate industry a problem requires the prediction of house prices. In this case, the decision tree initiates the splitting processing and analyzes all the features from the training data. It calculates the input of training data to generate mean for responses which are treated as a prediction for their respective groups. The function is performed for all the data points while a cost is generated for the candidate splits. In the end, the split which consumed the smallest cost is chosen.
Classification: G = sum(pk * (1 — pk))
To determine the quality of a split, the gini score is used which assesses the mixing of the response classes in the split’s groups. In the above equation, pk refers to the proportion in which a particular group has similar class inputs. Maximum purity of a class is achieved when it established that a group encompasses the same class’ inputs. In such a scenario the value of pk maybe either 0 or 1 while G remains 0. The worst purity is established when a node gets 50-50 split for a group’s classes. In binary classification, the values of pk and G would be 0.5 each in such a scenario.
Putting a Stop to Split
There is a point at which the split of the tree must be stopped. Generally, problems have several features which means that the resulting split is also huge, thereby creating a large tree. This is an undesired scenario because such trees raise over-fitting issues. One strategy for stopping a split is to define the lowest number for training inputs which are to be assigned for all the leaves. For instance, in the above example, we can take 15 passengers to reach a consensus or decision for survival or death whereas any leaf which is bombarded with less than 15 passengers is duly rejected. Conversely, you can also define the max depth for the model. Max depth is the longest path’s total length which exists between a root and a tree.
Pruning is used to enhance the performance of a decision tree. In pruning, any branch with low or weak feature importance is eliminated, thereby minimizing the tree’s complexity and boosting its predictive strength. Pruning can either initiate from the leaves or the root. In simpler scenarios, pruning begins from the leaves where it eliminates nodes that have the most popular class of that leaf unless they are not violating accuracy. This strategy is also called as reduced error pruning.
The above-mentioned knowledge is enough to complete your initial understanding of a decision tree. You can begin its coding by using Python’s Scikit-Learn library.