One of the most popular machine learning algorithms is the random forest algorithm. In real life nature, a forest is measured according to the number of its trees. A bigger number points towards a healthy forest. The random forest algorithm functions on the same principle of nature. As its name suggests, it creates a forest with trees. However, these trees we are talking about are the decision trees which we have covered in one of our earlier posts. These decision trees allow a random forest to make accurate forecasting decisions.
The forest is referred to as random because of the randomness of its components. Each of the trees in this forest receives training through a procedure known as the bagging method. This method enhances the final result of the algorithm.
The classifier class in a random forest is convenient. While the trees in the forest grow, the algorithm applies to achieve a greater degree of randomness.
In other algorithms, the best feature is determined during the split of a node. In this one, when the algorithm is in the working stage, the best random feature out of a collection is searched. This is done in order to enhance the model’s diversity.
Bear in mind that out of all the features, only a select few are assessed by the random forest during a node’s split. For further randomness in its trees, another technique known as the random threshold is used where they are equipped with all the individual features.
To illustrate this point, take the example of a man named Bob who wants to dine at a new restaurant. Bob asks his friend James for a few suggestions. James asks Bob questions related to his likes and dislikes in food, budget, area, and other relevant questions. In the end, James uses Bob’s answers to suggest a suitable restaurant. This whole process mirrors a decision tree.
Moving forward, Bob is not happy with the recommendation of James and wants to explore other dining options. Thus, he begins asking other friends for their recommendations. He goes to 5 more people who act similar to James; they ask him relevant questions to provide a recommendation. In the end, Bob goes through all the answers and picks the most common answer. Here, each of Bob’s friend acts as a decision tree and their combined answers generate a random forest.
Determining the relative importance of the feature of a prediction is easy in random forest, especially if it is compared with others. If you are looking for a tool which can aid you to calculate such values, then do consider the scikit-learn—a machine learning library in Python.
In the post-training period, a score is assigned to all of the features so the results can be scaled. This makes sure that a zero value is placed for each of the importance sum.
The assessment of feature importance is crucial in order to drop a feature. Usually, a feature is dropped when it struggles to add anything of value to the prediction. This is done because too many features pose the issue of over-fitting.
Hyper-parameters
One of the factors that make random forest unique is the precise output which does not require tuning of hyper-parameters. Like a decision tree, a random forest carries its own hyper-parameters.
In random forest, hyper-parameters are used for increasing the speed of the model. Following are the scikit-learn’s hyper-parameters.
- n_jobs
It provides the engine with details about the limit of processor for computational usage. If it has a “1” value, then this indicates that only a single processor can be run. On the contrary a “-1” value indicates that there is no restriction.
- N_estimators
It is the overall number of trees to be generated in the time period before the determination of max voting and averages for predictions. A larger number of figures increases reliability but it also affects the performance speed.
- random_state
It is used to convert the model’s output to create a replicable result. If similar piece of training data, a definite value for random_state, and hyper-parameters are inserted in the model, then the output would also be identical.
- min_sample_leaf
It examines the lowest limit of a leaf for the split of internet nodes.
- Max_features
It takes the figure of maximum digit of features that are required to be used in each tree.
Now that you have understood how a random forest algorithm works, you should begin implementing through coding. For more information about machine learning, check other blogs on our website.