What is a random forest?
Random forest is a supervised machine learning algorithm. It is one of the most used algorithms due to its accuracy, simplicity, and flexibility. The fact that it can be used for classification and regression tasks, combined with its nonlinear nature, makes it highly adaptable to a range of data and situations.
The term “random decision forest” was first proposed in 1995 by Tin Kam Ho. Ho developed a formula to use random data to create predictions. Then in 2006, Leo Breiman and Adele Cutler extended the algorithm and created random forests as we know them today. This means this technology, and the math and science behind it, are still relatively new.
It is called a “forest” because it grows a forest of decision trees. The data from these trees are then merged together to ensure the most accurate predictions. While a solo decision tree has one outcome and a narrow range of groups, the forest assures a more accurate result with a bigger number of groups and decisions. It has the added benefit of adding randomness to the model by finding the best feature among a random subset of features. Overall, these benefits create a model that has wide diversity that many data scientists favor.
What is a decision tree?
A decision tree is something that you probably use every day in your life. It is like if you asked your friends for recommendations on what sofa to buy. Your friends will ask you what is important to you. Size? Color? Fabric or leather? Based on those decisions, you can track down to find the perfect sofa based on your choices. A decision tree basically asks a series of true or false questions that lead to a certain answer.
Each “test” (leather or fabric?) is called a node. Each branch represents the outcome of that choice (fabric). Each leaf node is a label of that decision. Obviously in real scenarios, it is splitting observations so that entire groups are different, resulting in subgroups that are similar to each other but different from the other groups.
The difference between decision trees and random forests
A random forest is a group of decision trees. However, there are some differences between the two. A decision tree tends to create rules, which it uses to make decisions. A random forest will randomly choose features and make observations, build a forest of decision trees, and then average out the results.
The theory is that a large number of uncorrelated trees will create more accurate predictions than one individual decision tree. This is because the volume of trees work together to protect each other from individual errors and overfitting.
For a random forest to perform well, they need three things:
- An identifiable signal so that models are not just guessing
- The predictions made by the trees need to have low levels of correlation with the other trees
- Features that have some level of predictive power: GI=GO
How is a random forest algorithm used in business?
There are many applications for a random forest in business settings. For example, a single decision tree might classify a data set related to wine, separating various wines into light or heavy wines.
The random forest creates many trees, making the end result predictions far more sophisticated. It can take the wines and have multiple trees, comparing prices, tannins, acidity, alcohol content, sugar, availability, and a whole range of other features. Then, averaging out the results, it can make predictions about the (arguably) best wines overall, based on a huge number of criteria.
In a business, a random forest algorithm could be used in a scenario where there is a range of input data and a complex set of circumstances. For instance, identifying when a customer is going to leave a company. Customer churn is complex and usually involves a range of factors: cost of products, satisfaction with the end product, customer support efficiency, ease of payment, how long the contract is, extra features offered, as well as demographics like gender, age, and location. A random forest algorithm creates decision trees for all of these factors and can accurately predict which of the organization’s customers are at high risk of churn.
Another complex example would be trying to predict which customers will have the biggest spend in a year. A comprehensive range of variables and attributes are analyzed and predictions can be made about who the marketing department needs to target that year.
Bagging in decision forests
Bagging, otherwise known as bootstrap aggregation, allows individual decision trees to randomly sample from the dataset and replace data, creating very different outcomes in individual trees. This means that instead of including all the available data, each tree takes only some of the data. These individual trees then make decisions based on the data they have and predict outcomes based only on these data points.
That means that in each random forest, there are trees that are trained on different data and have used different features in order to make decisions. This provides a buffer for the trees, protecting them from errors and incorrect predictions.
The process of bagging only uses about two-thirds of the data, so the remaining third can be used as a test set.
Benefits of random forest
Easy to measure relative importance
It is simple to measure the importance of a feature by looking at the nodes that use that feature to reduce impurity across all the trees in that forest. It is easy to see the difference before and after permuting the variable, and this gives a measure of that variable’s importance.
Versatile
Because a random forest can be used for both classification and regression tasks, it is very versatile. It can easily handle binary and numerical features as well as categorical ones, with no need for transformation or rescaling. Unlike almost every other model, it is incredibly efficient with all types of data.
No overfitting
As long as there are enough trees in the forest, there is little to no risk of overfitting. Decision trees can also end up overfitting. Random forests prevent that by building different sized trees from subsets and combining the results.
Highly accurate
Using a number of trees with significant differences between the subgroups makes random forests a highly accurate prediction tool.
Reduces time spent on data management
With traditional data processing, a large proportion of valuable time is spent cleansing data. A random forest minimizes that as it deals well with missing data. Tests done comparing predictions resulting from complete and incomplete data showed almost identical performance levels. Outlying data and nonlinear features are essentially scrapped.
Random forest techniques also work to balance errors in populations and other unbalanced data sets. It does this by minimizing the error rate, so a larger class will have a lower error rate, and a smaller class will have a larger rate.
Quick training speed
Because random forests use a subset of features, they can quickly assess hundreds of different features. This means that prediction speed is faster than other models too, as generated forests can be saved and re-used in the future.
Challenges of random forest
Slower results
Because the algorithm is building many trees, it increases the sophistication and accuracy of the predictions. However, it slows down the speed of the process, as it is building hundreds or thousands of trees. This makes it ineffective for real-time predictions.
Solution: Out-of-bag (OOB) sampling can be used, where only two thirds of the data is used in order to make predictions. The random forest process is also parallelizable, so the process can be split over many machines, running in far faster time than it would in a solo system.
Unable to extrapolate
A random forest prediction relies upon an average of previously observed labels. Its range is bound by the lowest and highest labels in the training data. While this is only a problem in a scenario where the training and prediction inputs have different range and distributions, this covariate shift is a problem that means a different model should be used in some situations.
Low interpretability
Random forest models are the ultimate black box. They are not explainable, so it is difficult to understand how or why they arrived at a certain decision. This impenetrability means that the model simply has to be trusted as is and the outcomes accepted as is.
Alternatives to random forest
Neural networks (NN)
A neural network is a number of algorithms that work together to identify relationships in data. It is designed to try and replicate how the human brain works, always changing and adapting to suit the incoming data. It has significant benefits over random forest, as it can work with data other than in table format, such as audio and images. It also can be finely adjusted with many hyperparameters than can be tweaked to suit the data and outcome required.
However, if the data being worked with is tabular only, it is best to stick with random forest, as it is simpler and still yields good results. Neural networks can be labor and computer intensive, and for many calculations, the granular detail may simply not be required. For simple tabular data, both neural networks and random forests perform similarly in terms of predictions.
eXtreme gradient boosting (XGBoost)
eXtreme Gradient Boosting is said to be more accurate than random forests and more powerful. It combines a random forest and gradient boosting (GBM) to create a far more accurate set of results. XGBoost takes slower steps, predicting sequentially rather than independently. It uses the patterns in residuals, strengthening the model. This means the predicted error is less than random forest predictions.
Linear models
Linear prediction models are one of the simplest machine learning techniques. They are used widely, and when performed on the right data set, they are a powerful prediction tool. They are also easy to interpret, and do not have the black box effect that a random forest does. However, they are significantly less agile than a random forest, as they only use linear data. If data is non-linear, random forests will yield the best predictions.
Cluster models
The top five clustering methods include fuzzy clustering, density-based clustering, partitioning methods, model-based clustering, and hierarchical clustering. All of them, in some form, work by clustering a group of objects together in similar groups or clusters. It is a technique used in many fields of data science and is part of data mining, pattern recognition, and machine learning. While you can use clustering within a random forest, it is a standalone technique on its own.
Cluster models are excellent at adapting to new examples, generalizing cluster sizes and shapes, and its results give valuable data insights.
However, clustering does not deal well with outliers and non-gaussian distribution. Clustering can have scaling issues when processing a large number of samples. Finally, the number of features can be high, more even than the number of samples.
Support vector machine (SVM)
Support Vector Machines analyze data, which is then used for regression analysis and classification. It is a robust prediction method, and reliably builds models that categorize data points. These models rely on the idea of distance between points, although this may not be meaningful in all cases. While a random forest tells you the probability of belonging to a class in a classification problem, the support vector machine gives a distance to a boundary, so it still requires a conversion to make it a probability.
Bayesian network
A Bayesian network is a graphical model that shows variables, dependencies, and probabilities. They are used to build models from data, predict outcomes, detect anomalies, provide reasoning, run diagnostics, and assist with decision making. Bayesian networks are generative and model the probability distribution from given random variables. It is best suited for complex queries on random variables.
Random forests are descriptive models and are generally used for classification. If causality is of interest, then Bayesian networks may be a better fit than random forests. If the data pool is large, random forests are preferable.
Future of random forest
Highly effective, adaptable, and agile, the random forest is the preferred supervised machine learning model for many data scientists. It offers a range of benefits that many alternatives do not and gives accurate predictions and classifications. However, it is largely unexplainable and can be somewhat of a black box in terms of how results are achieved.
In the future, it is possible that combining classic random forest with other strategies may make predictions more accurate and optimize results even further. Also, the leap to explainable machine learning is becoming more of a reality now and may help to uncover some of the mysteries of random forest predictions.