What is upsupervised learning?
Unsupervised learning is one of the ways that machine learning (ML) ‘learns’ data. Unsupervised learning has unlabelled data that the algorithm has to try to make sense of on its own. Supervised learning is where datasets are labelled so there’s an answer key that the machine can measure its accuracy against. If machine learning was a child learning to ride a bike, supervised learning is the parent running behind the bike holding it upright. Unsupervised learning is handing over the bike, patting the child on the head, and saying ‘good luck’.
The goal is to simply let the machine learn without assistance or prompts from data scientists. Along the way, it should also learn to adjust the results and groupings when there are more suitable outcomes. It’s allowing the machine to understand the data and process it how it sees fit.
Unsupervised learning is used for exploring unknown data. It can reveal patterns that may have been missed or examine large data sets that would be too big for a human to tackle.
How does unsupervised learning work?
To understand unsupervised learning, we have to understand supervised learning. If a computer was learning to identify fruit in a supervised learning setting, it would be given example images of fruit that were labelled. This is called input data. For instance, the labels would say that bananas are long, curved and yellow, apples are round and red, while an orange is spherical, waxy-looking and orange. After enough time, the machine should be able to confidently identify which fruit is which, based on those descriptors. If presented with an apple, for instance, it would be able to confidently say it’s not orange coloured, therefore it’s not an orange, but also that it’s not yellow and long, therefore it’s not a banana. It’s round and red, so it’s an apple.
In contrast, unsupervised learning is when there is no categorization or labelling of the data at all. The machine will have no idea about the concept of fruit, so it cannot label the objects. However, it can group them together according to their colors, sizes, shapes and differences. The machine groups things together according to similarities, finding hidden structures and patterns in unlabelled data. There is no right or wrong way, and no teacher. There are no outcomes, just a pure analysis of the data.
Unsupervised learning uses a range of algorithms to fit data into broad groups, clustering and association.
Clustering algorithms in unsupervised learning
Clustering is when objects are grouped together into subsets called clusters. This is one of the best ways to get an overview of the structure of your data. There will be some similar characteristics in these clusters. This method is designed to have groups with the same characteristics and then assign them to relevant clusters.
Hierarchical clustering
This is when the machine groups things that go together in a cluster tree. All data is one cluster, then it breaks down into smaller and smaller clusters. Data will belong to a cascading set of clusters from most generic, down to most specific and tightly grouped. So, the end result is you see how different sub-groups relate to one another or how far apart they are.
k-Means clustering
This algorithm separates data into distinct clusters that have not been labeled in the data. The distance to the centre of the cluster depends on the strength of the association. Data points can belong to only one cluster. A larger k means a smaller group with more granularity in the same way. Each cluster is assigned a data point label.
Gaussian mixture models
Based on a normal bell curve distribution, groups clusters are spread along at normal, expected densities, showing sub-populations in the overall data.
Fuzzy clusters
These clusters can overlap, so each data point can belong to as many clusters as are relevant as opposed to hard clustering where data points can only belong to one cluster. This is the Venn diagram of the unsupervised learning world.
Clustering assumes relationships between groups, and so they aren’t always the best way for customer segmentation - this algorithm doesn’t treat data points as individuals. You need to apply more statistical methods to analyze the data further.
Association in unsupervised learning
In machine learning, the algorithm creates rules that find associations between data points. It finds the relationships between variables, identifying items that tend to occur together. For instance, basket analysis in supermarkets can see what items people tend to buy at the same time- soup and bread rolls, for instance. Or, when people buy a new home, what are they likely to also buy new? This algorithm is excellent for identifying marketing opportunities.
Latent variable models in unsupervised learning
Latent variable modeling shows the relationship between observable variables (or manifest variables) to those that are hidden or unobserved (latent variables). Latent variable models are mostly used mostly in data preprocessing/ cleansing, to reduce the features in a dataset, or break down the data set into multiple components.
When is unsupervised training preferred for unknown trends and patterns?
Because the machine doesn’t know that there’s a ‘correct’ answer, letting the decisions be made about the data based on the information only (meaning, without bias from the scientist) allows data scientists to learn more about the data. Algorithms may find interesting or hidden structures in the data that weren’t previously visible to data scientists. These hidden structures are called feature vectors.
Data often doesn’t come with labels, so unsupervised learning saves a data scientist from having to label everything, which can be a time-consuming and oftentimes insurmountable task. Unsupervised learning algorithms also allow for more complex processing tasks. Once again, no labelling means that complicated relationships and clusters of data can be mapped. No data labelling means no preconceived ideas, and no bias.
The best time to use unsupervised learning is when there i’s no pre-existing data onr preferred outcomes. Unsupervised learning can identify features that can be useful in the categorization of unknown data sets. For example, if a business needs to determine the target market for a brand new product.
Unsupervised learning uses a technique called dimensionality reduction. This is when the machine assumes a lot of data is redundant and either removes dimensions, or combines some parts of data together when applicable. Data compression results in time savings and savings in computing power.
Generative models are another strong point of unsupervised learning. Generative models show the distribution in the data. This is when data is reviewed and new samples can be created from this. For instance, a generative model can be given a set of images and create a set of fabricated images based on these.