Unsupervised learning is a branch of machine learning that takes unlabeled data that hasn't been previously classified or categorized and tries to extract features and patterns from the data on its own. Where supervised learning is analogous to taking a multiple choice test with pre-determined answer key, unsupervised learning is analogous to taking an open-ended test where the questions don't have an answer key or objective means of determining a grade.
The general goal of unsupervised learning is to gain some insights about a given data set by modeling the underlying structure or distribution in the data. Unsupervised learning algorithms aren't searching for concrete correct answers or specific outputs. Rather, they are handed a dataset without having any explicit instructions on what to do, and they are left alone to find interesting structure in the data.
The different unsupervised learning models that exist can be categorized based on the ways in which they organize data.
- Clustering - Identifying and grouping similar data points together. Variations include k-means, k-means++, hierarchical clustering, density clustering, spectral clustering, and more.
- Data compression / dimensionality reduction - Identifying and removing redundant data from a data set so that most of the important information can be represented with only a faction of the actual content, saving on computing power and storage costs. These methods include nonlinear dimensionality reduction (NDR), non-negative matrix factorization (NMF), singular value decompostion (SVD), as well as principal component analysis (PCA) and variations of PCA such as kernel PCA and sparse PCA.
- Anomaly detection - Identifying unusual patterns that do not conform to expected behavior. There are several types of anomaly detection that can be used for different purposes. They include: clustering-based methods such as k-means; support vector machine-based methods; density-based methods such as k-NN or local outlier factor (LOF).
- Association - Discovering interesting relationships between variables in large data sets. Well known association algorithms include Apriori and Eclat.
The following are methods used in unsupervised machine learning.