What is Machine Learning?
Machine learning is a type of computer science that uses statistical techniques to give computer systems the ability to "learn" with data, without being explicitly programmed. Machine learning is a really big buzzword right now in data science, so it is a good idea to get a good understanding of it, as machine learning can be broken into two general categories: unsupervised machine learning and supervised machine learning.
Unsupervised & Supervised Machine Learning
The basic difference between these two actually has to do with the data that we're inputting into the model and our objectives going out. Unsupervised machine learning is when you have unlabeled data and have a lot of variables, but you don't necessarily know where they fall in terms of what your objective is. Supervised machine learning on the other hand is when we have a series of variables and they are labeled.
Supervised Machine Learning
We have a data set about flowers that has a lot of different variables that describe the flowers, the features of the flowers, and we have it labeled. We know that it's species A, B, or C and what we really want to be able to do is take something like this and then take another data set that isn't labeled that has the same variables--push it through a decision tree to decide which species it is. That's the real goal here, as this is the end result of a machine learning supervised decision tree below.
During this basic process--when it was doing the machine learning it iterated through all the different variables and which flowers it was specified as. It optimized where these nodes are located in the tree and what these values are that distinguish between the different species and the different nodes. It continues to iterate through this many times until we've optimized or we feel confident about what species we're designating based on the features that we're inputting into the tree.
Regression in Supervised Machine Learning
Another example of supervised machine learning is regression. For this example, we are going to talk about sleep and happiness, because we all know that the less sleep you get--you are generally not very happy.
Here on the x-axis we have the number of hours of sleep and on the y-axis we have a happiness scale. On the left side of the graph below, you don't have very much sleep and you're not very happy, but you start to get happy the more sleep you get and then you reach a point where the more sleep you get the less happy you are.
A good work use case for this would be if you are a physician and you want to be able to predict how happy a patient will be when they tell you how much sleep they get on average. What we want to do is create a function from the data that we have--each point here represents a person in our supervised data set.
The way that this works is it iterates the way the machine learns and it draws lines iteratively through these data points in the hope to best represent the data. It does this as many times as we tell it to and then eventually it gets the optimum representation for this data.
Clustering in Unsupervised Machine Learning
One good example of unsupervised machine learning is clustering. This is used a lot of times in marketing campaigns when we want to do customer segmentation. In customer segmentation, you want to understand how customers are so you can have targeted marketing campaigns. Let's take an example with cereal, this is the final result of clustering unsupervised machine learning algorithm and we have a cereal we want to know something about our customers that buy cereal so we're going to look at two different variables.
The two variables we are going to look at are fiber and sugar. On the x-axis we have fiber and on the y-axis we have sugar. If this was a supervised example, we would have all of this information and this data would already be labeled, but we don't have that.
We had to make this up not knowing how these different groups would fall. The way that this kind of worked and iterated through--you see this line (the above graph) here is really what designates between these two groups and it went through a clustering algorithm several times and it basically redrew the line. Eventually it found an optimum place to place that line so that we can best represent these two different groups and then target them in a marketing campaign.
Dimension Reduction
Another use of machine learning is what's called dimension reduction. This is largely used as a tool to help data scientists reduce the number of dimensions that they are using in models, so that they can be more timely and effective. A lot of times you can have tons of variables and you should reduce them down.
One way to do this is when things are really obvious--sometimes you just know that data isn't really going to matter. If we have the same problem trying to identify what's the best corollaries to segment customers in terms of buying cereal, we might know that they really don't care about what caking agent the cereal has in it so we'll just throw that out.
We did this because we know that it doesn't really matter, however we might know that dye really does matter so we can take three different colors of dye. We have blue yellow and red and we can make these three variables down into two variables.
We hold the same amount of information but we're holding it in two variables rather than three. Basically the way that the machine learns the optimum way to represent these three variables is these are the two new variables that we're going to create and it basically positions these in different locations in this three-dimensional space until they have the optimum representation of these three different kinds of dye.
Conclusion
That is the general overview of machine learning. We talked about unsupervised and
supervised machine learning, where we have labeled data to work with and we have unlabeled data to work with. We can do clustering, dimension reduction, regression, and decision trees and there is a lot of other things that we can do with data with machine learning .
Posted by Gage Peake