Defining Supervised Learning
As the name suggests, supervised learning in Machine Learning is like having a supervisor while a machine learns to carry out tasks. In the process, we basically train the machine with some data that is already labeled correctly. Post this, some new sets of data are given to the machine, expecting it to generate the correct outcome based on its previous analysis on the labeled data.
Practice makes one perfect! The same applies to machines as well. As the number of practice samples increases, the outcomes produced by the machine become more accurate.
When do we use Supervised Learning?
Supervised learning develops predictive models to come up with reasonable predictions as a response to newly fed data. Hence, this technique is used if we have enough known data (labeled data) for the outcome we are trying to predict. In supervised learning, an algorithm is designed to map the function from the input to the output.
y = f(x)
Here, x and y are input and output variables, respectively.
The goal here is to propose a mapping function so precise that it is capable of predicting the output variable accurately when we put in the input variable.
So far in this blog, we learned what supervised learning is. Now, we will go further, exploring its types, advantages and disadvantages, and more. Let’s proceed.
Types of Supervised Learning
There are two types of supervised learning techniques, classification and regression. These are two vastly different methods. But how do we identify which one to use and when? Let’s get into that now.
Classification is used to identify labels or groups. This technique is used when the input data can be segregated into categories or can be tagged. If we have an algorithm that is supposed to label ‘male’ or ‘female,’ ‘cats’ or ‘dogs,’ etc., we can use the classification technique. Here, finite sets are distinguished into discrete labels.
A practical example of the classification technique would be the categorization of a set of financial transactions as fraudulent or non-fraudulent. Some of the common applications built around this technique are recommendations, speech recognition, medical imaging, etc.
Classification is again categorized into three:
- Binary classification: The input variables are segregated into two groups.
- Multiclass/Multinomial classification: The input variables are classified into three or more groups.
- Multilabel classification: Multiclass is generalized as multilabel.
The regression technique predicts continuous or real variables. For instance, here, the categories could be ‘height’ or ‘weight.’ This technique finds its application in algorithmic trading, electricity load forecasting, and more. A common application that uses the regression technique is time series prediction. A single output is predicted using the trained data.
When to use these techniques?
On either side of the line are two different classes. The line can distinguish between these classes that represent different things. Here, we use the classification method.
Whereas, regression is used to predict the responses of continuous variables such as stock price, house pricings, the height of a 12-year old girl, etc.
Advantages and Disadvantages of Supervised Learning
Next, we are checking out the pros and cons of supervised learning. Let us begin with its benefits.
- In supervised learning, we can be specific about the classes used in the training data. That is, classifiers can be given proper training to help distinguish themselves from other class definitions and define perfect decision boundaries.
- We get a clear picture of every class defined.
- The decision boundary can be set as the mathematical formula for classifying future inputs. Hence, it is not required to keep training the samples in a memory.
- We have complete control over choosing the number of classes we want in the training data.
- It is easy to understand the process when compared to unsupervised learning.
- It is found to be most helpful in classification problems.
- It is often used to predict values from the known set of data and labels.
- Supervised learning cannot handle all complex tasks in Machine Learning.
- It cannot cluster data by figuring out its features on its own.
- The decision boundary could be overtrained. If we are dealing with large amounts of data to train a classifier or samples used to train it are not good ones, then the accuracy of our model would be distorted.Hence, considering the classification method for big data can be very challenging.
- The computation behind the training process consumes a lot of time, so does the classification process. This can be a real test of our patience and the machine’s efficiency.
- As this learning method cannot handle huge amounts of data, the machine has to learn itself from the training data.
- If an input that doesn’t belong to any of the classes in the training data comes in, the outcome might result in a wrong class label after classification.
Data is the new oil. Hence, it is put to use in a variety of ways. We will now discuss one such interesting case: Credit card fraud detection. Here, we will see how supervised learning comes into play.
Credit Card Fraud Detection
Let us use exploratory data analysis (EDA) to get some basic insights into fraudulent transactions. EDA is an approach used to analyze data to find out its main characteristics and uncover hidden relationships between different parameters.
Digitization of the financial industry has made it vulnerable to digital frauds. As e-payments increase, the competition to provide the best user experience also increases. This nudges various service providers to turn to Machine Learning, Data Analytics, and AI-driven methods to reduce the number of steps involved in the verification process.
Let us upload some data on this onto Python:
#importing packages %matplotlib inline import scipy.stats as stats import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns plt.style.use('ggplot')
df = pd.read_csv('creditcard.csv')
We can use different algorithms to get the results. But which one to use here? Let us try out these algorithms one by one and understand what each can offer.
pd.set_option('precision', 3) df.loc[:, ['Time', 'Amount']].describe() #visualizations of time and amount plt.figure(figsize=(10,8)) plt.title('Distribution of Time Feature') sns.distplot(df.Time)
Let’s Wind up!
We had an in-depth understanding of what supervised learning is by learning its definition, types, and functionality. Further, we analyzed its pluses and minuses so that we can decide on when to use supervised learning in real. In the end, we elucidated a use case that additionally helped us know how supervised learning works. It would be great if we could discuss more on this technique. Share your comments below.