In this course, you’ll learn how Machine Learning is used in many key fields and industries.
For example,
in the health care industry, data scientists use Machine Learning to predict whether a human cell that is believed to be at risk of developing cancer, is either benign or malignant.
As such, Machine learning can play a key role in determining a person’s health and welfare.You’ll also learn about the value of decision trees and how building a good decision treefrom historical data helps doctors to prescribe the proper medicine for each of their patients.
You’ll learn how bankers use machine learning to make decisions on whether to approve loan applications. And you will learn how to use machine learning to do bank customer segmentation, where it is not usually easy to run for huge volumes of varied data.In this course, you’ll see how machine learning helps websites such as YouTube, Amazon, or Netflix develop recommendations to their customers about various products or services, such as which movies they might be interested in going to see or which books to buy.
There is so much that you can do with Machine Learning! Here,
you’ll learn how to use popular python libraries to build your model. For example, given an automobile dataset, we use the sci-kit learn (sklearn) library to estimate the Co2 emission of cars using their Engine size or Cylinders.We can even predict what the Co2 emissions will be for a car that hasn’t even been produced yet! And we’ll see how the telecommunications industry can predict customer churn.You can run and practice the code of all these samples using the built-in lab environment in this course.You don’t have to install anything to your computer or do anything on the cloud.All you have to do is click a button to start the lab environment in your browser.The code for the samples is already written using python language, in Jupyter notebooks,and you can run it to see the results, or change it to understand the algorithms better.So,
what will you be able to achieve by taking this course?Well, by putting in just a few hours a week over the next few weeks, you’ll get new skills to add to your resume, such as regression, classification, clustering, sci-kit learn and SciPy.You’ll also get new projects that you can add to your portfolio, including cancer detection,predicting economic trends, predicting customer churn, recommendation engines, and many more.You’ll also get a certificate in machine learning to prove your competency, and shareit anywhere you like online or offline, such as LinkedIn profiles and social media.So let’s get started.
Learning Objectives
In this course you will learn about:
- How Statistical Modeling relates to Machine Learning and do a comparison of each.
- Real-life examples of Machine learning and how it affects society in ways you may not have guessed!
- In the labs: Use Python libraries for Machine Learning, such as scikit-learn.
Explore many algorithms and models:
- Popular algorithms: Regression, Classification, and Clustering
- Recommender Systems: Content-Based and Collaborative Filtering
- Popular models: Train/Test Split, Gradient Descent, and Mean Squared Error
- Get ready to do more learning than your machine!
Syllabus
Module 1 - Machine Learning
- Python for Machine Learning
- Supervised vs Unsupervised
- Lab & Review
Module 2 - Regression
- Simple Linear Regression
- Multiple Linear Regression
- Model Evaluation in Regression Models
- Non-Linear Regression
- Lab & Review
Module 3 - Classification
- K-Nearest Neighbors
- Decision Trees
- Evaluation Metrics in Classification
- Logistic Regression vs Linear Regressin
- Support Vector Machine (SVM)
- Lab & Review
Module 4 - Clustering
- K-Means Clustering
- Hierarchical Clustering
- DBSCAN
- Lab & Review
Module 5 - Recommender Systems
- Content-Based Recommender Systems
- Collaborative Filtering
- Lab & Review
Major machine learning techniques
-
Regression / Estimation
Predicting continuous values
-
Classification
Predicting the item class/category of a case
-
Clustering
Finding the structure of data : summarization
-
Associations
Associating frequent co-occuring items/events
Major machine learning techniques
-
Anomaly detection
Discovering abnormal and unusual cases
-
Sequence mining
Predicting next events;click-stream(Markov Model,HMM)
-
Dimension Reduction
Reducing the size of data(PCA)
-
Recommendation systems
Recommending items
What is supervised learning ?
We "teach the model," then with that knowledge,it can predict unknown or future instances.
Learning Objectives
In this lesson you will learn about:
-
-
-
Regression Algorithms
-
Model Evaluation
-
Model Evaluation: Overfitting & Underfitting
-
Understanding Different Evaluation Models
-
Simple Linear Regression
-
-
Simple linear regression
Lab1:
1.0.0.1 About this Notebook
In this notebook, we learn how to use scikit-learn to implement simple linear regression. We download a dataset that is related to fuel consumption and Carbon dioxide emission of cars. Then, we split our data into training and test sets, create a model using training set, Evaluate your model using test set, and finally use model to predict unknown value
1.0.1 Importing Needed packages
In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline
1.1 Understanding the Data
1.1.1 FuelConsumption.csv
:
We have downloaded a fuel consumption dataset, FuelConsumption.csv
, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. Dataset source
- MODELYEAR e.g. 2014
- MAKE e.g. Acura
- MODEL e.g. ILX
- VEHICLE CLASS e.g. SUV
- ENGINE SIZE e.g. 4.7
- CYLINDERS e.g 6
- TRANSMISSION e.g. A6
- FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9
- FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9
- FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2
- CO2 EMISSIONS (g/km) e.g. 182 --> low --> 0
1.2 Reading the data in
In [8]:
# 文件路径
filepath = r'C:\Users\ML Learning\FuelConsumption.csv'
In [10]:
df = pd.read_csv(filepath)
# take a look at the dataset
df.head()
Out[10]:
MODELYEAR | MAKE | MODEL | VEHICLECLASS | ENGINESIZE | CYLINDERS | TRANSMISSION | FUELTYPE | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | FUELCONSUMPTION_COMB_MPG | CO2EMISSIONS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014 | ACURA | ILX | COMPACT | 2.0 | 4 | AS5 | Z | 9.9 | 6.7 | 8.5 | 33 | 196 |
1 | 2014 | ACURA | ILX | COMPACT | 2.4 | 4 | M6 | Z | 11.2 | 7.7 | 9.6 | 29 | 221 |
2 | 2014 | ACURA | ILX HYBRID | COMPACT | 1.5 | 4 | AV7 | Z | 6.0 | 5.8 | 5.9 | 48 | 136 |
3 | 2014 | ACURA | MDX 4WD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.7 | 9.1 | 11.1 | 25 | 255 |
4 | 2014 | ACURA | RDX AWD | SUV - SMALL | 3.5 | 6 | AS6 | Z | 12.1 | 8.7 | 10.6 | 27 | 244 |
1.2.1 Data Exploration
Lets first have a descriptive exploration on our data.
In [11]:
# summarize the data
df.describe()
Out[11]:
MODELYEAR | ENGINESIZE | CYLINDERS | FUELCONSUMPTION_CITY | FUELCONSUMPTION_HWY | FUELCONSUMPTION_COMB | FUELCONSUMPTION_COMB_MPG | CO2EMISSIONS | |
---|---|---|---|---|---|---|---|---|
count | 1067.0 | 1067.000000 | 1067.000000 | 1067.000000 | 1067.000000 | 1067.000000 | 1067.000000 | 1067.000000 |
mean | 2014.0 | 3.346298 | 5.794752 | 13.296532 | 9.474602 | 11.580881 | 26.441425 | 256.228679 |
std | 0.0 | 1.415895 | 1.797447 | 4.101253 | 2.794510 | 3.485595 | 7.468702 | 63.372304 |
min | 2014.0 | 1.000000 | 3.000000 | 4.600000 | 4.900000 | 4.700000 | 11.000000 | 108.000000 |
25% | 2014.0 | 2.000000 | 4.000000 | 10.250000 | 7.500000 | 9.000000 | 21.000000 | 207.000000 |
50% | 2014.0 | 3.400000 | 6.000000 | 12.600000 | 8.800000 | 10.900000 | 26.000000 | 251.000000 |
75% | 2014.0 | 4.300000 | 8.000000 | 15.550000 | 10.850000 | 13.350000 | 31.000000 | 294.000000 |
max | 2014.0 | 8.400000 | 12.000000 | 30.200000 | 20.500000 | 25.800000 | 60.000000 | 488.000000 |
Lets select some features to explore more.
In [12]:
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
cdf.head(9)
Out[12]:
ENGINESIZE | CYLINDERS | FUELCONSUMPTION_COMB | CO2EMISSIONS | |
---|---|---|---|---|
0 | 2.0 | 4 | 8.5 | 196 |
1 | 2.4 | 4 | 9.6 | 221 |
2 | 1.5 | 4 | 5.9 | 136 |
3 | 3.5 | 6 | 11.1 | 255 |
4 | 3.5 | 6 | 10.6 | 244 |
5 | 3.5 | 6 | 10.0 | 230 |
6 | 3.5 | 6 | 10.1 | 232 |
7 | 3.7 | 6 | 11.1 | 255 |
8 | 3.7 | 6 | 11.6 | 267 |
we can plot each of these fearues:
In [13]:
viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]
viz.hist()
plt.show()
Now, lets plot each of these features vs the Emission, to see how linear is their relation:
In [14]:
plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS, color='blue')
plt.xlabel("FUELCONSUMPTION_COMB")
plt.ylabel("Emission")
plt.show()
In [15]:
plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS, color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()
1.3 Practice
plot CYLINDER vs the Emission, to see how linear is their relation:
In [ ]:
# write your code here
plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS, color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()
Double-click here for the solution.
1.3.0.1 Creating train and test dataset
Train/Test Split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. After which, you train with the training set and test with the testing set. This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the data. It is more realistic for real world problems.
This means that we know the outcome of each data point in this dataset, making it great to test with! And since this data has not been used to train the model, the model has no knowledge of the outcome of these data points. So, in essence, it is truly an out-of-sample testing.
In [16]:
msk = np.random.rand(len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]
1.3.1 Simple Regression Model
Linear Regression fits a linear model with coefficients B = (B1, ..., Bn) to minimize the 'residual sum of squares' between the independent x in the dataset, and the dependent y by the linear approximation.
1.3.1.1 Train data distribution
In [17]:
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()
Using sklearn package to model data.
In [18]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[['ENGINESIZE']])
train_y = np.asanyarray(train[['CO2EMISSIONS']])
regr.fit (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
Coefficients: [[38.82068777]]
Intercept: [126.0416523]
As mentioned before, Coefficient and Intercept in the simple linear regression, are the parameters of the fit line. Given that it is a simple linear regression, with only 2 parameters, and knowing that the parameters are the intercept and slope of the line, sklearn can estimate them directly from our data. Notice that all of the data must be available to traverse and calculate the parameters.
we can plot the fit line over the data:
In [19]:
plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS, color='blue')
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
plt.xlabel("Engine size")
plt.ylabel("Emission")
Out[19]:
Text(0, 0.5, 'Emission')
we compare the actual values and predicted values to calculate the accuracy of a regression model. Evaluation metrics provide a key role in the development of a model, as it provides insight to areas that require improvement.
There are different model evaluation metrics, lets use MSE here to calculate the accuracy of our model based on the test set: - Mean absolute error: It is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just average error. - Mean Squared Error (MSE): Mean Squared Error (MSE) is the mean of the squared error. It’s more popular than Mean absolute error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones. - Root Mean Squared Error (RMSE). - R-squared is not error, but is a popular metric for accuracy of your model. It represents how close the data are to the fitted regression line. The higher the R-squared, the better the model fits your data. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
In [20]:
from sklearn.metrics import r2_score
test_x = np.asanyarray(test[['ENGINESIZE']])
test_y = np.asanyarray(test[['CO2EMISSIONS']])
test_y_ = regr.predict(test_x)
print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y_ , test_y) )
Mean absolute error: 21.72
Residual sum of squares (MSE): 826.48
R2-score: 0.71
Multiple Linear Regression
Model Evaluation in Regression Model
Evaluation Metrics in Regression
Non-Linear Regression
Lab2
1
Non Linear Regression Analysis
If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression because, as the name implies, linear regression presumes that the data is linear. Let's learn about non linear regressions and apply an example on python. In this notebook, we fit a non-linear model to the datapoints corrensponding to China's GDP from 1960 to 2014.
1.0.1 Importing required libraries
In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Though Linear regression is very good to solve many problems, it cannot be used for all datasets. First recall how linear regression, could model a dataset. It models a linear relation between a dependent variable y and independent variable x. It had a simple equation, of degree 1, for example y = 2*(x) + 3.
In [2]:
x = np.arange(-5.0, 5.0, 0.1)
##You can adjust the slope and intercept to verify the changes in the graph
y = 2*(x) + 3
y_noise = 2 * np.random.normal(size=x.size)
ydata = y + y_noise
#plt.figure(figsize=(8,6))
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
Non-linear regressions are a relationship between independent variables 𝑥x and a dependent variable 𝑦y which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of 𝑘k degrees (maximum power of 𝑥x).
𝑦=𝑎𝑥3+𝑏𝑥2+𝑐𝑥+𝑑 y=ax3+bx2+cx+d
Non-linear functions can have elements like exponentials, logarithms, fractions, and others. For example:
𝑦=log(𝑥)y=log(x)
Or even, more complicated such as :
𝑦=log(𝑎𝑥3+𝑏𝑥2+𝑐𝑥+𝑑)y=log(ax3+bx2+cx+d)
Let's take a look at a cubic function's graph.
In [3]:
x = np.arange(-5.0, 5.0, 0.1)
##You can adjust the slope and intercept to verify the changes in the graph
y = 1*(x**3) + 1*(x**2) + 1*x + 3
y_noise = 20 * np.random.normal(size=x.size)
ydata = y + y_noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
As you can see, this function has 𝑥3x3 and 𝑥2x2 as independent variables. Also, the graphic of this function is not a straight line over the 2D plane. So this is a non-linear function.
Some other types of non-linear functions are:
1.0.2 Quadratic
𝑌=𝑋2Y=X2
In [4]:
x = np.arange(-5.0, 5.0, 0.1)
##You can adjust the slope and intercept to verify the changes in the graph
y = np.power(x,2)
y_noise = 2 * np.random.normal(size=x.size)
ydata = y + y_noise
plt.plot(x, ydata, 'bo')
plt.plot(x,y, 'r')
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
1.0.3 Exponential
An exponential function with base c is defined by
𝑌=𝑎+𝑏𝑐𝑋Y=a+bcX
where b ≠0, c > 0 , c ≠1, and x is any real number. The base, c, is constant and the exponent, x, is a variable.
In [5]:
X = np.arange(-5.0, 5.0, 0.1)
##You can adjust the slope and intercept to verify the changes in the graph
Y= np.exp(X)
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
1.0.4 Logarithmic
The response 𝑦y is a results of applying logarithmic map from input 𝑥x's to output variable 𝑦y. It is one of the simplest form of log(): i.e.
𝑦=log(𝑥)y=log(x)
Please consider that instead of 𝑥x, we can use 𝑋X, which can be polynomial representation of the 𝑥x's. In general form it would be written as
𝑦=log(𝑋)y=log(X)
In [6]:
X = np.arange(-5.0, 5.0, 0.1)
Y = np.log(X)
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
<ipython-input-6-0b6a51fd782b>:3: RuntimeWarning: invalid value encountered in log
Y = np.log(X)
1.0.5 Sigmoidal/Logistic
𝑌=𝑎+𝑏1+𝑐(𝑋−𝑑)Y=a+b1+c(X−d)
In [7]:
X = np.arange(-5.0, 5.0, 0.1)
Y = 1-4/(1+np.power(3, X-2))
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
2 Non-Linear Regression example
For an example, we're going to try and fit a non-linear model to the datapoints corrensponding to China's GDP from 1960 to 2014. We download a dataset with two columns, the first, a year between 1960 and 2014, the second, China's corresponding annual gross domestic income in US dollars for that year.
In [8]:
import numpy as np
import pandas as pd
#downloading dataset
df = pd.read_csv(r"C:\Users\ML Learning\china_gdp.csv")
df.head(10)
Out[8]:
Year | Value | |
---|---|---|
0 | 1960 | 5.918412e+10 |
1 | 1961 | 4.955705e+10 |
2 | 1962 | 4.668518e+10 |
3 | 1963 | 5.009730e+10 |
4 | 1964 | 5.906225e+10 |
5 | 1965 | 6.970915e+10 |
6 | 1966 | 7.587943e+10 |
7 | 1967 | 7.205703e+10 |
8 | 1968 | 6.999350e+10 |
9 | 1969 | 7.871882e+10 |
Did you know? When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: Sign up now for free
2.0.1 Plotting the Dataset
This is what the datapoints look like. It kind of looks like an either logistic or exponential function. The growth starts off slow, then from 2005 on forward, the growth is very significant. And finally, it deaccelerates slightly in the 2010s.
In [9]:
plt.figure(figsize=(8,5))
x_data, y_data = (df["Year"].values, df["Value"].values)
plt.plot(x_data, y_data, 'ro')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()
2.0.2 Choosing a model
From an initial look at the plot, we determine that the logistic function could be a good approximation, since it has the property of starting with a slow growth, increasing growth in the middle, and then decreasing again at the end; as illustrated below:
In [10]:
X = np.arange(-5.0, 5.0, 0.1)
Y = 1.0 / (1.0 + np.exp(-X))
plt.plot(X,Y)
plt.ylabel('Dependent Variable')
plt.xlabel('Indepdendent Variable')
plt.show()
The formula for the logistic function is the following:
𝑌̂ =11+𝑒𝛽1(𝑋−𝛽2)Y^=11+eβ1(X−β2)
𝛽1β1: Controls the curve's steepness,
𝛽2β2: Slides the curve on the x-axis.
2.0.3 Building The Model
Now, let's build our regression model and initialize its parameters.
In [11]:
def sigmoid(x, Beta_1, Beta_2):
y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))
return y
Lets look at a sample sigmoid line that might fit with the data:
In [12]:
beta_1 = 0.10
beta_2 = 1990.0
#logistic function
Y_pred = sigmoid(x_data, beta_1 , beta_2)
#plot initial prediction against datapoints
plt.plot(x_data, Y_pred*15000000000000.)
plt.plot(x_data, y_data, 'ro')
Out[12]:
[<matplotlib.lines.Line2D at 0x1ebf4a84cd0>]
Our task here is to find the best parameters for our model. Lets first normalize our x and y:
In [13]:
# Lets normalize our data
xdata =x_data/max(x_data)
ydata =y_data/max(y_data)
2.0.3.1 How we find the best parameters for our fit line?
we can use curve_fit which uses non-linear least squares to fit our sigmoid function, to data. Optimal values for the parameters so that the sum of the squared residuals of sigmoid(xdata, *popt) - ydata is minimized.
popt are our optimized parameters.
In [14]:
from scipy.optimize import curve_fit
popt, pcov = curve_fit(sigmoid, xdata, ydata)
#print the final parameters
print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))
beta_1 = 690.451711, beta_2 = 0.997207
Now we plot our resulting regresssion model.
In [15]:
x = np.linspace(1960, 2015, 55)
x = x/max(x)
plt.figure(figsize=(8,5))
y = sigmoid(x, *popt)
plt.plot(xdata, ydata, 'ro', label='data')
plt.plot(x,y, linewidth=3.0, label='fit')
plt.legend(loc='best')
plt.ylabel('GDP')
plt.xlabel('Year')
plt.show()