Machine Learning with Python Part One

最新推荐文章于 2024-04-14 09:47:46 发布

DB架构

最新推荐文章于 2024-04-14 09:47:46 发布

阅读量2.3k

点赞数 1

分类专栏： Python learning Data Science

本文链接：https://blog.csdn.net/u011868279/article/details/115184975

版权

Data Science 同时被 2 个专栏收录

61 篇文章 2 订阅

订阅专栏

Python learning

33 篇文章 1 订阅

订阅专栏

In this course, you’ll learn how Machine Learning is used in many key fields and industries.

For example,

in the health care industry, data scientists use Machine Learning to predict whether a human cell that is believed to be at risk of developing cancer, is either benign or malignant.

As such, Machine learning can play a key role in determining a person’s health and welfare.You’ll also learn about the value of decision trees and how building a good decision treefrom historical data helps doctors to prescribe the proper medicine for each of their patients.
You’ll learn how bankers use machine learning to make decisions on whether to approve loan applications. And you will learn how to use machine learning to do bank customer segmentation, where it is not usually easy to run for huge volumes of varied data.In this course, you’ll see how machine learning helps websites such as YouTube, Amazon, or Netflix develop recommendations to their customers about various products or services, such as which movies they might be interested in going to see or which books to buy.

There is so much that you can do with Machine Learning! Here,

you’ll learn how to use popular python libraries to build your model. For example, given an automobile dataset, we use the sci-kit learn (sklearn) library to estimate the Co2 emission of cars using their Engine size or Cylinders.We can even predict what the Co2 emissions will be for a car that hasn’t even been produced yet! And we’ll see how the telecommunications industry can predict customer churn.You can run and practice the code of all these samples using the built-in lab environment in this course.You don’t have to install anything to your computer or do anything on the cloud.All you have to do is click a button to start the lab environment in your browser.The code for the samples is already written using python language, in Jupyter notebooks,and you can run it to see the results, or change it to understand the algorithms better.So,

what will you be able to achieve by taking this course?Well, by putting in just a few hours a week over the next few weeks, you’ll get new skills to add to your resume, such as regression, classification, clustering, sci-kit learn and SciPy.You’ll also get new projects that you can add to your portfolio, including cancer detection,predicting economic trends, predicting customer churn, recommendation engines, and many more.You’ll also get a certificate in machine learning to prove your competency, and shareit anywhere you like online or offline, such as LinkedIn profiles and social media.So let’s get started.

Learning Objectives

In this course you will learn about:

How Statistical Modeling relates to Machine Learning and do a comparison of each.

Real-life examples of Machine learning and how it affects society in ways you may not have guessed!
In the labs: Use Python libraries for Machine Learning, such as scikit-learn.

Explore many algorithms and models:

Popular algorithms: Regression, Classification, and Clustering
Recommender Systems: Content-Based and Collaborative Filtering

Popular models: Train/Test Split, Gradient Descent, and Mean Squared Error

Get ready to do more learning than your machine!

Syllabus

Module 1 - Machine Learning

Python for Machine Learning

Supervised vs Unsupervised

Lab & Review

Module 2 - Regression

Simple Linear Regression

Multiple Linear Regression

Model Evaluation in Regression Models

Non-Linear Regression
Lab & Review

Module 3 - Classification

K-Nearest Neighbors

Decision Trees

Evaluation Metrics in Classification

Logistic Regression vs Linear Regressin

Support Vector Machine (SVM)

Lab & Review

Module 4 - Clustering

K-Means Clustering

Hierarchical Clustering

DBSCAN

Lab & Review

Module 5 - Recommender Systems

Content-Based Recommender Systems

Collaborative Filtering

Lab & Review

Major machine learning techniques

Regression / Estimation

Predicting continuous values

Classification

Predicting the item class/category of a case

Clustering

Finding the structure of data : summarization

Associations

Associating frequent co-occuring items/events

Major machine learning techniques

Anomaly detection

Discovering abnormal and unusual cases

Sequence mining

Predicting next events;click-stream(Markov Model,HMM)

Dimension Reduction

Reducing the size of data(PCA)

Recommendation systems

Recommending items

What is supervised learning ?

We "teach the model," then with that knowledge,it can predict unknown or future instances.

Learning Objectives

In this lesson you will learn about:

- - Regression Algorithms
  - Model Evaluation
  - Model Evaluation: Overfitting & Underfitting
  - Understanding Different Evaluation Models
  - Simple Linear Regression

Simple linear regression

Lab1:

1.0.0.1 About this Notebook

In this notebook, we learn how to use scikit-learn to implement simple linear regression. We download a dataset that is related to fuel consumption and Carbon dioxide emission of cars. Then, we split our data into training and test sets, create a model using training set, Evaluate your model using test set, and finally use model to predict unknown value

1.0.1 Importing Needed packages

In [2]:

import matplotlib.pyplot as plt

import pandas as pd

import pylab as pl

import numpy as np

%matplotlib inline

1.1 Understanding the Data

1.1.1 `FuelConsumption.csv`:

We have downloaded a fuel consumption dataset, FuelConsumption.csv, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. Dataset source

MODELYEAR e.g. 2014
MAKE e.g. Acura
MODEL e.g. ILX
VEHICLE CLASS e.g. SUV
ENGINE SIZE e.g. 4.7
CYLINDERS e.g 6
TRANSMISSION e.g. A6
FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9
FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9
FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2
CO2 EMISSIONS (g/km) e.g. 182 --> low --> 0

1.2 Reading the data in

In [8]:

# 文件路径

filepath = r'C:\Users\ML Learning\FuelConsumption.csv'

In [10]:

df = pd.read_csv(filepath)

# take a look at the dataset

df.head()

Out[10]:

	MODELYEAR	MAKE	MODEL	VEHICLECLASS	ENGINESIZE	CYLINDERS	TRANSMISSION	FUELTYPE	FUELCONSUMPTION_CITY	FUELCONSUMPTION_HWY	FUELCONSUMPTION_COMB	FUELCONSUMPTION_COMB_MPG	CO2EMISSIONS
0	2014	ACURA	ILX	COMPACT	2.0	4	AS5	Z	9.9	6.7	8.5	33	196
1	2014	ACURA	ILX	COMPACT	2.4	4	M6	Z	11.2	7.7	9.6	29	221
2	2014	ACURA	ILX HYBRID	COMPACT	1.5	4	AV7	Z	6.0	5.8	5.9	48	136
3	2014	ACURA	MDX 4WD	SUV - SMALL	3.5	6	AS6	Z	12.7	9.1	11.1	25	255
4	2014	ACURA	RDX AWD	SUV - SMALL	3.5	6	AS6	Z	12.1	8.7	10.6	27	244

1.2.1 Data Exploration

Lets first have a descriptive exploration on our data.

In [11]:

# summarize the data

df.describe()

Out[11]:

	MODELYEAR	ENGINESIZE	CYLINDERS	FUELCONSUMPTION_CITY	FUELCONSUMPTION_HWY	FUELCONSUMPTION_COMB	FUELCONSUMPTION_COMB_MPG	CO2EMISSIONS
count	1067.0	1067.000000	1067.000000	1067.000000	1067.000000	1067.000000	1067.000000	1067.000000
mean	2014.0	3.346298	5.794752	13.296532	9.474602	11.580881	26.441425	256.228679
std	0.0	1.415895	1.797447	4.101253	2.794510	3.485595	7.468702	63.372304
min	2014.0	1.000000	3.000000	4.600000	4.900000	4.700000	11.000000	108.000000
25%	2014.0	2.000000	4.000000	10.250000	7.500000	9.000000	21.000000	207.000000
50%	2014.0	3.400000	6.000000	12.600000	8.800000	10.900000	26.000000	251.000000
75%	2014.0	4.300000	8.000000	15.550000	10.850000	13.350000	31.000000	294.000000
max	2014.0	8.400000	12.000000	30.200000	20.500000	25.800000	60.000000	488.000000

Lets select some features to explore more.

In [12]:

cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]

cdf.head(9)

Out[12]:

	ENGINESIZE	CYLINDERS	FUELCONSUMPTION_COMB	CO2EMISSIONS
0	2.0	4	8.5	196
1	2.4	4	9.6	221
2	1.5	4	5.9	136
3	3.5	6	11.1	255
4	3.5	6	10.6	244
5	3.5	6	10.0	230
6	3.5	6	10.1	232
7	3.7	6	11.1	255
8	3.7	6	11.6	267

we can plot each of these fearues:

In [13]:

viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]

viz.hist()

plt.show()

Now, lets plot each of these features vs the Emission, to see how linear is their relation:

In [14]:

plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS,  color='blue')

plt.xlabel("FUELCONSUMPTION_COMB")

plt.ylabel("Emission")

plt.show()

In [15]:

plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS,  color='blue')

plt.xlabel("Engine size")

plt.ylabel("Emission")

plt.show()

1.3 Practice

plot CYLINDER vs the Emission, to see how linear is their relation:

In [ ]:

# write your code here

plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS,  color='blue')

plt.xlabel("Engine size")

plt.ylabel("Emission")

plt.show()

Double-click here for the solution.

1.3.0.1 Creating train and test dataset

Train/Test Split involves splitting the dataset into training and testing sets respectively, which are mutually exclusive. After which, you train with the training set and test with the testing set. This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that have been used to train the data. It is more realistic for real world problems.

This means that we know the outcome of each data point in this dataset, making it great to test with! And since this data has not been used to train the model, the model has no knowledge of the outcome of these data points. So, in essence, it is truly an out-of-sample testing.

In [16]:

msk = np.random.rand(len(df)) < 0.8

train = cdf[msk]

test = cdf[~msk]

1.3.1 Simple Regression Model

Linear Regression fits a linear model with coefficients B = (B1, ..., Bn) to minimize the 'residual sum of squares' between the independent x in the dataset, and the dependent y by the linear approximation.

1.3.1.1 Train data distribution

In [17]:

plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')

plt.xlabel("Engine size")

plt.ylabel("Emission")

plt.show()

1.3.1.2 Modeling

Using sklearn package to model data.

In [18]:

from sklearn import linear_model

regr = linear_model.LinearRegression()

train_x = np.asanyarray(train[['ENGINESIZE']])

train_y = np.asanyarray(train[['CO2EMISSIONS']])

regr.fit (train_x, train_y)

# The coefficients

print ('Coefficients: ', regr.coef_)

print ('Intercept: ',regr.intercept_)

Coefficients:  [[38.82068777]]
Intercept:  [126.0416523]

As mentioned before, Coefficient and Intercept in the simple linear regression, are the parameters of the fit line. Given that it is a simple linear regression, with only 2 parameters, and knowing that the parameters are the intercept and slope of the line, sklearn can estimate them directly from our data. Notice that all of the data must be available to traverse and calculate the parameters.

1.3.1.3 Plot outputs

we can plot the fit line over the data:

In [19]:

plt.scatter(train.ENGINESIZE, train.CO2EMISSIONS,  color='blue')

plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')

plt.xlabel("Engine size")

plt.ylabel("Emission")

Out[19]:

Text(0, 0.5, 'Emission')

1.3.1.4 Evaluation

we compare the actual values and predicted values to calculate the accuracy of a regression model. Evaluation metrics provide a key role in the development of a model, as it provides insight to areas that require improvement.

There are different model evaluation metrics, lets use MSE here to calculate the accuracy of our model based on the test set: - Mean absolute error: It is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just average error. - Mean Squared Error (MSE): Mean Squared Error (MSE) is the mean of the squared error. It’s more popular than Mean absolute error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones. - Root Mean Squared Error (RMSE). - R-squared is not error, but is a popular metric for accuracy of your model. It represents how close the data are to the fitted regression line. The higher the R-squared, the better the model fits your data. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).

In [20]:

from sklearn.metrics import r2_score

test_x = np.asanyarray(test[['ENGINESIZE']])

test_y = np.asanyarray(test[['CO2EMISSIONS']])

test_y_ = regr.predict(test_x)

print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_ - test_y)))

print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_ - test_y) ** 2))

print("R2-score: %.2f" % r2_score(test_y_ , test_y) )

Mean absolute error: 21.72
Residual sum of squares (MSE): 826.48
R2-score: 0.71

Multiple Linear Regression

Model Evaluation in Regression Model

Evaluation Metrics in Regression

Non-Linear Regression

Lab2

1

Non Linear Regression Analysis

If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression because, as the name implies, linear regression presumes that the data is linear. Let's learn about non linear regressions and apply an example on python. In this notebook, we fit a non-linear model to the datapoints corrensponding to China's GDP from 1960 to 2014.

1.0.1 Importing required libraries

In [1]:

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

Though Linear regression is very good to solve many problems, it cannot be used for all datasets. First recall how linear regression, could model a dataset. It models a linear relation between a dependent variable y and independent variable x. It had a simple equation, of degree 1, for example y = 2*(x) + 3.

In [2]:

x = np.arange(-5.0, 5.0, 0.1)

##You can adjust the slope and intercept to verify the changes in the graph

y = 2*(x) + 3

y_noise = 2 * np.random.normal(size=x.size)

ydata = y + y_noise

#plt.figure(figsize=(8,6))

plt.plot(x, ydata,  'bo')

plt.plot(x,y, 'r')

plt.ylabel('Dependent Variable')

plt.xlabel('Indepdendent Variable')

plt.show()

Non-linear regressions are a relationship between independent variables 𝑥x and a dependent variable 𝑦y which result in a non-linear function modeled data. Essentially any relationship that is not linear can be termed as non-linear, and is usually represented by the polynomial of 𝑘k degrees (maximum power of 𝑥x).

𝑦=𝑎𝑥3+𝑏𝑥2+𝑐𝑥+𝑑 y=ax3+bx2+cx+d

Non-linear functions can have elements like exponentials, logarithms, fractions, and others. For example:

𝑦=log(𝑥)y=log⁡(x)

Or even, more complicated such as :

𝑦=log(𝑎𝑥3+𝑏𝑥2+𝑐𝑥+𝑑)y=log⁡(ax3+bx2+cx+d)

Let's take a look at a cubic function's graph.

In [3]:

x = np.arange(-5.0, 5.0, 0.1)

##You can adjust the slope and intercept to verify the changes in the graph

y = 1*(x**3) + 1*(x**2) + 1*x + 3

y_noise = 20 * np.random.normal(size=x.size)

ydata = y + y_noise

plt.plot(x, ydata,  'bo')

plt.plot(x,y, 'r')

plt.ylabel('Dependent Variable')

plt.xlabel('Indepdendent Variable')

plt.show()

As you can see, this function has 𝑥3x3 and 𝑥2x2 as independent variables. Also, the graphic of this function is not a straight line over the 2D plane. So this is a non-linear function.

Some other types of non-linear functions are:

1.0.2 Quadratic

𝑌=𝑋2Y=X2

In [4]:

x = np.arange(-5.0, 5.0, 0.1)

##You can adjust the slope and intercept to verify the changes in the graph

y = np.power(x,2)

y_noise = 2 * np.random.normal(size=x.size)

ydata = y + y_noise

plt.plot(x, ydata,  'bo')

plt.plot(x,y, 'r')

plt.ylabel('Dependent Variable')

plt.xlabel('Indepdendent Variable')

plt.show()

1.0.3 Exponential

An exponential function with base c is defined by

𝑌=𝑎+𝑏𝑐𝑋Y=a+bcX

where b ≠0, c > 0 , c ≠1, and x is any real number. The base, c, is constant and the exponent, x, is a variable.

In [5]:

X = np.arange(-5.0, 5.0, 0.1)

##You can adjust the slope and intercept to verify the changes in the graph

Y= np.exp(X)

plt.plot(X,Y)

plt.ylabel('Dependent Variable')

plt.xlabel('Indepdendent Variable')

plt.show()

1.0.4 Logarithmic

The response 𝑦y is a results of applying logarithmic map from input 𝑥x's to output variable 𝑦y. It is one of the simplest form of log(): i.e.

𝑦=log(𝑥)y=log⁡(x)

Please consider that instead of 𝑥x, we can use 𝑋X, which can be polynomial representation of the 𝑥x's. In general form it would be written as

𝑦=log(𝑋)y=log⁡(X)

In [6]:

X = np.arange(-5.0, 5.0, 0.1)

Y = np.log(X)

plt.plot(X,Y)

plt.ylabel('Dependent Variable')

plt.xlabel('Indepdendent Variable')

plt.show()

<ipython-input-6-0b6a51fd782b>:3: RuntimeWarning: invalid value encountered in log
  Y = np.log(X)

1.0.5 Sigmoidal/Logistic

𝑌=𝑎+𝑏1+𝑐(𝑋−𝑑)Y=a+b1+c(X−d)

In [7]:

X = np.arange(-5.0, 5.0, 0.1)

Y = 1-4/(1+np.power(3, X-2))

plt.plot(X,Y)

plt.ylabel('Dependent Variable')

plt.xlabel('Indepdendent Variable')

plt.show()

2 Non-Linear Regression example

For an example, we're going to try and fit a non-linear model to the datapoints corrensponding to China's GDP from 1960 to 2014. We download a dataset with two columns, the first, a year between 1960 and 2014, the second, China's corresponding annual gross domestic income in US dollars for that year.

In [8]:

import numpy as np

import pandas as pd

#downloading dataset

df = pd.read_csv(r"C:\Users\ML Learning\china_gdp.csv")

df.head(10)

Out[8]:

	Year	Value
0	1960	5.918412e+10
1	1961	4.955705e+10
2	1962	4.668518e+10
3	1963	5.009730e+10
4	1964	5.906225e+10
5	1965	6.970915e+10
6	1966	7.587943e+10
7	1967	7.205703e+10
8	1968	6.999350e+10
9	1969	7.871882e+10

Did you know? When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: Sign up now for free

2.0.1 Plotting the Dataset

This is what the datapoints look like. It kind of looks like an either logistic or exponential function. The growth starts off slow, then from 2005 on forward, the growth is very significant. And finally, it deaccelerates slightly in the 2010s.

In [9]:

plt.figure(figsize=(8,5))

x_data, y_data = (df["Year"].values, df["Value"].values)

plt.plot(x_data, y_data, 'ro')

plt.ylabel('GDP')

plt.xlabel('Year')

plt.show()

2.0.2 Choosing a model

From an initial look at the plot, we determine that the logistic function could be a good approximation, since it has the property of starting with a slow growth, increasing growth in the middle, and then decreasing again at the end; as illustrated below:

In [10]:

X = np.arange(-5.0, 5.0, 0.1)

Y = 1.0 / (1.0 + np.exp(-X))

plt.plot(X,Y)

plt.ylabel('Dependent Variable')

plt.xlabel('Indepdendent Variable')

plt.show()

The formula for the logistic function is the following:

𝑌̂ =11+𝑒𝛽1(𝑋−𝛽2)Y^=11+eβ1(X−β2)

𝛽1β1: Controls the curve's steepness,

𝛽2β2: Slides the curve on the x-axis.

2.0.3 Building The Model

Now, let's build our regression model and initialize its parameters.

In [11]:

def sigmoid(x, Beta_1, Beta_2):

     y = 1 / (1 + np.exp(-Beta_1*(x-Beta_2)))

     return y

Lets look at a sample sigmoid line that might fit with the data:

In [12]:

beta_1 = 0.10

beta_2 = 1990.0

#logistic function

Y_pred = sigmoid(x_data, beta_1 , beta_2)

#plot initial prediction against datapoints

plt.plot(x_data, Y_pred*15000000000000.)

plt.plot(x_data, y_data, 'ro')

Out[12]:

[<matplotlib.lines.Line2D at 0x1ebf4a84cd0>]

Our task here is to find the best parameters for our model. Lets first normalize our x and y:

In [13]:

# Lets normalize our data

xdata =x_data/max(x_data)

ydata =y_data/max(y_data)

2.0.3.1 How we find the best parameters for our fit line?

we can use curve_fit which uses non-linear least squares to fit our sigmoid function, to data. Optimal values for the parameters so that the sum of the squared residuals of sigmoid(xdata, *popt) - ydata is minimized.

popt are our optimized parameters.

In [14]:

from scipy.optimize import curve_fit

popt, pcov = curve_fit(sigmoid, xdata, ydata)

#print the final parameters

print(" beta_1 = %f, beta_2 = %f" % (popt[0], popt[1]))

 beta_1 = 690.451711, beta_2 = 0.997207

Now we plot our resulting regresssion model.

In [15]:

x = np.linspace(1960, 2015, 55)

x = x/max(x)

plt.figure(figsize=(8,5))

y = sigmoid(x, *popt)

plt.plot(xdata, ydata, 'ro', label='data')

plt.plot(x,y, linewidth=3.0, label='fit')

plt.legend(loc='best')

plt.ylabel('GDP')

plt.xlabel('Year')

plt.show()

DB架构

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Machine Learning with Python Part One

In this course, you’ll learn how Machine Learning is used in many key fields and industries.For example,in the health care industry, data scientists use Machine Learning to predict whether a human cell that is believed to be at risk of developing cance
复制链接

扫一扫