100-Days-Of-ML-Code

最新推荐文章于 2019-10-10 17:02:00 发布

Atom爱疼

最新推荐文章于 2019-10-10 17:02:00 发布

阅读量238

点赞数

分类专栏：机器学习文章标签： 100天机器学习代码实战

本文链接：https://blog.csdn.net/Benanan/article/details/86064570

版权

机器学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

这是一个100天机器学习代码实战的概述，涵盖了数据预处理、梯度下降、简单线性回归和多元线性回归。在数据预处理中，涉及导入必要的库、处理缺失数据、编码类别变量以及将数据集划分为训练集和测试集。接下来，介绍了梯度下降中的成本函数。在简单线性回归中，详细步骤包括数据预处理、模型拟合、预测结果及可视化。最后，对于多元线性回归，同样进行数据预处理和模型拟合，并预测测试结果。

摘要由CSDN通过智能技术生成

Day 1 | Data PreProcessing

Day2 | Grandient Descent

Day 3 | Simple Lenar Regression

Day 4 | Multiple Linear Regression

Day 1 | Data PreProcessing

Get the dataset from here.

Step 1 : Importing the required Libraries

These three are essential libraries which we will often import.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2 : Importing the Dataset

We use the read_csv method of the pandas library to read a local CSV file as a dataframe.

dataset = pd.read_csv("F://Data.csv")
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,3].values
print(dataset)

'''   
        Country  Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes
5	France	35.0	58000.0	Yes
6	Spain	NaN	52000.0	No
7	France	48.0	79000.0	Yes
8	Germany	50.0	83000.0	No
9	France	37.0	67000.0	Yes
'''

Step 3 : Handling the Missing Data

The data we get is rarely homogeneous. We can replace missing data by Mean or Median of the entire column. We use Imputer class of sklearn.preprocessing for this task.

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
print(X)

'''
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
'''

Step 4 : Encoding Categorical Data

Values such as "Yes" and "No" cannot be used in mathematical equaltions of the model so we need to encode these variables into numbers. To achieve this we import LabelEncoder class from sklearn.preprocessing library.

The usage of LabelEncoder and OneHotEncoder.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:,0] = labelencoder.fit_transform(X[:,0])
print(X)

'''
[[0 44.0 72000.0]
 [2 27.0 48000.0]
 [1 30.0 54000.0]
 [2 38.0 61000.0]
 [1 40.0 63777.77777777778]
 [0 35.0 58000.0]
 [2 38.77777777777778 52000.0]
 [0 48.0 79000.0]
 [1 50.0 83000.0]
 [0 37.0 67000.0]]
'''

onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
print(X,"\n=======\n",Y)

'''
[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01 7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01 4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01 5.40000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01 6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01 6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01 5.80000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01 5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01 7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01 8.30000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01 6.70000000e+04]] 
=======
 [0 1 0 0 1 1 0 1 0 1]
'''

Step 5 : Spliting the dataset into test set and train set

We import train_test_split method of sklearn.cross_validation library. The split is generally 80/20.

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state = 0)

Step 6 : Features Scaling

Done by Feature standardization or Z-score normalization. StandardScalar of sklearn.preprocessing is imported.

The usage of StandardScalar.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
print(X_train,'\n=====\n',X_test)

'''
[[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]] 
=====
 [[ 0.  0.  0. -1. -1.]
 [ 0.  0.  0.  1.  1.]]
'''

Day2 | Grandient Descent

Cost Function

The cost function is equal to the square error between estimators and real values. Our goal is to minimize the cost.

import numpy as np

def compute_cost(X, y, theta):
    # Initialize some useful values
    m = y.size
    cost = 0

    # ===================== Your Code Here =====================
    # Instructions : Compute the cost of a particular choice of theta.
    #                You should set the variable "cost" to the correct value.

    cost = np.sum(np.dot(X, theta) ** 2) / 2m

    return cost

The usage of np.dot() is here.

Day 3 | Simple Lenar Regression

Step 1 : Data preprocession

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("F://studentscores.csv")
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Step 2 : Fitting Simplr Linear Regression Model to the training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor = regressor.fit(X_train, y_train)

Step 3 : Predicting the result

y_p = regressor.predict(X_test)

Step 4 : Visualization

Visualizing the Training result

plt.scatter(X_train, y_train, color = "red")
plt.plot(X_train, regressor.predict(X_train), color = "blue")

Visualizing the Test result

plt.scatter(X_test, y_test, color = "red")
plt.plot(X_test, regressor.predict(X_test), color = "blue")

Day 4 | Multiple Linear Regression

Step 1 : Data preprocessing

import numpy as np
import pandas as pd

dataset = pd.read_csv("F://50_startups.csv")
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,-1].values

#Transform the str label to numeric label
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:,3] = labelencoder.fit_transform(X[:,3])

onehotencoder = OneHotEncoder(categorical_features=[3])
X = onehotencoder.fit_transform(X).toarray()

#Avoiding Dummy Variable Trap
X = X[:,1:]

from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size = 0.2,random_state = 0)

Step 2 : Fitting the Mutiple Linear Regression Model to the Training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,Y_trian)

Step 3 : Predicting the Test result

Y_p = regressor.predict(X_test)

Atom爱疼

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录