100-Days-Of-ML-Code

目录

Day 1 | Data PreProcessing

Day2 | Grandient Descent

Day 3 | Simple Lenar Regression

Day 4 | Multiple Linear Regression


 

 


Day 1 | Data PreProcessing

Get the dataset from here.

Step 1 : Importing the required Libraries

These three are essential libraries which we will often import.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2 : Importing the Dataset

We use the read_csv method of the pandas library to read a local CSV file as a dataframe.

dataset = pd.read_csv("F://Data.csv")
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,3].values
print(dataset)

'''   
        Country  Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes
5	France	35.0	58000.0	Yes
6	Spain	NaN	52000.0	No
7	France	48.0	79000.0	Yes
8	Germany	50.0	83000.0	No
9	France	37.0	67000.0	Yes
'''

 

Step 3 : Handling the Missing Data

The data we get is rarely homogeneous. We can replace missing data by Mean or Median of the entire column. We use Imputer class of sklearn.preprocessing for this task.

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
print(X)

'''
[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
'''

Step 4 : Encoding Categorical Data

Values such as "Yes" and "No" cannot be used in mathematical equaltions of the model so we need to encode these variables into numbers. To achieve this we import LabelEncoder class from sklearn.preprocessing library.

The usage of LabelEncoder and OneHotEncoder.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:,0] = labelencoder.fit_transform(X[:,0])
print(X)

'''
[[0 44.0 72000.0]
 [2 27.0 48000.0]
 [1 30.0 54000.0]
 [2 38.0 61000.0]
 [1 40.0 63777.77777777778]
 [0 35.0 58000.0]
 [2 38.77777777777778 52000.0]
 [0 48.0 79000.0]
 [1 50.0 83000.0]
 [0 37.0 67000.0]]
'''

onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
print(X,"\n=======\n",Y)

'''
[[1.00000000e+00 0.00000000e+00 0.00000000e+00 4.40000000e+01 7.20000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 2.70000000e+01 4.80000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 3.00000000e+01 5.40000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.80000000e+01 6.10000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 4.00000000e+01 6.37777778e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.50000000e+01 5.80000000e+04]
 [0.00000000e+00 0.00000000e+00 1.00000000e+00 3.87777778e+01 5.20000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 4.80000000e+01 7.90000000e+04]
 [0.00000000e+00 1.00000000e+00 0.00000000e+00 5.00000000e+01 8.30000000e+04]
 [1.00000000e+00 0.00000000e+00 0.00000000e+00 3.70000000e+01 6.70000000e+04]] 
=======
 [0 1 0 0 1 1 0 1 0 1]
'''

Step 5 : Spliting the dataset into test set and train set

We import train_test_split method  of sklearn.cross_validation library. The split is generally 80/20.

from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state = 0)

Step 6 : Features Scaling

Done by Feature standardization or Z-score normalization. StandardScalar of sklearn.preprocessing is imported.

The usage of StandardScalar.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
print(X_train,'\n=====\n',X_test)

'''
[[-1.          2.64575131 -0.77459667  0.26306757  0.12381479]
 [ 1.         -0.37796447 -0.77459667 -0.25350148  0.46175632]
 [-1.         -0.37796447  1.29099445 -1.97539832 -1.53093341]
 [-1.         -0.37796447  1.29099445  0.05261351 -1.11141978]
 [ 1.         -0.37796447 -0.77459667  1.64058505  1.7202972 ]
 [-1.         -0.37796447  1.29099445 -0.0813118  -0.16751412]
 [ 1.         -0.37796447 -0.77459667  0.95182631  0.98614835]
 [ 1.         -0.37796447 -0.77459667 -0.59788085 -0.48214934]] 
=====
 [[ 0.  0.  0. -1. -1.]
 [ 0.  0.  0.  1.  1.]]
'''

 

Day2 | Grandient Descent

Cost Function

The cost function is equal to the square error between estimators and real values. Our goal is to minimize the cost. 

import numpy as np

def compute_cost(X, y, theta):
    # Initialize some useful values
    m = y.size
    cost = 0

    # ===================== Your Code Here =====================
    # Instructions : Compute the cost of a particular choice of theta.
    #                You should set the variable "cost" to the correct value.

    cost = np.sum(np.dot(X, theta) ** 2) / 2m

    return cost

The usage of np.dot() is here.

Day 3 | Simple Lenar Regression

Step 1 : Data preprocession

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("F://studentscores.csv")
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Step 2 : Fitting Simplr Linear Regression Model to the training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor = regressor.fit(X_train, y_train)

Step 3 : Predicting the result

y_p = regressor.predict(X_test)

Step 4 : Visualization

Visualizing the Training result

plt.scatter(X_train, y_train, color = "red")
plt.plot(X_train, regressor.predict(X_train), color = "blue")

Visualizing the Test result

plt.scatter(X_test, y_test, color = "red")
plt.plot(X_test, regressor.predict(X_test), color = "blue")

 

Day 4 | Multiple Linear Regression

Step 1 : Data preprocessing

import numpy as np
import pandas as pd

dataset = pd.read_csv("F://50_startups.csv")
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,-1].values

#Transform the str label to numeric label
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:,3] = labelencoder.fit_transform(X[:,3])

onehotencoder = OneHotEncoder(categorical_features=[3])
X = onehotencoder.fit_transform(X).toarray()

#Avoiding Dummy Variable Trap
X = X[:,1:]

from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size = 0.2,random_state = 0)

Step 2 : Fitting the Mutiple Linear Regression Model to the Training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,Y_trian)

Step 3 : Predicting the Test result

Y_p = regressor.predict(X_test)

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值