仅供学习使用
参考https://github.com/Avik-Jain/100-Days-Of-ML-Code
100-Days-Of-ML-Code
day1 数据预处理
- 引入必要的库
- 引入数据集
- 处理丢失数据
- 给类别数据编码
- 将数据集分为测试集和训练集
- 特征scaling
大部分的机器学习算法,在计算的时候,使用欧几里德距离作为两个数据点的距离。
day2 简单线性回归
使用一个单独的特征,预测结果。
# coding:utf-8
'''
简单的线性回归
'''
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('~/git/100-Days-Of-ML-Code/datasets/studentscores.csv')
X = dataset.iloc[:, : 1].values
Y = dataset.iloc[:, 1].values
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=1 / 4, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor = regressor.fit(X_train, Y_train)
Y_pred = regressor.predict(X_test)
plt.scatter(X_train, Y_train, color='red')
plt.plot(X_train, regressor.predict(X_train), color='blue')
plt.scatter(X_test, Y_test, color='red')
plt.plot(X_test, regressor.predict(X_test), color='blue')
plt.show()
day3 多元线性回归
day4 逻辑回归
逻辑回归用来解决另外一类问题,叫做分类问题。目的是预测物体属于的类别。离散的结果,在0-1直接。
使用逻辑回归函数。sigmoid。
逻辑回归是离散的结果,线性回归是连续的结果。
day5 逻辑回归
学习损失函数是如何算的,在预测时候,如何使用梯度下降算法来降低损失函数的误差。
day6 实现逻辑回归
https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Code/Day%206%20Logistic%20Regression.md
'''
实现逻辑回归
'''
import pandas as pd
dataset = pd.read_csv('/Users/huihui/git/100-Days-Of-ML-Code/datasets/Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
day7 K近邻
找到k值是不容易的。
较小的k,意味着有结果中有噪音;
较大的k,使得计算复杂度很高。
依赖独立的case,最好是运行可能的k值,然后自己做决定
day8 逻辑回归背后的数学
学习这里
https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
day9 支持向量机SVM
简单介绍什么是SVM
如何用来解决分类问题
day10 SVM和KNN
深入了解SVM、实现K近邻算法
day11 实现K近邻
实现KNN算法,完成分类任务。
day12 支持向量机
SVM可以解决分类问题和回归问题。但是,多用于分类任务。
这个算法,我们把每一个数据绘制为一个N维度的点,N是特征的个数。
- 如何分类?
找到一个超平面,能够将不同的类别区分开来。
换句话说,算法输出一个最佳的超平面,将新样本分类。 - 什么是最佳的超平面?
能够让所有标签保持最大边距的那个超平面。
换句话说,那个超平面,距离每一个类别的最近元素,都是都是最远的。
注意:
有线性可分的、有线性不可分的
- kernel
- gamma
- regularization
- margin
day13 朴素贝叶斯分类
scikit-learn实现SVM
day14 实现SVM
# coding:utf-8
# 2019/10/10 15:03
# huihui
# ref:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
dataset = pd.read_csv('~/git/100-Days-Of-ML-Code/datasets/Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
from sklearn.svm import SVC
classifier = SVC(kernel='linear', random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha=0.75, cmap=ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c=ListedColormap(('red', 'green'))(i), label=j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
day15 朴素贝叶斯分类和黑箱机器学习
学习不同类型的朴素贝叶斯分类器。
https://bloomberg.github.io/foml/#home
Also started the lectures by Bloomberg. First one in the playlist was Black Box Machine Learning. It gives the whole overview about prediction functions, feature extraction, learning algorithms, performance evaluation, cross-validation, sample bias, nonstationarity, overfitting, and hyperparameter tuning.
day16 使用 Kernel Trick实现SVM
使用Scikit-Learn实现SVM算法,加入kernel,将数据点映射到高维空间
day17 开始深度学习
Completed the whole Week 1 and Week 2 on a single day. Learned Logistic regression as Neural Network.
day18 深度学习
day21 网站抓取
【略】
day22 学习可行否?
Lecture 2 of 18 of Caltech’s Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa. Learned about Hoeffding Inequality.
day23 决策树
ID3
day24 统计学习理论简介
Lec 3 of Bloomberg ML course introduced some of the core concepts like input space, action space, outcome space, prediction functions, loss functions, and hypothesis spaces.
day25 实现决策树
# coding:utf-8
# 2019/10/10 15:16
# huihui
# ref:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
dataset = pd.read_csv('~/git/100-Days-Of-ML-Code/datasets/Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha=0.75, cmap=ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c=ListedColormap(('red', 'green'))(i), label=j)
plt.title('Decision Tree Classification (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha=0.75, cmap=ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c=ListedColormap(('red', 'green'))(i), label=j)
plt.title('Decision Tree Classification (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
day30 微积分
day33 随机森林
随机森林是有监督的集成学习模型,用于分类和回归。
随机森林构建多个决策树,并将它们合并在一起,得到一个更加准确、稳定的预测。
- 两个步骤
- 随机创建一个森林
- 做预测
- 随机森林和决策树的区别:
随机森林中,寻找根节点和拆分特征节点的过程,是随机的。
day34 实现随机森林
https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/Code/Day%2034%20Random_Forest.md
day35 什么是神经网络
https://www.youtube.com/watch?v=aircAruvnKk&t=7s
对于神经网络的很好的理解。
通过手写数字识别的案例,解释相关概念。
day36 梯度下降,神经网络是如何学习的?
https://www.youtube.com/watch?v=IHZwWFHWa-w
用一种幽默的方式,解释了梯度下降的概念。
推荐必须学习。
day37 反向传播,在做什么?
https://www.youtube.com/watch?v=Ilg3gGewQ5U
解释偏导和反向传播。
day38 反向传播微积分
https://www.youtube.com/watch?v=tIeHLnjs5U8
day39 深度学习:python、TensorFlow、Keras教程
https://www.youtube.com/watch?v=wQ8BIBpya2k&t=19s&index=2&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN
day40 加载你自己的数据
https://www.youtube.com/watch?v=j-3vuBynnOE&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=2
深度学习基础
day41 卷积神经网络
https://www.youtube.com/watch?v=WvoLTXIjBYU&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=3
day42 TensorBoard分析模型
https://www.youtube.com/watch?v=BqgTU7_cBnk&list=PLQVvvaa0QuDfhTox0AjmQ6tvTgMBZBEXN&index=4
day43 K Means聚类
思考非监督学习,研究聚类。
day44 实现K均值聚类
https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master
day45 numpy-1
https://github.com/jakevdp/PythonDataScienceHandbook
Introduction to Numpy. Covered topics like Data Types, Numpy arrays and Computations on Numpy arrays.
- 学习:
Introduction to NumPy
Understanding Data Types in Python
The Basics of NumPy Arrays
Computation on NumPy Arrays: Universal Functions
day46 numpy-2
Aggregations, Comparisions and Broadcasting
Link to Notebook:
Aggregations: Min, Max, and Everything In Between
Computation on Arrays: Broadcasting
Comparisons, Masks, and Boolean Logic
day47 numpy-3
Fancy Indexing, sorting arrays, Struchered Data
Link to Notebook:
Fancy Indexing
Sorting Arrays
Structured Data: NumPy’s Structured Arrays
day48 pandas-1
Data Manipulation with Pandas
Covered Various topics like Pandas Objects, Data Indexing and Selection, Operating on Data, Handling Missing Data, Hierarchical Indexing, ConCat and Append.
Link To the Notebooks:
Data Manipulation with Pandas
Introducing Pandas Objects
Data Indexing and Selection
Operating on Data in Pandas
Handling Missing Data
Hierarchical Indexing
Combining Datasets: Concat and Append
day49 pandas-2
Chapter 3: Completed following topics- Merge and Join, Aggregation and grouping and Pivot Tables.
Combining Datasets: Merge and Join
Aggregation and Grouping
Pivot Tables
day50 pandas-3
Chapter 3: Vectorized Strings Operations, Working with Time Series
Links to Notebooks:
Vectorized String Operations
Working with Time Series
High-Performance Pandas: eval() and query()
day51 matplotlib-1
Matplotlib可视化
Learned about Simple Line Plots, Simple Scatter Plotsand Density and Contour Plots.
Links to Notebooks:
Visualization with Matplotlib
Simple Line Plots
Simple Scatter Plots
Visualizing Errors
Density and Contour Plots
day52 matplotlib-2
Matplotlib可视化
Learned about Histograms, How to customize plot legends, colorbars, and buliding Multiple Subplots.
链接到Notebooks:
Histograms, Binnings, and Density
Customizing Plot Legends
Customizing Colorbars
Multiple Subplots
Text and Annotation
day53 matplotlib-3
三维绘图
连接到Notebooks:
Three-Dimensional Plotting in Matplotlib
day54 Hierarchical Clustering 层次聚类
研究层次聚类
动图