1、%time %timeit 命令
要在ipython下才可以使用。(所以说Jupyter Notebook当然是可以用的,pycharm里的python环境也是jupyter Notebook)
%time 可以测量一行代码执行的时间
%timeit 可以测量一行代码多次执行的时间
2、read_csv(filepath_or_buffer, header)的使用
参考链接 https://www.pypandas.cn/docs/user_guide/io.html#csv-%E6%96%87%E6%9C%AC%E6%96%87%E4%BB%B6
pandas的I/O API是一组read函数,比如pandas.read_csv()函数。这类函数可以返回pandas对象。
(1)read_csv 方法的默认分隔符是","
如手写字识别中的训练数据所示,为逗号分隔符
(2)header=None
参考链接https://blog.csdn.net/sinat_32872729/article/details/93025161
header参数指定行数用来作为列名,数据开始行数。默认为0。如果文件中没有列名,应设置为None。
查看pandas官方文档发现,read_csv读取时会自动识别表头,数据有表头时默认读取第一行,即header=0(对于有表头的数据不能设置header为None,会报错);数据无表头时,若不设置header,则第一行数据会被视为表头,所以对于无表头的数据应传入names参数设置表头名称或设置header=None。
使用pandas中read_csv读取csv数据时,对于有表头的数据,将header设置为空(None),会报错:pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error() ParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 5
因此,对无表头的数据,则需设置 header=None,否则第一行数据被作为表头.
header=None指明原文件数据没有列索引,read_csv会自动加上列索引。
%time data = pd.read_csv("../mnist/mnist_train.csv", header=None)
read_csv读取的数据类型为Dataframe,则有以下四个方法
.dtypes可以查看每列的数据类型,
. values:对应的二维NumPy值数组
. columns:列索引:列名称
. index:行的索引:行号或行名。
如下图所示,mnist_train.csv数据共有60000行*785列
【绘图】plt.figure()的使用
np.mean()
mean()函数功能:求取均值
经常操作的参数为axis,以m * n矩阵举例:
axis 不设置值,对 m*n 个数求均值,返回一个实数
axis = 0:压缩行,对各列求均值,返回 1* n 矩阵
axis =1 :压缩列,对各行求均值,返回 m *1 矩阵
fit()函数 score()缺省评估方法打分
fit函数
scikit-learn中score的作用
它提供了一个缺省的评估法则来解决问题,简要的说,它用你训练好的模型在测试集上进行评分(0~1)1分代表最好
模型训练
clf.fit(X_train,Y_train)
模型评测
print(clf.score(X_test,Y_test))
手写数字识别
MNIST Lab
MNIST reference:
- http://yann.lecun.com/exdb/mnist/
scikit-learn reference:
- https://scikit-learn.org/stable/modules/multiclass.html
- https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
# 第一步 将训练和测试数据存储到numpy数组中
# read csv files (i.e., ../mnist/*.csv) with pandas
# store training and testing data into numpy arrays
import numpy as np
import pandas as pd
# 先查看一下训练数据的基本信息
%time data1 = pd.read_csv("../mnist/mnist_train.csv", header=None)
print(data1.values)
print(data1.shape) # (60000, 785)的训练数据 60000行*785列
print(data1.index)
print(data1.columns)
%time data = pd.read_csv("../mnist/mnist_train.csv", header=None).values
# %time可以测量一行代码执行的时间
# read_csv读取csv数据,数据无表头,设置header=None,否则默认第一行为表头
# read_csv读取的数据类型为Dataframe,通过.values可以获取到对应的二维numpy值数组
print(data)
# 获取图形训练数据 重组为新的矩阵
img_train = data[:,1:].reshape((-1,28,28))
# MNIST是一个包含数字0~9的手写体图片数据集,图片已归一化为以手写数字为中心的28*28规格的图片
# reshape操作时,数组新的shape属性应该要与原来的配套,如果等于-1的话,那么Numpy会根据剩下的维度计算出数组的另外一个shape属性值。
print(data[:,1:]) # 获取data中的所有行且列数从下标1(含)开始以后的数据,即(60000, 784)
print(data[:,1:].shape)
print('xxxxxxxxxxxxxxxxxxxxx')
print(img_train) # 通过reshape((-1,28,28)) 变形为x个28*28的矩阵
print(img_train.shape) # (60000, 28, 28) 即28*28的矩阵有60000个
# 获取手写数字训练数据的标签,即每行数据的首列值 60000个
lab_train = data[:,0]
print(lab_train, len(lab_train))
# 生成60000*10的全0矩阵,进行0-1编码
one_train = np.zeros((len(data), 10))
print(one_train.shape)
one_train[range(len(data)), data[:,0]] = 1
# 把 从0-59999行, 第 标签列(即data每行的首列数字值)的 矩阵值 设置为1
print(range(len(data))) # range(0, 60000)
print(data[:,0]) # [5 0 4 ... 5 6 8]
print(one_train[0,5], one_train[1,0], one_train[2,4])
print(one_train)
data = None
# 获取到img_train, lab_train, one_train即可,清空初始数据data
CPU times: user 2 s, sys: 308 ms, total: 2.31 s
Wall time: 2.31 s
[[5 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[4 0 0 ... 0 0 0]
...
[5 0 0 ... 0 0 0]
[6 0 0 ... 0 0 0]
[8 0 0 ... 0 0 0]]
(60000, 785)
RangeIndex(start=0, stop=60000, step=1)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
775, 776, 777, 778, 779, 780, 781, 782, 783, 784],
dtype='int64', length=785)
CPU times: user 1.94 s, sys: 168 ms, total: 2.11 s
Wall time: 2.11 s
[[5 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[4 0 0 ... 0 0 0]
...
[5 0 0 ... 0 0 0]
[6 0 0 ... 0 0 0]
[8 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
(60000, 784)
xxxxxxxxxxxxxxxxxxxxx
[[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
...
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]]
(60000, 28, 28)
[5 0 4 ... 5 6 8] 60000
(60000, 10)
range(0, 60000)
[5 0 4 ... 5 6 8]
1.0 1.0 1.0
[[0. 0. 0. ... 0. 0. 0.]
[1. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 1. 0.]]
# 先查看一下测试数据的基本信息
%time data = pd.read_csv("../mnist/mnist_test.csv", header=None).values
print(data.shape)# (10000, 785)的测试数据 10000行*785列
# 获取图形测试数据 重组为新的矩阵
img_test = data[:,1:].reshape((-1,28,28))
print(data[:,1:]) # 获取data中的所有行且列数从下标1(含)开始以后的数据,即(10000, 784)
print(data[:,1:].shape)
print('xxxxxxxxxxxxxxxxxxxxx')
print(img_test) # 通过reshape((-1,28,28)) 变形为x个28*28的矩阵
print(img_test.shape) # (10000, 28, 28) 即28*28的矩阵有10000个
# 获取手写数字测试数据的标签,即每行数据的首列值 10000个
lab_test = data[:,0]
print(lab_test, len(lab_test))
one_test = np.zeros((len(data), 10)) # 生成10000*10的全0矩阵
print(one_test.shape)
one_test[range(len(data)), data[:,0]] = 1
# 把 从0-9999行, 第 标签列(即data每行的首列数字值)的 矩阵值 设置为1
print(range(len(data))) # range(0, 10000)
print(data[:,0]) # [7 2 1 ... 4 5 6]
print(one_test[0,7], one_test[1,7], one_test[2,1])
print(one_test)
data = None
# 获取到img_test, lab_test, one_test,清空初始数据data
CPU times: user 332 ms, sys: 0 ns, total: 332 ms
Wall time: 330 ms
(10000, 785)
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
(10000, 784)
xxxxxxxxxxxxxxxxxxxxx
[[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
...
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]]
(10000, 28, 28)
[7 2 1 ... 4 5 6] 10000
(10000, 10)
range(0, 10000)
[7 2 1 ... 4 5 6]
1.0 0.0 1.0
[[0. 0. 0. ... 1. 0. 0.]
[0. 0. 1. ... 0. 0. 0.]
[0. 1. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
# 第二步 选择用于调试的训练和测试数据的一个小子集
# selecte a small subset of training and testing data for debugging
print(len(img_train))
print(len(img_test))
60000
10000
# 第三步 通过显示numpy数组的一些关键属性来检查加载的数据
# check loaded data by showing some key properties of the numpy arrays
# numpy.mean() 函数返回数组中元素的算术平均值。 如果提供了轴,则沿其计算。算术平均值是沿轴的元素的总和除以元素的数量。
# mean()函数功能:求取均值
# 经常操作的参数为axis,以m * n矩阵举例:
# axis 不设置值,对 m*n 个数求均值,返回一个实数
# axis = 0:压缩行,对各列求均值,返回 1* n 矩阵
# axis =1 :压缩列,对各行求均值,返回 m *1 矩阵
# 原one_train是60000*10,one_test是10000*10
# axis = 0, 压缩行,得到1*10的一维数组
print(img_train.shape, lab_train.shape, np.mean(one_train, axis = 0))
print(img_test.shape, lab_test.shape, np.mean(one_test, axis = 0))
(60000, 28, 28) (60000,) [0.09871667 0.11236667 0.0993 0.10218333 0.09736667 0.09035
0.09863333 0.10441667 0.09751667 0.09915 ]
(10000, 28, 28) (10000,) [0.098 0.1135 0.1032 0.101 0.0982 0.0892 0.0958 0.1028 0.0974 0.1009]
# 第四步 通过显示一些图像检查加载的数据
# check loaded data by showing a few images
import matplotlib.pyplot as plt
%matplotlib inline
size = 4
plt.figure(figsize=(size, size))
for i in range(size*size):
# subplot创建单个子图
plt.subplot(size, size, i+1)
# imshow()函数格式为:matplotlib.pyplot.imshow(X, cmap=None)
# X: 要绘制的图像或数组, cmap: 颜色图谱(colormap), 默认绘制为RGB(A)颜色空间
plt.imshow(img_test[i], cmap='gray')
# 第五步 通过显示一些标签来检查加载的数据
# check loaded data by showing a few labels
print(lab_test[:size*size])
print(one_test[:size*size])
[7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5]
[[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]]
# 第六步 利用scikiti -learn构建并评估一个分类模型
# build and evaluate a classification model with scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestCentroid
img_train = img_train.reshape((-1, 28*28)) / 256
img_test = img_test.reshape((-1, 28*28)) / 256
print(img_train, img_test)
print(img_train.shape, img_test.shape) # (60000, 784) (10000, 784)
model = NearestCentroid() # 最近邻算法
%time model.fit(img_train, lab_train)
# 使用缺省的评估方法,采取最近邻算法在测试集上进行评分,越接近1越好
%time score = model.score(img_test, lab_test)
print(score)
# scikit-learn中score的作用提供了一个缺省的评估法则来解决问题,简要的说,它用你训练好的模型在测试集上进行评分(0~1)1分代表最好
# clf.fit(X_train,Y_train)
# print(clf.score(X_test,Y_test))
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]] [[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
(60000, 784) (10000, 784)
CPU times: user 453 ms, sys: 35.6 ms, total: 489 ms
Wall time: 486 ms
CPU times: user 334 ms, sys: 147 ms, total: 480 ms
Wall time: 41 ms
0.8203
# 第七步 设计一个小实验,并绘制一个图表来显示一些关键的观察结果
# design a small experiment and plot a figure showing some key observations
acc = []
for i in range(10):
idx = lab_test == i
score = model.score(img_test[idx], lab_test[idx])
acc.append(score)
plt.bar(range(10), acc)
# 展示了0-9识别的准确率柱状图
<BarContainer object of 10 artists>
Tell a story based on the key observations
# advance experiments (optional)
count = np.zeros((10, 10), dtype=np.int)
pred_test = model.predict(img_test)
for i, j in zip(lab_test, pred_test):
count[i, j] += 1
count = (count.T / np.sum(count, axis=-1)).T * 100.0
print(" ", end="")
for j in range(10):
print(" %4d " % j, end="")
print()
for i in range(10):
print(i, end="")
for j in range(10):
print(" %4.1f%%" % count[i, j], end="")
print()
0 1 2 3 4 5 6 7 8 9
0 89.6% 0.0% 0.7% 0.2% 0.2% 5.9% 2.6% 0.1% 0.7% 0.0%
1 0.0% 96.2% 0.9% 0.3% 0.0% 0.6% 0.3% 0.0% 1.8% 0.0%
2 1.8% 6.9% 75.7% 3.2% 3.0% 0.3% 2.2% 1.7% 4.8% 0.3%
3 0.4% 2.4% 2.5% 80.6% 0.1% 4.9% 0.8% 1.5% 5.7% 1.2%
4 0.1% 2.2% 0.2% 0.0% 82.6% 0.3% 1.6% 0.1% 1.0% 11.8%
5 1.2% 7.1% 0.2% 13.2% 2.4% 68.6% 3.0% 1.1% 1.5% 1.7%
6 1.9% 2.8% 2.3% 0.0% 3.2% 3.3% 86.3% 0.0% 0.1% 0.0%
7 0.2% 5.7% 2.1% 0.1% 1.9% 0.2% 0.0% 83.3% 1.3% 5.2%
8 1.4% 4.0% 1.1% 8.5% 1.2% 3.7% 1.3% 1.0% 73.7% 3.9%
9 1.5% 2.2% 0.7% 1.0% 8.2% 1.2% 0.1% 2.7% 1.8% 80.7%