Machine Learning in Python
3.19
print("Zeroth Value: %d" % mylist[0])
=
print("Zeroth Value:" , mylist[0])
3.31 Line Plot
(点大小的区别,我好无聊。。)
plt.plot([1, 2, 3])
plt.plot(numpy.array([1, 1, 4]))
3.32 Scatter Plot
plt.scatter(x,y)
plt.scatter(x,y,x)
plt.scatter(x,y,y)
plt.scatter(x,y,z)
3.33 Pandas Series
print(myseries[0])
print(myseries['a'])
————————————
Chapter 3 summary
————————————————————————————————————————————————————————
4.3 Load CSV Files with NumPy
ValueError: could not convert string to float
- float类型之外的数据集导入
用dtype
data = loadtxt(raw_data, delimiter=",", dtype=numpy.str)
4.5 loading a CSV URL using NumPy
ValueError: Wrong number of columns at line 161
可能是line161列数超出了前面统一的列数。wrong point 见 npyio.py line1058 read_data()里
if len(vals) != N:
line_num = i + skiprows + 1
raise ValueError("Wrong number of columns at line %d" % line_num)
eg.
1 2
3 4 5
6 7
不会解决。。
4.7 loading a CSV file using Pandas
- 看line0 用 data[0:1] , 数据类型是?
- Numpy和Python用data[0]
4.9 loading a CSV URL using Pandas
names是列,data['preg']
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
看line161 print(data[161:162])
Chapter 4 summary
————————————————————————————————————————————————————————
5.13 Skew of Univariate Distributions
- 偏态分布的意义
Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted orsquashed in one direction or another. Many machine learning algorithms assume a Gaussiandistribution. Knowing that an attribute has a skew may allow you to perform data preparationto correct the skew and later improve the accuracy of your models.
- 反映偏态分布的集中趋势往往用中位数
峰左移,右偏,正偏(positive skew)
峰右移,左偏,负偏(negative skew)
与正态分布相对而言,偏态分布有两个特点:
一是左右不对称(即所谓偏态);
二是当样本增大时,其均数趋向正态分布。
5 summary
————————————————————————————————————————————————————————
Understand Your Data With Visualization
6.3 Box and Whisker Plots
6.4
# Correction Matrix Plot
from matplotlib import pyplot
from pandas import read_csv
import numpy
filename = ' pima-indians-diabetes.data.csv'
names = [' preg' , ' plas' , ' pres' , ' skin' , ' test' , ' mass' , ' pedi' , ' age' , ' class' ]
data = read_csv(filename, names=names)
correlations = data.corr()
# plot correlation matrix
fig = pyplot.figure() #初始化一个新的视图,尽管它可以调用绘图命令并自动启动。而plt.show()命令,将关闭正在操作的图形,然后新建一个图形
ax = fig.add_subplot(111) #mnp 一块画布分成m*n块,第p块
cax = ax.matshow(correlations, vmin=-1, vmax=1) #plot a matrix or an array as an image
fig.colorbar(cax)#Add a colorbar to a plot.
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
pyplot.show()#打开matplotlib查看器
6.6
from pandas.tools.plotting import scatter_matrix
报错: ModuleNotFoundError: No module named ‘pandas.tools’
-
from pandas.plotting import scatter_matrix
OK
——————————————————————————————————————————————————————
7 pre-processing
rescale 不对
不懂
数据处理
——————————————————————————————————————————————————————
8 Feature Selection
PCA
通过计算数据矩阵的协方差矩阵,然后得到协方差矩阵的特征值特征向量,选择特征值最大(即方差最大)的k个特征所对应的特征向量组成的矩阵。这样就可以将数据矩阵转换到新的空间当中,实现数据特征的降维。
——————————————————————————————————————————————————————
10 Machine Learning Algorithm Performance Metrics
-
Classification Metrics
-
Logarithmic Loss 越小越好
————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
————————————————————————————————————————————————————————
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$