python培训Day3

最新推荐文章于 2022-01-13 23:36:14 发布

郭大侠笔记

最新推荐文章于 2022-01-13 23:36:14 发布

阅读量468

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/weixin_43813300/article/details/90746752

版权

python 专栏收录该内容

30 篇文章 0 订阅

订阅专栏

python培训Day3

一、分类

1.1、基本概念

按照某种标准给对象贴标签

1.2、分类方法

手工
（人工撰写）规则的方法
统计/概率方法

1.3、分类流程

在这里插入图片描述

1.4、特征选择

去掉噪音特征，减少过学习

1.4.1基本特征选择算法

对于类别c,选择得分考前的K个特征

1.4.2特征选择所考虑的因素

类内代表性：该特征应该是类别当中的典型特征
偶尔出现1到2次的特征不是好特征
类间区别性：在多个类别中有区分性
比如每个类中都频繁出现的特征不是好特征

1.4.3不同特征的选择方法

常用的特征效用指标：
频率法（DF）-选择高频词项
信息增益（IG-information gain）：信息量的大小
卡方（Chi-square）:卡方越大，特征独立性越小
方法比较

1.4.4关于特征选择

每个类别选择前K个特征，K怎么定？
交叉验证法（待续）
通常画出K变化分类效果变化的曲线
每个类别选的特征数目能不能不是固定的？
可以采用阈值截断法
每个类别选出来的特征要不要合在一起形成全局特征空间哎？

1.5、文本分类评价

评价必须基于测试数据进行，测试数据与训练数据完全隔离的（样本之间无交集）
在训练集上容易得到很高的性能
指标：正确率、召回率、FI值、分类精确率等等
- 正确率和召回率、精确率需要数据平衡（没有极端数据）
- 正确率和召回率的讨论
- F值
- F1允许在正确率和召回率之间达到某种均衡
- 也就是P和R的调和平均值：
绘图模块参见：
https://matplotlib.org/gallery/index.html

#数学处理包
import numpy as np
#绘图
import matplotlib.pyplot as plt
#线性空间linspace(起始点，结束点，点数)
x=np.linspace(0,10,100)
y=np.sin(x)
z=np.cos(x)
b=np.tan(x)
#折线图
plt.plot(x,y)
plt.plot(x,z)
plt.plot(x,b)
#标题
plt.title('pic 1')
plt.show()

在这里插入图片描述

#数学处理包
import numpy as np
#绘图
import matplotlib.pyplot as plt
#线性空间linspace(起始点，结束点，点数)
x=np.linspace(0,10,100)
y=np.sin(x)
z=np.cos(x)
fig=plt.figure()
#将画布分块（行，列，选择的第几块）
ax=fig.add_subplot(2,2,3)
ax.set(xlim=[0,10],ylim=[-1,1],title='pic 3',ylabel='Num',xlabel='time')
plt.plot(x,y,'r')
bx=fig.add_subplot(2,2,2)
bx.set(xlim=[0,10],ylim=[-1,1],title='pic 2',ylabel='Num',xlabel='time')
plt.plot(x,z)
plt.show()

在这里插入图片描述

#数学处理包
import numpy as np
#绘图
import matplotlib.pyplot as plt
#
fig=plt.figure(facecolor='red')
#将画布分块（行，列，选择的第几块）
ax=fig.add_subplot(111)
ax.set(xlim=[0,100],ylim=[0,100],title='pic 3',ylabel='Num',xlabel='time')
x=np.random.randint(0,100,100)
y=np.random.randint(0,100,100)
plt.scatter(x,y,facecolor='r',marker='*')

<matplotlib.collections.PathCollection at 0x1e0a5b6fb70>

在这里插入图片描述

分类训练支持向量机SVM

速度慢，效果好

from sklearn import svm
import numpy as np
import pylab as pl
 
#生成随机点数据集
np.random.seed(0) #随机数种子，一次有效，固定随机值
#np.r_是按列连接两个矩阵，就是把两矩阵上下相加，要求列数相等，类似于pandas中的concat()。
#np.c_是按行连接两个矩阵，就是把两矩阵左右相加，要求行数相等，类似于pandas中的merge()。
x = np.r_[np.random.randn(20, 2) - [2, 2], np.random.randn(20, 2) + [2, 2]]
#生成20个零和20个一的矩阵
y = [0] *20 +[1] * 20
print(x)
print(y)
#实现函数
clf2 = svm.SVC(kernel='linear')#核函数，参数线性核函数————linear多项式核函数————poly径向基核函数————rbf（用得较多）Sigmoid核函数————sigmoid
clf2.fit(x, y)
print(clf2.support_)
pl.scatter(clf2.support_vectors_[:, 0],clf2.support_vectors_[:, 1],s=80)
#画出全部的点，参数：x，y，颜色，colormap，形状
pl.scatter(x[:, 0],x[:, 1],c=y,cmap=pl.cm.Paired,marker='o')
 
pl.axis('tight')
#pl.savefig("dd") 保存绘图
pl.show()

[[-0.23594765 -1.59984279]
 [-1.02126202  0.2408932 ]
 [-0.13244201 -2.97727788]
 [-1.04991158 -2.15135721]
 [-2.10321885 -1.5894015 ]
 [-1.85595643 -0.54572649]
 [-1.23896227 -1.87832498]
 [-1.55613677 -1.66632567]
 [-0.50592093 -2.20515826]
 [-1.6869323  -2.85409574]
 [-4.55298982 -1.3463814 ]
 [-1.1355638  -2.74216502]
 [ 0.26975462 -3.45436567]
 [-1.95424148 -2.18718385]
 [-0.46722079 -0.53064123]
 [-1.84505257 -1.62183748]
 [-2.88778575 -3.98079647]
 [-2.34791215 -1.84365103]
 [-0.76970932 -0.79762015]
 [-2.38732682 -2.30230275]
 [ 0.95144703  0.57998206]
 [ 0.29372981  3.9507754 ]
 [ 1.49034782  1.5619257 ]
 [ 0.74720464  2.77749036]
 [ 0.38610215  1.78725972]
 [ 1.10453344  2.3869025 ]
 [ 1.48919486  0.81936782]
 [ 1.97181777  2.42833187]
 [ 2.06651722  2.3024719 ]
 [ 1.36567791  1.63725883]
 [ 1.32753955  1.64044684]
 [ 1.18685372  0.2737174 ]
 [ 2.17742614  1.59821906]
 [ 0.36980165  2.46278226]
 [ 1.09270164  2.0519454 ]
 [ 2.72909056  2.12898291]
 [ 3.13940068  0.76517418]
 [ 2.40234164  1.31518991]
 [ 1.12920285  1.42115034]
 [ 1.68844747  2.05616534]]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[ 1 14 20]

在这里插入图片描述