机器学习(一)

分类学习

线性分类器

线性分类器:假设特征与分类结果存在线性关系的模型。
定义x=< x 1 x_1 x1, x 2 x_2 x2,…, x n x_n xn>代表n维特征列向量,同时用n维向量w=< w 1 w_1 w1, w 2 w_2 w2,…, w n w_n wn>来表示对应的权重。这种线性关系可以表示:
f(w,x,b)= w T w^T wT+b。其中f属于R
对于二分类问题f属于[0,1]。使用Logistic函数:g(z)= 1 1 + e − z \frac{1}{1+e^{-z}} 1+ez1
将在替换成f,逻辑回归模型就建立好了。
例:良恶性肿瘤

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.metrics import classification_report


colum_names=['S n','C T','U S','U Sh','M A','S E','B N','B C','N N','M','C']
data  = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",names=colum_names)
data = data.replace(to_replace='?',value=np.nan)
data = data.dropna(how='any')
print(data.shape)
print(data)
x_train,x_test,y_train,y_test=train_test_split(data[colum_names[1:10]],data[colum_names[10]],test_size=0.25,random_state=33)
print(y_train.value_counts())
print(y_test.value_counts())
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)

lr = LogisticRegression()
sgdc = SGDClassifier()
lr.fit(x_train,y_train)
lr_y_predict = lr.predict(x_test)
sgdc.fit(x_train,y_train)
sgdc_y_predict = sgdc.predict(x_test)
print('Accuracy of LR Classifier:',lr.score(x_test,y_test))
print(classification_report(y_test,lr_y_predict,target_names=['benign','Malignant']))
print('Accuracy of SGD Classifier:',sgdc.score(x_test,y_test))
print(classification_report(y_test,sgdc_y_predict,target_names=['benign','Malignant']))

支持向量机

支持向量机分类是根据训练样本的分布,搜索所有可能的线性分类器中最佳的那个。
例:

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

digits = load_digits()
print(digits.data.shape)
x_train,x_test,y_train,y_test=train_test_split(digits.data,digits.target,test_size=0.25,random_state=33)
print(y_train.shape)
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.fit_transform(x_test)

lsvc = LinearSVC()
lsvc.fit(x_train,y_train)
y_p = lsvc.predict(x_test)
print(lsvc.score(x_test,y_test))
print(classification_report(y_test,y_p,target_names=digits.target_names.astype(str)))

朴素贝叶斯

各个维度上的特征被分类的条件概率之间是相互独立的。
使用贝叶斯原理:
p(y|x)= p ( x ∣ y ) p ( y ) p ( x ) \frac{p(x|y)p(y)}{p(x)} p(x)p(xy)p(y)
目标是寻找所有的y属于{ c 1 c_1 c1, c 2 c_2 c2,…, c k c_k ck}中p(y|x)最大的。
例:新闻文本

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report


news = fetch_20newsgroups(shuffle='all')
print(len(news.data))
x_train,x_test,y_train,y_test=train_test_split(news.data,news.target,test_size=0.25,random_state=33)
vec = CountVectorizer()
x_train = vec.fit_transform(x_train)
x_test = vec.transform(x_test)

mnb = MultinomialNB()
mnb.fit(x_train,y_train)
y_p = mnb.predict(x_test)

print(mnb.score(x_test,y_test))
print(classification_report(y_test,y_p,target_names=news.target_names))

K邻近

就是寻找与这个待分类的样本在特征空间中距离最近的K个以标记的样本作为参考,来帮助我们作出分类决策。
例lris(鸢尾花)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

iris = load_iris()
print(iris.data.shape)
print(iris.DESCR)
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.25,random_state=33)
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)
knc = KNeighborsClassifier()
knc.fit(x_train,y_train)
y_p = knc.predict(x_test)
print(knc.score(x_test,y_test))
print(classification_report(y_test,y_p,target_names=iris.target_names))

决策树

描述非线性关系
通过搭建决策树,学习时考虑特征节点的选取顺序。
例:泰坦尼克号乘客数据

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report


titantic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
print(titantic.head())
print(titantic.info())
x = titantic[['pclass','age','sex']]
y = titantic['survived']
print(x.info())
x['age'].fillna(x['age'].mean(),inplace=True)
print(x.info())
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=33)
vec = DictVectorizer(sparse=False)
x_train = vec.fit_transform(x_train.to_dict(orient='record'))
print(vec.feature_names_)
x_test = vec.transform(x_test.to_dict(orient='record'))
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)
y_p = dtc.predict(x_test)
print(dtc.score(x_test,y_test))
print(classification_report(y_p,y_test,target_names=['died','survived']))

集成模型分类

从多个分类器的预测结果得到决策。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report


titanic=pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')
x = titanic[['pclass','age','sex']]
y = titanic['survived']

x['age'].fillna(x['age'].mean(),inplace=True)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=33)
vec = DictVectorizer(sparse=False)
x_train=vec.fit_transform(x_train.to_dict(orient = 'record'))
x_test = vec.transform(x_test.to_dict(orient = 'record'))
dtc = DecisionTreeClassifier()
dtc.fit(x_train,y_train)
dtc_p = dtc.predict(x_test)

rfc = RandomForestClassifier()
rfc.fit(x_train,y_train)
rfc_p = rfc.predict(x_test)

gbc = GradientBoostingClassifier()
gbc.fit(x_train,y_train)
gbc_p = dtc.predict(x_test)

print(classification_report(dtc_p,y_test))

print(classification_report(rfc_p,y_test))

print(classification_report(gbc_p,y_test))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值