《深度学习Python实践》第22章——文本分类实例

文本分类实例数据集链接:http://qwone.com/~jason/20Newsgroups/

代码如下:

1)算法比价

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot as plt

categories=['alt.atheism',
            'tec.sport.hockey',
            'sci.crypt',
            'comp.sys.ibm.pc.hardware',
            'sci.med',
            'comp.sys.mac.hardware',
            'sci.space',
            'comp.windows.x',
            'soc.religion.christian',
            'misc.forsale',
            'talk.politocs.guns',
            'rec.autos',
            'talk.politocs.medeast',
            'rec.motorcycles',
            'talk.politics.misc',
            'rec.sport.baseball',
            'talk.religion.misc']

#导入训练数据
train_path='/home/duan/下载/20news-bydate/20news-bydate-train'
dataset_train=load_files(container_path=train_path,categories=categories)

#导入评估数据
test_path='/home/duan/下载/20news-bydate/20news-bydate-test'
dataset_test=load_files(container_path=test_path,categories=categories)

#数据准备与理解

#计算词频
count_vect=CountVectorizer(stop_words='english',decode_error='ignore')
X_train_counts=count_vect.fit_transform(dataset_train.data)
#查看数据维度
#词频的计算结果如下:
print(X_train_counts.shape)

#计算TF-IDF
tf_transformer=TfidfVectorizer(stop_words='english',decode_error='ignore')
X_train_counts_tf=tf_transformer.fit_transform(dataset_train.data)
print(X_train_counts_tf.shape)

#以上用两种方法进行了文本特征的提取。并且查看了数据维度。
#接下来用TF-IDF特征进行分类模型的训练。


#评估算法
#设置评估算法的基准
num_folds=10
seed=7
scoring='accuracy'

#线性算法LR ,
#非线性算法:CART,SVM,MNB,KNN
models={}
models['LR']=LogisticRegression()
models['SVM']=SVC()
models['CART']=DecisionTreeClassifier()
models['MNB']=MultinomialNB()
models['KNN']=KNeighborsClassifier()

#比较算法
results=[]
for key in models:
    kfold = KFold(n_splits= num_folds, random_state=seed)
    cv_result = cross_val_score(models[key], X_train_counts_tf, dataset_train.target, cv=kfold, scoring=scoring)
    results.append(cv_result)
    print('%s: %f (%f)' %(key, cv_result.mean(), cv_result.std()))

运行结果:

(7838, 77172)
(7838, 77172)
KNN: 0.824575 (0.012700)
LR: 0.920900 (0.008155)
CART: 0.703240 (0.013782)
MNB: 0.896786 (0.009055)
SVM: 0.062772 (0.004306)

箱线图比较算法:

#箱线图10折交叉验证比较算法
fig=plt.figure()
fig.suptitle("Algorithm Comparision")
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(models.keys())
plt.show()

运行结果:
这里写图片描述
从图中结果可以看出,朴素贝叶斯分类器的数据离散程度比较好,逻辑回归的偏度较大.算法结果的离散程度能够反应算法对数据的只用情况,所以对逻辑回归和朴素贝叶斯分类器进行进一步的研究,实行算法调参.


2) 算法调参

通过上面的分析发现,LR和MNB值得进一步进行优化.下面对这两个算法的参宿进行调参,进一步提高算法的准确度.

(1)逻辑回归调参

逻辑回归中的超参数是C.C是目标的约束函数,C值越小则正则化强度越大,对C进行调参,每次给C设定一定数量的值,如果临界值是最有参数,重复这个步骤,直到找到最优值.

#算法调参
#调参LR
param_grid={}
param_grid['C']=[0.1,5,13,15]
model=LogisticRegression()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最优:%s使用%s'%(grid_result.best_score_,grid_result.best_params_))

运行结果:
最优:0.9393978055626435使用{'C': 15}

(2)朴素贝叶斯调参

朴素贝叶斯有一个alpha参数,该参数是一个平滑参数,默认值为1.0.
我们可以对这个参数进行调参,以提高算法的准确度.

#算法调参
#调参MNB
param_grid={}
param_grid['alpha']=[0.001,0.01,0.1,1.5]
model=MultinomialNB()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最优:%s使用%s'%(grid_result.best_score_,grid_result.best_params_))
cv_results=zip(grid_result.cv_results_['mean_test_score'],
               grid_result.cv_results_['std_test_score'],
               grid_result.cv_results_['params'])
for mean, std, param in cv_results:
    print('%f (%f) with %r'%(mean, std, param)) 

运行结果:

最优:0.934804797142128使用{'alpha': 0.01}
0.929829 (0.008380) with {'alpha': 0.001}
0.934805 (0.008096) with {'alpha': 0.01}
0.928043 (0.008024) with {'alpha': 0.1}
0.889640 (0.010375) with {'alpha': 1.5}

MNB算法最有参数为alpha=0.01.最优:0.934804797142128使用{‘alpha’: 0.01}
LR算法最优参数为:C=15. 最优:0.9393978055626435使用{‘C’: 15}

通过调参发现,LR在C=15时具有最好的准确度.接下来审查集成算法.

3).集成算法

随机森林(RF)
AdaBoost(AB)

ensembles={}
ensembles['RF']=RandomForestClassifier()
ensembles['AB']=AdaBoostClassifier()
#比较集成算法
results=[]
for key in ensembles:
    kfold = KFold(n_splits= num_folds, random_state=seed)
    cv_result = cross_val_score(ensembles[key], X_train_counts_tf, dataset_train.target, cv=kfold, scoring=scoring)
    results.append(cv_result)
    print('%s: %f (%f)' %(key, cv_result.mean(), cv_result.std()))

运行结果:

RF: 0.773795 (0.017244)
AB: 0.620055 (0.017638)

箱线图:

#箱线图10折交叉验证比较算法
fig=plt.figure()
fig.suptitle("Algorithm Comparision")
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(ensembles.keys())
plt.show()

这里写图片描述

从箱线图可以看出,随机森林的分布比较均匀,对数据的适用性比较高,更值得进一步优化研究.

4).集成算法调参

#集成算法调参
#调参RF
param_grid={}
param_grid['n_estimators']=[10,100,150,200]
model=RandomForestClassifier()

kfold=KFold(n_splits=num_folds,random_state=seed)

grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)

grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)

print('最优:%s使用%s'%(grid_result.best_score_,grid_result.best_params_))

cv_results=zip(grid_result.cv_results_['mean_test_score'],
               grid_result.cv_results_['std_test_score'],
               grid_result.cv_results_['params'])
for mean, std, param in cv_results:
    print('%f (%f) with %r'%(mean, std, param)) 

运行结果:

最优:0.888236795100791使用{'n_estimators': 200}
0.779025 (0.007910) with {'n_estimators': 10}
0.882496 (0.012405) with {'n_estimators': 100}
0.887982 (0.010867) with {'n_estimators': 150}
0.888237 (0.009727) with {'n_estimators': 200}

确定最终模型

#算法调参
#调参LR
param_grid={}
model=LogisticRegression(C=15)
model.fit(X=X_train_counts_tf,y=dataset_train.target)
X_test_counts=tf_transformer.transform(dataset_test.data)
predictions=model.predict(X_test_counts)
print(accuracy_score(dataset_test.target,predictions))
print(classification_report(dataset_test.target,predictions))

运行结果:

0.8844163312248419
             precision    recall  f1-score   support

          0       0.85      0.79      0.82       319
          1       0.78      0.84      0.81       392
          2       0.86      0.88      0.87       385
          3       0.91      0.89      0.90       395
          4       0.81      0.90      0.86       390
          5       0.91      0.91      0.91       396
          6       0.97      0.95      0.96       398
          7       0.94      0.97      0.96       397
          8       0.97      0.94      0.96       396
          9       0.92      0.89      0.91       396
         10       0.93      0.95      0.94       394
         11       0.86      0.93      0.89       398
         12       0.91      0.77      0.84       310
         13       0.70      0.62      0.65       251

avg / total       0.89      0.88      0.88      5217
  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值