Scilkit-Learn：Working With Text Data（文本分类）

最新推荐文章于 2024-07-30 11:11:01 发布

LRita

最新推荐文章于 2024-07-30 11:11:01 发布

阅读量4.2k

点赞数

分类专栏：机器学习 Python Scikit-learn 文章标签： python 机器学习 Sklearn

本文链接：https://blog.csdn.net/LRita/article/details/48179783

版权

机器学习同时被 3 个专栏收录

4 篇文章 0 订阅

订阅专栏

Python

4 篇文章 0 订阅

订阅专栏

Scikit-learn

1 篇文章 0 订阅

订阅专栏

安装完Scikit-learn 之后，利用其进行文本分类。

背景知识：

现在文本分类的算法很多，常见的有Naïve Bayes，SVM，KNN，Logistic回归等。其中SVM据文献中说是在工业界和学术界通吃的。

资料与程序

1. http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html介绍NaiveBayes方法如何应用在文本分类上

2. http://blog.163.com/jiayouweijiewj@126/blog/static/17123217720113115027394/详细分析了Mahout中如何实现NaïveBayes

3. http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Libsvm是用来进行SVM训练与预测的开源工具。下载下来就可以直接用，作者的文档写的很详细。

4. http://www.blogjava.net/zhenandaci/category/31868.htmlSVM的八股介绍，讲解的还是通俗易懂的

5. http://blog.pluskid.org/?page_id=683 介绍支持向量机的

6. https://code.google.com/p/tmsvm/ Tmsvm是我之前写的利用svm进行文本分类的程序，涉及到文本分类的所有流程。

7. http://www.blogjava.net/zhenandaci/category/31868.html?Show=All 这里有一个文本分类的入门系列，介绍的还是比较详细的。

8. 《文本挖掘中若干关键问题研究》，这本书很薄，但是写的很深入，对文本挖掘的一些重点问题进行了讨论

进入正题

本文主要包括4个部分：

数据下载
提取特征
Pipline 训练模型
GridSearchCV 寻找最优参数

1. Sklearn 文本分类的数据集：20news-19997.tar

categories = ['alt.atheism',
                       'soc.religion.christian',
                       'comp.graphics',
                       'comp.sys.ibm.pc.hardware', 
                       'sci.med']
twenty_train = fetch_20newsgroups(subset = 'train',categories = categories,shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test',categories=categories, shuffle=True, random_state=42)

2. 提取特征
1）语料文件可以用一个词文档矩阵代表，每行是一个文档，每列是一个标记（即词）。将文档文件转化为数值特征的一般过程被称为 向量化。这个特殊的策略（标记，计数和正态化）被称为词袋或者Bag of n-grams表征。用词频描述文档，但是完全忽略词在文档中出现的相对位置信息。

CountVectorizer在一个类中实现了标记和计数：

from sklearn.feature_extraction.text import CountVectorizer 
vectorizer = CountVectorizer(min_df=1)#得到模型

vectorizer.get_feature_names()#d得到特征
corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', ] 
X = vectorizer.fit_transform(corpus)

2）TF-IDF 计算词的权重

from sklearn.feature_extraction.textimportTfidfTransformer
 transformer = TfidfTransformer()
 tfidf = transformer.fit_transform(X)

*大文本向量可以选择哈希向量，限定特征个数

from sklearn.feature_extraction.textimportHashingVectorizer
 hv =HashingVectorizer(n_features=10)
  
     hv.transform(corpus)
 
     HashingVectorizer的局限： 
     
     不能反转模型（没有inverse_transform方法），也无法访问原始的字符串表征，因为，进行mapping的哈希方法是单向本性。
没有提供了IDF权重，因为这需要在模型中引入状态。如果需要的话，可以在管道中添加TfidfTransformer。

HashingVectorizer 详细内容见官网

3. 测试简单的模型训练+预测

from sklearn.feature_extraction.text import CountVectorizer，TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
#get vector
vect = CountVectorizer()
X_train= count_vect.fit_transform(twenty_train.data）
#get word tf-idf
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train)
#model train
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new = vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new)
#predict
predicted = clf.predict(X_new_tfidf) 
for doc, category in zip(docs_new, predicted): 
    print('%r => %s' % (doc, twenty_train.target_names[category]))

4. Pipline 串联处理器

#pipeline串联了3个处理器
def test():
    docs_new = ['God is love', 'OpenGL on the GPU is fast']
    text_clf = Pipeline([('vect', CountVectorizer()), 
                ('tfidf', TfidfTransformer()), 
                ('clf', MultinomialNB()), 
                ])
    #train            
    text_clf.fit(twenty_train.data, twenty_train.target)
    #predict
    new_predicted = text_clf.predict(docs_new)
    
    for doc, category in zip(docs_new,new_predicted):
        #输出文档 => 类别
        print ('%r => %s' %(doc, twenty_train.target_names[category]))

5. 模型训练+预测

def testPipline():
   
    #1. MultinomialNB
    print '*************************\nNB\n*************************'
    text_clf = Pipeline([('vect', CountVectorizer()), 
                ('tfidf', TfidfTransformer()), 
                ('clf', MultinomialNB()), 
                ])
    text_clf.fit(twenty_train.data, twenty_train.target)
    
    docs_test = twenty_test.data 
    nb_predicted = text_clf.predict(docs_test)
    
    accuracy=np.mean(nb_predicted == twenty_test.target)
    #print accuracy 
    print ("The accuracy of twenty_test is %s" %accuracy)
    
    print(metrics.classification_report(twenty_test.target, nb_predicted,target_names=twenty_test.target_names))
    
    #2. KNN
    print '*************************\nKNN\n*************************'
    text_clf = Pipeline([('vect', CountVectorizer()), 
                ('tfidf', TfidfTransformer()), 
                ('clf', KNeighborsClassifier()), 
                ])
    text_clf.fit(twenty_train.data, twenty_train.target)
    
    docs_test = twenty_test.data 
    knn_predicted = text_clf.predict(docs_test)
    
    accuracy=np.mean(knn_predicted == twenty_test.target)
    #print accuracy 
    print ("The accuracy of twenty_test is %s" %accuracy)
    
    print(metrics.classification_report(twenty_test.target, knn_predicted,target_names=twenty_test.target_names))
    
    #3. SVM
    print '*************************\nSVM\n*************************'
    text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)),])
    
    text_clf.fit(twenty_train.data, twenty_train.target)
    
    svm_predicted = text_clf.predict(docs_test)
    
    accuracy=np.mean(svm_predicted == twenty_test.target)
    #print accuracy 
    print ("The accuracy of twenty_test is %s" %accuracy)
   
    print(metrics.classification_report(twenty_test.target, svm_predicted,target_names=twenty_test.target_names))


    #4. 少量特征
    print '*************************\nHashingVectorizer\n*************************'
    text_clf = Pipeline([('vect', HashingVectorizer(stop_words = 'english',non_negative = True,  
                               n_features = 10000)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)),])
    
    text_clf.fit(twenty_train.data, twenty_train.target)
    
    svm_predicted = text_clf.predict(docs_test)
    
    accuracy=np.mean(svm_predicted == twenty_test.target)
    #print accuracy 
    print ("The accuracy of twenty_test is %s" %accuracy)
   
    print(metrics.classification_report(twenty_test.target, svm_predicted,target_names=twenty_test.target_names))

*结果分析

*************************
NB
*************************
The accuracy of twenty_test is 0.838897721251
                        precision    recall  f1-score   support

           alt.atheism       0.97      0.58      0.73       319
         comp.graphics       0.95      0.85      0.89       389
 comp.sys.mac.hardware       0.93      0.92      0.93       385
               sci.med       0.96      0.81      0.87       396
soc.religion.christian       0.62      0.99      0.76       398

           avg / total       0.88      0.84      0.84      1887

*************************
KNN
*************************
The accuracy of twenty_test is 0.746157922629
                        precision    recall  f1-score   support

           alt.atheism       0.56      0.86      0.68       319
         comp.graphics       0.84      0.73      0.78       389
 comp.sys.mac.hardware       0.82      0.75      0.78       385
               sci.med       0.87      0.58      0.69       396
soc.religion.christian       0.75      0.84      0.79       398

           avg / total       0.78      0.75      0.75      1887

*************************
SVM
*************************
The accuracy of twenty_test is 0.912559618442
                        precision    recall  f1-score   support

           alt.atheism       0.94      0.81      0.87       319
         comp.graphics       0.89      0.92      0.91       389
 comp.sys.mac.hardware       0.92      0.96      0.94       385
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.88      0.96      0.92       398

           avg / total       0.91      0.91      0.91      1887

*************************
HashingVectorizer
*************************
The accuracy of twenty_test is 0.897191308956
                        precision    recall  f1-score   support

           alt.atheism       0.91      0.77      0.84       319
         comp.graphics       0.89      0.91      0.90       389
 comp.sys.mac.hardware       0.92      0.95      0.93       385
               sci.med       0.91      0.89      0.90       396
soc.religion.christian       0.87      0.94      0.90       398

           avg / total       0.90      0.90      0.90      1887

分析：对比 CountVectorizer 和HashingVectorizer，全部特征的结果要更好一些，虽然加大了内存压力。
对比NB，SVM和KNN分类结果，SVM结果最好，接下来继续采用次算法。

6. GridSearch 搜索最优参数，见代码注释
GridSearch 详细定义见官网

#GridSearchCV 搜索最优参数
def testGridSearch():
    print '*************************\nPipeline+GridSearch+CV\n*************************'
    text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier()),])
    
    parameters = {  
      'vect__ngram_range': [(1, 1), (1, 2)],
      'vect__max_df': (0.5, 0.75),  
      'vect__max_features': (None, 5000, 10000),  
      'tfidf__use_idf': (True, False),  
    #  'tfidf__norm': ('l1', 'l2'),  
       'clf__alpha': (0.00001, 0.000001),  
    #  'clf__penalty': ('l2', 'elasticnet'),  
       'clf__n_iter': (10, 50),  
    }  
    #GridSearch 寻找最优参数的过程
    flag = 0
    if (flag!=0):
        grid_search = GridSearchCV(text_clf,parameters,n_jobs = 1,verbose=1)
        grid_search.fit(twenty_train.data, twenty_train.target)   
        print("Best score: %0.3f" % grid_search.best_score_) 
        best_parameters = dict(); 
        best_parameters = grid_search.best_estimator_.get_params()  
        print("Out the best parameters");  
        for param_name in sorted(parameters.keys()): 
            print("\t%s: %r" % (param_name, best_parameters[param_name]));  
    
    #找到最优参数后，利用最优参数训练模型
    text_clf.set_params(clf__alpha = 1e-05,   
                    clf__n_iter = 50,   
                    tfidf__use_idf = True,  
                    vect__max_df = 0.5,  
                    vect__max_features = None);  
    text_clf.fit(twenty_train.data, twenty_train.target)
    #预测
    pred = text_clf.predict(twenty_test.data)
    #输出结果
    accuracy=np.mean(pred == twenty_test.target)
    #print accuracy 
    print ("The accuracy of twenty_test is %s" %accuracy)
   
    print(metrics.classification_report(twenty_test.target, pred,target_names=twenty_test.target_names))
    array = metrics.confusion_matrix(twenty_test.target, pred)
    print array

*结果分析

*************************
Pipeline+GridSearch+CV
*************************
The accuracy of twenty_test is 0.918388977213
                        precision    recall  f1-score   support

           alt.atheism       0.95      0.84      0.89       319
         comp.graphics       0.90      0.92      0.91       389
 comp.sys.mac.hardware       0.92      0.95      0.93       385
               sci.med       0.95      0.91      0.93       396
soc.religion.christian       0.89      0.96      0.92       398

           avg / total       0.92      0.92      0.92      1887

1）每一个算法会输出分类结果报表

分类结果报表，其中：

准确率=被识别为该分类的正确分类记录数/被识别为该分类的记录数
召回率=被识别为该分类的正确分类记录数/测试集中该分类的记录总数
F1-score=2（准确率 * 召回率）/（准确率 + 召回率），F1-score是F-measure（又称F-score）beta=1时的特例
support=测试集中该分类的记录总数

2）混淆矩阵

array = metrics.confusion_matrix(twenty_test.target, pred)
print array

SVM分类结果的混淆矩阵，类别数n，结果是一个n*n的矩阵，每一行的所有数字之和表示测试集中该分类的记录总数，等于结果报表中的support值。

[[268   7   1   7  36]
 [  5 359  17   3   5]
 [  0  12 366   6   1]
 [  4  16  13 359   4]
 [  6   6   1   4 381]]

#对应类别
categories = ['alt.atheism',
             'soc.religion.christian',
             'comp.graphics',
             'comp.sys.mac.hardware', 
             'sci.med']

其中对角线上的元素表示正确分类结果数目，如comp.graphics 测试集中有319个文档记录，在这里有268个文档被分类正确，其他文档散落在了其他分类中。

所有代码下载

参考博客 http://blog.csdn.net/zhzhl202/article/details/8197109 （文本分类与SVM）

博客先到这里，继续补充。