使用xgboost进行文本分类

bitcarmanlee

于 2022-04-06 17:07:01 发布

阅读量2.2k

点赞数 4

分类专栏： text classifier 文章标签： xgboost 文本分类特征重要度树参数

本文链接：https://blog.csdn.net/bitcarmanlee/article/details/123991000

版权

text classifier 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文探讨了如何使用XGBoost对游戏评价数据进行情感分析，从数据预处理到模型训练，重点讲解了参数调整和解决不平衡样本的方法。通过实例展示了如何利用TF-IDF特征和XGBoost实现高精度的游戏评价分类，并展示了关键参数对模型性能的影响。

摘要由CSDN通过智能技术生成

1.数据准备

跟之前的文本一样，还是原来的数据格式。

sentence,label
游戏太坑，暴率太低，太克金，平民不能玩,negative
让人失望,negative
能解决一下服务器问题？网络正常老掉线，换手机也一样。。。,negative
期待,positive
一星也不想给，这特么简直龟速，炫舞老年版？,negative
衣服不好看游戏内容无特色，界面乱糟糟的,negative
喜欢喜欢,positive
从有了这个手游就一直玩，很喜欢呀，希望更多漂漂衣服,positive
因违反评价条例规定被折叠,negative

2.数据预处理

import time
import jieba
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import xgboost as xgb


def get_stop_words():
    filename = "your stop words file path"
    stop_word_list = []
    with open(filename, encoding='utf-8') as f:
        for line in f.readlines():
            stop_word_list.append(line.strip())
    return stop_word_list

def processing_sentence(x, stop_words):
    cut_word = jieba.cut(str(x).strip())
    words = [word for word in cut_word if word not in stop_words and word != ' ']
    return ' '.join(words)


def data_processing():
    train_file = "your train file path"
    df = pd.read_csv(train_file)
    x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['label'], test_size=0.1)
    stop_words = get_stop_words()
    x_train = x_train.apply(lambda x: processing_sentence(x, stop_words))
    x_test = x_test.apply(lambda x: processing_sentence(x, stop_words))

    tf = TfidfVectorizer()
    x_train = tf.fit_transform(x_train)
    x_test = tf.transform(x_test)
    x_train_weight = x_train.toarray()
    x_test_weight = x_test.toarray()

    return x_train_weight, x_test_weight, y_train, y_test

文本预处理得到的，仍然是分词以后的tf-idf特征。

3.模型训练

def train_model():
    x_train_weight, x_test_weight, y_train, y_test = data_processing()
    start = time.time()
    print("start time is: ", start)
    model = xgb.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=100,
                              silent=False, objective='binary:logistic')
    model.fit(x_train_weight, y_train)
    end = time.time()
    print("end time is: ", end)
    print("cost time is: ", (end - start))
    y_predict = model.predict(x_test_weight)

    confusion_mat = metrics.confusion_matrix(y_test, y_predict)
    print('准确率：', metrics.accuracy_score(y_test, y_predict))
    print("confusion_matrix is: ", confusion_mat)
    print('分类报告:', metrics.classification_report(y_test, y_predict))

代码训练运行的结果为

start time is:  1649228843.700035
end time is:  1649229253.274875
cost time is:  409.57483983039856
准确率： 0.7524366471734892
confusion_matrix is:  [[137  80]
 [ 47 249]]
分类报告:               precision    recall  f1-score   support

    negative       0.74      0.63      0.68       217
    positive       0.76      0.84      0.80       296

    accuracy                           0.75       513
   macro avg       0.75      0.74      0.74       513
weighted avg       0.75      0.75      0.75       513

4.xgboost参数

xgb的参数还是比较多的，而且在实际使用过程中，调参也是比较重要的一环，下面我们一起看看xgb里面的参数。

xgb自身参数

    booster: string
        Specify which booster to use: gbtree, gblinear or dart.
    n_jobs : int
        Number of parallel threads used to run xgboost.  (replaces ``nthread``)
    verbosity : int
        The degree of verbosity. Valid values are 0 (silent) - 3 (debug).
    scale_pos_weight : float
        Balancing of positive and negative weights.

booster指定树的类型，默认值为gbtree。
scale_pos_weight主要是处理样本不平衡问题，默认值为1。当样本高度不平衡的时候，比如正负样本比为1:100，可以将scale_pos_weight=10，加快模型收敛。

tree参数

    n_estimators : int
        Number of trees to fit.
   	max_depth : int
        Maximum tree depth for base learners.
    min_child_weight : int
        Minimum sum of instance weight(hessian) needed in a child.
    gamma : float
        Minimum loss reduction required to make a further partition on a leaf node of the tree.
    max_delta_step : int
        Maximum delta step we allow each tree's weight estimation to be.
    subsample : float
        Subsample ratio of the training instance.
    colsample_bytree : float
        Subsample ratio of columns when constructing each tree.

n_estimators:树棵数
max_depth:树最大深度
min_child_weight:每棵树上的叶子节点样本权重和的最小值
gamma:在每棵树上进行进一步分裂所需要的最小损失函数减小值
max_delta_step:每棵树的最大权重
subsample:每棵树训练时每个样本被抽样选择的概率
colsample_bytree:每棵树训练时使用的特征比例

算法通用参数

    learning_rate : float
        Boosting learning rate (xgb's "eta")
    objective : string or callable
        Specify the learning task and the corresponding learning objective or
        a custom objective function to be used (see note below).
    reg_alpha : float (xgb's alpha)
        L1 regularization term on weights
    reg_lambda : float (xgb's lambda)
        L2 regularization term on weights

objective包括：
回归任务
reg:linear (默认)
reg:logistic

二分类
binary:logistic 概率
binary:logitraw 类别

多分类
multi:softmax num_class=n 返回类别
multi:softprob num_class=n 返回概率

排序
rank:pairwise

5.参数总结

调整树模型复杂度的参数

n_estimators
max_depth
min_chlid_weight
gamma

增加树随机性的参数

subsample
colsample_bytree
learning_rate
num_round

解决样本不平衡

scale_pos_weight

6.画出特征重要性

将train方法的代码稍作修改，加入刻画特征重要性代码

import matplotlib.pyplot as plt
from xgboost import plot_importance

def train_model():
    x_train_weight, x_test_weight, y_train, y_test = data_processing()
    start = time.time()
    print("start time is: ", start)
    model = xgb.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=50, n_jobs=2,
                              silent=False, objective='binary:logistic')
    model.fit(x_train_weight, y_train)
    end = time.time()
    print("end time is: ", end)
    print("cost time is: ", (end - start))
    y_predict = model.predict(x_test_weight)

    confusion_mat = metrics.confusion_matrix(y_test, y_predict)
    print('准确率：', metrics.accuracy_score(y_test, y_predict))
    print("confusion_matrix is: ", confusion_mat)
    print('分类报告:', metrics.classification_report(y_test, y_predict))

    fig, ax = plt.subplots(figsize=(15, 15))
    plot_importance(model,
                    height=0.5,
                    ax=ax,
                    max_num_features=10)

    plt.show()

最后会输出如下图

在这里插入图片描述