NLP综合实践：机器学习算法实现O2O商铺食品安全相关评论发现（最初阶段部分）-CSDN博客

本文链接：https://blog.csdn.net/qq_62674172/article/details/134809980

本文介绍了作者初次分享的机器学习实战经验，包括数据预处理（文本清洗、分词、停用词移除），特征提取（CountVectorizer），以及使用逻辑回归、决策树和XGBoost模型进行K折交叉验证的文本分类过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本人第一次撰写博客，有许多不足和错误，还望海涵。

同时这也是本人第一次的机器学习实战，撰写的内容只是实验的第一步内容，其中可能存在变量命名指代意义不规范，还有单词的拼写错误。在这次的实战中更多的是学习和模仿。

对于学生证上有个绿色粽子的同学们，在进行CSA云行工作室的NLP综合实践的时候食用更佳哦。

一、导入所需要的库

import re  # 用于正则表达式操作
import pandas as pd  # 用于数据处理，读取CSV文件
import numpy as np  # 用于数值计算
import jieba  # 用于中文分词
import xgboost as xgb  # 用于XGBoost模型
import warnings  # 用于警告处理,在很多情况下是为了让输出更简洁
warnings.filterwarnings('ignore')  # 不显示警告
from sklearn.feature_extraction.text import CountVectorizer  # 用于文本特征提取
from sklearn.ensemble import VotingClassifier  # 用于投票分类器
from sklearn.model_selection import StratifiedKFold, train_test_split  # 用于交叉验证和数据集划分
from sklearn.linear_model import LogisticRegression  # 用于逻辑回归模型
from sklearn.tree import DecisionTreeClassifier#决策树
from sklearn.metrics import accuracy_score

二、数据预处理阶段：

1、文本清洗：通过正则表达式，只保留中文、英文、数字以及一些标点符号，去除其他特殊字符和特定词汇。

def text_predeal(temp):
    temp = re.sub('[^\u4e00-\u9fa5aA-Za-z0-9，。？：！；“”]', ' ', temp)  # 保留中文、英文、数字以及一些标点符号
    temp = temp.replace('网站', '')  # 去除特定的词汇
    temp = re.sub(re.compile('<.*?>'), ' ', temp)  # 去除HTML标签
    temp = temp.strip()  # 去除首尾空白字符
    return temp

2、jieba分词（中文分词）：通过分词可以将句子拆分成一个个有意义的词语，，便于后续的特征提取和模型学习。

def jiebafenci(sentences,stop_words):
    dots=list(jieba.cut(sentences))  # 使用jieba库进行中文分词
    return " ".join(dots)  # 将分词结果拼接成字符串

3、去除停用词：加载停用词表，用于在分词过程中去除一些常见但无实际意义的词汇。

stopfile_path = './tingyongcibiao.txt'
    stop_words=[]
    with open(stopfile_path, 'r', encoding='utf-8') as f:
        for temp_text in f.readlines():
            temp_text=temp_text.strip('\n')
            if temp_text=='':
                continue
            else:
                stop_words.append(temp_text)

4、特征提取：文本向量化CountVectorizer，将文本数据转化为模型可以理解的数值型特征。

vector = CountVectorizer(min_df=10, ngram_range=(1, 1), token_pattern=r'\b\w+\b')
vector.fit(train_temp + test_temp)

lable_train = np.array(train_data['label'].tolist())
matrix_train = vector.transform(train_temp).toarray()
matrix_test = vector.transform(test_temp).toarray()

三、模型训练

在本文的文本分类任务中，使用了逻辑回归、决策树和XGBoost等模型，在交叉验证下，通过投票集成这些模型。

K折交叉验证：就是将数据集分为k份，组合为不同的训练集和测试集，用训练集来训练模型，用测试集来评估模型的好坏。详细解释可以查看以下链接。

【人工智能概论】 K折交叉验证_小白的努力探索的博客-CSDN博客

def K_flod_cross_validation():
    # k折交叉验证,此为8折
    # 分层交叉验证
    skf = StratifiedKFold(n_splits=8, shuffle=True, random_state=2023).split(matrix_train, lable_train)
    #randomstate让实验可复现
    # 用于保存预测结果数组
    y_test_preds = np.zeros((len(matrix_test), 1))

    # 初始化不同分类器的实例
    logistic_cf = LogisticRegression(C=1.2)  # 逻辑回归模型，将正则化强度的倒数设为1.2
    detree_cf = DecisionTreeClassifier(criterion='gini', max_depth=30, min_samples_leaf=1,
                                       ccp_alpha=0.0)  # 决策树模型,ccp_alpha用于代价复杂性修剪的参数。可以限制分支的数量，以避免过拟合。
    xgboost_cf = xgb.XGBClassifier()  # XGBoost模型
    vote_cf = VotingClassifier(estimators=[('lr', logistic_cf), ('drc', detree_cf), ('xgb', xgboost_cf)],
                               voting='hard')  # 集成模型

四：K折交叉验证精度

五、完整代码

import re  # 用于正则表达式操作
import pandas as pd  # 用于数据处理，读取CSV文件
import numpy as np  # 用于数值计算
import jieba  # 用于中文分词
import xgboost as xgb  # 用于XGBoost模型
import warnings  # 用于警告处理,在很多情况下是为了让输出更简洁
warnings.filterwarnings('ignore')  # 不显示警告
from sklearn.feature_extraction.text import CountVectorizer  # 用于文本特征提取
from sklearn.ensemble import VotingClassifier  # 用于投票分类器
from sklearn.model_selection import StratifiedKFold, train_test_split  # 用于交叉验证和数据集划分
from sklearn.linear_model import LogisticRegression  # 用于逻辑回归模型
from sklearn.tree import DecisionTreeClassifier#决策树
from sklearn.metrics import accuracy_score

def text_predeal(temp):
    temp = re.sub('[^\u4e00-\u9fa5aA-Za-z0-9，。？：！；“”]', ' ', temp)  # 保留中文、英文、数字以及一些标点符号
    temp = temp.replace('网站', '')  # 去除特定的词汇
    temp = re.sub(re.compile('<.*?>'), ' ', temp)  # 去除HTML标签
    temp = temp.strip()  # 去除首尾空白字符
    return temp

def jiebafenci(sentences,stop_words):
    dots=list(jieba.cut(sentences))  # 使用jieba库进行中文分词
    return " ".join(dots)  # 将分词结果拼接成字符串


def init_data():
    # 读取训练文件,制表
    train_data = pd.read_csv('./train.csv', sep='\t')
    # 读取测试文件
    test_data = pd.read_csv('./test_new.csv', sep=',')
    # 对文本进行预处理，去除一些非文字符
    train_data['comment']=train_data['comment'].apply(lambda x:text_predeal(x))
    test_data['comment']=test_data['comment'].apply(lambda x:text_predeal(x))    # 对文本进行分词和去除停用词,加载停用词表,百度停用词表
    stopfile_path = './tingyongcibiao.txt'
    stop_words=[]
    with open(stopfile_path, 'r', encoding='utf-8') as f:
        for temp_text in f.readlines():
            temp_text=temp_text.strip('\n')
            if temp_text=='':
                continue
            else:
                stop_words.append(temp_text)

    # 对评论文本进行分词和去除停用词
    train_temp = [jiebafenci(knob, stop_words) for knob in train_data['comment'].values]
    test_temp = [jiebafenci(knob, stop_words) for knob in test_data['comment'].values]

    # 文本特征提取，min_df最小文档频率，ngram_range考虑词语范围，token_pattern词汇表中的词语模式
    vector = CountVectorizer(min_df=10, ngram_range=(1, 1), token_pattern=r'\b\w+\b')
    vector.fit(train_temp + test_temp)
    # 将训练集和测试集文本转化为矩阵形式
    lable_train = np.array(train_data['label'].tolist())
    matrix_train = vector.transform(train_temp).toarray()
    print(matrix_train)
    matrix_test = vector.transform(test_temp).toarray()
    return matrix_train, lable_train, matrix_test, test_data


def K_flod_cross_validation():
    # k折交叉验证,此为8折
    # 分层交叉验证
    skf = StratifiedKFold(n_splits=8, shuffle=True, random_state=2023).split(matrix_train, lable_train)
    # 用于保存预测结果数组
    y_test_preds = np.zeros((len(matrix_test), 1))

    # 初始化不同分类器的实例
    logistic_cf = LogisticRegression(C=1.2)  # 逻辑回归模型，将正则化强度的倒数设为1.2
    detree_cf = DecisionTreeClassifier(criterion='gini', max_depth=30, min_samples_leaf=1,
                                       ccp_alpha=0.0)  # 决策树模型,ccp_alpha用于代价复杂性修剪的参数。可以限制分支的数量，以避免过拟合。
    xgboost_cf = xgb.XGBClassifier()  # XGBoost模型
    vote_cf = VotingClassifier(estimators=[('lr', logistic_cf), ('drc', detree_cf), ('xgb', xgboost_cf)],
                               voting='hard')  # 集成模型
    for i, (train_set, verify_set) in enumerate(skf):
        xsubset_train, ysubset_train = matrix_train[train_set], lable_train[train_set]
        # 在单个分类器上进行训练和预测
        for cf in (logistic_cf, detree_cf, xgboost_cf):
            cf.fit(xsubset_train, ysubset_train)  # 在训练集上训练分类器
        vote_model = vote_cf
        vote_model.fit(xsubset_train, ysubset_train)
        temp = vote_model.predict(matrix_test)
        y_test_preds += temp.reshape(-1, 1) / 5
        y_verify_pred = vote_model.predict(matrix_train[verify_set])
        accuracy = accuracy_score(lable_train[verify_set], y_verify_pred)
        print(f"Fold {i + 1} Accuracy: {accuracy:8f}")
    return y_test_preds


if __name__ == '__main__':
    matrix_train, lable_train, matrix_test, data = init_data()
    pre_test = K_flod_cross_validation()
    pre_test=list(map(lambda x: 1 if x >= 0.5 else 0, pre_test))
    result = data.copy()
    result['label'] = pre_test
    result[['id', 'label']].to_csv('./final_result.csv', index=None)

由于最后是线上验证，保存为csv文件，最后线上验证精度好像在90%以上，时间太长我也忘记了