NLP:文本分类任务

丁兰子

已于 2024-02-29 10:07:43 修改

阅读量2.9k

点赞数 38

文章标签：自然语言处理分类人工智能

于 2024-02-27 16:29:07 首次发布

本文链接：https://blog.csdn.net/m0_58758292/article/details/136316439

版权

什么是文本分类任务？

文本分类任务是自然语言处理（NLP）领域中的一项基本任务，其目标是将一段文字自动分配给一个或多个预定义的类别。这个任务通常涉及到理解文本的内容，并根据其含义对其分类。

在文本分类中，算法会根据输入文本的特征来预测最合适的类别标签。类别标签可以是固定的几个类别（例如正面或负面情感），也可以是多个类别（例如新闻文章的主题分类，如政治、体育、娱乐等）。

文本分类在实际应用中非常广泛，一些常见的使用场景包括：

1. **情感分析**：判断文本表达的情绪倾向，如正面、负面或中性情感。

2. **垃圾邮件检测**：确定电子邮件是否为垃圾邮件。

3. **主题识别**：识别文档或文章的主题内容，如将新闻文章分类到不同的主题。

4. **意图识别**：在对话系统中识别用户输入的意图，例如订单查询、天气查询等。

5. **语言检测**：识别文本使用的语言。

6. **产品分类**：在电子商务中将产品描述分类到适当的产品类别。

7. **内容推荐**：根据用户的阅读或浏览历史，对内容进行分类，以便提供个性化推荐。

8. **自动标签生成**：为博客、文章或其他类型的网页内容自动生成标签或关键词。

文本分类可以采用多种机器学习方法，包括朴素贝叶斯、支持向量机（SVM）、随机森林、梯度提升树、以及深度学习方法如卷积神经网络（CNN）、循环神经网络（RNN）、长短期记忆网络（LSTM）和Transformer模型等。随着深度学习技术的发展，基于预训练语言模型的方法（如BERT、GPT等）在文本分类任务中取得了显著的成效。

文本分类一些常见的使用场景

1. 情感分析：

判断文本表达的情绪倾向，如正面、负面或中性情感。这个例子仅作为一个基本的示例，实际应用中需要更复杂的数据集、预处理步骤、特征提取方法以及模型调优。此外，深度学习方法，尤其是基于预训练模型的方法（如BERT），可能会提供更好的性能

# 导入所需的库
from sklearn.feature_extraction.text import CountVectorizer
# 导入CountVectorizer，它可以将文本数据转换为词频特征。

from sklearn.model_selection import train_test_split
# 导入train_test_split以划分数据集。

from sklearn.naive_bayes import MultinomialNB
# 导入朴素贝叶斯分类器MultinomialNB。

from sklearn.metrics import accuracy_score
# 导入accuracy_score以评估模型性能。

from sklearn.pipeline import make_pipeline
#导入make_pipeline用于创建处理和分类的管道。

# 1-3. 定义简单的数据集和标签，用于训练和测试。
# 1.示例数据集
data = [
    "I love this product",  # 正面情感
    "This is a great movie",  # 正面情感
    "I am so happy with this purchase",  # 正面情感
    "I hate this item",  # 负面情感
    "Worst service ever",  # 负面情感
    "I am not happy with this product",  # 负面情感
    "This is okay, but not the best",  # 中性情感
    "The movie was fine, nothing special",  # 中性情感
]
labels = [1, 1, 1, 0, 0, 0, 2, 2]  # 1代表正面情感，0代表负面情感，2代表中性情感

# 2.将数据集划分为训练集和测试集
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=0.2, random_state=42)

# 3.创建一个文本处理和朴素贝叶斯分类器的pipeline
# 创建一个包含文本向量化和朴素贝叶斯分类器的管道。

text_clf = make_pipeline(CountVectorizer(), MultinomialNB())

# 训练模型:使用训练数据训练模型。
text_clf.fit(train_data, train_labels)

# 对测试集进行预测:使用模型对测试数据进行情感预测。
predicted = text_clf.predict(test_data)

# 评估模型性能:打印模型在测试集上的准确度
print(f"Accuracy: {accuracy_score(test_labels, predicted)}")

# 测试模型:输入一个测试文本，使用训练好的模型进行情感预测，并打印出结果
test_text = "I am not sure if I like this movie"
predicted_sentiment = text_clf.predict([test_text])[0]
print(f"The sentiment of the test text: {'positive' if predicted_sentiment == 1 else 'negative' if predicted_sentiment == 0 else 'neutral'}")

2. 垃圾邮件检测：

确定电子邮件是否为垃圾邮件。在实际应用中，你可能需要一个更大的、带有实际标签的数据集，并且可能会使用更复杂的模型和特征提取方法以提高检测的准确性。此外，可以考虑使用深度学习方法，这可能需要更多的数据和计算资源

# 导入所需的库:导入所需的库，包括朴素贝叶斯分类器、TF-IDF向量化器、pipeline工具、数据集划分工具以及性能评估工具
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 示例数据集，其中包含垃圾邮件和非垃圾邮件:准备示例电子邮件数据和对应的标签，
# 其中垃圾邮件标记为1，非垃圾邮件标记为0
emails = [
    "Hey there, I thought you might like to see our new product range",
    "BIG DISCOUNTS on new products, buy now",
    "Dear friend, I have a business proposal for you",
    "Hello, I saw your profile online and would like to connect",
    "You have won $1,000,000! Claim your prize now",
    "Please find attached the weekly report",
    "Don't miss out on this opportunity, cheap meds for sale",
    "Your long lost relative has left you an inheritance"
]
labels = [0, 1, 1, 0, 1, 0, 1, 0]  # 0代表非垃圾邮件，1代表垃圾邮件

# 划分数据集为训练集和测试集
train_emails, test_emails, train_labels, test_labels = train_test_split(emails, labels, test_size=0.2, random_state=42)

# 创建一个文本处理（TF-IDF向量化）和朴素贝叶斯分类器的pipeline:创建一个pipeline，
这个pipeline首先使用TF-IDF方法将文本转换为其TF-IDF特征向量，然后使用朴素贝叶斯模型进行分类
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# 训练模型:使用训练数据来训练pipeline中的模型
model.fit(train_emails, train_labels)

# 预测测试集的邮件:使用训练好的模型来预测测试集邮件是否为垃圾邮件
predictions = model.predict(test_emails)

# 输出分类报告:打印出测试集预测结果的详细分类报告
print(classification_report(test_labels, predictions))

# 测试模型:使用模型对一个新的示例电子邮件进行预测，并打印出预测结果
sample_email = "Congratulations, you've won a free ticket to Bahamas!"
predicted_label = model.predict([sample_email])[0]
print(f"The email is {'not spam' if predicted_label == 0 else 'spam'}")

3. 主题识别：

识别文档或文章的主题内容，如将新闻文章分类到不同的主题。在实际应用中，你可能会有自己的特定数据集，需要进行适当的预处理和特征提取。此外，根据任务的复杂性和数据集的大小，你可能还需要调整模型的参数或尝试不同的分类算法

# 导入所需的库
from sklearn.datasets import fetch_20newsgroups  # 导入示例数据集
from sklearn.feature_extraction.text import TfidfVectorizer  # 用于文本向量化
from sklearn.svm import SVC  # 支持向量机分类器
from sklearn.pipeline import make_pipeline  # 创建处理管道
from sklearn.model_selection import train_test_split  # 划分数据集
from sklearn.metrics import classification_report  # 评估模型性能

# 加载数据集: 从scikit-learn内置数据集中加载20个新闻组的数据，这是主题识别任务常用的数据集
categories = ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

# 创建一个文本处理（TF-IDF向量化）和SVM分类器的pipeline创建一个pipeline，
# 包含TfidfVectorizer和SVM分类器。TfidfVectorizer用于将文本转换为TF-IDF特征向量，SVM用于进行分类任务
model = make_pipeline(TfidfVectorizer(), SVC())

# 训练模型：使用训练集数据训练模型
model.fit(newsgroups_train.data, newsgroups_train.target)

# 预测测试集的主题：使用训练好的模型来预测测试集的主题。
predictions = model.predict(newsgroups_test.data)

# 输出分类报告：打印出测试集预测结果的详细分类报告
print(classification_report(newsgroups_test.target, predictions, target_names=newsgroups_test.target_names))

# 测试模型：使用模型对一个新的示例文本进行主题预测，并打印出预测的主题类别
sample_text = "The launch of the space shuttle was successful."
predicted_category = model.predict([sample_text])[0]
print(f"The topic of the text: {newsgroups_train.target_names[predicted_category]}")

4. 意图识别：

在对话系统中识别用户输入的意图，例如订单查询、天气查询等。这个例子使用了一个简单的前馈神经网络，通过词嵌入和全局平均池化来处理文本数据。在实际应用中，你可能需要使用更多的数据和更复杂的模型结构，例如循环神经网络（RNN）或Transformer模型，以及进行超参数调优和模型评估等步骤。

# 导入TensorFlow和Keras相关库。
import tensorflow as tf  # 1-3. 导入Keras模型和层的相关类
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.preprocessing.text import Tokenizer # 4-5. 导入文本和序列预处理工具
from tensorflow.keras.preprocessing.sequence import pad_sequences 
from sklearn.preprocessing import LabelEncoder # 导入标签编码器，用于将文本标签转换为数字

# 假设我们有一些样本数据和对应的意图标签：定义示例句子和对应的意图标签
sentences = [
    'How long does shipping take?',
    'What are the shipping options?',
    'How do I track my order?',
    'Is there a discount available?'
]
labels = ['shipping_query', 'shipping_options', 'order_tracking', 'discounts_query']

# 对标签进行编码：使用标签编码器对意图标签进行编码，将文本标签转换为数字
label_encoder = LabelEncoder()
label_encoder.fit(labels)
encoded_labels = label_encoder.transform(labels)

# 准备文本数据： 设置文本预处理的参数，并使用Tokenizer将文本转换为序列
vocab_size = 1000
embedding_dim = 16
max_length = 20
tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences) # 使用pad_sequences将序列转换为统一长度的向量
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding='post')

# 创建序列模型： 定义一个神经网络模型，包括词嵌入层、池化层和两个密集层
model = Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_length),
    GlobalAveragePooling1D(),
    Dense(24, activation='relu'),
    Dense(len(set(labels)), activation='softmax')
])

# 编译模型：编译模型，设置损失函数、优化器和评价指标
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 训练模型
model.fit(padded_sequences, encoded_labels, epochs=500, verbose=2)

# 现在我们可以使用这个模型来预测新的用户输入的意图
test_sentence = ["Can I get my order expedited?"]
test_seq = tokenizer.texts_to_sequences(test_sentence)
test_padded = pad_sequences(test_seq, maxlen=max_length, padding='post')
prediction = model.predict(test_padded)
predicted_intent_label = label_encoder.inverse_transform([tf.argmax(prediction[0]).numpy()])[0] # 将预测的数字标签转换回文本标签并打印出来

print(f'Predicted intent: {predicted_intent_label}')

5. 语言检测：

识别文本使用的语言。在实际应用中，需要一个更大和更多样化的数据集来训练模型，以便能够准确地识别多种不同的语言。此外，可能还需要考虑数据预处理和模型调优等步骤

# 导入必要的库：导入scikit-learn库中所需的函数和类。
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 假设我们有下面这样的数据集：创建一个包含文本样本和相应语言标签的数据集。
texts = ['Hello, how are you?', 'Hola, cómo estás?', 'Bonjour, comment ça va?', 'Ciao, come stai?']
labels = ['English', 'Spanish', 'French', 'Italian']

# 划分数据集为训练集和测试集：使用train_test_split函数将数据集划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)

# 创建一个pipeline，包括两个步骤：1) 特征提取 2) 朴素贝叶斯分类器---
# 定义一个pipeline，其中包含特征提取器（CountVectorizer）和分类器（MultinomialNB）。这里我们使用字符级别n-gram特征，范围从1到3
model = Pipeline([
    ('vectorizer', CountVectorizer(analyzer='char', ngram_range=(1,3))),
    ('classifier', MultinomialNB())
])

# 训练模型
model.fit(X_train, y_train)

# 预测测试集的语言：使用训练好的模型预测测试集中文本的语言
predictions = model.predict(X_test)

# 输出预测结果： 打印出对测试集的预测结果的分类报告。
print(classification_report(y_test, predictions))

# 使用模型进行单个文本的语言检测：对一个新的文本样本进行语言检测，并打印出检测结果。
sample_text = 'Wie geht es dir?'
predicted_language = model.predict([sample_text])[0]
print(f'The language of the text is: {predicted_language}')

6. 产品分类：

在电子商务中将产品描述分类到适当的产品类别。在实际应用中，你需要一个更大和更多样化的数据集来训练模型，以便能够准确地分类多种不同的产品。还可能需要进行数据预处理、特征选择、模型调优和交叉验证等步骤来提高模型性能

# 导入必要的库
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# 假设我们有下面这样的数据集： 创建一个包含产品描述和相应产品类别标签的数据集。
product_descriptions = ['4K TV 60 inch', 'running shoes for men', 'wireless keyboard and mouse', 'bluetooth headphones']
product_categories = ['Electronics', 'Fashion', 'Computer Accessories', 'Electronics']

# 划分数据集为训练集和测试集：使用train_test_split函数将数据集划分为训练集和测试集。
X_train, X_test, y_train, y_test = train_test_split(product_descriptions, product_categories, test_size=0.2, random_state=42)

# 创建一个pipeline，包括两个步骤：1) 特征提取 2) 分类器
# 定义一个pipeline，其中包含特征提取器（TfidfVectorizer）和分类器（LinearSVC）。TfidfVectorizer用于将文本转换为TF-IDF特征向量，LinearSVC用于进行分类。
model = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000)),
    ('classifier', LinearSVC())
])

# 训练模型
model.fit(X_train, y_train)

# 预测测试集的产品类别：使用训练好的模型预测测试集中产品描述的类别。
predictions = model.predict(X_test)

# 输出预测结果
print(classification_report(y_test, predictions))

# 使用模型进行单个产品描述的分类：对一个新的产品描述进行分类，并打印出分类结果。
sample_description = 'smartwatch 4G LTE'
predicted_category = model.predict([sample_description])[0]
print(f'The product category is: {predicted_category}')

7. 内容推荐：

根据用户的阅读或浏览历史，对内容进行分类，以便提供个性化推荐。

# 导入scikit-learn中的TfidfVectorizer和cosine_similarity函数，以及numpy库。
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# 假设我们有以下的内容和用户的阅读历史：定义内容库和用户阅读历史的示例数据。
contents = [
    'Data Science News',
    'Data Science How To',
    'Machine Learning Basics',
    'Deep Learning for Beginners',
    'Artificial Intelligence Future',
    'Deep Learning Advances'
]
user_history = ['I love reading about deep learning and artificial intelligence']

# 创建TF-IDF向量化器，并转换内容为向量：初始化TfidfVectorizer来将文本转换为TF-IDF特征向量。
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(contents + user_history)

# 计算用户历史与所有内容的余弦相似度：使用向量化器来转换所有内容和用户历史的文本。
cosine_sim = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])

# 获取相似度分数并对内容索引进行排序：使用cosine_similarity函数计算用户历史与每个内容项的余弦相似度。
similar_scores = cosine_sim[0]
# 获得相似度分数，并按相似度分数降序排列的内容索引。 
similar_indices = similar_scores.argsort()[::-1]

# 推荐最相似的内容：推荐相似度分数最高的前num_recommendations个内容。
num_recommendations = 2
for i in range(num_recommendations):
    content_index = similar_indices[i]
    print(f'Recommended content: {contents[content_index]} with similarity score: {similar_scores[content_index]}')

8. 自动标签生成：

为博客、文章或其他类型的网页内容自动生成标签或关键词。在实际应用中，你可能需要一个更复杂的标签提取方法，可能包含词性标注、命名实体识别、基于词典的关键词提取等。你还可能需要处理更大的文档集合，并考虑文本的预处理步骤，诸如小写化、词干提取或词形还原。这个示例只是一个基础版本，用于说明如何用TF-IDF方法简单地提取关键词作为标签。

# 导入TfidfVectorizer类。
from sklearn.feature_extraction.text import TfidfVectorizer

# 假设我们有以下的文档集合：创建一个文档列表作为示例数据。
documents = [
    'Python is a powerful programming language for data analysis',
    'Deep learning can be used for image recognition',
    'The stock market is volatile and unpredictable'
]

# 定义函数来排序TF-IDF分数并选出顶部的词汇作为标签
# 定义一个函数extract_tags_from_document，
# 它接受一个文档、一个TF-IDF向量化器和一个参数top_n，后者决定我们想要提取多少个关键词作为标签。
def extract_tags_from_document(doc, vectorizer, top_n=3):
    # 将文档转换为TF-IDF特征：使用向量化器转换文档为TF-IDF特征矩阵。
    tfidf_matrix = vectorizer.transform([doc])
    # 获取词汇表中每个词的特征索引：从向量化器中获取特征名称。
    feature_array = np.array(vectorizer.get_feature_names_out())
    # 对特征按TF-IDF分数进行排序：按TF-IDF分数对特征进行降序排序。
    tfidf_sorting = np.argsort(tfidf_matrix.toarray()).flatten()[::-1]
    # 选出前top_n个最高分的词汇：从排序后的特征中选出TF-IDF分数最高的top_n个词汇作为标签。
    top_n_words = feature_array[tfidf_sorting][:top_n]
    return top_n_words

# 创建TF-IDF向量化器：初始化TfidfVectorizer，选择英语停用词过滤。
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# 训练TF-IDF向量化器：使用文档集合训练向量化器。
tfidf_vectorizer.fit(documents)

# 对每个文档生成标签：对每个文档使用extract_tags_from_document函数生成标签，并打印文档和对应的标签。
for doc in documents:
    tags = extract_tags_from_document(doc, tfidf_vectorizer)
    print(f'Document: "{doc}"')
    print(f'Tags: {tags}\n')