Scikit-LLM：使用LLM在scikit-learn框架中增强您的文本分析能力-CSDN博客

本文链接：https://blog.csdn.net/wjjc1017/article/details/135900453

本文介绍了Scikit-LLM，一个将大型语言模型如GPT-3集成到scikit-learn框架的Python库，用于零样本文本分类、多标签分类、文本向量化、翻译和摘要。通过API密钥集成OpenAI，展示了如何在scikit-learn流水线中使用该库进行实际操作，并讨论了其优点和局限性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

介绍
为什么选择这个库？
安装
配置
Zero-shot文本分类
单标签ZeroShotGPTClassifier
多标签ZeroShotGPT分类器
文本向量化
文本翻译
文本摘要
在scikit-learn流水线中使用scikit-llm
结论

介绍

Scikit-LLM 是一个Python包，将大型语言模型（LLMs）如OpenAI的GPT-3集成到scikit-learn框架中，用于文本分析任务。

Scikit-LLM旨在在scikit-learn框架内工作。因此，如果您熟悉scikit-learn，您将对scikit-llm感到非常熟悉。该库提供了一系列功能，其中我们将涵盖以下内容：

零样本文本分类
多标签零样本文本分类
文本向量化
文本翻译
文本摘要

为什么选择这个库？

这个库的主要优点是它熟悉的scikit-learn API，特别是：

您可以使用与scikit-learn类似的API，如.fit()，.fit_transform()和.predict()。
您可以在Sklearn管道中组合来自scikit-llm库的估计器（请查看本文中的最后一个示例）。

安装

您可以通过pip安装该库：

# 安装scikit-llm库
pip install scikit-llm

配置

在开始使用Scikit-LLM之前，您需要将您的OpenAI API密钥传递给Scikit-LLM。您可以查看这篇文章来设置您的OpenAI API密钥。

# 导入SKLLMConfig类
from skllm.config import SKLLMConfig

# 设置OpenAI的密钥和组织ID
OPENAI_SECRET_KEY = "sk-***"
OPENAI_ORG_ID = "org-***"

# 调用SKLLMConfig类的静态方法，设置OpenAI的密钥和组织ID
SKLLMConfig.set_openai_key(OPENAI_SECRET_KEY)
SKLLMConfig.set_openai_org(OPENAI_ORG_ID)

请注意，Scikit-LLM提供了一个方便的接口来访问OpenAI的GPT-3模型。使用这些模型是需要付费的，并且需要一个API密钥。虽然API的费用相对较低，但根据数据量和调用频率的不同，这些费用可能会累积起来。因此，重要的是要仔细计划和管理您的使用情况以控制成本。在开始使用Scikit-LLM之前，请务必查阅OpenAI的定价细节和使用条款。

为了给您一个大致的概念，我至少运行了五次这个笔记本来制作这个教程，总费用为0.02美元。我必须说，我以为这会更高！

Zero-shot文本分类

Scikit-LLM的一个特性是能够执行零-shot文本分类。Scikit-LLM提供了两个用于此目的的类：

ZeroShotGPTClassifier：用于单标签分类（例如情感分析），
MultiLabelZeroShotGPTClassifier：用于多标签分类任务。

单标签ZeroShotGPTClassifier

让我们对几篇电影评论进行情感分析。为了训练目的，我们为每个评论定义了情感（由变量movie_review_labels定义）。我们使用这些评论和标签来训练模型，以便我们可以使用训练好的模型来预测新的电影评论。

电影评论的示例数据集如下：

# 电影评论数据集

# 原始电影评论
movie_reviews = [
    "This movie was absolutely wonderful. The storyline was compelling and the characters were very realistic.",
    "I really loved the film! The plot had a few unexpected twists which kept me engaged till the end.",
    "The movie was alright. Not great, but not bad either. A decent one-time watch.",
    "I didn't enjoy the film that much. The plot was quite predictable and the characters lacked depth.",
    "This movie was not to my taste. It felt too slow and the storyline wasn't engaging enough.",
    "The film was okay. It was neither impressive nor disappointing. It was just fine.",
    "I was blown away by the movie! The cinematography was excellent and the performances were top-notch.",
    "I didn't like the movie at all. The story was uninteresting and the acting was mediocre at best.",
    "The movie was decent. It had its moments but was not consistently engaging."
]

# 电影评论的标签
movie_review_labels = [
    "positive", 
    "positive", 
    "neutral", 
    "negative", 
    "negative", 
    "neutral", 
    "positive", 
    "negative", 
    "neutral"
]

# 新的电影评论
new_movie_reviews = [
    # 一个正面的评论
    "The movie was fantastic! I was captivated by the storyline from beginning to end.",

    # 一个负面的评论
    "I found the film to be quite boring. The plot moved too slowly and the acting was subpar.",

    # 一个中性的评论
    "The movie was okay. Not the best I've seen, but certainly not the worst."
]

让我们训练模型，然后检查模型对每个新评论的预测。

from skllm.classifiers import ZeroShotGPTClassifier
# 使用OpenAI模型初始化分类器
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")

# 训练模型
clf.fit(X=movie_reviews, y=movie_review_labels)

# 使用训练好的分类器预测新评论的情感
predicted_movie_review_labels = clf.predict(X=new_movie_reviews)

# 遍历新评论和预测的情感结果
for review, sentiment in zip(new_movie_reviews, predicted_movie_review_labels):
    print(f"Review: {review}\nPredicted Sentiment: {sentiment}\n\n")

Review: 这部电影太棒了！从头到尾，剧情让我着迷。
Predicted Sentiment: 积极


Review: 我觉得这部电影相当无聊。剧情发展得太慢，演技也不怎么样。
Predicted Sentiment: 消极


Review: 这部电影还行。不是我看过的最好的，但绝对不是最差的。
Predicted Sentiment: 中性

如上所示，该模型正确预测了每个电影评论的情感。

多标签ZeroShotGPT分类器

在前面的部分中，我们有一个单标签分类器（[“positive”，“negative”，“neutral”]）。在这里，我们将使用MultiLabelZeroShotGPTClassifier估计器为餐厅评论列表分配多个标签。

餐厅评论数据集包含了一些餐厅的评论和对应的标签。

restaurant_reviews = [
    "The food was delicious and the service was excellent. A wonderful dining experience!",
    "The restaurant was in a great location, but the food was just average.",
    "The service was very slow and the food was cold when it arrived. Not a good experience.",
    "The restaurant has a beautiful ambiance, and the food was superb.",
    "The food was great, but I found it to be a bit overpriced.",
    "The restaurant was conveniently located, but the service was poor.",
    "The food was not as expected, but the restaurant ambiance was really nice.",
    "Great food and quick service. The location was also very convenient.",
    "The prices were a bit high, but the food quality and the service were excellent.",
    "The restaurant offered a wide variety of dishes. The service was also very quick."
]

restaurant_review_labels = [
    ["Food", "Service"],
    ["Location", "Food"],
    ["Service", "Food"],
    ["Atmosphere", "Food"],
    ["Food", "Price"],
    ["Location", "Service"],
    ["Food", "Atmosphere"],
    ["Food", "Service", "Location"],
    ["Price", "Food", "Service"],
    ["Food Variety", "Service"]
]

new_restaurant_reviews = [
    "The food was excellent and the restaurant was located in the heart of the city.",
    "The service was slow and the food was not worth the price.",
    "The restaurant had a wonderful ambiance, but the variety of dishes was limited."
]

让我们训练模型，然后预测新评论的标签。

from skllm import MultiLabelZeroShotGPTClassifier

# 使用OpenAI模型初始化分类器
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)

# 训练模型
clf.fit(X=restaurant_reviews, y=restaurant_review_labels)

# 使用训练好的分类器预测新评论的标签
predicted_restaurant_review_labels = clf.predict(X=new_restaurant_reviews)

# 遍历新评论和预测的标签
for review, labels in zip(new_restaurant_reviews, predicted_restaurant_review_labels):
    print(f"Review: {review}\nPredicted Labels: {labels}\n\n")
Review: 食物非常好，餐厅位于市中心。
Predicted Labels: ['食物', '位置']


Review: 服务很慢，食物不值得这个价格。
Predicted Labels: ['服务', '价格']


Review: 餐厅的氛围很好，但菜品种类有限。
Predicted Labels: ['氛围', '食物种类']

文本向量化

Scikit-LLM提供了GPTVectorizer类，用于将输入文本转换为固定维度的向量表示。每个生成的向量是一个浮点数数组，表示对应的句子。

让我们获取以下句子的向量表示。

# 导入GPTVectorizer模块
from skllm.preprocessing import GPTVectorizer

# 定义文本列表
X = [
    "AI can revolutionize industries.",
    "Robotics creates automated solutions.",
    "IoT connects devices for data exchange."
]

# 实例化GPTVectorizer
vectorizer = GPTVectorizer()

# 将文本列表转换为向量
vectors = vectorizer.fit_transform(X)

# 打印向量结果
print(vectors)
[[-0.00818074 -0.02555227 -0.00994665 ... -0.00266894 -0.02135153
   0.00325925]
 [-0.00944166 -0.00884305 -0.01260475 ... -0.00351341 -0.01211498
  -0.00738735]
 [-0.01084771 -0.00133671  0.01582962 ...  0.01247486 -0.00829649
  -0.01012453]]

在实践中，这些向量是输入到其他机器学习模型中的，用于分类、聚类或回归等任务，而不是直接检查向量。

文本翻译

GPT模型可以通过使用GPTTranslator模块，将文本从一种语言准确地翻译成另一种语言。我们可以使用GPTTranslator模块将文本翻译成所需语言。

# 导入GPTTranslator和get_translation_dataset模块
from skllm.preprocessing import GPTTranslator
from skllm.datasets import get_translation_dataset

# 创建GPTTranslator对象，指定openai_model为"gpt-3.5-turbo"，输出语言为"English"
translator = GPTTranslator(openai_model="gpt-3.5-turbo", output_language="English")

# 待翻译的文本
text_to_translate = ["Je suis content que vous lisiez ce post."]
# "I am happy that you are reading this post."

# 调用fit_transform方法进行翻译
translated_text = translator.fit_transform(text_to_translate)

# 输出翻译前和翻译后的文本
print(
    f"Text in French: \n{text_to_translate[0]}\n\nTranslated text in English: {translated_text[0]}"
)
# Text in French: 
# Je suis content que vous lisiez ce post.

# Translated text in English: I am glad that you are reading this post.

文本摘要

GPT模型非常适用于文本摘要。Scikit-LLM库提供了GPTSummarizer估计器用于文本摘要。让我们通过对下面给出的长篇评论进行摘要来看看它的效果。

reviews = [
    (
        "I dined at The Gourmet Kitchen last night and had a wonderful experience. " 
        "The service was impeccable, the food was exquisite, and the ambiance was delightful. "
        "I had the seafood pasta, which was cooked to perfection. "
        "The wine list was also quite impressive. "
        "I would highly recommend this restaurant to anyone looking for a fine dining experience."
    ),
    (
        "I visited The Burger Spot for lunch today and was pleasantly surprised. "
        "Despite being a fast food joint, the quality of the food was excellent. "
        "I ordered the classic cheeseburger and it was juicy and flavorful. "
        "The fries were crispy and well-seasoned. "
        "The service was quick and the staff was friendly. "
        "It's a great place for a quick and satisfying meal."
    ),
    (
        "The Coffee Corner is my favorite spot to work and enjoy a good cup of coffee. "
        "The atmosphere is relaxed and the coffee is always top-notch. "
        "They also offer a variety of pastries and sandwiches. "
        "The staff is always welcoming and the service is fast. "
        "I enjoy their latte and the blueberry muffin is a must-try."
    )
]

# 注意
# string1 = "ABC"
# string2 = ("A" "B" "C")
# print(string1 == string2)
# >>> True

注意上面的reviews是一个包含三个项目的列表，并且列表中的每个项目都以易于阅读的方式书写出来。

from skllm.preprocessing import GPTSummarizer

# 导入GPTSummarizer类，用于生成摘要
gpt_summarizer = GPTSummarizer(openai_model="gpt-3.5-turbo", max_words=15)

# 创建GPTSummarizer对象，指定使用的OpenAI模型为"gpt-3.5-turbo"，摘要的最大词数为15

summaries = gpt_summarizer.fit_transform(reviews)

# 使用GPTSummarizer对象对reviews进行摘要生成，并将结果赋值给summaries

print(summaries)

# 打印摘要结果

在scikit-learn流水线中使用scikit-llm

到目前为止，上面的所有示例都只使用了Scikit-LLM库中的估计器。如前所述，该库的主要优点是与scikit-learn平台的集成。

以下示例在scikit-learn流水线中使用了一个scikit-llm估计器，并在之前演示的电影评论示例上运行了一个XGBoost分类器。

从sklearn.pipeline导入Pipeline类
从sklearn.preprocessing导入LabelEncoder类
从skllm.preprocessing导入GPTVectorizer类
从xgboost导入XGBClassifier类

# 定义变量以减少混淆
X, y = movie_reviews, movie_review_labels
X_test, y_test = new_movie_reviews, new_movie_review_labels

# 对标签进行编码
le = LabelEncoder()
y_encoded = le.fit_transform(y)
y_test_encoded = le.transform(y_test)

# 使用scikit-learn的管道
steps = [("GPT", GPTVectorizer()), ("Clf", XGBClassifier())]
clf = Pipeline(steps)

clf.fit(X, y_encoded)

y_pred_encoded = clf.predict(X_test)

# 将编码的标签恢复为实际标签
y_pred = le.inverse_transform(y_pred_encoded)

print(f"\n编码标签（训练集）：{y_encoded}\n")
print(f"实际标签（训练集）：{y}")

print(f"预测标签（编码）：{y_test_encoded}\n")

print("------------------\n评估XGBoost分类器的性能：\n")
for test_review, actual_label, predicted_label in zip(X_test, y_test, y_pred):
    print(f"评论：{test_review}\n实际标签：{actual_label}\n预测标签：{predicted_label}\n")
编码标签（训练集）：[2 2 1 0 0 1 2 0 1]

实际标签（训练集）：['positive', 'positive', 'neutral', 'negative', 'negative', 'neutral', 'positive', 'negative', 'neutral']
预测标签（编码）：[2 0 1]

------------------
评估XGBoost分类器的性能：

评论：电影太棒了！我从头到尾都被剧情吸引住了。
实际标签：positive
预测标签：positive

评论：我觉得这部电影相当无聊。情节发展得太慢，演技也很差。
实际标签：negative
预测标签：positive

评论：这部电影还行。不是我看过的最好的，但绝对不是最差的。
实际标签：neutral
预测标签：neutral

请注意，上述仅是一个用例，用于说明在SKlearn流水线中集成Scikit-LLM估计器的可能性。

结论

Scikit-LLM是一个强大的工具，将高级语言模型（如GPT-3）的功能添加到众所周知的scikit-learn框架中。在本教程中，我们介绍了Scikit-LLM的一些重要功能：1）零样本文本分类，2）多标签零样本文本分类，3）文本向量化，4）文本摘要，5）语言翻译和6）与scikit-learn流水线的集成。

正如我之前提到的，这个库的主要优点是它与scikit-learn具有相似的API，并且可以集成到scikit-learn流水线中。然而，Scikit-LLM的主要局限性是它对OpenAI的严重依赖。尽管集成开源模型在Scikit-LLM的路线图上，但目前尚未在库中提供。因此，如果您使用Scikit-LLM，您应该知道：