问题描述
想用自定义组件(如情感分析、拼写检查、字符分词器等)加强Rasa现有NLU模型
Rasa NLU pipeline介绍
pipeline定义了输入到输出经过哪些处理,如:
pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"
每个组件会一个接一个调用,并产生输出,这些输出要么直接作为最终输出,要么作为其他组件的输入
步骤
以下添加【情感分析组件】为例:
- 安装自然语言处理库spaCy
pip install spacy
python -m spacy download en_core_web_sm
-
创建项目:
rasa init --no-prompt
-
nlu.md顶部添加数据
## intent: feedback
- It’s very helpful
- I had the best experience speaking with you
- no feedback
- ok
- You are the most stupid bot I have ever seen
- the worst
- 添加标签
labels.txt
pos
pos
neu
neu
neg
neg
- 构建情感分析组件
sentiment.py
。实现继承Component
类的方法,具体有训练train()
、解析process()
、持久化persist()
、加载load()
import pickle
from typing import Any, Text, Dict
from rasa.nlu.components import Component
from nltk.classify import NaiveBayesClassifier
SENTIMENT_MODEL_FILE_NAME = "sentiment_classifier.pkl"
class SentimentAnalyzer(Component):
"""自定义情感分析组件"""
name = "sentiment"
provides = ["entities"]
requires = ["tokens"]
defaults = {}
language_list = ["en"]
def __init__(self, component_config=None):
super(SentimentAnalyzer, self).__init__(component_config)
def train(self, training_data, cfg, **kwargs):
"""从文本文件中加载情感标签,检索训练分词并格式化,形成情感分类器"""
with open("labels.txt", "r") as f:
labels = f.read().splitlines()
training_data = training_data.training_examples # list of Message objects
tokens = [list(map(lambda x: x.text, t.get("tokens"))) for t in training_data]
processed_tokens = [self.preprocessing(t) for t in tokens]
labeled_data = [(t, x) for t, x in zip(processed_tokens, labels)]
self.clf = NaiveBayesClassifier.train(labeled_data)
def convert_to_rasa(self, value, confidence):
"""将模型输出转换为Rasa NLU的输出格式"""
entity = {"value": value,
"confidence": confidence,
"entity": "sentiment",
"extractor": "sentiment_extractor"}
return entity
def preprocessing(self, tokens):
"""创建训练示例的词袋表示"""
return ({word: True for word in tokens})
def process(self, message, **kwargs):
"""检索新消息的分词,并将其传给分类器,将预测结果追加到message中"""
if not self.clf:
print("No training!")
else:
tokens = [t.text for t in message.get("tokens")]
tb = self.preprocessing(tokens)
pred = self.clf.prob_classify(tb)
sentiment = pred.max()
confidence = pred.prob(sentiment)
entity = self.convert_to_rasa(sentiment, confidence)
message.set("entities", [entity], add_to_output=True)
def persist(self, file_name, model_dir):
"""将整个类持久化"""
classifier_file = SENTIMENT_MODEL_FILE_NAME
with open(classifier_file, "wb") as f:
pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)
return {"classifier_file": SENTIMENT_MODEL_FILE_NAME}
@classmethod
def load(cls, meta: Dict[Text, Any], model_dir=None, model_metadata=None, cached_component=None, **kwargs):
file_name = meta.get("classifier_file")
classifier_file = file_name
with open(classifier_file, "rb") as f:
return pickle.load(f)
- 修改
config.yml
,添加自定义组件,格式为模块名.类名
language: en
pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "sentiment.SentimentAnalyzer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"
- 当祖安用户发来亲切的问候
Hello stupid bot
"entities": [
{
"value": "neg",
"confidence": 0.8181818181818182,
"entity": "sentiment",
"extractor": "sentiment_extractor"
}
]
备注
教程原文How to Enhance Rasa NLU Models with Custom Components,作者很好看有没有