利用hugging face的Transformers实现文本分类

最新推荐文章于 2024-06-29 01:00:00 发布

xuanningmeng

最新推荐文章于 2024-06-29 01:00:00 发布

阅读量1k

点赞数 2

分类专栏： NLP 文章标签： tensorflow 深度学习神经网络

本文链接：https://blog.csdn.net/weixin_42223207/article/details/115409683

版权

NLP 专栏收录该内容

25 篇文章 11 订阅

订阅专栏

文本分类

文本分类任务在实际工作中很常见，一般是多分类和多标签分类。多标签分类的内容参见博客https://blog.csdn.net/weixin_42223207/article/details/115036283。本文是以用hugging face的Transformers实现文本分类，采用的框架是tensorflow==2.4.0。本文的内容大致如下：

数据处理
模型
模型训练
模型预测

数据处理

采用BertTokenizer对字进行Tokenizer，代码如下`

def create_inputs_targets(sentences, labels, max_len, tokenizer):
    dataset_dict = {
        "input_ids": [],
        "attention_mask": [],
        "labels": []
    }
    assert len(sentences) == len(labels)
    for i in range(len(sentences)):
        input_ids = []
        for idx, word in enumerate(sentences[i]):
            ids = tokenizer.encode(word, add_special_tokens=False)
            input_ids.extend(ids.ids)

        # Pad truncate，句子前后加'[CLS]','[SEP]'
        input_ids = input_ids[:max_len - 2]
        input_ids = [101] + input_ids + [102]
        # 这里'O'对应的是16, 这里是否对应的是tag2id中的[CLS][SEP]
        attention_mask = [1] * len(input_ids)
        padding_len = max_len - len(input_ids)
        # vocab中 [PAD]的编码是0
        input_ids = input_ids + ([0] * padding_len)
        attention_mask = attention_mask + ([0] * padding_len)
        dataset_dict["input_ids"].append(input_ids)
        dataset_dict["attention_mask"].append(attention_mask)
        dataset_dict["labels"].append(labels[i])
    for key in dataset_dict:
        dataset_dict[key] = np.array(dataset_dict[key])

    x = [
        dataset_dict["input_ids"],
        dataset_dict["attention_mask"],
    ]
    y = dataset_dict["labels"]
    return x, y

这里取了bert tokenizer中的input_ids和attention_mask。

模型

采用Bert模型进行fine tuning，代码如下：

class BertTextClassifier(object):
    def __init__(self, bert_model_name, label_num):
        self.label_num = label_num
        self.bert_model_name = bert_model_name

    def get_model(self):
        bert = TFBertModel.from_pretrained(self.bert_model_name)
        input_ids = keras.Input(shape=(None,), dtype=tf.int32, name="input_ids")
        attention_mask = keras.Input(shape=(None,), dtype=tf.int32, name="attention_mask")

        outputs = bert(input_ids, attention_mask=attention_mask)[1]
        cla_outputs = layers.Dense(self.label_num, activation='softmax')(outputs)
        model = keras.Model(
            inputs=[input_ids, attention_mask],
            outputs=[cla_outputs])
        return model

def create_model(bert_model_name, label_nums):
    model = BertTextClassifier(bert_model_name, label_nums).get_model()
    optimizer = tf.keras.optimizers.Adam(lr=1e-5)
    model.compile(optimizer=optimizer, loss='categorical_crossentropy',
                  metrics=['accuracy', tf.keras.metrics.Precision(),
                           tf.keras.metrics.Recall(),
                           tf.keras.metrics.AUC()])   # metrics=['accuracy']
    return model

模型训练

这里采用tensorflow2.x中的高阶API keras进行模型训练。代码如下：

model = create_model(args["bert_model_name"], len(tag2id))
    # model.summary()
model.fit(train_x,
              train_y,
              epochs=epoch,
              verbose=1,
              batch_size=batch_size,
              validation_data=(dev_x, dev_y),
              validation_batch_size=batch_size
              )   # , validation_split=0.1

    # model save
    model_path = os.path.join(args["output_path"], "classification_model.h5")
    model.save_weights(model_path, overwrite=True)

    # save pb model
    tf.keras.models.save_model(model, args["pb_path"],
                               save_format="tf",
                               overwrite=True)

在搜狗数据集上训练的结果如下：

        precision    recall  f1-score   support
          体育       1.00      1.00      1.00       209
          健康       0.94      0.98      0.96       180
          军事       0.99      0.99      0.99       208
          教育       0.98      0.94      0.96       197
          汽车       0.98      0.99      0.99       202

    accuracy                           0.98       996
   macro avg       0.98      0.98      0.98       996
weighted avg       0.98      0.98      0.98       996

模型预测

将数据处理为模型输入的格式，即用Tokenizer得到数据的input_ids和attention_mask的特征。代码如下：

def create_infer_inputs(sentences, max_len, tokenizer):
    dataset_dict = {
        "input_ids": [],
        "attention_mask": [],
    }
    for i in range(len(sentences)):
        input_ids = []
        for idx, word in enumerate(sentences[i]):
            ids = tokenizer.encode(word, add_special_tokens=False)
            input_ids.extend(ids.ids)

        # Pad truncate，句子前后加'[CLS]','[SEP]'
        input_ids = input_ids[:max_len - 2]
        input_ids = [101] + input_ids + [102]
        # 这里'O'对应的是16, 这里是否对应的是tag2id中的[CLS][SEP]
        attention_mask = [1] * len(input_ids)
        padding_len = max_len - len(input_ids)
        # vocab中 [PAD]的编码是0
        input_ids = input_ids + ([0] * padding_len)
        attention_mask = attention_mask + ([0] * padding_len)
        dataset_dict["input_ids"].append(input_ids)
        dataset_dict["attention_mask"].append(attention_mask)
    for key in dataset_dict:
        dataset_dict[key] = np.array(dataset_dict[key])

    x = [
        dataset_dict["input_ids"],
        dataset_dict["attention_mask"],
    ]

    return x

这里采用Flask服务实现模型预测。代码如下：

@app.route("/classification", methods=['POST'])
def classification_predict():
    data = json.loads(request.get_data(), encoding="utf-8")
    sentence = data["context"]
	url = data["url"]
    input_ids, attention_mask = create_infer_inputs(sentence, max_len, tokenizer)
    print("input_ids: ", input_ids)
    print("attention_mask: ", attention_mask)
    data = json.dumps({"signature_name": "serving_default",
                       "inputs": {"input_ids": input_ids,
                                  "attention_mask": attention_mask}})
    headers = {"content-type": "application/json"}
    result = requests.post(url, data=data, headers=headers)
    print("result: ", result)
    if result.status_code == 200:
        result = json.loads(result.text)
        logits = np.array(result["outputs"])
        pred = np.argmax(logits, axis=1).tolist()
        pred_label = id2tag[pred[0]]
        print(pred_label)
        return_result = {"code": 200,
                         "context": sentence,
                         "label": pred_label}
        return jsonify(return_result)
    else:
        return_result = {"code": 200,
                         "context": sentence,
                         "label": None}
        return jsonify(return_result)

其中url是用docker+Tensorflow serving部署模型的服务。如有问题，欢迎指正。

xuanningmeng

关注

2
点赞
踩
9

收藏

觉得还不错? 一键收藏
2
评论
利用hugging face的Transformers实现文本分类

文本分类文本分类任务在实际工作中很常见，一般是多分类和多标签分类。多标签分类的内容参见博客https://blog.csdn.net/weixin_42223207/article/details/115036283。本文是以用hugging face的Transformers实现文本分类，采用的框架是tensorflow==2.4.0。本文的内容大致如下：数据tokenizer模型模型训练模型预测Flask服务数据tokenizer采用...
复制链接

扫一扫