情感分析预训练模型SKEP使用教程
本项目将演示如何使用情感分析预训练模型SKEP完成句子级情感分析、对象级情感分析以及观点抽取任务。
此外,通过从情感分析任务,引入和介绍传统文本分类模型如TextCNN等、预训练模型SKEP及其在 PaddleNLP 的使用方式。
本项目主要包括“任务介绍”、“常用数据”、“传统情感分析模型TextCNN”、“情感分析预训练模型SKEP”等四个部分。
In [ ]
!pip install --upgrade paddlenlp
情感分析任务
众所周知,人类自然语言中包含了丰富的情感色彩:表达人的情绪(如悲伤、快乐)、表达人的心情(如倦怠、忧郁)、表达人的喜好(如喜欢、讨厌)、表达人的个性特征和表达人的立场等等。情感分析在商品喜好、消费决策、舆情分析等场景中均有应用。利用机器自动分析这些情感倾向,不但有助于帮助企业了解消费者对其产品的感受,为产品改进提供依据;同时还有助于企业分析商业伙伴们的态度,以便更好地进行商业决策。
通常情况下,人们把情感分析任务看成一个三分类问题:
情感分析任务
正向: 表示正面积极的情感,如高兴,幸福,惊喜,期待等。
负向: 表示负面消极的情感,如难过,伤心,愤怒,惊恐等。
其他: 其他类型的情感。
情感分析数据
ChnSenticorp数据集是公开中文情感分析数据集, 其为2分类数据集。PaddleNLP已经内置该数据集,一键即可加载。
In [ ]
from paddlenlp.datasets import load_dataset
train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])
idx = 0
for data in train_ds:
print(data)
idx += 1
if idx >= 3:
break
传统情感分类模型TextCNN
传统情感分类模型通过CNN、RNN、LSTM、GRU等网络,将文本表征为一个向量。由于RNN、LSTM、GRU等循环神经网络不能并行计算,而CNN在速度方面却有着无可比拟的效果,且由于它的可并行性广被工业界喜爱。2014年Yoon Kim提出TextCNN网络用于文本分类任务中,同时取得不错的效果。在文本中,并不是所有的文本都是全部依赖,可与利用n-gram信息,捕捉文本的局部相关性特征。CNN的原理也是如此,通过卷积核,来补捉文本的局部相关性特征。同时可以使用多个不同的卷积核,来捕捉多个ngram信息。
PaddleNLP提供了序列化建模模块paddlenlp.seq2vec模块,该模块可以将文本抽象成一个携带语义的文本向量。
关于seq2vec模块更多信息参考:[paddlenlp.seq2vec是什么?快来看看如何用它完成情感分析任务]https://aistudio.baidu.com/aistudio/projectdetail/1283423()
接下来,我们看看如何实现TextCNN模型。
paddle.nn.Embedding组建word-embedding层
paddlenlp.seq2vec.CNNEncoder组建句子建模层
paddle.nn.Linear构造二分类器
In [ ]
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
import paddlenlp as nlp
class TextCNNModel(nn.Layer):
"""
This class implements the Text Convolution Neural Network model.
At a high level, the model starts by embedding the tokens and running them through
a word embedding. Then, we encode these representations with a `CNNEncoder`.
The CNN has one convolution layer for each ngram filter size. Each convolution operation gives
out a vector of size num_filter. The number of times a convolution layer will be used
is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these
outputs from the convolution layer and outputs the max.
Lastly, we take the output of the encoder to create a final representation,
which is passed through some feed-forward layers to output a logits (`output_layer`).
"""
def __init__(self,
vocab_size,
num_classes,
emb_dim=128,
padding_idx=0,
num_filter=128,
ngram_filter_sizes=(1, 2, 3),
fc_hidden_size=96):
super().__init__()
self.embedder = nn.Embedding(
vocab_size, emb_dim, padding_idx=padding_idx)
self.encoder = nlp.seq2vec.CNNEncoder(
emb_dim=emb_dim,
num_filter=num_filter,
ngram_filter_sizes=ngram_filter_sizes)
self.fc = nn.Linear(self.encoder.get_output_dim(), fc_hidden_size)
self.output_layer = nn.Linear(fc_hidden_size, num_classes)
def forward(self, text):
# Shape: (batch_size, num_tokens, embedding_dim)
embedded_text = self.embedder(text)
# Shape: (batch_size, len(ngram_filter_sizes)*num_filter)
encoder_out = self.encoder(embedded_text)
encoder_out = paddle.tanh(encoder_out)
# Shape: (batch_size, fc_hidden_size)
fc_out = paddle.tanh(self.fc(encoder_out))
# Shape: (batch_size, num_classes)
logits = self.output_layer(fc_out)
return logits
model = TextCNNModel(
len(vocab.idx_to_token),
len(train_ds.label_list),
padding_idx=vocab.to_indices('[PAD]'))
model = paddle.Model(model)
构建词汇表
由于TextCNN模型输入的是文本单词,所以我们还需要对文本进行切词操作。
首先需要对整体语料构造词表。通过切词统计词频,去除低频词,从而完成构造词表。我们使用jieba作为中文切词工具。
停用词表,我们从网上直接获取:https://github.com/goto456/stopwords/blob/master/baidu_stopwords.txt
In [ ]
import os
from collections import Counter
from itertools import chain
import jieba
def sort_and_write_words(all_words, file_path):
words = list(chain(*all_words))
words_vocab = Counter(words).most_common()
with open(file_path, "w", encoding="utf8") as f:
f.write('[UNK]\n[PAD]\n')
# filter the count of words below 5
# 过滤低频词,词频<5
for word, num in words_vocab:
if num < 5:
continue
f.write(word + "\n")
all_texts = [data['text'] for data in train_ds]
all_texts += [data['text'] for data in dev_ds]
all_texts += [data['text'] for data in test_ds]
all_words = []
for text in all_texts:
words = jieba.lcut(text)
words = [word for word in words if word.strip() !='']
all_words.append(words)
# 写入词表
sort_and_write_words(all_words, "work/vocab.txt")
In [ ]
# 词汇表大小
!wc -l work/vocab.txt
# 停用词表大小
!wc -l work/stopwords.txt
还需对数据作以下处理:
将原始数据处理成模型可以读入的格式。首先使用jieba切词,之后将jieba切完后的单词映射词表中单词id。
使用paddle.io.DataLoader接口多线程异步加载数据。
In [ ]
from functools import partial
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from utils import create_dataloader,convert_example
vocab = Vocab.load_vocabulary(
"work/vocab.txt", unk_token='[UNK]', pad_token='[PAD]')
tokenizer = JiebaTokenizer(vocab)
trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False)
# 将读入的数据batch化处理,便于模型batch化运算。
# batch中的每个句子将会padding到这个batch中的文本最大长度batch_max_seq_len。
# 当文本长度大于batch_max_seq时,将会截断到batch_max_seq_len;当文本长度小于batch_max_seq时,将会padding补齐到batch_max_seq_len.
batch_size = 64
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 1)), # word_ids
Stack(dtype="int64") # label
): [data for data in fn(samples)]
train_loader = create_dataloader(
train_ds,
trans_fn=trans_fn,
batch_size=batch_size,
mode='train',
batchify_fn=batchify_fn)
dev_loader = create_dataloader(
dev_ds,
trans_fn=trans_fn,
batch_size=batch_size,
mode='validation',
batchify_fn=batchify_fn)
TextCNN模型训练
处理完了数据之后,还需要定义优化器和损失函数。此处选择准确率Accuracy作为评价指标。
In [ ]
# 定义优化器、损失和评价指标.
optimizer = paddle.optimizer.Adam(
parameters=model.parameters(), learning_rate=5e-5)
criterion = paddle.nn.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
model.prepare(optimizer, criterion, metric)
# 开始训练和评估
model.fit(train_loader, dev_loader, epochs=5, save_dir='./textcnn_ckpt')
情感分析预训练模型SKEP
近年来,大量的研究表明基于大型语料库的预训练模型(Pretrained Models, PTM)可以学习通用的语言表示,有利于下游NLP任务,同时能够避免从零开始训练模型。随着计算能力的发展,深度模型的出现(即 Transformer)和训练技巧的增强使得 PTM 不断发展,由浅变深。
情感预训练模型SKEP(Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis)。SKEP利用情感知识增强预训练模型, 在14项中英情感分析典型任务上全面超越SOTA,此工作已经被ACL 2020录用。SKEP是百度研究团队提出的基于情感知识增强的情感预训练算法,此算法采用无监督方法自动挖掘情感知识,然后利用情感知识构建预训练目标,从而让机器学会理解情感语义。SKEP为各类情感分析任务提供统一且强大的情感语义表示。
论文地址:https://arxiv.org/abs/2005.05635
百度研究团队在三个典型情感分析任务,句子级情感分类(Sentence-level Sentiment Classification),评价对象级情感分类(Aspect-level Sentiment Classification)、观点抽取(Opinion Role Labeling),共计14个中英文数据上进一步验证了情感预训练模型SKEP的效果。
实验表明,以通用预训练模型ERNIE(内部版本)作为初始化,SKEP相比ERNIE平均提升约1.2%,并且较原SOTA平均提升约2%,具体效果如下表:
同样地,以之前的句子级情感分类ChnSentiCorp为例,我们看看SKEP的性能表现如何。
SKEP模型加载
PaddleNLP已经实现了SKEP预训练模型,可以通过一行代码实现SKEP加载。
In [ ]
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch", num_classes=2)#len(train_ds.label_list))
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")
SkepForSequenceClassification可用于句子级情感分析和对象级情感分析任务。其通过预训练模型SKEP获取输入文本的表示,之后将文本表示进行分类。
pretrained_model_name_or_path:模型名称。支持"skep_ernie_1.0_large_ch","skep_ernie_2.0_large_en","skep_roberta_large_en"。
"skep_ernie_1.0_large_ch":是SKEP模型在预训练ernie_1.0_large_ch基础之上在海量中文数据上继续预训练得到的中文预训练模型;
"skep_ernie_2.0_large_en":是SKEP模型在预训练ernie_2.0_large_en基础之上在海量英文数据上继续预训练得到的英文预训练模型;
"skep_roberta_large_en":是SKEP模型在预训练roberta_large_en基础之上在海量英文数据上继续预训练得到的英文预训练模型;
num_classes: 数据集分类类别数。
关于SKEP模型实现详细信息参考:https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/skep
数据处理
同样地,我们需要将原始ChnSentiCorp数据处理成模型可以读入的数据格式。
SKEP模型对中文文本处理按照字粒度进行处理,我们可以使用PaddleNLP内置的SkepTokenizer完成一键式处理。
In [ ]
def convert_example(example,
tokenizer,
max_seq_length=512,
is_test=False):
"""
Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. And creates a mask from the two sequences passed
to be used in a sequence-pair classification task.
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
::
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
which contains most of the methods. Users should refer to the superclass for more information regarding methods.
max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
Returns:
input_ids(obj:`list[int]`): The list of token ids.
token_type_ids(obj: `list[int]`): List of sequence pair mask.
label(obj:`int`, optional): The input label if not is_test.
"""
encoded_inputs = tokenizer(
text=example["text"], max_seq_len=max_seq_length)
input_ids = encoded_inputs["input_ids"]
token_type_ids = encoded_inputs["token_type_ids"]
if not is_test:
label = example["label"]
return input_ids, token_type_ids, label
else:
return input_ids, token_type_ids
train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])
batch_size = 32
max_seq_length = 128
trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids
Stack(dtype="int64") # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
train_ds,
mode='train',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
dev_data_loader = create_dataloader(
dev_ds,
mode='dev',
batch_size=batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
模型训练和评估
定义损失函数、优化器以及评价指标后,即可开始训练。
In [13]
import time
from utils import evaluate
epochs = 1
ckpt_dir = "skep_ckpt"
num_training_steps = len(train_data_loader) * epochs
# 除所有的bias和LayerNorm参数,其他参数均需权重衰减
decay_params = [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
]
optimizer = paddle.optimizer.AdamW(
learning_rate=3e-6,
parameters=model.parameters(),
weight_decay=0.01,
apply_decay_param_fun=lambda x: x in decay_params)
criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
for step, batch in enumerate(train_data_loader, start=1):
input_ids, token_type_ids, labels = batch
logits = model(input_ids, token_type_ids)
loss = criterion(logits, labels)
probs = F.softmax(logits, axis=1)
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()
global_step += 1
if global_step % 10 == 0:
print(
"global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
% (global_step, epoch, step, loss, acc,
10 / (time.time() - tic_train)))
tic_train = time.time()
loss.backward()
optimizer.step()
optimizer.clear_grad()
if global_step % 100 == 0:
save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
evaluate(model, criterion, metric, dev_data_loader)
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
模型预测
使用训练得到的模型还可以对文本进行情感预测。
In [ ]
from utils import predict
data = [
'这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般',
'怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片',
'作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。',
]
label_map = {0: 'negative', 1: 'positive'}
results = predict(
model, data, tokenizer, label_map, batch_size, max_seq_length)
for idx, text in enumerate(data):
print('Data: {} \t Label: {}'.format(text, results[idx]))
对象级情感分析
在情感分析任务中,研究人员除了分析句子的情感类型外,还细化到以句子中具体的“方面”为分析主体进行情感分析(aspect-level),如下:
这个薯片口味有点咸,太辣了,不过口感很脆。
关于薯片的口味方面是一个负向评价(咸,太辣),然而对于口感方面却是一个正向评价(很脆)。
我很喜欢夏威夷,就是这边的海鲜太贵了。
关于夏威夷是一个正向评价(喜欢),然而对于夏威夷的海鲜却是一个负向评价(价格太贵)。
同样SKEP支持对象级情感分析任务。运行以下命令即可完成对象级情感分析任务。
In [ ]
# 对象级情感分析训练
!python train_aspect.py --save_dir skep_aspect
In [ ]
# 对象级情感分析预测
!python predict_aspect.py --params_path skep_aspect/model_900/model_state.pdparams
观点抽取
给定一个用户评论文本,抽取其中表达观点的三元组(维度词、评价词、情感极性)
示例:这家旅店服务还是不错的,但是房间比较简陋
观点1:<服务,不错,积极>
观点2:<房间,简陋,消极>
In [ ]
# 观点抽取训练
!python train_opinion.py --save_dir skep_opinion
In [ ]
# 观点抽取预测
!python predict_opinion.py --params_path skep_opinion/model_900/model_state.pdparams
utils
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import paddle
def read_vocab(vocab_path):
vocab = {}
with open(vocab_path, "r", encoding="utf8") as f:
for idx, line in enumerate(f):
word = line.strip("\n")
vocab[word] = idx
return vocab
def create_dataloader(dataset,
trans_fn=None,
mode='train',
batch_size=1,
batchify_fn=None):
"""
Creats dataloader.
Args:
dataset(obj:`paddle.io.Dataset`): Dataset instance.
trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
the sample list, None for only stack each fields of sample in axis
0(same as :attr::`np.stack(..., axis=0)`).
Returns:
dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
"""
if trans_fn:
dataset = dataset.map(trans_fn)
shuffle = True if mode == 'train' else False
if mode == "train":
sampler = paddle.io.DistributedBatchSampler(
dataset=dataset, batch_size=batch_size, shuffle=shuffle)
else:
sampler = paddle.io.BatchSampler(
dataset=dataset, batch_size=batch_size, shuffle=shuffle)
dataloader = paddle.io.DataLoader(
dataset, batch_sampler=sampler, collate_fn=batchify_fn)
return dataloader
def convert_example(example, tokenizer, is_test=False):
"""
Builds model inputs from a sequence for sequence classification tasks.
It use `jieba.cut` to tokenize text.
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
Returns:
input_ids(obj:`list[int]`): The list of token ids.
valid_length(obj:`int`): The input sequence valid length.
label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
"""
input_ids = tokenizer.encode(example["text"])
input_ids = np.array(input_ids, dtype='int64')
if not is_test:
label = np.array(example["label"], dtype="int64")
return input_ids, label
else:
return input_ids
@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader):
"""
Given a dataset, it evals model and computes the metric.
Args:
model(obj:`paddle.nn.Layer`): A model to classify texts.
criterion(obj:`paddle.nn.Layer`): It can compute the loss.
metric(obj:`paddle.metric.Metric`): The evaluation metric.
data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
"""
model.eval()
metric.reset()
losses = []
for batch in data_loader:
input_ids, token_type_ids, labels = batch
logits = model(input_ids, token_type_ids)
loss = criterion(logits, labels)
losses.append(loss.numpy())
correct = metric.compute(logits, labels)
metric.update(correct)
accu = metric.accumulate()
print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu))
model.train()
metric.reset()
@paddle.no_grad()
def predict(model, data, tokenizer, label_map, batch_size=1, max_seq_length=128):
examples = []
for text in data:
input_ids, token_type_ids = convert_example(
text,
tokenizer,
max_seq_length=max_seq_length,
is_test=True)
examples.append((input_ids, token_type_ids))
# Seperates data into some batches.
batches = [
examples[idx:idx + batch_size]
for idx in range(0, len(examples), batch_size)
]
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token type ids
): [data for data in fn(samples)]
results = []
model.eval()
for batch in batches:
input_ids, token_type_ids = batchify_fn(batch)
input_ids = paddle.to_tensor(input_ids)
token_type_ids = paddle.to_tensor(token_type_ids)
logits = model(input_ids, token_type_ids)
probs = F.softmax(logits, axis=1)
idx = paddle.argmax(probs, axis=1).numpy()
idx = idx.tolist()
labels = [label_map[i] for i in idx]
results.extend(labels)
return results
train_aspect
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from functools import partial
import argparse
import os
import random
import time
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
parser.add_argument("--max_seq_length", default=400, type=int, help="The maximum total input sequence length after tokenization. "
"Sequences longer than this will be truncated, sequences shorter will be padded.")
parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument("--learning_rate", default=3e-6, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
parser.add_argument("--epochs", default=50, type=int, help="Total number of training epochs to perform.")
parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
args = parser.parse_args()
# yapf: enable
def set_seed(seed):
"""Sets random seed."""
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)
def convert_example(example,
tokenizer,
max_seq_length=512,
is_test=False,
dataset_name="chnsenticorp"):
"""
Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. And creates a mask from the two sequences passed
to be used in a sequence-pair classification task.
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
::
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
note: There is no need token type ids for skep_roberta_large_ch model.
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
which contains most of the methods. Users should refer to the superclass for more information regarding methods.
max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2".
Returns:
input_ids(obj:`list[int]`): The list of token ids.
token_type_ids(obj: `list[int]`): List of sequence pair mask.
label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
"""
encoded_inputs = tokenizer(
text=example["text"],
text_pair=example["text_pair"],
max_seq_len=max_seq_length)
input_ids = encoded_inputs["input_ids"]
token_type_ids = encoded_inputs["token_type_ids"]
if not is_test:
label = np.array([example["label"]], dtype="int64")
return input_ids, token_type_ids, label
else:
return input_ids, token_type_ids
def create_dataloader(dataset,
mode='train',
batch_size=1,
batchify_fn=None,
trans_fn=None):
if trans_fn:
dataset = dataset.map(trans_fn)
shuffle = True if mode == 'train' else False
if mode == 'train':
batch_sampler = paddle.io.DistributedBatchSampler(
dataset, batch_size=batch_size, shuffle=shuffle)
else:
batch_sampler = paddle.io.BatchSampler(
dataset, batch_size=batch_size, shuffle=shuffle)
return paddle.io.DataLoader(
dataset=dataset,
batch_sampler=batch_sampler,
collate_fn=batchify_fn,
return_list=True)
if __name__ == "__main__":
set_seed(args.seed)
paddle.set_device(args.device)
rank = paddle.distributed.get_rank()
if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env()
train_ds = load_dataset("seabsa16", "phns", splits=["train"])
model = SkepForSequenceClassification.from_pretrained(
'skep_ernie_1.0_large_ch', num_classes=len(train_ds.label_list))
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=args.max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids
Stack(dtype="int64") # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
train_ds,
mode='train',
batch_size=args.batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
state_dict = paddle.load(args.init_from_ckpt)
model.set_dict(state_dict)
model = paddle.DataParallel(model)
num_training_steps = len(train_data_loader) * args.epochs
# Generate parameter names needed to perform weight decay.
# All bias and LayerNorm parameters are excluded.
decay_params = [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
]
optimizer = paddle.optimizer.AdamW(
learning_rate=args.learning_rate,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in decay_params)
criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
global_step = 0
tic_train = time.time()
for epoch in range(1, args.epochs + 1):
for step, batch in enumerate(train_data_loader, start=1):
input_ids, token_type_ids, labels = batch
logits = model(input_ids, token_type_ids)
loss = criterion(logits, labels)
probs = F.softmax(logits, axis=1)
correct = metric.compute(probs, labels)
metric.update(correct)
acc = metric.accumulate()
global_step += 1
if global_step % 10 == 0 and rank == 0:
print(
"global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
% (global_step, epoch, step, loss, acc,
10 / (time.time() - tic_train)))
tic_train = time.time()
loss.backward()
optimizer.step()
optimizer.clear_grad()
if global_step % 100 == 0 and rank == 0:
save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Need better way to get inner model of DataParallel
model._layers.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)
#predict_aspect
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from functools import partial
import argparse
import os
import random
import time
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
parser.add_argument("--max_seq_length", default=400, type=int, help="The maximum total input sequence length after tokenization. "
"Sequences longer than this will be truncated, sequences shorter will be padded.")
parser.add_argument("--batch_size", default=6, type=int, help="Batch size per GPU/CPU for prediction.")
parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
args = parser.parse_args()
# yapf: enable
@paddle.no_grad()
def predict(model, data_loader, label_map):
"""
Given a prediction dataset, it gives the prediction results.
Args:
model(obj:`paddle.nn.Layer`): A model to classify texts.
data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
label_map(obj:`dict`): The label id (key) to label str (value) map.
"""
model.eval()
results = []
for batch in data_loader:
input_ids, token_type_ids = batch
logits = model(input_ids, token_type_ids)
probs = F.softmax(logits, axis=1)
idx = paddle.argmax(probs, axis=1).numpy()
idx = idx.tolist()
labels = [label_map[i] for i in idx]
results.extend(labels)
return results
def convert_example(example,
tokenizer,
max_seq_length=512,
is_test=False,
dataset_name="chnsenticorp"):
"""
Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. And creates a mask from the two sequences passed
to be used in a sequence-pair classification task.
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
::
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
note: There is no need token type ids for skep_roberta_large_ch model.
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
which contains most of the methods. Users should refer to the superclass for more information regarding methods.
max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2".
Returns:
input_ids(obj:`list[int]`): The list of token ids.
token_type_ids(obj: `list[int]`): List of sequence pair mask.
label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
"""
encoded_inputs = tokenizer(
text=example["text"],
text_pair=example["text_pair"],
max_seq_len=max_seq_length)
input_ids = encoded_inputs["input_ids"]
token_type_ids = encoded_inputs["token_type_ids"]
if not is_test:
label = np.array([example["label"]], dtype="int64")
return input_ids, token_type_ids, label
else:
return input_ids, token_type_ids
def create_dataloader(dataset,
mode='train',
batch_size=1,
batchify_fn=None,
trans_fn=None):
if trans_fn:
dataset = dataset.map(trans_fn)
shuffle = True if mode == 'train' else False
if mode == 'train':
batch_sampler = paddle.io.DistributedBatchSampler(
dataset, batch_size=batch_size, shuffle=shuffle)
else:
batch_sampler = paddle.io.BatchSampler(
dataset, batch_size=batch_size, shuffle=shuffle)
return paddle.io.DataLoader(
dataset=dataset,
batch_sampler=batch_sampler,
collate_fn=batchify_fn,
return_list=True)
if __name__ == "__main__":
test_ds = load_dataset("seabsa16", "phns", splits=["test"])
label_map = {0: 'negative', 1: 'positive'}
model = SkepForSequenceClassification.from_pretrained(
'skep_ernie_1.0_large_ch', num_classes=len(label_map))
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=args.max_seq_length,
is_test=True)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids
Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # token_type_ids
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
test_ds,
mode='test',
batch_size=args.batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
if args.params_path and os.path.isfile(args.params_path):
state_dict = paddle.load(args.params_path)
model.set_dict(state_dict)
print("Loaded parameters from %s" % args.params_path)
results = predict(model, test_data_loader, label_map)
for idx, text in enumerate(test_ds.data):
print('Data: {} \t Label: {}'.format(text, results[idx]))
train_opinion
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from functools import partial
import argparse
import os
import random
import time
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.datasets import load_dataset
from paddlenlp.metrics import ChunkEvaluator
from paddlenlp.transformers import SkepCrfForTokenClassification, SkepModel, SkepTokenizer
# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.")
parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. "
"Sequences longer than this will be truncated, sequences shorter will be padded.")
parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument("--learning_rate", default=5e-7, type=float, help="The initial learning rate for Adam.")
parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.")
parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.")
parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization")
parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
args = parser.parse_args()
# yapf: enable
def set_seed(seed):
"""Sets random seed."""
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)
def convert_example_to_feature(example,
tokenizer,
max_seq_len=512,
no_entity_label="O",
is_test=False):
"""
Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. And creates a mask from the two sequences passed
to be used in a sequence-pair classification task.
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
::
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
which contains most of the methods. Users should refer to the superclass for more information regarding methods.
max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded.
no_entity_label(obj:`str`, defaults to "O"): The label represents that the token isn't an entity.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
Returns:
input_ids(obj:`list[int]`): The list of token ids.
token_type_ids(obj: `list[int]`): List of sequence pair mask.
label(obj:`list[int]`, optional): The input label if not test data.
"""
tokens = example['tokens']
labels = example['labels']
tokenized_input = tokenizer(
tokens,
return_length=True,
is_split_into_words=True,
max_seq_len=max_seq_len)
input_ids = tokenized_input['input_ids']
token_type_ids = tokenized_input['token_type_ids']
seq_len = tokenized_input['seq_len']
if is_test:
return input_ids, token_type_ids, seq_len
else:
labels = labels[:(max_seq_len - 2)]
encoded_label = np.array(
[no_entity_label] + labels + [no_entity_label], dtype="int64")
return input_ids, token_type_ids, seq_len, encoded_label
def create_dataloader(dataset,
mode='train',
batch_size=1,
batchify_fn=None,
trans_fn=None):
if trans_fn:
dataset = dataset.map(trans_fn)
shuffle = True if mode == 'train' else False
if mode == 'train':
batch_sampler = paddle.io.DistributedBatchSampler(
dataset, batch_size=batch_size, shuffle=shuffle)
else:
batch_sampler = paddle.io.BatchSampler(
dataset, batch_size=batch_size, shuffle=shuffle)
return paddle.io.DataLoader(
dataset=dataset,
batch_sampler=batch_sampler,
collate_fn=batchify_fn,
return_list=True)
if __name__ == "__main__":
paddle.set_device(args.device)
rank = paddle.distributed.get_rank()
if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env()
train_ds = load_dataset("cote", "dp", splits=['train'])
# The COTE_DP dataset labels with "BIO" schema.
label_map = {label: idx for idx, label in enumerate(train_ds.label_list)}
# `no_entity_label` represents that the token isn't an entity.
no_entity_label_idx = label_map.get("O", 2)
# `ignore_label` is using to pad input labels.
ignore_label = -1
set_seed(args.seed)
skep = SkepModel.from_pretrained('skep_ernie_1.0_large_ch')
model = SkepCrfForTokenClassification(
skep, num_classes=len(train_ds.label_list))
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
trans_func = partial(
convert_example_to_feature,
tokenizer=tokenizer,
max_seq_len=args.max_seq_length,
no_entity_label=no_entity_label_idx,
is_test=False)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # input ids
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # token type ids
Stack(dtype='int64'), # sequence lens
Pad(axis=0, pad_val=ignore_label) # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
train_ds,
mode='train',
batch_size=args.batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
state_dict = paddle.load(args.init_from_ckpt)
model.set_dict(state_dict)
model = paddle.DataParallel(model)
num_training_steps = len(train_data_loader) * args.epochs
# Generate parameter names needed to perform weight decay.
# All bias and LayerNorm parameters are excluded.
decay_params = [
p.name for n, p in model.named_parameters()
if not any(nd in n for nd in ["bias", "norm"])
]
optimizer = paddle.optimizer.AdamW(
learning_rate=args.learning_rate,
parameters=model.parameters(),
weight_decay=args.weight_decay,
apply_decay_param_fun=lambda x: x in decay_params)
metric = ChunkEvaluator(label_list=train_ds.label_list, suffix=True)
global_step = 0
tic_train = time.time()
for epoch in range(1, args.epochs + 1):
for step, batch in enumerate(train_data_loader, start=1):
input_ids, token_type_ids, seq_lens, labels = batch
loss = model(
input_ids, token_type_ids, seq_lens=seq_lens, labels=labels)
avg_loss = paddle.mean(loss)
global_step += 1
if global_step % 10 == 0 and rank == 0:
print(
"global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s"
% (global_step, epoch, step, avg_loss,
10 / (time.time() - tic_train)))
tic_train = time.time()
loss.backward()
optimizer.step()
optimizer.clear_grad()
if global_step % 100 == 0 and rank == 0:
save_dir = os.path.join(args.save_dir, "model_%d" % global_step)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
file_name = os.path.join(save_dir, "model_state.pdparam")
# Need better way to get inner model of DataParallel
paddle.save(model._layers.state_dict(), file_name)
predict_opinion
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
from functools import partial
import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import SkepCrfForTokenClassification, SkepModel, SkepTokenizer
# yapf: disable
parser = argparse.ArgumentParser()
parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.")
parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. "
"Sequences longer than this will be truncated, sequences shorter will be padded.")
parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.")
parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
args = parser.parse_args()
# yapf: enable
def convert_example(example, tokenizer, max_seq_length=512, is_test=False):
"""
Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
by concatenating and adding special tokens. And creates a mask from the two sequences passed
to be used in a sequence-pair classification task.
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
::
- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``
A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
::
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |
If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer`
which contains most of the methods. Users should refer to the superclass for more information regarding methods.
max_seq_len(obj:`int`): The maximum total input sequence length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded.
Returns:
input_ids(obj:`list[int]`): The list of token ids.
token_type_ids(obj: `list[int]`): List of sequence pair mask.
"""
tokens = example["tokens"]
encoded_inputs = tokenizer(
tokens,
return_length=True,
is_split_into_words=True,
max_seq_len=max_seq_length)
input_ids = encoded_inputs["input_ids"]
token_type_ids = encoded_inputs["token_type_ids"]
seq_len = encoded_inputs["seq_len"]
return input_ids, token_type_ids, seq_len
@paddle.no_grad()
def predict(model, data_loader, label_map):
"""
Given a prediction dataset, it gives the prediction results.
Args:
model(obj:`paddle.nn.Layer`): A model to classify texts.
data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
label_map(obj:`dict`): The label id (key) to label str (value) map.
"""
model.eval()
results = []
for input_ids, token_type_ids, seq_lens in data_loader:
preds = model(input_ids, token_type_ids, seq_lens=seq_lens)
tags = parse_predict_result(preds.numpy(), seq_lens.numpy(), label_map)
results.extend(tags)
return results
def parse_predict_result(predictions, seq_lens, label_map):
"""
Parses the prediction results to the label tag.
"""
pred_tag = []
for idx, pred in enumerate(predictions):
seq_len = seq_lens[idx]
# drop the "[CLS]" and "[SEP]" token
tag = [label_map[i] for i in pred[1:seq_len - 1]]
pred_tag.append(tag)
return pred_tag
def create_dataloader(dataset,
mode='train',
batch_size=1,
batchify_fn=None,
trans_fn=None):
if trans_fn:
dataset = dataset.map(trans_fn)
shuffle = True if mode == 'train' else False
if mode == 'train':
batch_sampler = paddle.io.DistributedBatchSampler(
dataset, batch_size=batch_size, shuffle=shuffle)
else:
batch_sampler = paddle.io.BatchSampler(
dataset, batch_size=batch_size, shuffle=shuffle)
return paddle.io.DataLoader(
dataset=dataset,
batch_sampler=batch_sampler,
collate_fn=batchify_fn,
return_list=True)
if __name__ == "__main__":
paddle.set_device(args.device)
test_ds = load_dataset("cote", "dp", splits=['test'])
# The COTE_DP dataset labels with "BIO" schema.
label_map = {0: "B", 1: "I", 2: "O"}
# `no_entity_label` represents that the token isn't an entity.
no_entity_label_idx = 2
skep = SkepModel.from_pretrained('skep_ernie_1.0_large_ch')
model = SkepCrfForTokenClassification(
skep, num_classes=len(test_ds.label_list))
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')
if args.params_path and os.path.isfile(args.params_path):
state_dict = paddle.load(args.params_path)
model.set_dict(state_dict)
print("Loaded parameters from %s" % args.params_path)
trans_func = partial(
convert_example,
tokenizer=tokenizer,
max_seq_length=args.max_seq_length)
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # input ids
Pad(axis=0, pad_val=tokenizer.vocab[tokenizer.pad_token]), # token type ids
Stack(dtype='int64'), # sequence lens
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
test_ds,
mode='test',
batch_size=args.batch_size,
batchify_fn=batchify_fn,
trans_fn=trans_func)
results = predict(model, test_data_loader, label_map)
for idx, example in enumerate(test_ds.data):
print(len(example['tokens']), len(results[idx]))
print('Data: {} \t Label: {}'.format(example, results[idx]))