文章目录
前言
之前,我们已经通过认识基本的概念,以及通过一个简单的例子熟悉了pytorch的组成部分以及运行过程。本章,我们将会直接使用最先进的NLP库Transformers,以Albert为例构造我们的最先进的模型。
这里,我们以transformers官方的样例MM_IMDB进行代码的解读,并尝试搭建属于自己的文本分类模型。这里为什么我们使用transformers库呢,因为它使得很多最先进的模型都统一了接口,从而让我们更好的使用他们。
1. 初识
首先我们看到程序主要分为5个部分,分别是设置时间种子,训练过程,评估过程,加载数据,以及主过程,接下来我们将对每个部分进行一个细致的研究,重点在于其代码思路。
def set_seed(args)
def train(args, train_dataset, model, tokenizer, criterion)
def evaluate(args, model, tokenizer, criterion, prefix="")
def load_examples(args, tokenizer, evaluate=False)
def main()
1.1 设置种子
def set_seed(args):
random.seed(args.seed)
np.random.seed(args.seed)
torch.manual_seed(args.seed)
if args.n_gpu > 0:
torch.cuda.manual_seed_all(args.seed)
设计随机种子的代码比较简单,其意义在于能够让我们的代码复现。无怪乎是以下四个步骤:
- 设置随机种子
- 将种子赋予np
- 将种子赋予torch
- 将种子赋予cuda
1.2 训练过程
训练过程很长,我们先看一下其参数有哪些,以及返回值是什么。参数主要有实验参数args
,训练集train_dataset
,模型model
,解析器tokenizer
以及损失函数criterion
。返回值则是全局的步数global_step
以及平均的损失tr_loss / global_step
def train(args, train_dataset, model, tokenizer, criterion):
"省略主要代码"
return global_step, tr_loss / global_step
下面我们再看细节代码,由于代码量大,我们只列举一些重要的部分代码。
1.2.1 训练加载器
train_sampler = RandomSampler(train_dataset) if args.local_rank == -1 else DistributedSampler(train_dataset)
train_dataloader = DataLoader(
train_dataset,
sampler=train_sampler,
batch_size=args.train_batch_size,
collate_fn=collate_fn,
num_workers=args.num_workers,
)
这几行代码中,训练数据采样器train_sampler
和训练数据加载器train_dataloader
都已经加载好了。主要目的就是将数据变为可以输入给模型的存储形式。一个重要的东西就是上篇文章所讲的collate_fn
,它可以重构读取数据的样子,有点像数据格式化,下面的代码可以看到原文中的做法到底是在做什么,其实就是准备数据。
def collate_fn(batch):
lens = [len(row["sentence"]) for row in batch]
bsz, max_seq_len = len(batch), max(lens)
mask_tensor = torch.zeros(bsz, max_seq_len, dtype=torch.long)
text_tensor = torch.zeros(bsz, max_seq_len, dtype=torch.long)
for i_batch, (input_row, length) in enumerate(zip(batch, lens)):
text_tensor[i_batch, :length] = input_row["sentence"]
mask_tensor[i_batch, :length] = 1
img_tensor = torch.stack([row["image"] for row in batch])
tgt_tensor = torch.stack([row["label"] for row in batch])
img_start_token = torch.stack([row["image_start_token"] for row in batch])
img_end_token = torch.stack([row["image_end_token"] for row in batch])
return text_tensor, mask_tensor, img_tensor, img_start_token, img_end_token, tgt_tensor
1.2.2 设置优化器及规则
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
"weight_decay": args.weight_decay,
},
{"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], "weight_decay": 0.0},
]
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(
optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
)
这里主要设置的是优化器及其规则,例如优化器、学习参数,学习率等等。这里的准备是为后面训练做准备。
1.2.3 多GPU和分布式训练
# multi-gpu training (should be after apex fp16 initialization)
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
# Distributed training (should be after apex fp16 initialization)
if args.local_rank != -1:
model = torch.nn.parallel.DistributedDataParallel(
model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=True
)
这里则是以后大有可用的多GPU和分布式训练,只需要硬件支持即可,硬件不支持,使用默认的参数也行。
1.2.4 训练过程
这里来重头戏了,真正的训练过程。由于代码比较长,我们还是分开讲解。
1.2.4.1 日志打印
# Train!
logger.info("***** Running training *****")
logger.info(" Num examples = %d", len(train_dataset))
logger.info(" Num Epochs = %d", args.num_train_epochs)
logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size)
logger.info(
" Total train batch size (w. parallel, distributed & accumulation) = %d",
args.train_batch_size
* args.gradient_accumulation_steps
* (torch.distributed.get_world_size() if args.local_rank != -1 else 1),
)
logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
logger.info(" Total optimization steps = %d", t_total)
这一步虽然可以省略,但是如果增加了这一步,有很多时候就清楚我们在干什么。
1.2.4.2 设置训练参数
global_step = 0
tr_loss, logging_loss = 0.0, 0.0
best_f1, n_no_improve = 0, 0
model.zero_grad()
train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0])
set_seed(args) # Added here for reproductibility
这里没什么好说的部分,主要是为了训练准备一些参数。
1.2.4.3 迭代循环
这是整个训练过程的主要部分,可以看到,每个epoch和每个step形成的双重循环。
for _ in train_iterator:
epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0])
for step, batch in enumerate(epoch_iterator):
然后是将数据以batch和索引的形式送给模型,这里模型使用的是outputs = model(**inputs)
,其中**inputs
表示关键字参数,它本质上是一个 dict,这样我们在构建模型的时候,就可以以字典的形式给模型输入了。
model.train()
batch = tuple(t.to(args.device) for t in batch)
labels = batch[5]
inputs = {
"input_ids": batch[0],
"input_modal": batch[2],
"attention_mask": batch[1],
"modal_start_tokens": batch[3],
"modal_end_tokens": batch[4],
}
outputs = model(**inputs)
接下来则是接收模型的输出以及损失函数的计算,transformer的输出是有格式的,输出为一个元组,其内容如文档所示。但是这里由于是一个自定义的模型输出,虽然输出的格式也是元组,但是内容有所不同,我们在模型部分就可以看到。
logits = outputs[0] # model outputs are always tuple in transformers (see doc)
loss = criterion(logits, labels)
if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel training
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
if args.fp16:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
tr_loss += loss.item()
接下来就是一些数据的打印输出,这里也能够看到验证结果的部分。
if args.local_rank in [-1, 0] and args.logging_steps > 0 and global_step % args.logging_steps == 0:
logs = {}
if (
args.local_rank == -1 and args.evaluate_during_training
): # Only evaluate when single GPU otherwise metrics may not average well
results = evaluate(args, model, tokenizer, criterion)
for key, value in results.items():
eval_key = "eval_{}".format(key)
logs[eval_key] = value
loss_scalar = (tr_loss - logging_loss) / args.logging_steps
learning_rate_scalar = scheduler.get_lr()[0]
logs["learning_rate"] = learning_rate_scalar
logs["loss"] = loss_scalar
logging_loss = tr_loss
for key, value in logs.items():
tb_writer.add_scalar(key, value, global_step)
print(json.dumps({**logs, **{"step": global_step}}))
最后的部分则是保存模型的检查点,这里也能看到最规范的保存模型的方法。
if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0:
# Save model checkpoint
output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step))
if not os.path.exists(output_dir):
os.makedirs(output_dir)
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
torch.save(model_to_save.state_dict(), os.path.join(output_dir, WEIGHTS_NAME))
torch.save(args, os.path.join(output_dir, "training_args.bin"))
logger.info("Saving model checkpoint to %s", output_dir)
循环最后有两个及早停止,一个是如果你设置了最大步数,则会提起前停止,否则,则是看结果是否有提升,没有提升则停止。
if args.max_steps > 0 and global_step > args.max_steps:
epoch_iterator.close()
break
if args.max_steps > 0 and global_step > args.max_steps:
train_iterator.close()
break
if args.local_rank == -1:
results = evaluate(args, model, tokenizer, criterion)
if results["micro_f1"] > best_f1:
best_f1 = results["micro_f1"]
n_no_improve = 0
else:
n_no_improve += 1
if n_no_improve > args.patience:
train_iterator.close()
break
1.3 评估过程
评估过程与训练过程类似,也包含了加载数据,评估数据,输出结果,以及保存结果4个部分。我们先从其参数入手:
def evaluate(args, model, tokenizer, prefix="")
这里主要的参数哦在args,剩下的model就是模型,tokenizer就是解析器,Prefix是前缀,就是标识符。
1.3.1 加载数据
# Loop to handle MNLI double evaluation (matched, mis-matched)
eval_output_dir = args.output_dir
eval_dataset = load_examples(args, tokenizer, evaluate=True)
if not os.path.exists(eval_output_dir) and args.local_rank in [-1, 0]:
os.makedirs(eval_output_dir)
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
# Note that DistributedSampler samples randomly
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(
eval_dataset, sampler=eval_sampler, batch_size=args.eval_batch_size, collate_fn=collate_fn
)
# multi-gpu eval
if args.n_gpu > 1:
model = torch.nn.DataParallel(model)
加载数据这块没什么区别,里面有一个load_examples
,我们一会再详细解释。下面的评估过程也和训练过程差不多,主要是最后增加了一个结果的评估。
# Eval!
logger.info("***** Running evaluation {} *****".format(prefix))
logger.info(" Num examples = %d", len(eval_dataset))
logger.info(" Batch size = %d", args.eval_batch_size)
eval_loss = 0.0
nb_eval_steps = 0
preds = None
out_label_ids = None
for batch in tqdm(eval_dataloader, desc="Evaluating"):
model.eval()
batch = tuple(t.to(args.device) for t in batch)
with torch.no_grad():
batch = tuple(t.to(args.device) for t in batch)
labels = batch[5]
inputs = {
"input_ids": batch[0],
"input_modal": batch[2],
"attention_mask": batch[1],
"modal_start_tokens": batch[3],
"modal_end_tokens": batch[4],
}
outputs = model(**inputs)
logits = outputs[0] # model outputs are always tuple in transformers (see doc)
tmp_eval_loss = criterion(logits, labels)
eval_loss += tmp_eval_loss.mean().item()
nb_eval_steps += 1
if preds is None:
preds = torch.sigmoid(logits).detach().cpu().numpy() > 0.5
out_label_ids = labels.detach().cpu().numpy()
else:
preds = np.append(preds, torch.sigmoid(logits).detach().cpu().numpy() > 0.5, axis=0)
out_label_ids = np.append(out_label_ids, labels.detach().cpu().numpy(), axis=0)
eval_loss = eval_loss / nb_eval_steps
result = {
"loss": eval_loss,
"macro_f1": f1_score(out_label_ids, preds, average="macro"),
"micro_f1": f1_score(out_label_ids, preds, average="micro"),
}
接下来也是结果的展示和输出。
output_eval_file = os.path.join(eval_output_dir, prefix, "eval_results.txt")
with open(output_eval_file, "w") as writer:
logger.info("***** Eval results {} *****".format(prefix))
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
writer.write("%s = %s\n" % (key, str(result[key])))
return result
1.4 加载数据
def load_examples(args, tokenizer, evaluate=False):
path = os.path.join(args.data_dir, "dev.jsonl" if evaluate else "train.jsonl")
transforms = get_image_transforms()
labels = get_mmimdb_labels()
dataset = JsonlDataset(path, tokenizer, transforms, labels, args.max_seq_length - args.num_image_embeds - 2)
return dataset
这里没什么说道的部分,但是JsonlDataset
需要说明一下,因为它继承自一个非常重要的基类Dataset
。它与Dataloader
是一对。这里只需要给出初始化__init__
,长度函数__len__
以及取元素函数__getitem__
即可。这里的返回值就是一个样本的所有特征的字典,这一个个样本是在Dataloader
里拼接后再输入到模型中的,而这个拼接过程就在于刚才讲到的collate_fn
函数。
class JsonlDataset(Dataset):
def __init__(self, data_path, tokenizer, transforms, labels, max_seq_length):
self.data = [json.loads(l) for l in open(data_path)]
self.data_dir = os.path.dirname(data_path)
self.tokenizer = tokenizer
self.labels = labels
self.n_classes = len(labels)
self.max_seq_length = max_seq_length
self.transforms = transforms
def __len__(self):
return len(self.data)
def __getitem__(self, index):
sentence = torch.LongTensor(self.tokenizer.encode(self.data[index]["text"], add_special_tokens=True))
start_token, sentence, end_token = sentence[0], sentence[1:-1], sentence[-1]
sentence = sentence[: self.max_seq_length]
label = torch.zeros(self.n_classes)
label[[self.labels.index(tgt) for tgt in self.data[index]["label"]]] = 1
image = Image.open(os.path.join(self.data_dir, self.data[index]["img"])).convert("RGB")
image = self.transforms(image)
return {
"image_start_token": start_token,
"image_end_token": end_token,
"sentence": sentence,
"image": image,
"label": label,
}
def get_label_frequencies(self):
label_freqs = Counter()
for row in self.data:
label_freqs.update(row["label"])
return label_freqs
1.5 主函数
主函数一开头就给了一个parser解析的,这里包含很多,我们这里就省略了,直接从正文开始讲起。
1.5.1 加载模型
加载模型主要有以下几个关键步骤,就是先加载原始的transformer_config
,tokenzier
以及transformer
模型后,再根据这些搭建自己的模型model
和配置config
。在第二节中,我们将会详细讲解这里的model
和config
是如何搭建的。
# Setup model
labels = get_mmimdb_labels()
num_labels = len(labels)
args.model_type = args.model_type.lower()
config_class, model_class, tokenizer_class = MODEL_CLASSES[args.model_type]
transformer_config = config_class.from_pretrained(
args.config_name if args.config_name else args.model_name_or_path
)
tokenizer = tokenizer_class.from_pretrained(
args.tokenizer_name if args.tokenizer_name else args.model_name_or_path,
do_lower_case=args.do_lower_case,
cache_dir=args.cache_dir if args.cache_dir else None,
)
transformer = model_class.from_pretrained(
args.model_name_or_path, config=transformer_config, cache_dir=args.cache_dir if args.cache_dir else None
)
img_encoder = ImageEncoder(args)
config = MMBTConfig(transformer_config, num_labels=num_labels)
model = MMBTForClassification(config, transformer, img_encoder)
if args.local_rank == 0:
torch.distributed.barrier() # Make sure only the first process in distributed training will download model & vocab
model.to(args.device)
logger.info("Training/evaluation parameters %s", args)
1.5.2 训练过程
这里就是调用之前的训练函数的部分。这里不仅介绍了如何进行训练,而且还告诉我们如何保存和加载模型。
# Training
if args.do_train:
train_dataset = load_examples(args, tokenizer, evaluate=False)
label_frequences = train_dataset.get_label_frequencies()
label_frequences = [label_frequences[l] for l in labels]
label_weights = (
torch.tensor(label_frequences, device=args.device, dtype=torch.float) / len(train_dataset)
) ** -1
criterion = nn.BCEWithLogitsLoss(pos_weight=label_weights)
global_step, tr_loss = train(args, train_dataset, model, tokenizer, criterion)
logger.info(" global_step = %s, average loss = %s", global_step, tr_loss)
# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
if args.do_train and (args.local_rank == -1 or torch.distributed.get_rank() == 0):
# Create output directory if needed
if not os.path.exists(args.output_dir) and args.local_rank in [-1, 0]:
os.makedirs(args.output_dir)
logger.info("Saving model checkpoint to %s", args.output_dir)
# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = (
model.module if hasattr(model, "module") else model
) # Take care of distributed/parallel training
torch.save(model_to_save.state_dict(), os.path.join(args.output_dir, WEIGHTS_NAME))
tokenizer.save_pretrained(args.output_dir)
# Good practice: save your training arguments together with the trained model
torch.save(args, os.path.join(args.output_dir, "training_args.bin"))
# Load a trained model and vocabulary that you have fine-tuned
model = MMBTForClassification(config, transformer, img_encoder)
model.load_state_dict(torch.load(os.path.join(args.output_dir, WEIGHTS_NAME)))
tokenizer = tokenizer_class.from_pretrained(args.output_dir)
model.to(args.device)
1.5.3 评估过程
这段似曾相识啊,没错,这个在评估函数里已经写了,只不过这里是先训练,后评估。之前的是边训练,边评估。
# Evaluation
results = {}
if args.do_eval and args.local_rank in [-1, 0]:
tokenizer = tokenizer_class.from_pretrained(args.output_dir, do_lower_case=args.do_lower_case)
checkpoints = [args.output_dir]
if args.eval_all_checkpoints:
checkpoints = list(
os.path.dirname(c) for c in sorted(glob.glob(args.output_dir + "/**/" + WEIGHTS_NAME, recursive=True))
)
logging.getLogger("transformers.modeling_utils").setLevel(logging.WARN) # Reduce logging
logger.info("Evaluate the following checkpoints: %s", checkpoints)
for checkpoint in checkpoints:
global_step = checkpoint.split("-")[-1] if len(checkpoints) > 1 else ""
prefix = checkpoint.split("/")[-1] if checkpoint.find("checkpoint") != -1 else ""
model = MMBTForClassification(config, transformer, img_encoder)
model.load_state_dict(torch.load(checkpoint))
model.to(args.device)
result = evaluate(args, model, tokenizer, criterion, prefix=prefix)
result = dict((k + "_{}".format(global_step), v) for k, v in result.items())
results.update(result)
2. 模型搭建
原封不动的使用原有的transformer的模型不好,肯定要根据我们的任务进行一定的改进。那么如何根据自己的任务进行模型的编写呢,这节我们将会揭晓答案。
2.1模型配置
在编写模型的同时,我们需要编写模型的配置,主要和具体模型的超参数有关。
class MMBTConfig(object):
"""Configuration class to store the configuration of a `MMBT Model`.
Args:
config (:obj:`~transformers.PreTrainedConfig`):
Config of the underlying Transformer models. Its values are
copied over to use a single config.
num_labels (:obj:`int` or :obj:`None`, optional, defaults to `None`):
Size of final Linear layer for classification.
modal_hidden_size (:obj:`int`, optional, defautls to 2048):
Embedding dimension of the non-text modality encoder.
"""
def __init__(self, config, num_labels=None, modal_hidden_size=2048):
self.__dict__ = config.__dict__
self.modal_hidden_size = modal_hidden_size
if num_labels:
self.num_labels = num_labels
2.2 模型搭建
这个例子我们有三层模型,自大到小分别是MMBTForClassification
,MMBTModel
以及ModalEmbeddings
。我们从最外层开始,逐步解剖。
2.2.1 最外层
最外层是将模型应用于分类,因此里面主要就是增加了dropout和分类层。主要看forward部分。这部分主要做了以下4件事情:
- 使用MMBT模型获得输出
- 经过自身获取最终结果
- 是否有标签来决定在内部是否算损失
- 拼接出最后的结果传递给外面
class MMBTForClassification(nn.Module):
def __init__(self, config, transformer, encoder):
super().__init__()
self.num_labels = config.num_labels
self.mmbt = MMBTModel(config, transformer, encoder)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
def forward(
self,
input_modal,
input_ids=None,
modal_start_tokens=None,
modal_end_tokens=None,
attention_mask=None,
token_type_ids=None,
modal_token_type_ids=None,
position_ids=None,
modal_position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
):
outputs = self.mmbt(
input_modal=input_modal,
input_ids=input_ids,
modal_start_tokens=modal_start_tokens,
modal_end_tokens=modal_end_tokens,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
modal_token_type_ids=modal_token_type_ids,
position_ids=position_ids,
modal_position_ids=modal_position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
outputs = (logits,) + outputs[2:] # add hidden states and attention if they are here
if labels is not None:
if self.num_labels == 1:
# We are doing regression
loss_fct = MSELoss()
loss = loss_fct(logits.view(-1), labels.view(-1))
else:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
outputs = (loss,) + outputs
return outputs # (loss), logits, (hidden_states), (attentions)
2.2.2 中间层
这一层由于非常长,我们就不给出代码了,这里主要的工作就是创造自己的模型,也是核心的部分,就是当给定输入的时候,我们设计的模型该如何进行处理。它的功能就是一个承上启下的过程,它是核心模块,用于外层的分类模型进行分类,同时,它也是处理输入,将其应用到我们一个具体的已有的模型之中,例如albert等。
2.2.3 最后层
最后一层就是具体的编码层,它不能再细分,是模型的最基本的组成部分。如果打个比方,它就是送快递时,你买的物品,再上一层的中间层就是外面的快递盒子,将你特殊的物品包装成统一的形式,最外层则是快递员派送,将盒子送往你想要的地方。因此你的模型的复杂程度也取决于你用了多少层。
一般的最后一层都是最基础的模块,然后中间层是一个组装层,最外面一层则是封装成针对下游任务的一层,给出分类结果并传出。因此把握好这三层,你就可以写出非常规范的代码了。
3. 自己动手——简单分类任务
通过上面的详细描述,我们对于实验的整体步骤有一个比较清醒的认识了。那么接下来,我们就使用我们所学的东西,搭建一个我们自己的分类模型,这里我们使用一个关系对进行分类。
由于transformers官方没有中文版本的albert,因此我使用了另外一个版本的代码,这个版本的代码和官方的很像,但是是中国人写的,因此提供了中文的albert模型,并且有部分注释是中文的(也存在一些错别字)。
3.1 数据处理
首先我们,创建自己的一个dataprocessor
,用于读取文件数据。
class RelationProcessor(DataProcessor):
"""Processor for the Relation data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")
def get_test_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")
def get_labels(self):
"""See base class."""
return ['Joint',
'Sequence',
'Progression',
"Contrast",
"Supplement",
"Cause-Result",
"Result-Cause",
"Background",
"Behavior-Purpose",
"Purpose-Behavior",
"Elaboration",
"Summary",
"Evaluation",
"Statement-Illustration",
"Illustration-Statement"
]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[1]
text_b = line[2]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
3.2 测试过程
之前的样例中并没有测试的代码,我们正好可以借此机会实现一次。
def test(args, model, tokenizer, prefix=""):
# Loop to handle MNLI double evaluation (matched, mis-matched)
test_task_names = ("mnli", "mnli-mm") if args.task_name == "mnli" else (args.task_name,)
test_outputs_dirs = (args.output_dir, args.output_dir + '-MM') if args.task_name == "mnli" else (args.output_dir,)
results = {}
for test_task, test_output_dir in zip(test_task_names, test_outputs_dirs):
test_dataset = load_and_cache_examples(args, test_task, tokenizer, data_type='test')
if not os.path.exists(test_output_dir) and args.local_rank in [-1, 0]:
os.makedirs(test_output_dir)
args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, args.n_gpu)
# Note that DistributedSampler samples randomly
test_sampler = SequentialSampler(test_dataset) if args.local_rank == -1 else DistributedSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=args.eval_batch_size,
collate_fn=collate_fn)
# Test!
logger.info("***** Running test {} *****".format(prefix))
logger.info(" Num examples = %d", len(test_dataset))
logger.info(" Batch size = %d", args.eval_batch_size)
eval_loss = 0.0
nb_eval_steps = 0
preds = None
out_label_ids = None
pbar = ProgressBar(n_total=len(test_dataloader), desc="Testing")
for step, batch in enumerate(test_dataloader):
model.eval()
batch = tuple(t.to(args.device) for t in batch)
with torch.no_grad():
inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[3],
'token_type_ids': batch[2]}
outputs = model(**inputs)
tmp_eval_loss, logits = outputs[:2]
eval_loss += tmp_eval_loss.mean().item()
nb_eval_steps += 1
if preds is None:
preds = logits.detach().cpu().numpy()
out_label_ids = inputs['labels'].detach().cpu().numpy()
else:
preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
out_label_ids = np.append(out_label_ids, inputs['labels'].detach().cpu().numpy(), axis=0)
pbar(step)
print(' ')
if 'cuda' in str(args.device):
torch.cuda.empty_cache()
eval_loss = eval_loss / nb_eval_steps
if args.output_mode == "classification":
preds = np.argmax(preds, axis=1)
elif args.output_mode == "regression":
preds = np.squeeze(preds)
result = compute_metrics(test_task, preds, out_label_ids)
results.update(result)
logger.info("***** Test results {} *****".format(prefix))
for key in sorted(result.keys()):
logger.info(" %s = %s", key, str(result[key]))
classreport=ClassReport(['Joint',
'Sequence',
'Progression',
"Contrast",
"Supplement",
"Cause-Result",
"Result-Cause",
"Background",
"Behavior-Purpose",
"Purpose-Behavior",
"Elaboration",
"Summary",
"Evaluation",
"Statement-Illustration",
"Illustration-Statement"
])
classreport(preds,out_label_ids)
logger.info("%s : %s",classreport.name(),classreport.value())
return results
3.主过程
我们只需要在主过程中增加测试的代码即可。
# Test
results = []
if args.do_predict and args.local_rank in [-1, 0]:
tokenizer = tokenization_albert.FullTokenizer(vocab_file=args.vocab_file,
do_lower_case=args.do_lower_case,
spm_model_file=args.spm_model_file)
checkpoints = [(0, args.output_dir)]
if args.predict_all_checkpoints:
checkpoints = list(
os.path.dirname(c) for c in
sorted(glob.glob(args.output_dir + '/**/' + WEIGHTS_NAME, recursive=True)))
checkpoints = [(int(checkpoint.split('-')[-1]), checkpoint) for checkpoint in checkpoints if
checkpoint.find('checkpoint') != -1]
checkpoints = sorted(checkpoints, key=lambda x: x[0])
logger.info("Test the following checkpoints: %s", checkpoints)
for _, checkpoint in checkpoints:
global_step = checkpoint.split('-')[-1] if len(checkpoints) > 1 else ""
prefix = checkpoint.split('/')[-1] if checkpoint.find('checkpoint') != -1 else ""
model = AlbertForSequenceClassification.from_pretrained(checkpoint)
model.to(args.device)
result = test(args, model, tokenizer, prefix=prefix)
results.extend([(k + '_{}'.format(global_step), v) for k, v in result.items()])
output_test_file = os.path.join(args.output_dir, "checkpoint_test_results.txt")
with open(output_test_file, "w") as writer:
for key, value in results:
writer.write("%s = %s\n" % (key, str(value)))
然后运行下面命令,就可以完美运行。
CUDA_VISIBLE_DEVICES=5 python3 run_classifier_relation.py \
--model_type=albert \
--model_name_or_path=./albert_base_zh/pytorch_model.bin \
--vocab_file=./albert_base_zh/vocab.txt \
--config_name=./albert_base_zh/config.json \
--task_name=relation \
--do_train \
--do_eval \
--do_predict \
--predict_all_checkpoints \
--do_lower_case \
--data_dir=./dataset/relation/ \
--max_seq_length=512 \
--per_gpu_train_batch_size=2 \
--per_gpu_eval_batch_size=2 \
--learning_rate=1e-5 \
--num_train_epochs=5.0 \
--logging_steps=1192 \
--save_steps=1192 \
--output_dir=./outputs/relation_output/ \
--overwrite_output_dir \
--seed=42
4. 小结
通过以上的介绍,我们熟悉了transformers的各个模型及应用,然后依葫芦画瓢构建了属于自己的例子,再接下来,我们就要更上一层,构建更加高级的模型了。正所谓,路漫漫其修远兮。