目录
运行报错:No module named 'transformers.modeling_bert'
pytorch框架
pytorch-nlu 网络比较全
支持bert albert,
运行报错:No module named 'transformers.modeling_bert'
报错代码:
from transformers.modeling_bert import BertPreTrainedModel, BertModel
改为:
from pybert.model.albert.modeling_bert import BertPreTrainedModel, BertModel
作者说的transformers安装地址:
https://github.com/huggingface/transformers
albertv2 预训练模型和数据集
预训练模型下载地址
https://github.com/lonePatient/albert_pytorch
切换模型:需要把目录pybert/dataset中的cached_train_examples_albert 和相似文件删除
albertv2推理
import torch
from pybert.configs.basic_config import config
from pybert.io.albert_processor import AlbertProcessor
from pybert.model.albert_for_multi_label import AlbertForMultiLable
def main(text,arch,max_seq_length,do_lower_case):
processor = AlbertProcessor(spm_model_file=config['albert_vocab_path'], do_lower_case=do_lower_case, vocab_file=None)
label_list = processor.get_labels()
id2label = {i: label for i, label in enumerate(label_list)}
model = AlbertForMultiLable.from_pretrained(config['checkpoint_dir'] /f'{arch}', num_labels=len(label_list))
tokens = processor.tokenizer.tokenize(text)
if len(tokens) > max_seq_length - 2:
tokens = tokens[:max_seq_length - 2]
tokens = ['[CLS]'] + tokens + ['[SEP]']
input_ids = processor.tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(input_ids).unsqueeze(0) # Batch size 1, 2 choices
logits = model(input_ids)
probs = logits.sigmoid()
return probs.cpu().detach().numpy()[0]
if __name__ == "__main__":
text = ''''"FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!"'''
max_seq_length = 256
do_loer_case = False
arch = 'albert'
probs = main(text,arch,max_seq_length,do_loer_case)
print(probs)
'''
#output
[0.98304486 0.40958735 0.9851305 0.04566246 0.8630512 0.07316463]
'''
数据集下载,用谷歌账号登录,然后同意协议,可以下载。
Toxic Comment Classification Challenge | Kaggle
libMultiLabel
https://github.com/ASUS-AICS/LibMultiLabel
依赖项
nltk
# wait for https://github.com/Lightning-AI/pytorch-lightning/pull/19191
lightning==2.0.9 torch torchmetrics==0.10.3
torchtext
transformers
liblinear-multicore
liblinear不能安装,安装会报错库找不到
安装torchtext 会把torch重新安装一遍
GitHub - FBI1314/textClassification: 短文本分类、多标签分类方法封装
'max_input_length':20, #训练文本最大长度
"src_vocab_size": 20000, # 输入词袋最多值
"label_vocab_size": 20, #label最大词袋长度
tf的
clue
https://github.com/yuanxiaosc/BERT-for-Sequence-Labeling-and-Text-Classification/tree/master
https://github.com/RandolphVI/Hierarchical-Multi-Label-Text-Classification/tree/master
数据;