多标签文本分类

AI算法网奇

已于 2024-04-19 16:16:58 修改

阅读量691

点赞数 11

分类专栏：深度学习基础文章标签：分类数据挖掘人工智能

于 2024-04-17 18:55:27 首次发布

本文链接：https://blog.csdn.net/jacke121/article/details/137884342

版权

深度学习基础专栏收录该内容

166 篇文章 16 订阅

订阅专栏

pytorch框架

pytorch-nlu 网络比较全

支持bert albert，

运行报错：No module named 'transformers.modeling_bert'

albertv2 预训练模型和数据集

albertv2推理

libMultiLabel

liblinear不能安装，安装会报错库找不到

安装torchtext 会把torch重新安装一遍

tf的

clue

pytorch框架

pytorch-nlu 网络比较全

GitHub - yongzhuo/Pytorch-NLU: Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词、抽取式文本摘要等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of spee

支持bert albert，

GitHub - lonePatient/Bert-Multi-Label-Text-Classification: This repo contains a PyTorch implementation of a pretrained BERT model for multi-label text classification.

运行报错：No module named 'transformers.modeling_bert'

报错代码：

from transformers.modeling_bert import BertPreTrainedModel, BertModel

改为：

from pybert.model.albert.modeling_bert import BertPreTrainedModel, BertModel

作者说的transformers安装地址：

https://github.com/huggingface/transformers

albertv2 预训练模型和数据集

预训练模型下载地址

https://github.com/lonePatient/albert_pytorch

切换模型：需要把目录pybert/dataset中的cached_train_examples_albert 和相似文件删除

albertv2推理

import torch
from pybert.configs.basic_config import config
from pybert.io.albert_processor import AlbertProcessor
from pybert.model.albert_for_multi_label import AlbertForMultiLable

def main(text,arch,max_seq_length,do_lower_case):
    processor = AlbertProcessor(spm_model_file=config['albert_vocab_path'], do_lower_case=do_lower_case, vocab_file=None)
    label_list = processor.get_labels()
    id2label = {i: label for i, label in enumerate(label_list)}
    model = AlbertForMultiLable.from_pretrained(config['checkpoint_dir'] /f'{arch}', num_labels=len(label_list))
    tokens = processor.tokenizer.tokenize(text)
    if len(tokens) > max_seq_length - 2:
        tokens = tokens[:max_seq_length - 2]
    tokens = ['[CLS]'] + tokens + ['[SEP]']
    input_ids = processor.tokenizer.convert_tokens_to_ids(tokens)
    input_ids = torch.tensor(input_ids).unsqueeze(0)  # Batch size 1, 2 choices
    logits = model(input_ids)
    probs = logits.sigmoid()
    return probs.cpu().detach().numpy()[0]

if __name__ == "__main__":
    text = ''''"FUCK YOUR FILTHY MOTHER IN THE ASS, DRY!"'''
    max_seq_length = 256
    do_loer_case = False
    arch = 'albert'
    probs = main(text,arch,max_seq_length,do_loer_case)
    print(probs)
    
    '''
    #output
    [0.98304486 0.40958735 0.9851305  0.04566246 0.8630512  0.07316463]
    '''

数据集下载，用谷歌账号登录，然后同意协议，可以下载。

Toxic Comment Classification Challenge | Kaggle