大规模未标记的文本数据分类处理baseline

最新推荐文章于 2024-05-11 03:31:15 发布

西南叶孤城

最新推荐文章于 2024-05-11 03:31:15 发布

阅读量1.8k

点赞数 2

分类专栏： nlp深度学习文章标签：自然语言处理深度学习机器学习

本文链接：https://blog.csdn.net/weixin_44305190/article/details/120112105

版权

nlp深度学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、问题介绍

这里是华为的一个文本分类比赛，数据量大，而且有很多文章并没有标记类别。基础数据集包含两部分：训练集和测试集。其中训练集给定了该样本的文章质量的相关标签，测试集用来测试模型的标签预测准确率。在这里插入图片描述
该文本分类的难点主要有两个，一、文章的长度比较长，属于长文本分类，而Bert的最大输入只有512.二、训练集中有大量的未标记数据，而且还包含了“类别”为其他的文本，但是没有标记出来。所以对测试集分类的时候，也要考虑文章类别为“其他”的情况。
以下给出训练集中标签和文本数量的数据情况，’ ‘表示未标记数据，对应的文章可能有类型，也可能类型为“其他”。
训练集有576454条文本数据，只有76454条有标签。
{’ ': 500000, ‘人物专栏’: 7242, ‘情感解读’: 7183, ‘科普知识文’: 6337, ‘攻略文’: 5517, ‘物品评测’: 4381, ‘治愈系文章’: 3868, ‘推荐文’: 1194, ‘深度事件’: 16670, ‘作品分析’: 14094, ‘行业解读’: 9968}

二、Baseline解决思路

（1）对数据预处理，对原始训练集、测试集进行简单清洗，处理训练集得到带标记样本集（P)和未知标记样本集（U)。
（2）使用P训练Bert分类器（10分类）
（3）使用已训练的Bert分类器预测U,并输出可靠负样本集（RN），文章类型为“其他”。
（4）使用P和RN训练二分类器
（5）使用二分类器和Bert分类器联合预测测试集上样本的类别，并保存文章id和预测的标签至文件

简单提一下这里的难点，就是获取可靠负样本集（RN)，Baseline采用了随机森林的机器学习方法，也可以采用朴素贝叶斯的方法。这一步获取的负样本影响到二分类的训练，二分类训练器主要用来判别文章类型为“其他”的文章。

三、代码

1、Config.py

给出数据、模型等文件目录

# The path to the directory which stores all the datasets.
# If equals "", the path will set to the path to the directory which stores the datasets downloaded from Digix website.
BASE_DATASET_PATH = "./data"
# The path to the directory which stores all the trained model files.
# If equals "", the path will set to the path to the sub directory "model" at current directory
BASE_MODEL_PATH = "./model"
# The path to the pretrained BERT ENCODER, such as "/data/bert_base_chinese", you can download it at https://huggingface.co/bert-base-chinese/tree/main
PRETRAINED_BERT_ENCODER_PATH = "./预训练模型"
# The path to save the structured result(.csv) of evaluating test files.
# If equals "", the path will set to the file "submission.csv" at the current directory
SUMMARY_OUTPUT_PATH = "./summary"

2、main.py

执行训练并保存预测的结果，即处理的流程

from Preprocess import preprocess
from Build_PU_data import build_pu_data
from Train_Bert import train_bert
from Train_PU_model import train_pu_model
from Joint_Predictor import joint_predictor

if __name__ == "__main__":
    # 数据集预处理：对原始训练集，测试集进行简单的清洗，从训练集中输出 带标记样本集（P）和 未知样本集（U）
    preprocess()
    # 使用 P 训练 Bert分类器（10分类）
    train_bert()
    # 使用已训练的 Bert分类器预测 U，并输出 可靠负样本集（RN）
    build_pu_data()
    # 使用 P 和 RN 训练 二分类器
    train_pu_model()
    # 使用 Bert分类器 和 二分类器 联合预测 测试集上样本的类别，并格式化输出结果至文件
    joint_predictor()

3、Preprocess.py

数据处理，清洗训练数据和测试数据，并取出训练集中带标签的数据P和不带标签的数据U。数据格式如下：
在这里插入图片描述
下面给出处理的代码

import json
import os
import pandas as pd
import re
from tqdm import tqdm
from bs4 import BeautifulSoup
import Config

if Config.BASE_DATASET_PATH == "":
    curdir = os.path.dirname(os.path.abspath(__file__))
    dataset_path = os.path.join(curdir, "dataset")
    if not os.path.exists(dataset_path):
        os.mkdir(dataset_path)
else:
    dataset_path = Config.BASE_DATASET_PATH

RAW_TRAIN_FILE_PATH = os.path.join(dataset_path, "doc_quality_data_train_1000.json")
RAW_TEST_FILE_PATH = os.path.join(dataset_path, "doc_quality_data_test_1000.json")
PREPROCESSED_TRAIN_FILE_PATH = os.path.join(dataset_path, "preprocessed_train.json")
PREPROCESSED_TEST_FILE_PATH = os.path.join(dataset_path, "preprocessed_test.json")
POSITIVE_TRAIN_FILE_PATH = os.path.join(dataset_path, "postive_train.json")
POSITIVE_TRAIN_INFO_PATH = os.path.join(dataset_path, "positive_info.json")
UNLABELED_TRAIN_FILE_PATH = os.path.join(dataset_path, "unlabeled_train.json")
# 优质类别索引列表
INDEX = ['人物专栏', '作品分析', '情感解读', '推荐文', '攻略文', '治愈系文章', '深度事件', '物品评测', '科普知识文', '行业解读']

# 获取数据集的标签集及其大小
def get_label_set_and_sample_num(config_path, sample_num=False):
    with open(config_path, "r", encoding="UTF-8") as input_file:
        json_data = json.loads(input_file.readline())
        if sample_num:
            return json_data["label_list"], json_data["total_num"]
        else:
            return json_data["label_list"]


# 生成数据集对应的标签集以及样本总数
def build_label_set_and_sample_num(input_path, output_path):
    label_set = set()
    sample_num = 0
    
    with open(input_path, 'r', encoding="utf-8") as input_file:
        for line in tqdm(input_file):
            json_data = json.loads(line)
            label_set.add(json_data["label"])
            sample_num += 1
            
    with open(output_path, "w", encoding="UTF-8") as output_file:
        record = {"label_list": sorted(list(label_set)), "total_num": sample_num}
        json.dump(record, output_file, ensure_ascii=False)

        return record["label_list"], record["total_num"]


def get_sentences_list(raw_text: str):
    #BeautifulSoup对象，参数 文档字符串，html解析器，文档编码
    return [s for s in BeautifulSoup(raw_text, 'html.parser')._all_strings()]


def check_length(length_list):
    #sum对列表的元素求和
    sum_length = sum(length_list)
    if sum_length < 510:
        return sum_length
    return 510


# 去除空白字符, 从数据集遍历代码中移至此处
def remove_symbol(string: str):
    return string.replace('\t', '').replace('\n', '').replace('\r', '')

#这一步主要是解决一部分标题在文本中也出现了的情况，因为训练Bert时是取标题加上文本开头的部分。不超过512
def check_duplicate_title(input_path, output_path):
    duplicate = 0
    no_html = 0
    no_duplicate = 0
    print("Processing File: ", input_path)
    with open(input_path, "r", encoding='utf-8') as file, open(output_path, "w", encoding="utf-8") as outfile:
        for line in tqdm(file):
            json_data = json.loads(line)
            title = json_data["title"]
            body = get_sentences_list(json_data["body"])
            title_length = len(title)

            # 正文中不含HTML标签
            if len(body) == 1:
                no_html += 1
                tmp_body = body[0]
                # 注意,这边re.sub的pattern使用了re.escape()
                # 是为了转译title中存在的会被re视为元字符的字符(例如"?"","*")
                # 事实上相当于"\".join(title)[将所有字符转译为普通字符]
                new_body = re.sub("(原标题：)?" + re.escape(title), "", tmp_body)
                new_body_length = len(new_body)

                if new_body_length == len(tmp_body):
                    no_duplicate += 1
                else:
                    duplicate += 1

            # 正文中包含HTML标签
            else:
                i = 0
                # 检查 标题是否出现在前两个元素中 (有可能存在标签<p class=\"ori_titlesource\">,会有"原标题: title"的情况出现)
                for sentence in body[:2]:
                    if title in sentence:
                        i += 1

                new_body = "".join(body[i:])

                if i > 0:
                    duplicate += 1
                else:
                    no_duplicate += 1

            rm_whites_body = remove_symbol(new_body)
            rm_whites_title = remove_symbol(title)

            json_data["body"] = rm_whites_body
            json_data["title"] = rm_whites_title
            json_data["length"] = check_length([len(rm_whites_body), len(rm_whites_title)])
            json.dump(json_data, outfile, ensure_ascii=False)
            outfile.write("\n")

    print("duplicate: {}\t no_html: {}, no_duplicate: {}\n".format(duplicate, no_html, no_duplicate))


def index_data_pd(index, input_path, output_path1, output_path2):
    print(input_path)
    df_data = pd.read_json(input_path, orient="records", lines=True)
    # 处理已标注数据
    df_data_labeled = df_data[df_data["doctype"] != ""]
    df_data_labeled = df_data_labeled.sample(frac=1.0)

    df_data_labeled["label"] = df_data_labeled.apply(lambda x: index.index(x["doctype"]), axis=1, raw=False)
    
    print("\n\n===================   The distribution of Positive train data   ===================\n")
    print(df_data_labeled["label"].value_counts())
    print("\n\n")
    df_data_labeled = df_data_labeled.drop(columns=["category"])
    # 单独保存已标注数据
    df_data_labeled.to_json(output_path1, orient="records", lines=True, force_ascii=False)

    # 处理未标注数据
    df_data_unlabeled = df_data[df_data["doctype"] == ""]
    df_data_unlabeled = df_data_unlabeled.sample(frac=1.0)
    df_data_unlabeled = df_data_unlabeled.drop(columns=["category"])
    # 单独保存未标注数据
    df_data_unlabeled.to_json(output_path2, orient="records", lines=True, force_ascii=False)


def preprocess():
    # 清除训练集、测试集的文章正文中可能存在的标题
    check_duplicate_title(RAW_TRAIN_FILE_PATH, PREPROCESSED_TRAIN_FILE_PATH)
    check_duplicate_title(RAW_TEST_FILE_PATH, PREPROCESSED_TEST_FILE_PATH)
    # 对训练集中带标记的样本索引化标签
    index_data_pd(INDEX, PREPROCESSED_TRAIN_FILE_PATH,
                  POSITIVE_TRAIN_FILE_PATH, UNLABELED_TRAIN_FILE_PATH)

    if os.path.exists(POSITIVE_TRAIN_INFO_PATH):
        labels_set, total_num = get_label_set_and_sample_num(POSITIVE_TRAIN_INFO_PATH, True)
    else:
        labels_set, total_num = build_label_set_and_sample_num(POSITIVE_TRAIN_FILE_PATH, POSITIVE_TRAIN_INFO_PATH)
    print("Preprocess done!")

4、Train_Bert.py

主要用从训练集中获取的带标记的文本集P,训练一个十分类器。

import json
import os
import torch
import transformers as tfs
import random
from torch import nn
from torch import optim
from tqdm import tqdm
from logger import Progbar
import Config
import Preprocess


if Config.BASE_MODEL_PATH == "":
    curdir = os.path.dirname(os.path.abspath(__file__))
    model_path = os.path.join(curdir, "model")
    if not os.path.exists(model_path):
        os.mkdir(model_path)
else:
    model_path = Config.BASE_MODEL_PATH

# Bert预训练模型
FINETUNED_BERT_ENCODER_PATH = os.path.join(model_path, "finetuned_bert.bin")
POSITIVE_TRAIN_FILE_PATH = Preprocess.POSITIVE_TRAIN_FILE_PATH
POSITIVE_TRAIN_INFO_PATH = os.path.join(Preprocess.dataset_path, "positive_info.json")
UNLABELED_TRAIN_FILE_PATH = Preprocess.UNLABELED_TRAIN_FILE_PATH
PRETRAINED_BERT_ENCODER_PATH = Config.PRETRAINED_BERT_ENCODER_PATH
BERT_MODEL_SAVE_PATH = model_path
BATCH_SIZE = 16
EPOCH = 5


# 获取一个epoch需要的batch数
def get_steps_per_epoch(line_count, batch_size):
    return line_count // batch_size if line_count % batch_size == 0 else line_count // batch_size + 1


# 定义输入到Bert中的文本的格式,即标题,正文的组织形式
def prepare_sequence(title: str, body: str):
    return (title, body[:256] + "|" + body[-256:])


# 迭代器: 逐条读取数据并输出文本和标签
def get_text_and_label_index_iterator(input_path):
    with open(input_path, 'r', encoding="utf-8") as input_file:
        for line in input_file:
            json_data = json.loads(line)
            text = prepare_sequence(json_data["title"], json_data["body"])
            label = json_data['label']

            yield text, label


# 迭代器: 生成一个batch的数据
def get_bert_iterator_batch(data_path, batch_size=32):
    keras_bert_iter = get_text_and_label_index_iterator(data_path)
    continue_iterator = True
    while continue_iterator:
        data_list = []
        for _ in range(batch_size):
            try:
                #next是获取下一个迭代器对象的方法，直到异常的时候结束
                data = next(keras_bert_iter)
                data_list.append(data)
            except StopIteration:
                continue_iterator = False
        random.shuffle(data_list)

        text_list = []
        label_list = []

        for data in data_list:
            text, label = data
            text_list.append(text)
            label_list.append(label)

        yield text_list, label_list

    return False


class BertClassificationModel(nn.Module):
    """Bert分类器模型"""
    def __init__(self, model_path, predicted_size, hidden_size=768):
        super(BertClassificationModel, self).__init__()
        model_class, tokenizer_class = tfs.BertModel, tfs.BertTokenizer
        self.tokenizer = tokenizer_class.from_pretrained(model_path)
        self.bert = model_class.from_pretrained(model_path)
        self.linear = nn.Linear(hidden_size, predicted_size)
        self.dropout = nn.Dropout(p=0.2)

    def forward(self, batch_sentences):
        batch_tokenized = self.tokenizer.batch_encode_plus(batch_sentences, add_special_tokens=True,
                                                           max_length=512,
                                                           pad_to_max_length=True)

        input_ids = torch.tensor(batch_tokenized['input_ids']).cuda()
        token_type_ids = torch.tensor(batch_tokenized['token_type_ids']).cuda()
        attention_mask = torch.tensor(batch_tokenized['attention_mask']).cuda()

        bert_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
        bert_cls_hidden_state = bert_output[0][:, 0, :]
        linear_output = self.dropout(self.linear(bert_cls_hidden_state).cuda()).cuda()
        return linear_output


def train_bert():
    if os.path.exists(POSITIVE_TRAIN_INFO_PATH):
        labels_set, total_num = Preprocess.get_label_set_and_sample_num(POSITIVE_TRAIN_INFO_PATH, True)
    else:
        print("Found no positive_info.json, please rerun the Preprocess.py.")
        exit()

    torch.cuda.set_device(0)

    print("Start training model...")
    # train the model
    steps = get_steps_per_epoch(total_num, BATCH_SIZE)

    bert_classifier_model = BertClassificationModel(PRETRAINED_BERT_ENCODER_PATH, len(labels_set))
    #把模型指定到GPU上去，使用model.to(device)会更好，如device.to(cuda) bert_classifier_model.to(device)
    bert_classifier_model = bert_classifier_model.cuda()

    # 不同子网络设定不同的学习率,这里把整个网络的参数分为bert的，和下游任务分类的，对不同参数采用不同的优化策略
    Bert_model_param = []
    Bert_downstream_param = []
    number = 0
    #named_parameters()是模型里的参数名和参数值，model.name_parameters
    #可以先用net.parameters()查看网络net中所有的参数，一般是网络参数和优化器参数
    for items, _ in bert_classifier_model.named_parameters():
        if "bert" in items:
            Bert_model_param.append(_)
        else:
            Bert_downstream_param.append(_)
        number += _.numel()
    param_groups = [{"params": Bert_model_param, "lr": 1e-5},
                    {"params": Bert_downstream_param, "lr": 1e-4}]
    #这里用到了权值衰减weiht_decay,就是L2正则化，增大模型泛化能力
    optimizer = optim.Adam(param_groups, eps=1e-7, weight_decay=0.001)
    StepLR = torch.optim.lr_scheduler.StepLR(optimizer, step_size=steps, gamma=0.6)
    criterion = nn.CrossEntropyLoss()
    #model.train(),启用batcnormaliztion和dropout，前者解决数据通过不同网络后尺寸变化影响梯度，后者选择性的让神经元失活，不参与计算，保留前面的值，防止过拟合。
    bert_classifier_model.train()
    progbar = Progbar(target=steps)

    for epoch in range(EPOCH):
        model_save_path = os.path.join(BERT_MODEL_SAVE_PATH, "model_epoch{}.pkl".format(epoch))

        dataset_iterator = get_bert_iterator_batch(POSITIVE_TRAIN_FILE_PATH, BATCH_SIZE)

        for i, iteration in enumerate(dataset_iterator):
            # 清空梯度,模型中参数的梯度设为0，model.zero_grad()
            bert_classifier_model.zero_grad()
            text = iteration[0]
            labels = torch.tensor(iteration[1]).cuda()
            optimizer.zero_grad()
            output = bert_classifier_model(text)
            loss = criterion(output, labels).cuda()
            #反向传播，计算当前梯度
            loss.backward()

            # 根据当前梯度更新模型参数
            optimizer.step()
            # 学习率优化器计数
            StepLR.step()
            progbar.update(i + 1, None, None, [("train loss", loss.item()), ("bert_lr", optimizer.state_dict()["param_groups"][0]["lr"]), ("fc_lr", optimizer.state_dict()["param_groups"][1]["lr"])])

            if i == steps - 1:
                break

        # 保存完整的 BERT 分类器模型
        torch.save(bert_classifier_model, model_save_path)
        # 单独保存经 fune tune 的 BertEncoder模型
        torch.save(bert_classifier_model.bert, FINETUNED_BERT_ENCODER_PATH)
        print("epoch {} is over!\n".format(epoch))

    print("\nTraining is over!\n")

5、Build_PU_data.py

使用已训练的Bert分类器预测U，并输出可靠负样本集RN

import os
import random
import numpy as np
import json
import transformers as tfs
import torch
import Config
from torch import nn
from logger import Progbar
from tqdm import tqdm
import Preprocess
import Train_Bert

BATCH_SIZE = 2
POSITIVE_TRAIN_FILE_PATH = Preprocess.POSITIVE_TRAIN_FILE_PATH
POSITIVE_TRAIN_INFO_PATH = Preprocess.POSITIVE_TRAIN_INFO_PATH
UNLABELED_TRAIN_FILE_PATH = Preprocess.UNLABELED_TRAIN_FILE_PATH
BERT_TOKENZIER_PATH = Config.PRETRAINED_BERT_ENCODER_PATH
FINETUNED_BERT_ENCODER_PATH = Train_Bert.FINETUNED_BERT_ENCODER_PATH
PU_DATA_TEXT_SAVE_PATH = os.path.join(Preprocess.dataset_path, "PU_text.npy")
PU_DATA_LABEL_SAVE_PATH = os.path.join(Preprocess.dataset_path, "PU_label.npy")
STOP = False

# 获取一个epoch需要的batch数
def get_steps_per_epoch(line_count, batch_size):
    return line_count // batch_size if line_count % batch_size == 0 else line_count // batch_size + 1


# 获取数据集的标签集及其大小，这样好划分批次
def get_label_set_and_sample_num(config_path, sample_num=False):
    with open(config_path, "r", encoding="UTF-8") as input_file:
        json_data = json.loads(input_file.readline())
        if sample_num:
            return json_data["label_list"], json_data["total_num"]
        else:
            return json_data["label_list"]


# 定义输入到Bert中的文本的格式,即标题,正文的组织形式
def prepare_sequence(title: str, body: str):
    return (title, body[:256] + "|" + body[-256:])


# 迭代器: 逐条读取数据并输出文本和标签，yield可以看成返回一个迭代器
def get_text_and_label_index_iterator(input_path):
    with open(input_path, 'r', encoding="utf-8") as input_file:
        for line in input_file:
            json_data = json.loads(line)
            text = prepare_sequence(json_data["title"], json_data["body"])
            yield text


# 迭代器: 生成一个batch的数据
def get_bert_iterator_batch(data_path, batch_size=32):
    keras_bert_iter = get_text_and_label_index_iterator(data_path)
    continue_iterator = True
    while True:
        data_list = []
        for _ in range(batch_size):
            try:
                data = next(keras_bert_iter)
                data_list.append(data)
            except StopIteration:
                continue_iterator = False
                break
        random.shuffle(data_list)
        text_list = []
        if continue_iterator:
            for data in data_list:
                text_list.append(data)

            yield text_list
        else:
            return StopIteration


# 生成数据集对应的标签集以及样本总数
def build_label_set_and_sample_num(input_path, output_path):
    label_set = set()
    sample_num = 0
    with open(input_path, 'r', encoding="utf-8") as input_file:
        for line in tqdm(input_file):
            json_data = json.loads(line)
            label_set.add(json_data["label"])
            sample_num += 1

    with open(output_path, "w", encoding="UTF-8") as output_file:
        record = {"label_list": sorted(list(label_set)), "total_num": sample_num}
        json.dump(record, output_file, ensure_ascii=False)

        return record["label_list"], record["total_num"]


class MyBertEncoder(nn.Module):
    """自定义的Bert编码器"""
    def __init__(self, tokenizer_path, finetuned_bert_path):
        super(MyBertEncoder, self).__init__()
        model_class, tokenizer_class = tfs.BertModel, tfs.BertTokenizer
        self.tokenizer = tokenizer_class.from_pretrained(tokenizer_path)
        #这里把之前保存的微调的bert模型加载进来
        self.bert = torch.load(finetuned_bert_path)

    def forward(self, batch_sentences):
        batch_tokenized = self.tokenizer.batch_encode_plus(batch_sentences, add_special_tokens=True,
                                                           max_length=512, pad_to_max_length=True)

        input_ids = torch.tensor(batch_tokenized['input_ids']).cuda()
        token_type_ids = torch.tensor(batch_tokenized['token_type_ids']).cuda()
        attention_mask = torch.tensor(batch_tokenized['attention_mask']).cuda()

        bert_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
        bert_cls_hidden_state = bert_output[0][:, 0, :]
        return bert_cls_hidden_state


def build_pu_data():
    print("Start building PU data...")
    pos_data_iter = get_bert_iterator_batch(POSITIVE_TRAIN_FILE_PATH, batch_size=BATCH_SIZE)
    unlabeled_data_iter = get_bert_iterator_batch(UNLABELED_TRAIN_FILE_PATH, batch_size=BATCH_SIZE*2)

    torch.cuda.set_device(0)
    encoder = MyBertEncoder(BERT_TOKENZIER_PATH, FINETUNED_BERT_ENCODER_PATH)
   #model.eval保证用到所有的网络来test,就是关掉drpout
    encoder.eval()

    X, y = None, None
    #torch.no_grad()一个上下文管理器，里面的计算不会在反向传播中记录，不会影响梯度
    with torch.no_grad():
        i = 0
        for pos_batch, unlabeled_batch in tqdm(zip(pos_data_iter, unlabeled_data_iter)):
            encoded_pos = np.array(encoder(pos_batch).tolist())
            encoded_unlabeled = np.array(encoder(unlabeled_batch).tolist())
            if i == 0:
                X = np.concatenate([encoded_pos, encoded_unlabeled], axis=0)
                y = np.concatenate([np.full(shape=encoded_pos.shape[0], fill_value=1, dtype=np.int),
                                    np.full(shape=encoded_unlabeled.shape[0], fill_value=0, dtype=np.int)])
            else:
                X = np.concatenate([X, encoded_pos, encoded_unlabeled], axis=0)
                y = np.concatenate([y, np.full(shape=encoded_pos.shape[0], fill_value=1, dtype=np.int),
                                    np.full(shape=encoded_unlabeled.shape[0], fill_value=0, dtype=np.int)])

            i += 1

        np.save(PU_DATA_TEXT_SAVE_PATH, X)
        np.save(PU_DATA_LABEL_SAVE_PATH, y)
        print("PU data build successfully...")

这里用之前训练好的BERT分别获取有标记文本和无标记文本的的bert编码，一个文本对应一个编码，再把它们拼接在一起，并对有标签的编码打上标签1，无标签的编码打上标签0。X就是编码，y就是编码对应的标签（0或者1），用于后面的随机森林算法。也叫PU处理，解决大量未标记的数据的分类问题。

6、Train_PU_model.py

这一步主要是用来训练二分类器，判断是否是“其他”类型。

import numpy as np
import os
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.exceptions import NotFittedError
import Build_PU_data
import Train_Bert

PU_DATA_TEXT_SAVE_PATH = Build_PU_data.PU_DATA_TEXT_SAVE_PATH
PU_DATA_LABEL_SAVE_PATH = Build_PU_data.PU_DATA_LABEL_SAVE_PATH
PU_MODEL_SAVE_PATH = os.path.join(Train_Bert.model_path, "pu_model.bin")


class ElkanotoPuClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, estimator, hold_out_ratio=0.1):
        self.estimator = estimator
        # c is the constant proba that a example is positive, init to 1
        self.c = 1.0
        self.hold_out_ratio = hold_out_ratio
        self.estimator_fitted = False

    def __str__(self):
        return 'Estimator: {}\np(s=1|y=1,x) ~= {}\nFitted: {}'.format(
            self.estimator,
            self.c,
            self.estimator_fitted,
        )

    def split_hold_out(self, data):
        np.random.permutation(data)
        hold_out_size = int(np.ceil(data.shape[0] * self.hold_out_ratio))
        hold_out_part = data[:hold_out_size]
        rest_part = data[hold_out_size:]

        return hold_out_part, rest_part

    def fit(self, pos, unlabeled):
        # 打乱 pos 数据集, 按比例划分 hold_out 部分和非 hold_out 部分
        pos_hold_out, pos_rest = self.split_hold_out(pos)
        unlabeled_hold_out, unlabeled_rest = self.split_hold_out(unlabeled)

        all_rest = np.concatenate([pos_rest, unlabeled_rest], axis=0)
        all_rest_label = np.concatenate([np.full(shape=pos_rest.shape[0], fill_value=1, dtype=np.int),
                                             np.full(shape=unlabeled_rest.shape[0], fill_value=-1, dtype=np.int)])

        self.estimator.fit(all_rest, all_rest_label)

        # c is calculated based on holdout set predictions
        hold_out_predictions = self.estimator.predict_proba(pos_hold_out)
        hold_out_predictions = hold_out_predictions[:, 1]
        c = np.mean(hold_out_predictions)
        self.c = c
        self.estimator_fitted = True
        return self

    def predict_proba(self, X):
        if not self.estimator_fitted:
            raise NotFittedError(
                'The estimator must be fitted before calling predict_proba().'
            )
        probabilistic_predictions = self.estimator.predict_proba(X)
        probabilistic_predictions = probabilistic_predictions[:, 1]
        return probabilistic_predictions / self.c

    def predict(self, X, threshold=0.5):
        if not self.estimator_fitted:
            raise NotFittedError(
                'The estimator must be fitted before calling predict(...).'
            )
        return np.array([
            1.0 if p > threshold else -1.0
            for p in self.predict_proba(X)
        ])


def train_pu_model():
    print("\nStart fitting...")
    estimator = RandomForestClassifier(
        n_estimators=100,
        criterion='gini',
        bootstrap=True,
        n_jobs=1,
    )
    pu_classifier = ElkanotoPuClassifier(estimator, hold_out_ratio=0.1)

    X = np.load(PU_DATA_TEXT_SAVE_PATH)
    y = np.load(PU_DATA_LABEL_SAVE_PATH)

    n_postive = (y == 1).sum()
    n_unlabeled = (y == 0).sum()
    print("total n_positive: ", n_postive)
    print("total n_unlabel:  ", n_unlabeled)
    # 随机筛选正样本和负样本
    # positive_random_index = np.random.choice(n_postive, RANDOM_POSITIVE_NUM)
    # unlabeled_random_index = np.random.choice(n_unlabeled, RANDOM_NEGATIVE_NUM)
    y_unlabel = np.ones(n_unlabeled)

    X_positive = X[y == 1]
    print("len of X_positive: ", X_positive.shape)
    y_positive_train = np.ones(n_postive)

    X_unlabel = X[y == 0]
    print("len of X_unlabeled: ", X_unlabel.shape)
    pu_classifier.fit(X_positive, X_unlabel)
    joblib.dump(pu_classifier, PU_MODEL_SAVE_PATH)
    print("Fitting done!")

这里涉及到一些机器学习的算法，就是根据概率，来筛选正负样本。可以自己查阅用到的算法的原理。

7、Joint_Predictor.py

这里就联合二分类器和Bert十分类器来预测，首先判断是否是其他类型，不是的话再用Bert分类器来预测

import importlib
import sys
importlib.reload(sys)
#sys.setdefaultencoding('utf8')  Python3 默认的使用的就是utf-8的编码
import json
import os
import torch
import numpy as np
import transformers as tfs
import pandas as pd
from tqdm import tqdm
tqdm.pandas(desc='pandas bar')
from torch import nn
import joblib
from Build_PU_data import MyBertEncoder
import Config
import Train_Bert
import Preprocess
import Train_PU_model


softmax = nn.Softmax(dim=1)
# Bert预训练模型
PRETRAINED_BERT_ENCODER_PATH = Config.PRETRAINED_BERT_ENCODER_PATH
FINETUNED_BERT_ENCODER_PATH = Train_Bert.FINETUNED_BERT_ENCODER_PATH
BERT_MODEL_SAVE_PATH = Train_Bert.BERT_MODEL_SAVE_PATH
PU_MODEL_SAVE_PATH = Train_PU_model.PU_MODEL_SAVE_PATH
TEST_FILE_PATH = Preprocess.PREPROCESSED_TEST_FILE_PATH

if Config.SUMMARY_OUTPUT_PATH == "":
    curdir = os.path.dirname(os.path.abspath(__file__))
    SUMMARY_OUTPUT_PATH = os.path.join(curdir, "submission.csv")
else:
    SUMMARY_OUTPUT_PATH = os.path.join(Config.SUMMARY_OUTPUT_PATH, "submission.csv")

INDEX = Preprocess.INDEX
MODEL_EPOCH = 5


# 获取数据集的标签集及其大小
def get_label_set_and_sample_num(config_path, sample_num=False):
    with open(config_path, "r", encoding="UTF-8") as input_file:
        json_data = json.loads(input_file.readline())
        if sample_num:
            return json_data["label_list"], json_data["total_num"]
        else:
            return json_data["label_list"]


# 生成数据集对应的标签集以及样本总数
def build_label_set_and_sample_num(input_paths, output_paths):
    label_set = set()
    sample_num = 0
    for input_path in input_paths:
        with open(input_path, 'r', encoding="utf-8") as input_file:
            for line in tqdm(input_file):
                json_data = json.loads(line)
                label_set.add(json_data["label"])
                sample_num += 1

    with open(output_paths, "w", encoding="UTF-8") as output_file:
        record = {"label_list": sorted(list(label_set)), "total_num": sample_num}
        json.dump(record, output_file, ensure_ascii=False)

        return record["label_list"], record["total_num"]


# 定义输入到Bert中的文本的格式,即标题,正文,source的组织形式
def prepare_sequence(title: str, body: str):
    return (title, body[:256] + "|" + body[-256:])


# 读取测试集数据, 这里使用 pd.read_json()
def read_test_file(input_path: str):
    test_df = pd.read_json(input_path, orient="records", lines=True)

    return test_df


def predict_with_pu(x, index, pu_classifier, bert_encoder, bert_classifier_model):
    text = prepare_sequence(x["title"], x["body"])

    encoded_pos = np.array(bert_encoder([text]).tolist())
    # 先使用 PU 预测是否为 "其他"
    pu_result = pu_classifier.predict(encoded_pos)
    if pu_result[0] < 0:
        predicted_label = "其他"
        proba = 0.5

    else:
        output = bert_classifier_model([text])
        predicted_proba = softmax(output).tolist()[0]
        predicted_index = np.argmax(predicted_proba)
        predicted_label = index[predicted_index]

        # 预测类别的预测概率
        proba = predicted_proba[predicted_index]

    return [predicted_label, round(proba, 2)]


# 结构化输出模型在测试集上的结果
def summary(test_df, output_path, pu_classifier, bert_encoder, bert_classifier_model):
    test_df[["predicted_label", "proba"]] = test_df.progress_apply(
        lambda x: pd.Series(predict_with_pu(x, INDEX, pu_classifier, bert_encoder, bert_classifier_model)), axis=1)

    # 提取id, predicted_label两列信息,并重命名列名, 最后输出到文件
    csv_data = test_df.loc[:, ["id", "predicted_label"]]
    csv_data.columns = ["id", "predict_doctype"]
    print("\n\n===================   The distribution of predictions   ===================\n")
    print(csv_data["predict_doctype"].value_counts())
    print("\n\n")
    csv_data.to_csv(output_path, index=0, line_terminator="\r\r\n")


class BertClassificationModel(nn.Module):
    """Bert模型支持两句输入..."""
    def __init__(self, model_path, predicted_size, hidden_size=768):
        super(BertClassificationModel, self).__init__()
        model_class, tokenizer_class = tfs.BertModel, tfs.BertTokenizer
        self.tokenizer = tokenizer_class.from_pretrained(model_path)
        self.bert = model_class.from_pretrained(model_path)
        self.linear = nn.Linear(hidden_size, predicted_size)  # bert默认的隐藏单元数是768
        self.dropout = nn.Dropout(p=0.2)

    def forward(self, batch_sentences):
        batch_tokenized = self.tokenizer.batch_encode_plus(batch_sentences, add_special_tokens=True,
                                                           max_length=512,
                                                           pad_to_max_length=True)

        input_ids = torch.tensor(batch_tokenized['input_ids']).cuda()
        token_type_ids = torch.tensor(batch_tokenized['token_type_ids']).cuda()
        attention_mask = torch.tensor(batch_tokenized['attention_mask']).cuda()

        bert_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)

        bert_cls_hidden_state = bert_output[0][:, 0, :]  # 提取[CLS]对应的隐藏状态
        linear_output = self.dropout(self.linear(bert_cls_hidden_state).cuda()).cuda()
        return linear_output


def joint_predictor():
    torch.cuda.set_device(0)
    fs = os.listdir(BERT_MODEL_SAVE_PATH)
    gs = list()
    for f in fs:
        if 'model_epoch' in f:
            gs.append(f)
    MODEL_EPOCH = max([int(x.split('.')[0].split('model_epoch')[-1]) for x in gs])
    model_save_path = os.path.join(BERT_MODEL_SAVE_PATH, "model_epoch{}.pkl".format(MODEL_EPOCH))
    print("Start evluation...")
    print("Load bert_classifier model path: ", model_save_path)
    print("Load PU_classifier model path: ", PU_MODEL_SAVE_PATH)
    test_df = read_test_file(TEST_FILE_PATH)

    # 读取 BERT 分类器模型
    bert_classifier_model = torch.load(model_save_path)
    bert_classifier_model = bert_classifier_model.cuda()
    bert_classifier_model.eval()

    with torch.no_grad():
        # 读取 PU 模型
        pu_classifier = joblib.load(PU_MODEL_SAVE_PATH)
        # 读取 fine tuned Bert Encoder模型
        bert_encoder = MyBertEncoder(PRETRAINED_BERT_ENCODER_PATH, FINETUNED_BERT_ENCODER_PATH)
        bert_encoder.eval()
        summary(test_df, SUMMARY_OUTPUT_PATH, pu_classifier, bert_encoder, bert_classifier_model)
    
    print("Evaluation done! Result has saved to: ", SUMMARY_OUTPUT_PATH)

四、总结

本人的改进主要有两点，在筛选正负样本时是用是朴素贝叶斯算法来进行二分类训练，在训练Bert十分类器时输入到的Bert的部分，是先对文本做一个摘要，再取前面512的长度输入Bert。这篇文章特别适合用来学习BERT和pytorch,如果需要数据集的话可以联系我，974128464@qq.com,喜欢C++的nlp方向在读硕士一枚，欢迎大家一起讨论。

西南叶孤城

关注

2
点赞
踩
15

收藏

觉得还不错? 一键收藏
3
评论
大规模未标记的文本数据分类处理baseline

一、问题介绍这里是华为的一个文本分类比赛，数据量大，而且有很多文章并没有标记类别。基础数据集包含两部分：训练集和测试集。其中训练集给定了该样本的文章质量的相关标签，测试集用来测试模型的标签预测准确率，该文本分类的难点主要有两个，一、文章的长度比较长，属于长文本分类，而Bert的最大输入只有512.二、训练集中有大量的未标记数据，而且还包含了“类别”为其他的文本，但是没有标记出来。所以对测试集分类的时候，也要考虑文章类别为“其他”的情况。以下给出训练集中标签和文本数量的数据情况，’ ‘表示未标记数据，对
复制链接

扫一扫