记录一下:
环境:谷歌Colab (带GPU)
数据集:爬虫爬取的电商数据集,107个分类,35W条数据
依赖库:huggingface-hub-0.10.1 sklearn-0.0 tokenizers-0.13.1 transformers-4.23.1
!pip install sklearn transformers pandas tensorflow
from sklearn.model_selection import train_test_split
from transformers import TFBertForSequenceClassification
import pandas as pd
import torch
import os
import tensorflow as tf
from transformers import BertTokenizer
采用的是将数据先传到谷歌云盘,再挂载云盘就可以轻松读入数据了~
挂载云盘可以使用命令,也可以直接设置,方便了很多
df_raw = pd.read_csv('/content/drive/MyDrive/data_use_y.csv', encoding='utf-8-sig', index_col=0)
df_raw.head()
数据如下:
用了两个模型进行训练测试,最大长度这里设置的是20,不过看数据,其实还可以再设置更大一些。batch_size设置一次为500,epoch稍微大一些的方式进行训练。
# tokenizer = BertTokenizer.from_pretrained('xlm-roberta-base')
tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm-ext')
max_length = 20
batch_size = 500
def split_dataset(df):
train_set, x = train_test_split(df,
stratify=df['label'],
test_size=0.1,
random_state=42)
val_set, test_set = train_test_split(x,
stratify=x['label'],
test_size=0.5,
random_state=43)
return train_set,val_set, test_set
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
return {
"input_ids": input_ids,
"token_type_ids": token_type_ids,
"attention_mask": attention_masks,
}, label
def encode_examples(ds, limit=-1):
# prepare list, so that we can build up final TensorFlow dataset from slices.
input_ids_list = []
token_type_ids_list = []
attention_mask_list = []
label_list = []
if (limit > 0):
ds = ds.take(limit)
for index, row in ds.iterrows():
review = row["text"]
label = row["y"]
bert_input = convert_example_to_feature(review)
input_ids_list.append(bert_input['input_ids'])
token_type_ids_list.append(bert_input['token_type_ids'])
attention_mask_list.append(bert_input['attention_mask'])
label_list.append([label])
return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)
划分训练集、验证集(用于调参)、测试集(用于查验模型最终的效果)
train_data, val_data, test_data = split_dataset(df_raw)
# 对数据集进行编码
# train dataset
ds_train_encoded = encode_examples(train_data).shuffle(10000).batch(batch_size)
# val dataset
ds_val_encoded = encode_examples(val_data).batch(batch_size)
# test dataset
ds_test_encoded = encode_examples(test_data).batch(batch_size)
# 加载模型
# model = TFBertForSequenceClassification.from_pretrained("xlm-roberta-base", num_labels=len(set(df_raw['label'].tolist())))
model = TFBertForSequenceClassification.from_pretrained("hfl/chinese-bert-wwm-ext", num_labels=len(set(df_raw['label'].tolist())))
learning_rate = 2e-5
# learning_rate = 2e-5
# 20其实还是有点过拟合了
number_of_epochs = 20
# 设置优化器
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08, clipnorm=1)
# optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08, clipnorm=1)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
# ----使用Chinese bert 全词掩码的方式
bert_history = model.fit(ds_train_encoded, epochs=number_of_epochs, validation_data=ds_val_encoded)
训练日志如下:
可以看到,其实在9个epoch时模型基本趋于稳定,后面反而略有降低,但最终保持在0.85以上。
查看模型在测试集上的效果:
model.evaluate(ds_test_encoded)
在测试集上准确率也在0.85以上,后面用roberta训练测试准确率也在0.83的水平,可见即便在没有进行精细的数据预处理条件下,bert仍有不错的表现,但如何进一步提升多分类效果值得进一步探究。