bert keras_带有bert变压器和keras的多标签多类文本分类

最新推荐文章于 2024-03-02 20:57:22 发布

weixin_26720761

最新推荐文章于 2024-03-02 20:57:22 发布

阅读量587

点赞数

文章标签： python html 人工智能 java linux

原文链接：https://towardsdatascience.com/multi-label-multi-class-text-classification-with-bert-transformer-and-keras-c6355eccb63a

版权

本文介绍了如何使用BERT变压器和Keras进行多标签多类别的文本分类任务，详细阐述了相关步骤和技术细节。

摘要由CSDN通过智能技术生成

bert keras

The internet is full of text classification articles, most of which are BoW-models combined with some kind of ML-model typically solving a binary text classification problem. With the rise of NLP, and in particular BERT (take a look here, if you are not familiar with BERT) and other multilingual transformer based models, more and more text classification problems can now be solved.

互联网上充斥着文本分类文章，其中大多数是BoW模型与某种ML模型相结合，通常可以解决二进制文本分类问题。随着NLP的兴起，尤其是BERT(如果您不熟悉BERT，请在这里看看)和其他基于多语言转换器的模型的兴起，现在可以解决越来越多的文本分类问题。

However, when it comes to solving a multi-label, multi-class text classification problem using Huggingface Transformers, BERT, and Tensorflow Keras, the number of articles are indeed very limited and I for one, haven’t found any… Yet!

但是，当涉及到使用Huggingface Transformers ， BERT和Tensorflow Keras解决多标签，多类文本分类问题时，文章的数量确实非常有限，而且我还没有找到任何…………！

Therefore, with the help and inspiration of a great deal of blog posts, tutorials and GitHub code snippets all relating to either BERT, multi-label classification in Keras or other useful information I will show you how to build a working model, solving exactly that problem.

因此，在大量与BERT，Keras中的多标签分类或其他有用信息有关的博客文章，教程和GitHub代码段的帮助和启发下，我将向您展示如何构建工作模型，并解决该问题问题。

And why use Huggingface Transformers instead of Googles own BERT solution? Because with Transformers it is extremely easy to switch between different models, that being BERT, ALBERT, XLnet, GPT-2 etc. Which means, that you more or less ‘just’ replace one model for another in your code.

为什么使用Huggingface Transformers而不是Google自己的BERT解决方案？因为使用Transformers可以非常容易地在BERT，ALBERT，XLnet，GPT-2等不同模型之间进行切换。这意味着，您或多或少地“只是”将代码中的一个模型替换为另一个模型。

从哪儿开始 (Where to start)

With data. Looking for text data I could use for a multi-label multi-class text classification task, I stumbled upon the ‘Consumer Complaint Database’ from data.gov. Seems to do the trick, so that’s what we’ll use.

随着数据。在寻找可以用于多标签多类别文本分类任务的文本数据时，我偶然发现了data.gov上的“消费者投诉数据库” 。似乎可以解决问题，所以这就是我们要使用的方法。

Next up is the exploratory data analysis. This is obviously crucial to get a proper understanding of what your data looks like, what pitfalls there might be, the quality of your data, and so on. But I’m skipping this step for now, simply because the aim of this article is purely how to build a model.

接下来是探索性数据分析。对于正确理解数据的外观，可能存在的陷阱，数据的质量等等，这显然很关键。但是我现在跳过这一步，仅仅是因为本文的目的纯粹是如何构建模型。

If you don’t like googling around take a look at these two articles on the subject: NLP Part 3 | Exploratory Data Analysis of Text Data and A Complete Exploratory Data Analysis and Visualization for Text Data.

如果您不喜欢四处逛逛，请阅读有关该主题的这两篇文章： NLP第3部分| 文本数据的探索性数据分析和文本数据的完全探索性数据分析与可视化。

继续吧 (Get on with it)

We have our data and now comes the coding part.

我们有数据，现在是编码部分。

First, we’ll load the required libraries.

首先，我们将加载所需的库。

#######################################
### -------- Load libraries ------- #### Load Huggingface transformers
from transformers import TFBertModel,  BertConfig, BertTokenizerFast# Then what you need from tensorflow.keras
from tensorflow.keras.layers import Input, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical# And pandas for data import + sklearn because you allways need sklearn
import pandas as pd
from sklearn.model_selection import train_test_split

Then we will import our data and wrangle it around so it fits our needs. Nothing fancy there. Note that we will only use the columns ‘Consumer complaint narrative’, ‘Product’ and ‘Issue’ from our dataset. ‘Consumer complaint narrative’ will serve as our input for the model and ‘Product’ and ‘Issue’ as our two outputs.

然后，我们将导入我们的数据并对其进行纠缠，以使其适合我们的需求。没什么可看的。请注意，我们只会使用数据集中的“消费者投诉叙述”，“产品”和“问题”列。 “消费者投诉叙述”将作为模型的输入，而“产品”和“问题”将作为我们的两个输出。

#######################################
### --------- Import data --------- #### Import data from csv
data = pd.read_csv('dev/Fun with BERT/complaints.csv')# Select required columns
data = data[['Consumer complaint narrative', 'Product', 'Issue']]# Remove a row if any of the three remaining columns are missing
data = data.dropna()# Remove rows, where the label is present only ones (can't be split)
data = data.groupby('Issue').filter(lambda x : len(x) > 1)
data = data.groupby('Product').filter(lambda x : len(x) > 1)# Set your model output as categorical and save in new label col
data['Issue_label'] = pd.Categorical(data['Issue'])
data['Product_label'] = pd.Categorical(data['Product'])# Transform your output to numeric
data['Issue'] = data['Issue_label'].cat.codes
data['Product'] = data['Product_label'].cat.codes# Split into train and test - stratify over Issue
data, data_test = train_test_split(data, test_size = 0.2, stratify = data[['Issue']])

Next we will load a number of different Transformers classes.

接下来，我们将加载许多不同的Transformers类。

#######################################
### --------- Setup BERT ---------- #### Name of the BERT model to use
model_name = 'bert-base-uncased'# Max length of tokens
max_length = 100# Load transformers config and set output_hidden_states to False
config = BertConfig.from_pretrained(model_name)
config.output_hidden_states = False# Load BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path = model_name, config = config)# Load the Transformers BERT model
transformer_model = TFBertModel.from_pretrained(model_name, config = config)

Here we first load a BERT config object that controls the model, tokenizer and so on.

在这里，我们首先加载一个控制模型，令牌生成器等的BERT配置对象。

Then, a tokenizer that we will use later in our script to transform our text input into BERT tokens and then pad and truncate them to our max length. The tokenizer is pretty well documented so I won’t get into that here.

然后，使用标记器，稍后将在脚本中使用该标记器将文本输入转换为BERT标记，然后将其填充并截断为最大长度。标记器已被很好地记录下来，因此在此不做介绍。

Lastly, we will load the BERT model itself as a BERT Transformers TF 2.0 Keras model (here we use the 12-layer bert-base-uncased).

最后，我们将BERT模型本身加载为BERT Transformers TF 2.0 Keras模型(此处使用12层bert-base-uncased )。

现在是有趣的部分 (Now for the fun part)

We are ready to build our model. In the Transformers library, there are a number of different BERT classification models to use. The mother of all models is the one simply called ‘BertModel’ (PyTorch) or ‘TFBertModel’ (TensorFlow) and thus the one we want.

我们准备建立模型。在Transformers库中，可以使用多种不同的BERT分类模型。所有模型的母亲都是一个简称为“ BertModel”(PyTorch)或“ TFBertModel”(TensorFlow)的模型，因此就是我们想要的模型。

The Transformers library also comes with a prebuilt BERT model for sequence classification called ‘TFBertForSequenceClassification’. If you take a look at the code found here you’ll see, that they start by loading a clean BERT model and then they simply add a dropout and a dense layer to it. Therefore, what we’ll do is simply to add two dense layers instead of just one.

Transformers库还带有用于序列分类的预建BERT模型，称为“ TFBertForSequenceClassification”。如果您看一下这里找到的代码，就会发现它们是从加载一个干净的BERT模型开始的，然后只需向其添加一个dropout和一个致密层。因此，我们要做的只是添加两个密集层，而不仅仅是一个。

Here what our model looks like:

我们的模型如下所示：

Image for post — The Multi-Label, Multi-Class Text Classification with BERT, Transformer and Keras model

And a more detailed view of the model:

以及该模型的更详细视图：

Model: "BERT_MultiLabel_MultiClass"
___________________________________________________________________
Layer (type)      Output Shape       Param #    Connected to       
===================================================================
input_ids         [(None, 100)]      0                             
(InputLayer)
___________________________________________________________________
bert              (                  109482240  input_ids[0][0]    
(TFBertMainLayer)  (None, 100, 768),
                   (None, 768)
                  )
___________________________________________________________________
pooled_output     (None, 768)        0          bert[1][1]         
(Dropout)
___________________________________________________________________
issue             (None, 159)        122271     pooled_output[0][0]
(Dense)
___________________________________________________________________
product           (None, 18)         13842      pooled_output[0][0]
(Dense)
===================================================================
Total params: 109,618,353
Trainable params: 109,618,353
Non-trainable params: 0
___________________________________________________________________

If you want to know more about BERTs architecture itself, take a look here.

如果您想进一步了解BERTs架构本身，请在此处查看。

Now that we have our model architecture, all we need to do is write it in code.

现在我们有了模型架构，我们要做的就是用代码编写它。

#######################################
### ------- Build the model ------- #### TF Keras documentation: https://www.tensorflow.org/api_docs/python/tf/keras/Model# Load the MainLayer
bert = transformer_model.layers[0]# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
inputs = {'input_ids': input_ids}# Load the Transformers BERT model as a layer in a Keras model
bert_model = bert(inputs)[1]
dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
pooled_output = dropout(bert_model, training=False)# Then build your model output
issue = Dense(units=len(data.Issue_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='issue')(pooled_output)
product = Dense(units=len(data.Product_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='product')(pooled_output)
outputs = {'issue': issue, 'product': product}# And combine it all in a model object
model = Model(inputs=inputs, outputs=outputs, name='BERT_MultiLabel_MultiClass')# Take a look at the model
model.summary()

让魔术开始 (Let the magic begin)

Then all there is left to do is to compile our new model and fit it on our data.

然后剩下要做的就是编译我们的新模型并将其适合我们的数据。

#######################################
### ------- Train the model ------- #### Set an optimizer
optimizer = Adam(
    learning_rate=5e-05,
    epsilon=1e-08,
    decay=0.01,
    clipnorm=1.0)# Set loss and metrics
loss = {'issue': CategoricalCrossentropy(from_logits = True), 'product': CategoricalCrossentropy(from_logits = True)}
metric = {'issue': CategoricalAccuracy('accuracy'), 'product': CategoricalAccuracy('accuracy')}# Compile the model
model.compile(
    optimizer = optimizer,
    loss = loss, 
    metrics = metric)# Ready output data for the model
y_issue = to_categorical(data['Issue'])
y_product = to_categorical(data['Product'])# Tokenize the input (takes some time)
x = tokenizer(
    text=data['Consumer complaint narrative'].to_list(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = False,
    verbose = True)# Fit the model
history = model.fit(
    x={'input_ids': x['input_ids']},
    y={'issue': y_issue, 'product': y_product},
    validation_split=0.2,
    batch_size=64,
    epochs=10)

Once the model is fitted, we can evaluate it on our test data to see how it performs.

拟合模型后，我们可以在测试数据上对其进行评估，以查看其性能。

#######################################
### ----- Evaluate the model ------ #### Ready test data
test_y_issue = to_categorical(data_test['Issue'])
test_y_product = to_categorical(data_test['Product'])
test_x = tokenizer(
    text=data_test['Consumer complaint narrative'].to_list(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = False,
    verbose = True)# Run evaluation
model_eval = model.evaluate(
    x={'input_ids': test_x['input_ids']},
    y={'issue': test_y_issue, 'product': test_y_product}
)

As it turns out, our model performs fairly okay and has a relatively good accuracy. Especially considering the fact that our output ‘Product’ consists of 18 labels and ‘Issue’ consists of 159 different labels.

事实证明，我们的模型执行得还不错，并且具有相对较好的准确性。特别要考虑以下事实：我们的输出“产品”包含18个标签，而“问题”包含159个不同的标签。

####################################################################
Classification metrics for Product                             precision    recall  f1-score   support
    Bank account or service       0.63      0.36      0.46      2977
Checking or savings account       0.60      0.75      0.67      4685
              Consumer Loan       0.48      0.29      0.36      1876
                Credit card       0.56      0.42      0.48      3765
Credit card or prepaid card       0.63      0.71      0.67      8123
           Credit reporting       0.64      0.37      0.47      6318
   Credit reporting, credit 
  repair services, or other
  personal consumer reports       0.81      0.85      0.83     38529
            Debt collection       0.80      0.85      0.82     23848
    Money transfer, virtual 
 currency, or money service       0.59      0.65      0.62      1966
            Money transfers       0.50      0.01      0.01       305
                   Mortgage       0.89      0.93      0.91     13502
    Other financial service       0.00      0.00      0.00        60
                Payday loan       0.57      0.01      0.02       355
Payday loan, title loan, or
              personal loan       0.46      0.40      0.43      1523
               Prepaid card       0.82      0.14      0.24       294
               Student loan       0.83      0.87      0.85      5332
      Vehicle loan or lease       0.49      0.51      0.50      1963
           Virtual currency       0.00      0.00      0.00         3                   accuracy                           0.76    115424
                  macro avg       0.57      0.45      0.46    115424
               weighted avg       0.75      0.76      0.75    115424####################################################################
Classification metrics for Issue (only showing summarized metrics)precision    recall  f1-score   support
                   accuracy                           0.41    115424
                  macro avg       0.09      0.08      0.06    115424

接下来做什么？ (What to do next?)

There are, however, plenty of things you could do to increase performance of this model. Here I have tried to do it as simple as possible, but if you are looking for better performance consider the following:

但是，您可以做很多事情来提高此模型的性能。在这里，我尝试了尽可能简单的方法，但是如果您想获得更好的性能，请考虑以下几点：

Fiddle around with the hyperparameters set in the optimizer or change the optimizer itself
摆弄优化器中设置的超参数或更改优化器本身
Train a language model using the Consumer Complaint Database data- either from scratch or by fine-tuning an existing BERT model (have a look here to see how). Then load that model instead of the ‘bert-base-uncased’ used here.
培养使用消费者投诉数据库数据-无论是从头开始或微调现有的BERT模型语言模型( 看看这里，看看如何 )。然后加载该模型，而不是此处使用的“无基础的bert-base”。
Use multiple inputs. In our current setup, we only use token id’s as input. However, we could (probably) gain some performance increase if we added attention masks to our input. It is pretty straightforward and looks something like this:
使用多个输入。在当前设置中，我们仅使用令牌ID作为输入。但是，如果我们在输入中添加了注意蒙版，则可以(可能)提高性能。它非常简单，看起来像这样：

# Build your model inputinput_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')attention_mask = Input(shape=(max_length,), name='attention_mask', dtype='int32')inputs = {'input_ids': input_ids, 'attention_mask': attention_mask}

(remember to add attention_mask when fitting your model and set return_attention_mask to True in your tokenizer. For more info on attention masks, look here. Also I have added attention_mask to the gist below and commented it out for your inspiration.)

(请记住，在拟合模型时添加tention_mask，并在令牌生成器中将return_attention_mask设置为True。有关注意蒙版的更多信息，请参见 此处 。此外，我还在下面的要点中添加了注意_ 蒙版， 并为您的灵感加以注释。)

Try another model such as ALBERT, RoBERTa, XLM or even an autoregressive model such as GPT-2 or XLNet — all of them easily imported into your framework though the Transformers library. You can find an overview of all the directly available models here.
尝试使用其他模型，例如ALBERT，RoBERTa，XLM甚至是自动回归模型(例如GPT-2或XLNet)，所有这些模型都可以通过Transformers库轻松导入到您的框架中。您可以在此处找到所有直接可用模型的概述。

That’s it — hope you like this little walk-through of how to do a ‘Multi-Label, Multi-Class Text Classification with BERT, Transformer and Keras’. If you have any feedback or questions, fire away in the comments below.

就是这样–希望您喜欢这个如何完成“使用BERT，Transformer和Keras进行多标签，多类文本分类”的小指南。如果您有任何反馈或问题，请在下面的评论中提出。

#######################################
### -------- Load libraries ------- ###


# Load Huggingface transformers
from transformers import TFBertModel,  BertConfig, BertTokenizerFast


# Then what you need from tensorflow.keras
from tensorflow.keras.layers import Input, Dropout, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import TruncatedNormal
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import CategoricalAccuracy
from tensorflow.keras.utils import to_categorical


# And pandas for data import + sklearn because you allways need sklearn
import pandas as pd
from sklearn.model_selection import train_test_split




#######################################
### --------- Import data --------- ###


# Import data from csv
data = pd.read_csv('dev/Fun with BERT/complaints.csv')


# Select required columns
data = data[['Consumer complaint narrative', 'Product', 'Issue']]


# Remove a row if any of the three remaining columns are missing
data = data.dropna()


# Remove rows, where the label is present only ones (can't be split)
data = data.groupby('Issue').filter(lambda x : len(x) > 1)
data = data.groupby('Product').filter(lambda x : len(x) > 1)


# Set your model output as categorical and save in new label col
data['Issue_label'] = pd.Categorical(data['Issue'])
data['Product_label'] = pd.Categorical(data['Product'])


# Transform your output to numeric
data['Issue'] = data['Issue_label'].cat.codes
data['Product'] = data['Product_label'].cat.codes


# Split into train and test - stratify over Issue
data, data_test = train_test_split(data, test_size = 0.2, stratify = data[['Issue']])




#######################################
### --------- Setup BERT ---------- ###


# Name of the BERT model to use
model_name = 'bert-base-uncased'


# Max length of tokens
max_length = 100


# Load transformers config and set output_hidden_states to False
config = BertConfig.from_pretrained(model_name)
config.output_hidden_states = False


# Load BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path = model_name, config = config)


# Load the Transformers BERT model
transformer_model = TFBertModel.from_pretrained(model_name, config = config)




#######################################
### ------- Build the model ------- ###


# TF Keras documentation: https://www.tensorflow.org/api_docs/python/tf/keras/Model


# Load the MainLayer
bert = transformer_model.layers[0]


# Build your model input
input_ids = Input(shape=(max_length,), name='input_ids', dtype='int32')
# attention_mask = Input(shape=(max_length,), name='attention_mask', dtype='int32') 
# inputs = {'input_ids': input_ids, 'attention_mask': attention_mask}
inputs = {'input_ids': input_ids}


# Load the Transformers BERT model as a layer in a Keras model
bert_model = bert(inputs)[1]
dropout = Dropout(config.hidden_dropout_prob, name='pooled_output')
pooled_output = dropout(bert_model, training=False)


# Then build your model output
issue = Dense(units=len(data.Issue_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='issue')(pooled_output)
product = Dense(units=len(data.Product_label.value_counts()), kernel_initializer=TruncatedNormal(stddev=config.initializer_range), name='product')(pooled_output)
outputs = {'issue': issue, 'product': product}


# And combine it all in a model object
model = Model(inputs=inputs, outputs=outputs, name='BERT_MultiLabel_MultiClass')


# Take a look at the model
model.summary()




#######################################
### ------- Train the model ------- ###


# Set an optimizer
optimizer = Adam(
    learning_rate=5e-05,
    epsilon=1e-08,
    decay=0.01,
    clipnorm=1.0)


# Set loss and metrics
loss = {'issue': CategoricalCrossentropy(from_logits = True), 'product': CategoricalCrossentropy(from_logits = True)}
metric = {'issue': CategoricalAccuracy('accuracy'), 'product': CategoricalAccuracy('accuracy')}


# Compile the model
model.compile(
    optimizer = optimizer,
    loss = loss, 
    metrics = metric)


# Ready output data for the model
y_issue = to_categorical(data['Issue'])
y_product = to_categorical(data['Product'])


# Tokenize the input (takes some time)
x = tokenizer(
    text=data['Consumer complaint narrative'].to_list(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = True,
    verbose = True)


# Fit the model
history = model.fit(
    # x={'input_ids': x['input_ids'], 'attention_mask': x['attention_mask']},
    x={'input_ids': x['input_ids']},
    y={'issue': y_issue, 'product': y_product},
    validation_split=0.2,
    batch_size=64,
    epochs=10)




#######################################
### ----- Evaluate the model ------ ###


# Ready test data
test_y_issue = to_categorical(data_test['Issue'])
test_y_product = to_categorical(data_test['Product'])
test_x = tokenizer(
    text=data_test['Consumer complaint narrative'].to_list(),
    add_special_tokens=True,
    max_length=max_length,
    truncation=True,
    padding=True, 
    return_tensors='tf',
    return_token_type_ids = False,
    return_attention_mask = False,
    verbose = True)


# Run evaluation
model_eval = model.evaluate(
    x={'input_ids': test_x['input_ids']},
    y={'issue': test_y_issue, 'product': test_y_product}
)