(10-3-05-01)银行消费者投诉处理模型

文章介绍了如何使用Scikit-Learn进行数据划分,PlotlyExpress可视化训练集和测试集标签分布,以及使用HuggingFaceTransformers加载和处理文本数据,包括预训练模型的使用、GPU加速和隐藏状态提取,以支持文本分类任务。
摘要由CSDN通过智能技术生成

10.3.5  制作模型

1使用Scikit-Learn中的函数train_test_split将数据划分为训练集和测试集,并输出了数据集的相关信息。具体实现代码如下所示。

from sklearn.model_selection import train_test_split as tts
train_files,test_files, train_labels, test_labels = tts(df_data['text'],df_data['label'],
                         test_size=0.1,random_state=32,stratify=df_data['label'])

train_files = pd.DataFrame(train_files)
test_files = pd.DataFrame(test_files)
train_files['label'] = train_labels
test_files['label'] = test_labels

print(type(train_files))
print('Training Data',train_files.shape)
print('Validation Data',test_files.shape)

执行后会输出:

<class 'pandas.core.frame.DataFrame'>
Training Data (26100, 2)
Validation Data (2901, 2)

(2)使用库Plotly Express创建了一个柱状图,用于可视化展示训练集和测试集中不同标签的分布情况。具体实现代码如下所示。

import plotly.express as px

train_values = train_files['label'].value_counts()
test_values = test_files['label'].value_counts()
visual = pd.concat([train_values,test_values],axis=1)
visual = visual.T
visual.index = ['train','test']

fig = px.bar(visual,template='plotly_white',
       barmode='group',text_auto=True,height=300,
       title='Train/Test Split Distribution')

fig.show("png")

上述代码的目的是可视化训练集和测试集中各个标签的样本分布情况,以帮助我们了解数据集的类别分布是否均衡。如果某些类别之间的样本数量差异很大,可能需要采取一些处理措施来处理类别不平衡的问题。执行效果如图10-34所示

10-34  可视化训练集和测试集中各个标签的样本分布情况

(3)查看数据框train_files的内容,以显示训练集的样本数据和标签。具体实现代码如下所示。

train_files

执行后会输出:

     text	                                  label
8290	I paid off all of my bills and should not have...	debt collection
1520	Several checks were issued from XXXX for possi...	other financial service
23071	In XXXX, we my husband and myself took out a l...	student loan
4123	I use an Amex Serve card ( a prepaid debit car...	prepaid card
16470	Despite YEARS of stellar credit reports and sc...	credit reporting
...	...	...
21733	I know that I am victim of student loan scam. ...	student loan
27578	On XX/XX/XXXX I made a payment of {$380.00} to...	bank account or service
26297	XXXX XXXX XXXX XXXX XXXX, AZ XXXX : ( XXXX ) X...	bank account or service
17867	After nearly a decade of business with Bank of...	credit card
144	On XXXX XXXX, 2015, I made a purchase on EBay....	money transfers

4使用库Hugging Face Transformers 和 Datasets 加载和处理文本数据集,具体实现代码如下所示。

import transformers
transformers.logging.set_verbosity_error()
import warnings; warnings.filterwarnings('ignore')
import os; os.environ['WANDB_DISABLED'] = 'true'
from datasets import Dataset,Features,Value,ClassLabel, DatasetDict 

traindts = Dataset.from_pandas(train_files)
traindts = traindts.class_encode_column("label")
testdts = Dataset.from_pandas(test_files)
testdts = testdts.class_encode_column("label")

上述代码的主要目的是准备文本数据集以供文本分类模型使用,包括加载数据集、处理标签和准备训练集与测试集。执行后输出:

Casting to class labels: 100%27/27 [00:00<00:00, 80.11ba/s]
Casting the dataset: 100%3/3 [00:00<00:00, 13.46ba/s]
Casting to class labels: 100%3/3 [00:00<00:00, 44.10ba/s]
Casting the dataset: 100%1/1 [00:00<00:00, 19.15ba/s]

(5)创建一个名为 corpus 的 DatasetDict 对象,其中包含两个子集,分别为 "train" 和 "validation"。每个子集都对应一个 Hugging Face Datasets 中的 Dataset 对象,其中包含了训练数据和验证数据。具体实现代码如下所示。

corpus = DatasetDict({"train" : traindts ,
                      "validation" : testdts })
corpus['train']

执行后输出:

Dataset({
    features: ['text', 'label', '__index_level_0__'],
    num_rows: 26100

})

(6)使用库Hugging Face Transformers 中的 AutoTokenizer 加载了一个预训练的分词器(tokenizer),并使用该分词器对文本数据集进行分词(tokenization)。具体实现代码如下所示。

from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

def tokenise(batch):
    return tokenizer(batch["text"], 
                     padding=True, 
                     truncation=True)

corpus_tokenised = corpus.map(tokenise, 
                              batched=True, 
                              batch_size=None)

print(corpus_tokenised["train"].column_names)

上述代码的主要目的是将文本数据集进行分词,以便将其传递给预训练的语言模型进行训练或推理。

7使用库Hugging Face Transformers中的 AutoModel 加载了一个预训练的模型,并将模型移动到可用的计算设备(GPU 或 CPU),以便后续进行文本分类等任务。具体实现代码如下所示。

from transformers import AutoModel
import torch


model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
model = AutoModel.from_pretrained(model_ckpt).to(device)

执行输出:

Downloading: 100%28.0/28.0 [00:00<00:00, 568B/s]
Downloading: 100%483/483 [00:00<00:00, 7.12kB/s]
Downloading: 100%226k/226k [00:00<00:00, 3.04MB/s]
Downloading: 100%455k/455k [00:00<00:00, 4.63MB/s]
100%
1/1 [00:27<00:00, 27.89s/ba]
100%
1/1 [00:03<00:00, 3.85s/ba]
['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask']

(8)编写函数extract_hidden_states(batch)将文本数据输入到模型中,获取模型的隐藏状态表示,以便后续用于文本分类等任务。具体实现代码如下所示。

# 提取隐藏状态
def extract_hidden_states(batch):
    
    # 将模型输入放在GPU上
    inputs = {k:v.to(device) for k,v in batch.items()
              if k in tokenizer.model_input_names}
    
    # 提取最后的隐藏状态
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
        
    # 返回[CLS]标记的向量
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

执行后会输出:

cuda
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Downloading: 100%
256M/256M [00:07<00:00, 34.2MB/s]

(9)将数据集 corpus_tokenised 中的特征格式转换为 PyTorch 张量格式,以便在 PyTorch 中进行进一步的处理和训练。具体实现代码如下所示。

corpus_tokenised.set_format("torch",
                            columns=["input_ids", "attention_mask", "label"])
corpus_tokenised

执行后会输出:

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 26100
    })
    validation: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
        num_rows: 2901
    })
})

(10)使用之前定义的 extract_hidden_states 函数从模型中提取数据集的隐藏状态,主要目的是使用 GPU 加速,从模型中提取数据集的隐藏状态,并将其添加到数据集中以供后续使用。具体实现代码如下所示。

corpus_hidden = corpus_tokenised.map(extract_hidden_states,
                                     batched=True,
                                     batch_size=32)
corpus_hidden["train"].column_names

执行后会输出:

100%
816/816 [03:56<00:00, 3.65ba/s]
100%91/91 [00:26<00:00, 3.83ba/s]
['text',
 'label',
 '__index_level_0__',
 'input_ids',
 'attention_mask',
 'hidden_state']

未完待续

  • 7
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码农三叔

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值