10.3.5 制作模型
(1)使用Scikit-Learn中的函数train_test_split将数据划分为训练集和测试集,并输出了数据集的相关信息。具体实现代码如下所示。
from sklearn.model_selection import train_test_split as tts
train_files,test_files, train_labels, test_labels = tts(df_data['text'],df_data['label'],
test_size=0.1,random_state=32,stratify=df_data['label'])
train_files = pd.DataFrame(train_files)
test_files = pd.DataFrame(test_files)
train_files['label'] = train_labels
test_files['label'] = test_labels
print(type(train_files))
print('Training Data',train_files.shape)
print('Validation Data',test_files.shape)
执行后会输出:
<class 'pandas.core.frame.DataFrame'>
Training Data (26100, 2)
Validation Data (2901, 2)
(2)使用库Plotly Express创建了一个柱状图,用于可视化展示训练集和测试集中不同标签的分布情况。具体实现代码如下所示。
import plotly.express as px
train_values = train_files['label'].value_counts()
test_values = test_files['label'].value_counts()
visual = pd.concat([train_values,test_values],axis=1)
visual = visual.T
visual.index = ['train','test']
fig = px.bar(visual,template='plotly_white',
barmode='group',text_auto=True,height=300,
title='Train/Test Split Distribution')
fig.show("png")
上述代码的目的是可视化训练集和测试集中各个标签的样本分布情况,以帮助我们了解数据集的类别分布是否均衡。如果某些类别之间的样本数量差异很大,可能需要采取一些处理措施来处理类别不平衡的问题。执行效果如图10-34所示。
图10-34 可视化训练集和测试集中各个标签的样本分布情况
(3)查看数据框train_files的内容,以显示训练集的样本数据和标签。具体实现代码如下所示。
train_files
执行后会输出:
text label
8290 I paid off all of my bills and should not have... debt collection
1520 Several checks were issued from XXXX for possi... other financial service
23071 In XXXX, we my husband and myself took out a l... student loan
4123 I use an Amex Serve card ( a prepaid debit car... prepaid card
16470 Despite YEARS of stellar credit reports and sc... credit reporting
... ... ...
21733 I know that I am victim of student loan scam. ... student loan
27578 On XX/XX/XXXX I made a payment of {$380.00} to... bank account or service
26297 XXXX XXXX XXXX XXXX XXXX, AZ XXXX : ( XXXX ) X... bank account or service
17867 After nearly a decade of business with Bank of... credit card
144 On XXXX XXXX, 2015, I made a purchase on EBay.... money transfers
(4)使用库Hugging Face Transformers 和 Datasets 加载和处理文本数据集,具体实现代码如下所示。
import transformers
transformers.logging.set_verbosity_error()
import warnings; warnings.filterwarnings('ignore')
import os; os.environ['WANDB_DISABLED'] = 'true'
from datasets import Dataset,Features,Value,ClassLabel, DatasetDict
traindts = Dataset.from_pandas(train_files)
traindts = traindts.class_encode_column("label")
testdts = Dataset.from_pandas(test_files)
testdts = testdts.class_encode_column("label")
上述代码的主要目的是准备文本数据集以供文本分类模型使用,包括加载数据集、处理标签和准备训练集与测试集。执行后输出:
Casting to class labels: 100%27/27 [00:00<00:00, 80.11ba/s]
Casting the dataset: 100%3/3 [00:00<00:00, 13.46ba/s]
Casting to class labels: 100%3/3 [00:00<00:00, 44.10ba/s]
Casting the dataset: 100%1/1 [00:00<00:00, 19.15ba/s]
(5)创建一个名为 corpus 的 DatasetDict 对象,其中包含两个子集,分别为 "train" 和 "validation"。每个子集都对应一个 Hugging Face Datasets 中的 Dataset 对象,其中包含了训练数据和验证数据。具体实现代码如下所示。
corpus = DatasetDict({"train" : traindts ,
"validation" : testdts })
corpus['train']
执行后输出:
Dataset({
features: ['text', 'label', '__index_level_0__'],
num_rows: 26100
})
(6)使用库Hugging Face Transformers 中的 AutoTokenizer 加载了一个预训练的分词器(tokenizer),并使用该分词器对文本数据集进行分词(tokenization)。具体实现代码如下所示。
from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
def tokenise(batch):
return tokenizer(batch["text"],
padding=True,
truncation=True)
corpus_tokenised = corpus.map(tokenise,
batched=True,
batch_size=None)
print(corpus_tokenised["train"].column_names)
上述代码的主要目的是将文本数据集进行分词,以便将其传递给预训练的语言模型进行训练或推理。
(7)使用库Hugging Face Transformers中的 AutoModel 加载了一个预训练的模型,并将模型移动到可用的计算设备(GPU 或 CPU),以便后续进行文本分类等任务。具体实现代码如下所示。
from transformers import AutoModel
import torch
model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
model = AutoModel.from_pretrained(model_ckpt).to(device)
执行后输出:
Downloading: 100%28.0/28.0 [00:00<00:00, 568B/s]
Downloading: 100%483/483 [00:00<00:00, 7.12kB/s]
Downloading: 100%226k/226k [00:00<00:00, 3.04MB/s]
Downloading: 100%455k/455k [00:00<00:00, 4.63MB/s]
100%
1/1 [00:27<00:00, 27.89s/ba]
100%
1/1 [00:03<00:00, 3.85s/ba]
['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask']
(8)编写函数extract_hidden_states(batch)将文本数据输入到模型中,获取模型的隐藏状态表示,以便后续用于文本分类等任务。具体实现代码如下所示。
# 提取隐藏状态
def extract_hidden_states(batch):
# 将模型输入放在GPU上
inputs = {k:v.to(device) for k,v in batch.items()
if k in tokenizer.model_input_names}
# 提取最后的隐藏状态
with torch.no_grad():
last_hidden_state = model(**inputs).last_hidden_state
# 返回[CLS]标记的向量
return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}
执行后会输出:
cuda
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Downloading: 100%
256M/256M [00:07<00:00, 34.2MB/s]
(9)将数据集 corpus_tokenised 中的特征格式转换为 PyTorch 张量格式,以便在 PyTorch 中进行进一步的处理和训练。具体实现代码如下所示。
corpus_tokenised.set_format("torch",
columns=["input_ids", "attention_mask", "label"])
corpus_tokenised
执行后会输出:
DatasetDict({
train: Dataset({
features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
num_rows: 26100
})
validation: Dataset({
features: ['text', 'label', '__index_level_0__', 'input_ids', 'attention_mask'],
num_rows: 2901
})
})
(10)使用之前定义的 extract_hidden_states 函数从模型中提取数据集的隐藏状态,主要目的是使用 GPU 加速,从模型中提取数据集的隐藏状态,并将其添加到数据集中以供后续使用。具体实现代码如下所示。
corpus_hidden = corpus_tokenised.map(extract_hidden_states,
batched=True,
batch_size=32)
corpus_hidden["train"].column_names
执行后会输出:
100%
816/816 [03:56<00:00, 3.65ba/s]
100%91/91 [00:26<00:00, 3.83ba/s]
['text',
'label',
'__index_level_0__',
'input_ids',
'attention_mask',
'hidden_state']