Pytorch中数据集太大加载爆内存问题解决记录

39 篇文章 4 订阅
16 篇文章 10 订阅

问题说明

项目需要,要加载一个具有两千多万条样本的两万多分类问题的数据集在BERT模型上进行Fine tune,我选取了其中2%的数据(约50万条)作为测试集,然后剩下的两千多万条作为训练集。

我按照 Transformers库官方文档里的 Fine-tuning with custom datasets一文中对BERT模型在IMDb数据集上Fine tune的过程进行改写。原代码如下:

train_texts, train_labels = read_imdb_split('aclImdb/train')
test_texts, test_labels = read_imdb_split('aclImdb/test')

from sklearn.model_selection import train_test_split
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=.2)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

# Fine tune with Trainer

from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

这里,文档中是先将train_texts、test_texts、val_texts先进行了分词操作,我在改写时起初也是这样,但是这样有几个缺点:

  1. 分词处理时间慢,使用BertTokenizer时巨慢,推荐使用BertTokenizerFast会好一些
  2. 内存要求高,在对train_texts进行分词时直接内存溢出,jupyter lab被直接kill

问题解决

在进行数据集处理时,我们完全没有必要一开始就把所有的数据全部处理好,特别是非常大的数据集。完全可以采用一种“lazy dataset”的方式,在训练用到一批时我再加载一批。相关实现如下:

class LazyTextMAG_Dataset(torch.utils.data.Dataset):
    """
    Works with datasets of simple lines of text. Lines are loaded and tokenized
    lazily rather than being pulled into memory up-front. This reduces the memory
    footprint when using large datasets, and also remedies a problem seen when using
    the other Datasets (above) whereby they take too long to load all
    of the data and tokenize it before doing any training.

    The file i/o work is handled within self.examples. This class just indexes
    into that object and applies the tokenization.
    """
    def __init__(self, tokenizer, filepath, label2mid_dict, block_size=32):
        """
        :args:
            tokenizer: tokenizer.implementations.BaseTokenizer object (instantiated)
                     : This tokenizer will be directly applied to the text data
                       to prepare the data for passing through the model.
            file_path: str
                     : Path to the data file to be used.
            label2mid_dict: dict
                            : key is label string, value is label id
            block_size: int
                      : The maximum length of a sequence (truancated beyond this length).
        :returns: None.
        """
        self.texts, self.labels = self.read_mag_file(filepath, label2mid_dict)
        self.label2mid_dict = label2mid_dict
        self.tokenizer = tokenizer
        self.max_len = block_size

        
    def __len__(self):
        return len(self.labels)
    
            
    def read_mag_file(self, filepath, label2mid_dict):
        texts = []
        labels = []
        with open(filepath, "r", encoding="utf-8") as f:
            for line in f:
                ori, nor = line.replace("\n", "").split("\t\t")
                mid = label2mid_dict[nor]
                texts.append(ori)
                labels.append(mid)
        f.close()

        return texts, labels

    
    def _text_to_encoding(self, item):
        """
        Defines the logic for transforming a single raw text item to a tokenized
        tensor ready to be passed into a model.

        :args:
            item: str
                : The text item as a string to be passed to the tokenizer.
        """
        return self.tokenizer(item, padding='max_length', truncation=True, max_length=self.max_len)

    
    def _text_to_item(self, text):
        """
        Convenience functino to encapsulate re-used logic for converting raw
        text to the output of __getitem__ of __next__.

        :returns:
            torch.Tensor of tokenized text if no errors.
            None if any errors encountered.
        """
        try:
            if (text is not None):
                return self._text_to_encoding(text)
            else:
                return None
        except:
            return None

        
    def __getitem__(self, _id):
        """
        :returns:
            torch.Tensor of tokenized text if no errors.
            None if any errors encountered.
        """
        text = self.texts[_id]
        label = self.labels[_id]
        encodings = self._text_to_item(text)
        
        item = {key: torch.tensor(value) for key, value in encodings.items()}
        item['label'] = torch.tensor(label)
        return item

然后用这种方式生成数据集

train_dataset = LazyTextMAG_Dataset(tokenizer, train_filepath, train_label2mid_dict)
test_dataset = LazyTextMAG_Dataset(tokenizer, test_filepath, train_label2mid_dict)

在这里,所实现的类的做法是一开始加载了所有的文本text,然后再需要时再分批返回经过tokenizer后的结果(当然,一种更理想的方式是需要时再加载需要的文本text,但相比之下应该先加载所有的文本text方式避免了io请求,可能会更快一些(只是猜想)。)。

参考

  1. Memory error : load 200GB file in run_language_model.py,https://github.com/huggingface/transformers/issues/3083
  • 9
    点赞
  • 29
    收藏
    觉得还不错? 一键收藏
  • 5
    评论
当你的数据集很大时,你可以使用PyTorch数据集和数据器来逐批加数据。这样,你可以在内存有限的情况下有效地加和使用大型数据集。 以下是一个使用PyTorch数据集和数据器来逐批加数据的示例代码: ```python import os import scipy.io as sio import numpy as np import torch from torch.utils.data import Dataset, DataLoader # 定义自定义数据集类 class MyDataset(Dataset): def __init__(self, data_path): self.file_paths = [] for root, dirs, files in os.walk(data_path): for file in files: if file.endswith(".mat"): self.file_paths.append(os.path.join(root, file)) def __len__(self): return len(self.file_paths) def __getitem__(self, idx): mat_data = sio.loadmat(self.file_paths[idx]) np_data = np.array(mat_data['data']) return np_data # 定义数据路径和批量大小 data_path = "/path/to/data/folder" batch_size = 32 # 创建自定义数据集对象和数据器对象 dataset = MyDataset(data_path) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) # 遍历数据器并输出数据批次的形状 for i, batch in enumerate(dataloader): print("Batch ", i, " shape: ", batch.shape) ``` 这个代码与之前的代码类似,但是使用了PyTorch数据集和数据器来逐批加数据。自定义数据集类(MyDataset)用于从磁盘加.mat文件并将其转换为Numpy数组。数据器(DataLoader)用于加数据的批次数据。 请注意,在这个示例代码,我们使用了shuffle=True来打乱数据集。这是一个很好的实践,因为它可以帮助模型更好地学习数据的模式。如果你的数据集已经按照某种顺序排列好了,你可以将shuffle设置为False。 这个示例代码只是一个简单的演示如何使用PyTorch数据集和数据器来逐批加数据。你可以根据自己的需求修改代码以适应你的数据集和模型。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值