【毕业论文参考】用Python清洗与增强生成式AI的训练数据

本文链接：https://blog.csdn.net/liuweni/article/details/144758221

文章目录

一、生成式AI数据的挑战
二、数据清洗
三、数据增强
四、数据清洗与增强的整合与实践
五、总结

在生成式AI模型的训练中，数据的质量至关重要。无论是文本生成、图像生成还是其他模态的生成任务，数据清洗和增强是模型成功的关键步骤。清洗与增强不仅能提高数据的质量，还能为模型提供更多样化、更丰富的学习特征，从而显著提升生成效果。

本文将详细讲解如何利用Python工具和库，对生成式AI的训练数据进行清洗和增强，并结合代码示例帮助读者快速上手。

一、生成式AI数据的挑战

生成式AI数据的挑战主要体现在以下几个方面：

数据噪声：原始数据中可能包含错误、重复或无意义的内容。
数据不均衡：某些类别或特征可能在数据集中占比过大或过小。
数据不足：生成式任务需要海量数据，但实际收集的数据量可能有限。
多模态对齐问题：在多模态生成任务中，不同模态数据可能缺乏一致性。

二、数据清洗

数据清洗是数据预处理的第一步，旨在去除数据中的噪声和不必要的部分。以下针对文本和图像两种主要模态，介绍数据清洗的方法。

2.1 文本数据清洗

文本数据清洗通常包括以下步骤：

1. 去除无意义符号

利用正则表达式删除特殊符号和多余的空格。

import re

def clean_text(text):
    # 去除HTML标签
    text = re.sub(r'<.*?>', '', text)
    # 去除特殊符号
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # 去除多余空格
    text = re.sub(r'\s+', ' ', text).strip()
    return text

sample_text = "Hello!!!   This is a <b>test</b> text...  "
cleaned_text = clean_text(sample_text)
print("清洗后的文本:", cleaned_text)

2. 过滤低质量数据

对于生成式任务，低质量的文本可能会影响模型表现，例如过短或过长的句子。

def filter_texts(texts, min_length=5, max_length=50):
    return [text for text in texts if min_length <= len(text.split()) <= max_length]

sample_texts = ["Short", "This is a good sample text for testing.", "This text is too long " * 10]
filtered_texts = filter_texts(sample_texts)
print("过滤后的文本:", filtered_texts)

3. 去重与标准化

去除重复的文本记录，并将大小写、数字等标准化。

def standardize_text(texts):
    return list(set([text.lower() for text in texts]))

texts = ["Sample Text", "sample text", "Another Text"]
standardized_texts = standardize_text(texts)
print("标准化后的文本:", standardized_texts)

2.2 图像数据清洗

1. 检测和删除损坏文件

使用Pillow库检测图像文件是否损坏。

from PIL import Image
import os

def check_image(file_path):
    try:
        img = Image.open(file_path)
        img.verify()
        return True
    except:
        return False

image_files = ["image1.jpg", "image2.jpg"]  # 示例文件
valid_images = [f for f in image_files if check_image(f)]
print("有效图像文件:", valid_images)

2. 图像尺寸和格式统一

将所有图像调整为统一的尺寸和格式。

def resize_image(file_path, output_path, size=(256, 256)):
    img = Image.open(file_path)
    img = img.resize(size)
    img.save(output_path)

resize_image("input.jpg", "output.jpg", size=(256, 256))

三、数据增强

数据增强可以有效缓解数据不足和数据不均衡的问题，通过增加数据的多样性提升模型的泛化能力。

3.1 文本数据增强

文本增强可以通过同义词替换、随机删除、随机插入等方法实现。

1. 同义词替换

使用WordNet替换句子中的某些单词。

from nltk.corpus import wordnet

def synonym_replacement(sentence, n=2):
    words = sentence.split()
    for _ in range(n):
        word = words.pop(0)  # 示例操作，替换首个单词
        synonyms = wordnet.synsets(word)
        if synonyms:
            replacement = synonyms[0].lemmas()[0].name()
            words.insert(0, replacement)
    return ' '.join(words)

sentence = "The quick brown fox jumps over the lazy dog"
augmented_sentence = synonym_replacement(sentence)
print("增强后的句子:", augmented_sentence)

2. 随机删除

随机删除句子中的某些单词。

import random

def random_deletion(sentence, p=0.2):
    words = sentence.split()
    if len(words) == 1:
        return sentence
    return ' '.join([word for word in words if random.random() > p])

augmented_sentence = random_deletion(sentence)
print("增强后的句子:", augmented_sentence)

3.2 图像数据增强

1. 数据增强库：Albumentations

Albumentations是一个强大的图像增强库，支持多种增强操作。

from albumentations import Compose, HorizontalFlip, RandomBrightnessContrast
from albumentations.pytorch import ToTensorV2
import cv2

transform = Compose([
    HorizontalFlip(p=0.5),
    RandomBrightnessContrast(p=0.2),
    ToTensorV2()
])

image = cv2.imread("input.jpg")
augmented = transform(image=image)
augmented_image = augmented["image"]

2. 随机裁剪与旋转

使用Pillow库实现随机裁剪和旋转。

import random

def random_crop(image, crop_size=(200, 200)):
    width, height = image.size
    left = random.randint(0, width - crop_size[0])
    top = random.randint(0, height - crop_size[1])
    right = left + crop_size[0]
    bottom = top + crop_size[1]
    return image.crop((left, top, right, bottom))

def random_rotation(image):
    return image.rotate(random.randint(0, 360))

img = Image.open("input.jpg")
cropped_img = random_crop(img)
rotated_img = random_rotation(img)

四、数据清洗与增强的整合与实践

通过将数据清洗与增强相结合，可以有效提高数据质量和多样性。以下是一个完整的工作流示例。

def preprocess_and_augment(image_paths, text_data):
    processed_images = []
    for path in image_paths:
        if check_image(path):
            img = Image.open(path)
            img = resize_image(img, size=(256, 256))
            processed_images.append(img)
    
    clean_texts = [clean_text(text) for text in text_data]
    augmented_texts = [synonym_replacement(text) for text in clean_texts]
    
    return processed_images, augmented_texts

image_paths = ["image1.jpg", "image2.jpg"]
text_data = ["This is a test sentence.", "Another example."]
images, texts = preprocess_and_augment(image_paths, text_data)