人工智能训练师如何做文本数据清洗

最新推荐文章于 2025-04-09 13:53:48 发布

小宝哥Code

最新推荐文章于 2025-04-09 13:53:48 发布

阅读量1k

点赞数 28

分类专栏：人工智能训练师文章标签：人工智能

本文链接：https://blog.csdn.net/chenby186119/article/details/145693622

版权

1. 什么是文本数据清洗？

文本数据清洗是自然语言处理（NLP）的关键步骤，主要目的是去除无关字符、特殊符号、停用词、重复内容，并进行文本格式化、标准化，以提高 AI 模型的训练质量。

2. 文本数据清洗的核心步骤

2.1 主要清洗任务

任务	描述	示例
去除 HTML 标签	删除 HTML 代码	`<p>Hello</p>` → `Hello`
去除特殊字符	删除 `@#$%^&*()` 等	`Hello! #AI` → `Hello AI`
去除数字	删除所有数字	`Model GPT-4o` → `Model GPT`
去除停用词	删除无意义的词（如 "的", "是", "and", "the"）	`This is a book` → `book`
小写转换	统一文本格式	`HELLO AI` → `hello ai`
词形还原	还原单词的基本形式	`running` → `run`
拼写纠正	修正拼写错误	`recieve` → `receive`
情感符号转换	统一表情符号	`:)` → `positive_emotion`

3. Python 实现文本数据清洗

3.1 安装必要的库

pip install beautifulsoup4 lxml nltk spacy textblob emoji

3.2 代码实现完整的文本清洗流程

import re
import string
import nltk
import spacy
from bs4 import BeautifulSoup
from textblob import TextBlob
import emoji

# 下载 NLTK 停用词
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

# 加载 spaCy 进行词形还原
nlp = spacy.load("en_core_web_sm")

def clean_text(text):
    """执行完整的文本数据清洗"""

    # 1. 去除 HTML 标签
    text = BeautifulSoup(text, "lxml").text

    # 2. 去除 URL
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)

    # 3. 去除 @用户名 和 #话题
    text = re.sub(r'@\w+|#\w+', '', text)

    # 4. 替换表情符号
    text = emoji.demojize(text)

    # 5. 去除标点符号
    text = text.translate(str.maketrans("", "", string.punctuation))

    # 6. 去除数字
    text = re.sub(r'\d+', '', text)

    # 7. 转换为小写
    text = text.lower()

    # 8. 词形还原 (Lemmatization)
    doc = nlp(text)
    text = " ".join([token.lemma_ for token in doc])

    # 9. 去除停用词
    stop_words = set(stopwords.words('english'))
    text = " ".join([word for word in text.split() if word not in stop_words])

    # 10. 拼写纠正
    text = str(TextBlob(text).correct())

    return text

# 测试示例
raw_text = "Hello! 😊 This is a <b>test</b> message. Visit: https://example.com #AI @user123"
cleaned_text = clean_text(raw_text)
print("原始文本:", raw_text)
print("清洗后文本:", cleaned_text)

4. 代码解析

4.1 逐步解析代码

去除 HTML 标签
```
text = BeautifulSoup(text, "lxml").text
```
示例: <p>Hello</p> → Hello

去除 URL

text = re.sub(r'http\S+|www\S+|https\S+', '', text)

示例: Visit https://example.com → Visit

去除 @用户名和 #话题
```
text = re.sub(r'@\w+|#\w+', '', text)
```
示例: @user123 #AI → ``
替换表情符号
```
text = emoji.demojize(text)
```
示例: 😊 → :smiley:

去除标点符号

text = text.translate(str.maketrans("", "", string.punctuation))

示例: Hello! → Hello

去除数字
```
text = re.sub(r'\d+', '', text)
```
示例: GPT-4 → GPT
转换为小写
```
text = text.lower()
```
示例: HELLO → hello
词形还原

最低0.47元/天解锁文章