数据预处理（ML）

最新推荐文章于 2025-02-13 22:07:07 发布

Sonhhxg_柒

最新推荐文章于 2025-02-13 22:07:07 发布

阅读量1.5w

点赞数 6

分类专栏：机器学习（ML）文章标签： sql mysql 数据库

本文链接：https://blog.csdn.net/sikh_0529/article/details/126805578

版权

机器学习（ML）专栏收录该内容

147 篇文章

订阅专栏

介绍

数据预处理可以分为两种类型的过程：准备和转换。我们将探索常见的预处理技术，然后介绍我们特定应用的相关流程。

准备中

准备数据涉及组织和清理数据。

加入

执行 SQL 连接与现有数据表，以将您需要的所有相关数据组织到一个视图中。这使得使用我们的数据集变得更加容易。

SELECT * FROM A
INNER JOIN B on A.id == B.id

缺失值

首先，我们必须识别缺失值的行，一旦我们这样做了，就有几种方法来处理它们。

省略具有缺失值的样本（如果只有一小部分缺失它）


# Drop a row (sample) by index
df.drop([4, 10, ...])
# Conditionally drop rows (samples)
df = df[df.value > 0]
# Drop samples with any missing feature
df = df[df.isnull().any(axis=1)]

省略整个特征（如果太多样本缺少该值）


# Drop a column (feature)
df.drop(["A"], axis=1)

填充特征的缺失值（使用领域知识、启发式等）

# Fill in missing values with mean
df.A = df.A.fillna(df.A.mean())

可能并不总是看起来“缺失”（例如 0、null、NA 等）

# Replace zeros to NaNs
import numpy as np
df.A = df.A.replace({"0": np.nan, 0: np.nan})

异常值（异常）

关于什么是“正常”预期值的工艺假设


# Ex. Feature value must be within 2 standard deviations
df[np.abs(df.A - df.A.mean()) <= (2 * df.A.std())]

注意不要删除重要的异常值（例如欺诈）
当我们应用转换（例如幂律）时，值可能不是异常值
异常可以是全局（点）、上下文（条件）或集体（个别点不是异常的，集体组是异常值）

特征工程

特征工程涉及以独特的方式组合特征以提取信号。


# Input
df.C = df.A + df.B

打扫

清理我们的数据涉及应用约束以使我们的模型更容易从数据中提取我们的信号。

使用领域专业知识和 EDA
通过过滤器应用约束
确保数据类型一致性
删除具有某些或空列值的数据点
图像（裁剪、调整大小、剪辑等）

# Resize
import cv2
dims = (height, width)
resized_img = cv2.resize(src=img, dsize=dims, interpolation=cv2.INTER_LINEAR)

文本（lower、stem、lemmatize、regex 等）


# Lower case the text
text = text.lower()

转型

转换数据涉及特征编码和工程。

缩放

输入规模影响过程的模型需要
从训练拆分中学习构造并应用于其他拆分（本地）
不要盲目地缩放特征（例如分类特征）
标准化：将值重新调整为均值 0，std 1

# Standardization
import numpy as np
x = np.random.random(4) # values between 0 and 1
print ("x:\n", x)
print (f"mean: {np.mean(x):.2f}, std: {np.std(x):.2f}")
x_standardized = (x - np.mean(x)) / np.std(x)
print ("x_standardized:\n", x_standardized)
print (f"mean: {np.mean(x_standardized):.2f}, std: {np.std(x_standardized):.2f}")

x：[0.36769939 0.82302265 0.9891467 0.56200803]
平均值：0.69，标准：0.24
x_standardized：[-1.33285946 0.57695671 1.27375049 -0.51784775]
平均值：0.00，标准：1.00

min-max : 在最小值和最大值之间重新调整值

# Min-max
import numpy as np
x = np.random.random(4) # values between 0 and 1
print ("x:", x)
print (f"min: {x.min():.2f}, max: {x.max():.2f}")
x_scaled = (x - x.min()) / (x.max() - x.min())
print ("x_scaled:", x_scaled)
print (f"min: {x_scaled.min():.2f}, max: {x_scaled.max():.2f}")

x：[0.20195674 0.99108855 0.73005081 0.02540603]
最小值：0.03，最大值：0.99
x_scaled：[0.18282479 1. 0.72968575 0.]
最小值：0.00，最大值：1.00

binning：使用 bin 将连续特征转换为分类特征

# Binning
import numpy as np
x = np.random.random(4) # values between 0 and 1
print ("x:", x)
bins = np.linspace(0, 1, 5) # bins between 0 and 1
print ("bins:", bins)
binned = np.digitize(x, bins)
print ("binned:", binned)

x：[0.54906364 0.1051404 0.2737904 0.2926313]
垃圾箱：[0。0.25 0.5 0.75 1.]
分箱：[3 1 2 2]

还有更多！

编码

允许有效地表示数据（保持信号）和有效地（学习模式，例如 one-hot 与嵌入）

label : 分类值的唯一索引


# Label encoding
label_encoder.class_to_index = {
"attention": 0,
"autoencoders": 1,
"convolutional-neural-networks": 2,
"data-augmentation": 3,
... }
label_encoder.transform(["attention", "data-augmentation"])

数组（[2, 2, 1]）

one-hot : 表示为二进制向量

# One-hot encoding
one_hot_encoder.transform(["attention", "data-augmentation"])

数组（[1, 0, 0, 1, 0, ..., 0]）

嵌入：能够表示上下文的密集表示


# Embeddings
self.embeddings = nn.Embedding(
    embedding_dim=embedding_dim, num_embeddings=vocab_size)
x_in = self.embeddings(x_in)
print (x_in.shape)

(len(X), embedding_dim)

还有更多！

萃取

从现有特征中提取信号
结合现有功能
迁移学习：使用预训练模型作为特征提取器并对其结果进行微调
自动编码器：学习对压缩知识表示的输入进行编码

主成分分析（PCA）：对低维空间中的项目数据进行线性降维。


# PCA
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1, 3], [-2, -1, 2], [-3, -2, 1]])
pca = PCA(n_components=2)
pca.fit(X)
print (pca.transform(X))
print (pca.explained_variance_ratio_)
print (pca.singular_values_)

[[-1.44245791 -0.1744313 ]
 [-0.1148688   0.31291575]
 [ 1.55732672 -0.13848446]]
[0.96838847 0.03161153]
[2.12582835 0.38408396]

counts (ngram)：将文本稀疏表示为令牌计数矩阵 - 如果特征值具有大量有意义的、可分离的信号，则很有用。

# Counts (ngram)
from sklearn.feature_extraction.text import CountVectorizer
y = [
    "acetyl acetone",
    "acetyl chloride",
    "chloride hydroxide",
]
vectorizer = CountVectorizer()
y = vectorizer.fit_transform(y)
print (vectorizer.get_feature_names())
print (y.toarray())
# 💡 Repeat above with char-level ngram vectorizer
# vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 3)) # uni, bi and trigrams

['丙酮'，'乙酰'，'氯化物'，'氢氧化物']
[[1 1 0 0]
 [0 1 1 0]
 [0 0 1 1]]

应用

对于我们的应用程序，我们将实施一些与我们的数据集相关的预处理步骤。

特征工程

我们可以结合现有的输入特征来创建新的有意义的信号（帮助模型学习）。但是，如果没有经验性地尝试不同的组合，通常没有简单的方法可以知道某些特征组合是否有帮助。在这里，我们可以分别使用项目的标题和描述作为特征，但我们会将它们结合起来创建一个输入特征。


# Input
df["text"] = df.title + " " + df.description

打扫

由于我们正在处理文本数据，我们可以应用一些常见的文本预处理步骤：

!pip install nltk==3.7 -q


import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

nltk.download("stopwords")
STOPWORDS = stopwords.words("english")
stemmer = PorterStemmer()

def clean_text(text, lower=True, stem=False, stopwords=STOPWORDS):
    """Clean raw text."""
    # Lower
    if lower:
        text = text.lower()

    # Remove stopwords
    if len(stopwords):
        pattern = re.compile(r'\b(' + r"|".join(stopwords) + r")\b\s*")
        text = pattern.sub('', text)

    # Spacing and filters
    text = re.sub(
        r"([!\"'#$%&()*\+,-./:;<=>?@\\\[\]^_`{|}~])", r" \1 ", text
    )  # add spacing between objects to be filtered
    text = re.sub("[^A-Za-z0-9]+", " ", text)  # remove non alphanumeric chars
    text = re.sub(" +", " ", text)  # remove multiple spaces
    text = text.strip()  # strip white space at the ends

    # Remove links
    text = re.sub(r"http\S+", "", text)

    # Stemming
    if stem:
        text = " ".join([stemmer.stem(word, to_lowercase=lower) for word in text.split(" ")])

    return text

!!! note
    We could definitely try and include emojis, punctuations, etc. because they do have a lot of signal for the task but it's best to simplify the initial feature set we use to just what we think are the most influential and then we can slowly introduce other features and assess utility.


# Apply to dataframe
original_df = df.copy()
df.text = df.text.apply(clean_text, lower=True, stem=False)
print (f"{original_df.text.values[0]}\n{df.text.values[0]}")

YOLO 和 RCNN 在真实世界视频中的比较 将理论带入实验很酷。我们可以在 colab 中轻松训练模型并在几分钟内找到结果。
比较 yolo rcnn 真实世界 视频 带来 理论 实验 酷 轻松 训练 模型 colab 查找 结果 分钟

更换标签

根据我们在EDA中的发现，我们将应用几个约束来标记我们的数据：

如果一个数据点有一个我们目前不支持的标签，我们将把它替换为other
如果某个标签没有足够的样本，我们将其替换为other

import json

# Accepted tags (external constraint)
ACCEPTED_TAGS = ["natural-language-processing", "computer-vision", "mlops", "graph-learning"]

# Out of scope (OOS) tags
oos_tags = [item for item in df.tag.unique() if item not in ACCEPTED_TAGS]
oos_tags

['强化学习'，'时间序列']


# Samples with OOS tags
oos_indices = df[df.tag.isin(oos_tags)].index
df[df.tag.isin(oos_tags)].head()

	ID	创建于	标题	描述	标签
3	15	2020-02-28 23:55:26	真棒蒙特卡洛树搜索	蒙特卡洛树搜索论文的精选列表...	强化学习
37	121	2020-03-24 04:56:38	TensorFlow2 中的深度强化学习	deep-rl-tf2 是一个实现...	强化学习
67	218	2020-04-06 11:29:57	使用 TensorFlow2 的分布式强化学习	🐳 各种分布式系统的实现...	强化学习
74	239	2020-04-06 18:39:48	Prophet：大规模预测	用于生成高质量预测的工具...	时间序列
95	277	2020-04-07 00:30:33	强化学习课程	课程学习应用于强化学习...	强化学习

# Replace this tag with "other"
df.tag = df.tag.apply(lambda x: "other" if x in oos_tags else x)
df.iloc[oos_indices].head()

ID	创建于	标题	描述	标签
3	15	2020-02-28 23:55:26	真棒蒙特卡洛树搜索	蒙特卡洛树搜索论文的精选列表...	其他
37	121	2020-03-24 04:56:38	TensorFlow2 中的深度强化学习	deep-rl-tf2 是一个实现...	其他
67	218	2020-04-06 11:29:57	使用 TensorFlow2 的分布式强化学习	🐳 各种分布式系统的实现...	其他
74	239	2020-04-06 18:39:48	Prophet：大规模预测	用于生成高质量预测的工具...	其他
95	277	2020-04-07 00:30:33	强化学习课程	课程学习应用于强化学习...	其他

我们还将映射限制为仅高于某个频率阈值的标签。没有足够项目的标签将没有足够的样本来建模它们的关系。

# Minimum frequency required for a tag
min_freq = 75
tags = Counter(df.tag.values)

# Tags that just made / missed the cut
@widgets.interact(min_freq=(0, tags.most_common()[0][1]))
def separate_tags_by_freq(min_freq=min_freq):
    tags_above_freq = Counter(tag for tag in tags.elements()
                                    if tags[tag] >= min_freq)
    tags_below_freq = Counter(tag for tag in tags.elements()
                                    if tags[tag] < min_freq)
    print ("Most popular tags:\n", tags_above_freq.most_common(3))
    print ("\nTags that just made the cut:\n", tags_above_freq.most_common()[-3:])
    print ("\nTags that just missed the cut:\n", tags_below_freq.most_common(3))

最受欢迎的标签：
 [（'自然语言处理'，388），（'计算机视觉'，356），（'其他'，87）]

刚刚成功的标签：
 [('计算机视觉', 356), ('其他', 87), ('mlops', 79)]

刚刚错过的标签：
 [('图形学习', 45)]

def filter(tag, include=[]):
    """Determine if a given tag is to be included."""
    if tag not in include:
        tag = None
    return tag

# Filter tags that have fewer than <min_freq> occurrences
tags_above_freq = Counter(tag for tag in tags.elements()
                          if (tags[tag] >= min_freq))
df.tag = df.tag.apply(filter, include=list(tags_above_freq.keys()))

# Fill None with other
df.tag = df.tag.fillna("other")

编码

我们将对我们的输出标签进行编码，我们将为每个标签分配一个唯一的索引。

import numpy as np
import random

# Get data
X = df.text.to_numpy()
y = df.tag

我们将编写我们自己的基于 scikit-learn实现的 LabelEncoder 。能够为我们想要创建的对象编写干净的类是一项非常有价值的技能。

class LabelEncoder(object):
    """Encode labels into unique indices"""
    def __init__(self, class_to_index={}):
        self.class_to_index = class_to_index or {}  # mutable defaults ;)
        self.index_to_class = {v: k for k, v in self.class_to_index.items()}
        self.classes = list(self.class_to_index.keys())

    def __len__(self):
        return len(self.class_to_index)

    def __str__(self):
        return f"<LabelEncoder(num_classes={len(self)})>"

    def fit(self, y):
        classes = np.unique(y)
        for i, class_ in enumerate(classes):
            self.class_to_index[class_] = i
        self.index_to_class = {v: k for k, v in self.class_to_index.items()}
        self.classes = list(self.class_to_index.keys())
        return self

    def encode(self, y):
        encoded = np.zeros((len(y)), dtype=int)
        for i, item in enumerate(y):
            encoded[i] = self.class_to_index[item]
        return encoded

    def decode(self, y):
        classes = []
        for i, item in enumerate(y):
            classes.append(self.index_to_class[item])
        return classes

    def save(self, fp):
        with open(fp, "w") as fp:
            contents = {"class_to_index": self.class_to_index}
            json.dump(contents, fp, indent=4, sort_keys=False)

    @classmethod
    def load(cls, fp):
        with open(fp, "r") as fp:
            kwargs = json.load(fp=fp)
        return cls(**kwargs)

如果您不熟悉装饰器，请从我们的Python 课程@classmethod中了解更多信息

# Encode
label_encoder = LabelEncoder()
label_encoder.fit(y)
num_classes = len(label_encoder)

label_encoder.class_to_index

{'计算机视觉'：0，
 'mlops': 1,
 “自然语言处理”：2，
 “其他”：3}

label_encoder.index_to_class

{0: '计算机视觉',
 1：'mlops'，
 2：“自然语言处理”，
 3：“其他”}

# Encode
label_encoder.encode(["computer-vision", "mlops", "mlops"])

数组（[0, 1, 1]）


# Decode
label_encoder.decode(np.array([0, 1, 1]))

['计算机视觉'，'mlops'，'mlops']


# Encode all our labels
y = label_encoder.encode(y)
print (y.shape)

我们将对输入文本特征进行的许多转换都是特定于模型的。例如，对于我们的简单基线，我们可能会做label encoding→tf-idf而对于更多涉及的架构，我们可能会做label encoding→ one-hot encoding→ embeddings。因此，当我们实施基线时，我们将在下一组课程中介绍这些内容。