Python NumPy 数据清洗：高效处理数据异常与缺失

敲代码不忘补水

已于 2024-09-28 17:07:53 修改

阅读量258

点赞数 21

文章标签： python numpy 开发语言数据清洗二维数据

于 2024-09-28 17:05:58 首次发布

本文链接：https://blog.csdn.net/u014394049/article/details/142618293

版权

Python NumPy 数据清洗：高效处理数据异常与缺失

文章目录

Python NumPy 数据清洗：高效处理数据异常与缺失

本文展示了如何利用 Python 的 NumPy 库高效地进行数据清洗，特别是对复杂数据的异常处理与缺失值填补。文章详细介绍了数据清洗中的常见问题，包括数据值缺失、异常值、格式错误及非独立数据，并提供了对应的解决方案。通过具体的学期学生成绩数据示例，演示了如何识别重复学号、处理缺失年龄值、剔除异常分数、填补缺失成绩等。借助 np.unique、np.isnan、np.clip 等 NumPy 方法，实现了对数据的精细化处理。最终，经过数据清洗，所有数据均符合预期格式和规则，为后续的数据分析和建模奠定了坚实基础。

一数据预处理的常见问题

数据值缺失、数据值异常大或小、格式错误、非独立数据错误。以下是数据预处理中常见问题及其描述、原因和解决方法的表格：

问题类型	描述	产生原因	解决方法
数据值缺失	数据集中有些值为空或不存在，通常用 `NaN`、`null` 或空字符串表示。	数据收集过程中的遗漏、传感器故障、数据传输问题。	删除缺失值、填充缺失值（平均值、中位数等）、标记缺失值。
数据值异常大或小	数据集中某些值与其他数据显著不同，通常为极端值或不合理的值。	测量或录入错误、正态分布中的极端情况、数据采集设备问题。	剔除异常值、替换异常值（中位数、分位数等）、使用算法检测异常值。
格式错误	数据格式不符合预期，如字符串中混有数字、日期格式不一致、分类标签混乱等。	数据来源不一致、录入错误、缺乏格式规范。	统一格式、清理数据（删除或更正错误格式）、转换数据类型。
非独立数据错误	数据点之间存在依赖关系，如重复数据、同一对象的多次测量导致数据非独立性。	同一对象多次采样、数据重复、对象间存在依赖关系。	删除重复数据、聚合处理（取平均、最大值等）、标记数据依赖性。

注：在项目中，应结合业务背景和实际情况综合判断哪些数据属于异常。

二待处理的数据

假设手上有一学期的小学生上课成绩数据。

raw_data = [
    ["Name", "StudentID", "Age", "AttendClass", "Score"],
    ["小明", 20131, 10, 1, 67],
    ["小花", 20132, 11, 1, 88],
    ["小菜", 20133, None, 1, "98"],
    ["小七", 20134, 8, 1, 110],
    ["花菜", 20134, 98, 0, None],
    ["西兰花", 20136, 12, 0, 12]
]
print(raw_data)

对比数据类型。

data = np.array(raw_data)
# object
print("data.dtype", data.dtype)
test1 = np.array([1, 2, 3])
test2 = np.array([1.1, 2.3, 3.4])
test3 = np.array([1, 2, 3], dtype=np.float64)
print("test1.dtype", test1.dtype)
print("test2.dtype", test2.dtype)
print("test3.dtype", test3.dtype)
print("test2 > 2 ", test2 > 2)
# TypeError: '>' not supported between instances of 'str' and 'int'
# print("data > 2", data > 2)  # 这里会报错

运行结结果

data.dtype object
test1.dtype int64
test2.dtype float64
test3.dtype float64
test2 > 2  [False  True  True]

NumPy 数组主要是用于存储同种数据类型的元素，运行 data > 2 后报错 TypeError: '>' not supported between instances of 'str' and 'int'。下面对上述 data 数据进行清洗。

三数据预处理

不要首行字符串并去掉首列名字。

data_process = []
for i in range(len(raw_data)):
    if i == 0:
        continue  # 不要首行字符串
    # 去掉首列名字
    data_process.append(raw_data[i][1:])
data = np.array(data_process, dtype=np.float64)
print("data.dtype", data.dtype)
print(data)

四清洗数据

1 查看第一列学号

进过 np.unique 函数运行之后，发现学号有重复，查看数据发现相邻的数据 20135，则修改第五行第一列。

# 查看第一列学号
sid = data[:, 0]
unique, counts = np.unique(sid, return_counts=True)
print(counts)
# 数据中少 20135
print(unique[counts > 1])
# 修改第五行第一列
data[4, 0] = 20135
print(data)

2 查看第二列年龄

# 查看第二列年龄
is_nan = np.isnan(data[:, 1])
print("is_nan:", is_nan)
nan_idx = np.argwhere(is_nan)
print(nan_idx)
# 用 ~ 符号可以 True/False 对调
print(~np.isnan(data[:, 1]))
# 计算有数据的平均年龄，用 ~ 符号可以 True/False 对调
print(data[~np.isnan(data[:, 1]), 1])
mean_age = data[~np.isnan(data[:, 1]), 1].mean()
print("有数据的平均年龄：", mean_age)

运行结果

is_nan: [False False  True False False False]
[[2]]
[ True  True False  True  True  True]
[10. 11.  8. 98. 12.]
有数据的平均年龄： 27.8

结果解析

发现平均均值偏高，查看数据发现有个年龄为 98 的，判断这个是异常数据，继续处理数据。

# ~ 表示 True/False 对调，& 就是逐个做 Python and 的运算
normal_age_mask = ~np.isnan(data[:, 1]) & (data[:, 1] < 20)
print("normal_age_mask:", normal_age_mask)

normal_age_mean = data[normal_age_mask, 1].mean()
print("normal_age_mean:", normal_age_mean)

data[~normal_age_mask, 1] = normal_age_mean
print("ages:", data[:, 1])

运行结果

normal_age_mask: [ True  True False  True False  True]
normal_age_mean: 10.25
ages: [10.   11.   10.25  8.   10.25 12.  ]

函数解释

1）`~np.isnan(data[:, 1])`

~ 是按位取反运算符，它将布尔数组中 True 和 False 的值互换，即将 True 变为 False，False 变为 True。
~np.isnan(data[:, 1]) 将 np.isnan(data[:, 1]) 中所有的 True 变为 False，False 变为 True，即标记出 data[:, 1] 中那些不是 NaN 的元素。

2）`data[~np.isnan(data[:, 1]), 1]`

这一部分使用布尔索引选择 data 中第二列（索引为 1）的非 NaN 元素。
data[~np.isnan(data[:, 1]), 1] 中的 ~np.isnan(data[:, 1]) 作为行的条件，表示只选择第二列中非 NaN 对应的行。
1 表示列索引，指第二列。
结果是一个一维数组，包含 data 中第二列所有非 NaN 的元素。

3 观察后三行数据

# 观察后三行数据
print(data[-3:, 2:])

运行结果

[[  1. 110.]
 [  0.  nan]
 [  0.  12.]]

结果解析

因为没上课，就没成绩，但是倒数第一行，没上课，但是有成绩？倒数第三行，成绩居然超出了满分 100 分，继续处理数据。

# 没上课的转成分数转成 0
data[data[:, 2] == 0, 3] = 0
# 超过 100 分和低于 0 分的都处理一下
data[:, 3] = np.clip(data[:, 3], 0, 100)
print(data[:, 2:])

函数解释

np.clip(array, min, max)：这是 NumPy 的一个函数，用于将数组中的元素限制在给定的最小值和最大值之间。对于每个元素：

如果元素小于 min，则将该元素设置为 min；
如果元素大于 max，则将该元素设置为 max；
如果元素在 min 和 max 之间，则保持不变。

五完整代码示例

# This is a sample Python script.

# Press ⌃R to execute it or replace it with your code.
# Press Double ⇧ to search everywhere for classes, files, tool windows, actions, and settings.
import numpy as np


def print_hi(name):
    # Use a breakpoint in the code line below to debug your script.
    print(f'Hi, {name}')  # Press ⌘F8 to toggle the breakpoint.
    raw_data = [
        ["Name", "StudentID", "Age", "AttendClass", "Score"],
        ["小明", 20131, 10, 1, 67],
        ["小花", 20132, 11, 1, 88],
        ["小菜", 20133, None, 1, "98"],
        ["小七", 20134, 8, 1, 110],
        ["花菜", 20134, 98, 0, None],
        ["西兰花", 20136, 12, 0, 12]
    ]
    # print(raw_data)

    data = np.array(raw_data)
    print("data.dtype", data.dtype)
    test1 = np.array([1, 2, 3])
    test2 = np.array([1.1, 2.3, 3.4])
    test3 = np.array([1, 2, 3], dtype=np.float64)
    print("test1.dtype", test1.dtype)
    print("test2.dtype", test2.dtype)
    print("test3.dtype", test3.dtype)
    print("test2 > 2 ", test2 > 2)
    # TypeError: '>' not supported between instances of 'str' and 'int'
    # print("data > 2", data > 2)  # 这里会报错
    # 数据预处理
    data_process = []
    for i in range(len(raw_data)):
        if i == 0:
            continue  # 不要首行字符串
        # 去掉首列名字
        data_process.append(raw_data[i][1:])
    data = np.array(data_process, dtype=np.float64)
    print("data.dtype", data.dtype)
    # print(data)
    # 清洗数据
    # 查看第一列学号
    sid = data[:, 0]
    unique, counts = np.unique(sid, return_counts=True)
    print(counts)
    # 数据中少 20135
    print(unique[counts > 1])
    # 修改第五行第一列
    data[4, 0] = 20135
    # print(data)
    # 查看第二列年龄
    is_nan = np.isnan(data[:, 1])
    print("is_nan:", is_nan)
    nan_idx = np.argwhere(is_nan)
    print(nan_idx)
    # 用 ~ 符号可以 True/False 对调
    print(~np.isnan(data[:, 1]))
    # 计算有数据的平均年龄，用 ~ 符号可以 True/False 对调
    print(data[~np.isnan(data[:, 1]), 1])
    mean_age = data[~np.isnan(data[:, 1]), 1].mean()
    print("有数据的平均年龄：", mean_age)
    # 发现平均均值偏高，查看数据发现有个年龄为 98 的，判断这个是异常数据
    # ~ 表示 True/False 对调，& 就是逐个做 Python and 的运算
    normal_age_mask = ~np.isnan(data[:, 1]) & (data[:, 1] < 20)
    print("normal_age_mask:", normal_age_mask)

    normal_age_mean = data[normal_age_mask, 1].mean()
    print("normal_age_mean:", normal_age_mean)

    data[~normal_age_mask, 1] = normal_age_mean
    print("ages:", data[:, 1])

    # 观察后面两数据
    print(data[-3:, 2:])
    # 因为没上课，就没成绩，但是倒数第一行，没上课，怎么还有成绩？还有倒数第三行，成绩居然超出了满分 100 分
    # 没上课的转成分数转成 0
    data[data[:, 2] == 0, 3] = 0

    # 超过 100 分和低于 0 分的都处理一下
    data[:, 3] = np.clip(data[:, 3], 0, 100)

    print(data[:, 2:])


# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    print_hi('数据清洗')

# See PyCharm help at https://www.jetbrains.com/help/pycharm/

复制粘贴并覆盖到你的 main.py 中运行，运行结果如下。

Hi, 数据清洗
data.dtype object
test1.dtype int64
test2.dtype float64
test3.dtype float64
test2 > 2  [False  True  True]
data.dtype float64
[1 1 1 2 1]
[20134.]
is_nan: [False False  True False False False]
[[2]]
[ True  True False  True  True  True]
[10. 11.  8. 98. 12.]
有数据的平均年龄： 27.8
normal_age_mask: [ True  True False  True False  True]
normal_age_mean: 10.25
ages: [10.   11.   10.25  8.   10.25 12.  ]
[[  1. 110.]
 [  0.  nan]
 [  0.  12.]]
[[  1.  67.]
 [  1.  88.]
 [  1.  98.]
 [  1. 100.]
 [  0.   0.]
 [  0.   0.]]