20.使用标准差剔除异常值

其木王·王子

已于 2024-11-16 19:46:49 修改

阅读量1.7k

点赞数 31

分类专栏： Python 文章标签： python 人工智能

于 2024-11-16 18:52:21 首次发布

本文链接：https://blog.csdn.net/u014217137/article/details/143819886

版权

Python 专栏收录该内容

22 篇文章

订阅专栏

标准差剔除异常值

1. 方法
2. 示例代码

我有个记录数据采集后格式是step_rewards.txt 的文档，里面只有一列数据，10*10000行数据，没有表头，分别代表奖励数值。因为有些数据点峰值和峰谷很高，抖动大。现在需要在分10段，每段10000条数据。读取原始数据之后，在画折线图前，这中间增加一个对原始数据过滤的功能，将过滤后的数据保存到一个新的文件中，filter_step_rewards.txt 。

要求：这个区间里的数据，把抖动很大的数据，给过滤到剔除掉，不再使用这种抖动大的数据，然后用新过滤后的数据，再画图。

1. 方法

在绘制图表之前对数据进行预处理，剔除掉那些波动很大的数据点。我们可以使用一些统计方法来识别和剔除这些异常值。常用的方法包括使用标准差（Standard Deviation）或四分位数（Interquartile Range, IQR）来检测和剔除异常值。

2. 示例代码

下面是一个完整的示例代码，展示了如何读取数据、进行数据清洗、分段读取数据并绘图。
使用标准差方法剔除异常值。

2.1 数据读取与清洗

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

# 读取数据
file_path = 'step_rewards.txt'
data = pd.read_csv(file_path, header=None, names=['reward'])

# 检查数据结构
print(data.head())

# 数据清洗 - 使用标准差方法剔除异常值
def remove_outliers(df, column, threshold=3):
    mean = df[column].mean()
    std = df[column].std()
    outliers = (df[column] - mean).abs() > threshold * std
    return df[~outliers]

# 清洗数据
cleaned_data = remove_outliers(data, 'reward')

# 打印前几行查看清洗后的数据
print(cleaned_data.head())

2.2 分段读取数据并绘图

# 将清洗后的数据保存到新的文件中
output_file_path = 'filter_step_rewards.txt'
cleaned_data.to_csv(output_file_path, index=False, header=False)
print(f"Filtered data saved to {output_file_path}")

# 确保输出目录存在
output_dir = 'plots'
os.makedirs(output_dir, exist_ok=True)

# 每隔 10000 行提取数据
chunk_size = 10000
num_chunks = len(cleaned_data) // chunk_size

for i in range(num_chunks):
    start_idx = i * chunk_size
    end_idx = (i + 1) * chunk_size
    chunk = cleaned_data.iloc[start_idx:end_idx]

    # 绘制清洗后的数据
    plt.figure(figsize=(10, 6))
    plt.plot(chunk.index, chunk['reward'], label='Cleaned Reward')
    plt.legend()
    plt.title(f'Rewards for Chunk {i+1}')
    plt.xlabel('Step')
    plt.ylabel('Reward')

    # 保存图像
    image_path = os.path.join(output_dir, f'reward_chunk_{i+1}.png')
    plt.savefig(image_path)
    plt.close()

print(f"Generated {num_chunks} plots and saved them in {output_dir}")

2.3 解释

1 ） 数据读取：

data = pd.read_csv(file_path, header=None, names=['reward'])

这行代码读取 step_rewards.txt 文件，文件没有标题行。我们将列名指定为 reward。

2 ） 数据检查：

print(data.head())

打印数据的前几行，以便确认数据是否正确加载。

3 ） 数据清洗：

def remove_outliers(df, column, threshold=3):
    mean = df[column].mean()
    std = df[column].std()
    outliers = (df[column] - mean).abs() > threshold * std
    return df[~outliers]

remove_outliers 函数用于剔除异常值。它计算数据的均值和标准差，然后将距离均值超过 threshold 倍标准差的数据点标记为异常值。
threshold 参数默认为3，表示剔除距离均值超过3倍标准差的数据点。可以根据实际情况调整这个阈值。

cleaned_data = remove_outliers(data, 'reward')

调用 remove_outliers 函数对数据进行清洗。

4 ) 保存清洗后的数据:

output_file_path = 'filter_step_rewards.txt'
cleaned_data.to_csv(output_file_path, index=False, header=False)
print(f"Filtered data saved to {output_file_path}")

将清洗后的数据保存到 filter_step_rewards.txt 文件中。
index=False 表示不保存索引。
header=False 表示不保存列名。

5 ） 分段读取数据并绘图：

chunk_size = 10000
num_chunks = len(cleaned_data) // chunk_size

for i in range(num_chunks):
    start_idx = i * chunk_size
    end_idx = (i + 1) * chunk_size
    chunk = cleaned_data.iloc[start_idx:end_idx]

    plt.figure(figsize=(10, 6))
    plt.plot(chunk.index, chunk['reward'], label='Cleaned Reward')
    plt.legend()
    plt.title(f'Rewards for Chunk {i+1}')
    plt.xlabel('Step')
    plt.ylabel('Reward')

    image_path = os.path.join(output_dir, f'reward_chunk_{i+1}.png')
    plt.savefig(image_path)
    plt.close()

chunk_size 定义了每组数据的大小。
num_chunks 计算总共有多少组数据。
for 循环遍历每组数据，提取并绘制清洗后的数据。
plt.savefig(image_path) 将图像保存到指定路径。
plt.close() 关闭当前图像，防止内存泄漏。

6 ） 输出信息：

print(f"Generated {num_chunks} plots and saved them in {output_dir}")

输出生成的图像数量和保存路径。

运行上述代码后，将在 plots 目录中找到 10 张图像文件，每张图像对应一组数据的清洗后奖励值的图表。

在这里插入图片描述

2.4 `outliers`

1 ）在 remove_outliers 函数中，我们定义了一个布尔 Series outliers，用于标识哪些数据点是异常值。

outliers = (df[column] - mean).abs() > threshold * std

(df[column] - mean).abs()：计算每个数据点与均值的绝对差值。
> threshold * std：判断绝对差值是否大于 threshold 倍的标准差。
结果是一个布尔 Series，其中 True 表示该数据点是异常值，False 表示该数据点不是异常值。

2 ） ~outliers
在 Python 中，~ 是按位取反运算符。对于布尔值，~True 为 False，~False 为 True。因此，~outliers 会将布尔 Series outliers 中的 True 变为 False，False 变为 True。