自学笔记|筛选dataframe行数据并保存

小小白鹿

已于 2023-07-01 13:56:21 修改

阅读量395

点赞数

文章标签：笔记

于 2023-07-01 13:10:06 首次发布

本文链接：https://blog.csdn.net/m0_48196258/article/details/131488705

版权

需求分析：我想筛选出df中存在两个字符串“USA”或者“CHN”的行，并把他保存下来

#读取文件
df=pd.read_csv('20230626.export.CSV',sep='\t',header=None)

#条件筛选
filtered_df = df[df.apply(lambda x: x.astype(str).str.contains("CHN|USA")).any(axis=1)]
#filtered_df = df[df.apply(lambda x: x.astype(str).str.contains("CHN") & x.astype(str).str.contains("USA")).any(axis=1)]

上述代码解释：

lambda 函数 lambda x: x.astype(str).str.contains("CHN|USA") 会对 x 转换为字符串类型，并检查每个元素是否包含字符串 "CHN" 或 "USA"。最终返回一个具有相同形状的布尔值DataFrame，其中每个元素为 True 或 False 表示对应位置上的元素是否满足条件。

lambda 函数 lambda x: x.astype(str).str.contains("CHN") & x.astype(str).str.contains("USA") 首先将 x 转换为字符串类型，并检查每个元素是否同时包含字符串 "CHN" 和 "USA"。然后，使用按位与操作符 & 在两个条件之间进行逻辑与运算。

.any(axis=1)：这一部分使用 any(axis=1) 方法，沿着每一行的方向（axis=1），判断每一行中是否存在任何一个元素满足条件。

df[...]：这一部分使用布尔索引，将符合条件的行筛选出来。通过将布尔 Series 放入方括号中，可以获取 DataFrame df 中所有对应位置为 True 的行。只有满足条件的行会被选中，其他行将被排除。

举个栗子：

   A        B        C
0  CHN    USA    Europe
1  CHN    JPN    USA
2  CAN    USA    IND
3  USA    USA    USA

filtered_df = df.apply(lambda x: x.astype(str).str.contains("CHN") & x.astype(str).str.contains("USA"))

#会返回下面这个结果

       A     B      C
0   True   True  False
1   True  False   True
2  False   True  False
3  False   True  False

filtered_df = df.apply(lambda x: x.astype(str).str.contains("CHN") & x.astype(str).str.contains("USA")).any(axis=1)

#会返回下面这个结果

0     True
1     True
2    False
3    False
dtype: bool

filtered_df = df[df.apply(lambda x: x.astype(str).str.contains("CHN") & x.astype(str).str.contains("USA")).any(axis=1)]

#会返回这个结果

   A       B        C
0  CHN   USA     Europe
1  CHN   JPN     USA

达到同样效果的代码：

# 遍历每一行
#通过迭代器 df.iterrows() 遍历DataFrame df 中的每一行。在每次迭代中，index 表示当前行的索引，row 表示当前行的数据。
for index, row in df.iterrows():
  # 检查当前行中是否同时包含"CHN"和"USA"
    if "CHN" in row.values and "USA" in row.values:
        # 如果条件符合，将当前行添加到新的数据框中
          new_df = new_df.append(row)
          new_df = new_df.reset_index(drop=True)

批量处理df并保存到新的csv：

import pandas as pd
import os
import warnings

# 忽略所有警告
warnings.filterwarnings("ignore")

# 指定文件夹路径
folder_path = 'G:\\整月\\'

# 遍历文件夹中的文件
for file_name in os.listdir(folder_path):
    # 检查文件扩展名是否为CSV
    if file_name.endswith('.csv'):
        # 构建完整的文件路径
        file_path = os.path.join(folder_path, file_name)

        # 读取CSV文件
        df = pd.read_csv(file_path, sep='\t')

        df = df.iloc[:, [2, 7, 17, 29, 30, 34]]

        df.columns = ['', '', '', '', '', '']
        #print(df)

        # 创建一个空的数据框
        new_df = pd.DataFrame(columns=df.columns)

        # 遍历每一行
        for index, row in df.iterrows():
            # 检查当前行中是否同时包含"CHN"和"USA"
            if "CHN" in row.values and "USA" in row.values:
                # 如果条件符合，将当前行添加到新的数据框中
                new_df = new_df.append(row)
                new_df = new_df.reset_index(drop=True)

        # 保存新的数据框到CSV文件
        new_file_name = os.path.splitext(file_name)[0] + '_修改.csv'  # 修改后的文件名
        new_file_path = os.path.join('G:\\\\整月\\', new_file_name)
        new_df.to_csv(new_file_path, sep='\t')

小结：

用前面代码在大型数据集效率更高，逐行迭代的方法可能会在大型数据集上变得很慢，因为Python中的循环操作通常较慢。

condition = (df == "CHN").any(axis=1) & (df == "USA").any(axis=1)
new_df = df[condition].reset_index(drop=True)

小小白鹿

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫