pandas 过滤数据

最新推荐文章于 2023-11-30 09:50:28 发布

LaoChen_ZeroonE

最新推荐文章于 2023-11-30 09:50:28 发布

阅读量297

点赞数

分类专栏： Python第三方库数据挖掘

本文链接：https://blog.csdn.net/qq_34356768/article/details/115679251

版权

Python第三方库同时被 2 个专栏收录

16 篇文章 3 订阅

订阅专栏

数据挖掘

12 篇文章 6 订阅

订阅专栏

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author  : LaoChen

"""
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
    Out[9]:
                     A         B         C         D
    2017-04-01  0.522241  0.495106 -0.268194 -0.035003
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
    2017-04-03  0.480507  1.215048  1.313314 -0.072320
    2017-04-04  1.700309  0.287588 -0.012103  0.525291
    2017-04-05  0.526615 -0.417645  0.405853 -0.835213
    2017-04-06  1.143858 -0.326720  1.425379  0.531037
选取
通过 [] 来选取
选取一列或者几列：
df['A']
Out:
    2017-04-01    0.522241
    2017-04-02    2.104572
    2017-04-03    0.480507
    2017-04-04    1.700309
    2017-04-05    0.526615
    2017-04-06    1.143858
df[['A','B']]
Out:
                       A         B
    2017-04-01  0.522241  0.495106
    2017-04-02  2.104572 -0.977768
    2017-04-03  0.480507  1.215048
    2017-04-04  1.700309  0.287588
    2017-04-05  0.526615 -0.417645
    2017-04-06  1.143858 -0.326720
选取某一行或者几行：
df['2017-04-01':'2017-04-01']
Out:
                       A         B         C         D
    2017-04-01  0.522241  0.495106 -0.268194 -0.03500
df['2017-04-01':'2017-04-03']
                       A         B         C         D
    2017-04-01  0.522241  0.495106 -0.268194 -0.035003
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
    2017-04-03  0.480507  1.215048  1.313314 -0.072320
loc, 通过行标签选取数据
df.loc['2017-04-01','A']
df.loc['2017-04-01']
Out:
    A    0.522241
    B    0.495106
    C   -0.268194
    D   -0.035003
df.loc['2017-04-01':'2017-04-03']
Out:
                       A         B         C         D
    2017-04-01  0.522241  0.495106 -0.268194 -0.035003
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
    2017-04-03  0.480507  1.215048  1.313314 -0.072320
df.loc['2017-04-01':'2017-04-04',['A','B']]
Out:
                       A         B
    2017-04-01  0.522241  0.495106
    2017-04-02  2.104572 -0.977768
    2017-04-03  0.480507  1.215048
    2017-04-04  1.700309  0.287588
df.loc[:,['A','B']]
Out:
                       A         B
    2017-04-01  0.522241  0.495106
    2017-04-02  2.104572 -0.977768
    2017-04-03  0.480507  1.215048
    2017-04-04  1.700309  0.287588
    2017-04-05  0.526615 -0.417645
    2017-04-06  1.143858 -0.326720
iloc, 通过行号获取数据
df.iloc[2]
Out:
    A    0.480507
    B    1.215048
    C    1.313314
    D   -0.072320
df.iloc[1:3]
Out:
                       A         B         C         D
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
    2017-04-03  0.480507  1.215048  1.313314 -0.072320
df.iloc[1,1]

df.iloc[1:3,1]

df.iloc[1:3,1:2]

df.iloc[[1,3],[2,3]]
Out:
                       C         D
    2017-04-02 -0.139632 -0.735926
    2017-04-04 -0.012103  0.525291

df.iloc[[1,3],:]

df.iloc[:,[2,3]]
iat, 获取某一个 cell 的值
df.iat[1,2]
Out:
    -0.13963224781812655
过滤
使用 [] 过滤
[]中是一个boolean 表达式，凡是计算为 True 的行就会被选取。

df[df.A>1]
Out:
                       A         B         C         D
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926
    2017-04-04  1.700309  0.287588 -0.012103  0.525291
    2017-04-06  1.143858 -0.326720  1.425379  0.531037
df[df>1]
Out:
                       A         B         C   D
    2017-04-01       NaN       NaN       NaN NaN
    2017-04-02  2.104572       NaN       NaN NaN
    2017-04-03       NaN  1.215048  1.313314 NaN
    2017-04-04  1.700309       NaN       NaN NaN
    2017-04-05       NaN       NaN       NaN NaN
    2017-04-06  1.143858       NaN  1.425379 NaN

df[df.A+df.B>1.5]
Out:
                       A         B         C         D
    2017-04-03  0.480507  1.215048  1.313314 -0.072320
    2017-04-04  1.700309  0.287588 -0.012103  0.525291
下面是一个更加复杂的例子，选取的是 index 在 '2017-04-01'中'2017-04-04'的，一行的数据的和大于1的行：

df.loc['2017-04-01':'2017-04-04',df.sum()>1]
还可以通过和 apply 方法结合，构造更加复杂的过滤，实现将某个返回值为 boolean 的方法作为过滤条件：

df[df.apply(lambda x: x['b'] > x['c'], axis=1)]
使用 isin
df['E']=['one', 'one','two','three','four','three']
                       A         B         C         D      E
    2017-04-01  0.522241  0.495106 -0.268194 -0.035003    one
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926    one
    2017-04-03  0.480507  1.215048  1.313314 -0.072320    two
    2017-04-04  1.700309  0.287588 -0.012103  0.525291  three
    2017-04-05  0.526615 -0.417645  0.405853 -0.835213   four
    2017-04-06  1.143858 -0.326720  1.425379  0.531037  three

df[df.E.isin(['one'])]
    Out:
                       A         B         C         D    E
    2017-04-01  0.522241  0.495106 -0.268194 -0.035003  one
    2017-04-02  2.104572 -0.977768 -0.139632 -0.735926  one


"""

LaoChen_ZeroonE

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas 过滤数据

#!/usr/bin/env python# -*- coding: utf-8 -*-# @Author : LaoChen"""df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD')) Out[9]: A B C D 2017-04-01 0.522241 0.495106 -0.268194
复制链接

扫一扫