Python | Pandas | 筛选

16 篇文章 3 订阅
12 篇文章 0 订阅

本文主要介绍如何使用 Pandas 对值进行筛选。

Updated: 2023 / 08 / 05


以下面的3个 dataframe 为例对如何针对性筛选数据的方法进行说明 12

  1. 原始数据文件参考此处,如下所示:
df1 = pd.read_csv('dtype_RawData.csv')
#    Customer Number     Customer Name         2016  ... Month   Day Active
# 0             1002  Quest industries  ¥125,000.00  ...     1  2015      Y
# 1           552278    Smith Plumbing  ¥920,000.00  ...     6  2014      Y
# 2            23477   ACME Industrial   ¥50,000.00  ...     3  2016      Y
# 3            24900        Brekke LTD  ¥350,000.00  ...    10  2015      Y
# 4           651029         Harbor Co   ¥15,000.00  ...     2  2014      N
#
# [5 rows x 9 columns]
#
# Customer Number     int64
# Customer Name      object
# 2016               object
# 2017               object
# Percent Growth     object
# Jan Units          object
# Month               int64
# Day                 int64
# Active             object
# dtype: object
df2 = {
    'number': [1, 2, 3, 4, 5],
    'sex': ['male', 'female', 'female', 'male', 'male'],
    'name': ['Jacob', 'Sam', 'Jane', 'Smith', 'Einstein'],
    'age': ['', '31.0', '', '21.0', np.nan],
    'education': ['master', 'bachelor', 'master', 'PhD', 'bachelor']
}

df2 = pd.DataFrame(df2)
#    number     sex      name   age education
# 0       1    male     Jacob          master
# 1       2  female       Sam  31.0  bachelor
# 2       3  female      Jane          master
# 3       4    male     Smith  21.0       PhD
# 4       5    male  Einstein   NaN  bachelor
#
# number        int64
# sex          object
# name         object
# age          object
# education    object
# dtype: object
df3 = pd.DataFrame(
    data={"data_provider": ["prov_1", "prov_1", "prov_2", "prov_2", "prov_3", "prov_3"],
          "indicator": ["ind_a", "ind_a", "ind_a", "ind_b", "ind_b", "ind_b"],
          "unit": ["EUR", "EUR", "EUR", "EUR", "EUR", "EUR"],
          "year": ["2017", "2018","2019", "2017","2018","2019"],
          "country1": [1, 2, 3, 2, 4, 6],
          "country2": [4, 5, 6, 40, 50, 60]}
)

df3 = pd.pivot_table(data=df3,
                     index=['data_provider', 'indicator'],
                     columns=['year', 'country1'],
                     fill_value=30)
#                         country2
# year                        2017     2018     2019
# country1                       1   2    2   4    3   6
# data_provider indicator
# prov_1        ind_a            4  30    5  30   30  30
# prov_2        ind_a           30  30   30  30    6  30
#               ind_b           30  40   30  30   30  30
# prov_3        ind_b           30  30   30  50   30  60
#
# MultiIndex([('prov_1', 'ind_a'),
#             ('prov_2', 'ind_a'),
#             ('prov_2', 'ind_b'),
#             ('prov_3', 'ind_b')],
#            names=['data_provider', 'indicator'])
#
# MultiIndex([('country2', '2017', 1),
#             ('country2', '2017', 2),
#             ('country2', '2018', 2),
#             ('country2', '2018', 4),
#             ('country2', '2019', 3),
#             ('country2', '2019', 6)],
#            names=[None, 'year', 'country1'])

数字

方法

方法说明
df.select_dtypes(include=None, exclude=None)
pd.to_numeric(arg, errors="raise", downcast=None)

示例

df.select_dtypes
int1 = df1.select_dtypes(include='int')
#    Customer Number  Month   Day
# 0             1002      1  2015
# 1           552278      6  2014
# 2            23477      3  2016
# 3            24900     10  2015
# 4           651029      2  2014

print(int1.dtypes)
# Customer Number    int64
# Month              int64
# Day                int64
# dtype: object
int2 = df2.select_dtypes(include='int')
#    number
# 0       1
# 1       2
# 2       3
# 3       4
# 4       5

print(int2.dtypes)
# number    int64
# dtype: object

pd.to_numeric
df1nums = []
for col in df1.columns:
    df1num = pd.to_numeric(df1[col], errors='coerce')
    df1nums.append(df1num)
df1nums = pd.concat(dfnums, axis=1, ignore_index=True)
df1nums.columns = df1.columns
#    Customer Number  Customer Name  2016  2017  ...  Jan Units  Month   Day  Active
# 0             1002            NaN   NaN   NaN  ...      500.0      1  2015     NaN
# 1           552278            NaN   NaN   NaN  ...      700.0      6  2014     NaN
# 2            23477            NaN   NaN   NaN  ...      125.0      3  2016     NaN
# 3            24900            NaN   NaN   NaN  ...       75.0     10  2015     NaN
# 4           651029            NaN   NaN   NaN  ...        NaN      2  2014     NaN
#
# [5 rows x 9 columns]

print(df1nums.dtypes)
# Customer Number      int64
# Customer Name      float64
# 2016               float64
# 2017               float64
# Percent Growth     float64
# Jan Units          float64
# Month                int64
# Day                  int64
# Active             float64
# dtype: object
df2nums = []
for col in df2.columns:
    df2num = pd.to_numeric(df2[col], errors='coerce', downcast='integer')
    df2nums.append(df2num)
df2nums = pd.concat(df2nums, axis=1, ignore_index=True)
#    0   1   2     3   4
# 0  1 NaN NaN   NaN NaN
# 1  2 NaN NaN  31.0 NaN
# 2  3 NaN NaN   NaN NaN
# 3  4 NaN NaN  21.0 NaN
# 4  5 NaN NaN   NaN NaN

print(df2nums.dtypes)
# 0       int8
# 1    float64
# 2    float64
# 3    float64
# 4    float64
# dtype: object

方法

前提方法说明
已有目标列标签 [^pandas读取或选择多列]pd.Dataframe(df, columns=labels)
df[['col1', 'col2']]
df.loc[:, ['col1', 'col2']]
df.loc[:, df.columns.isin(['col1', 'col2', 'col3'])]
df.filter(items=['col1', 'col2', 'col3'], axis=1)
已有符合特定模式的目标列标签df.columns.str.contains(‘keyword’)
df.filter(like='keyword', axis=1)根据指定的索引标签对数据框的行或列进行子集化。请注意,此例程不会根据其内容过滤数据框。过滤器应用于索引的标签。
df.loc[:, df.columns.str.startswith('keyword')]
已有目标列索引 [^pandas读取或选择多列]df.iloc[:, [x, y]]

df.filter
DataFrame.filter(items=None, like=None, regex=None, axis=None)

返回和输入数据同类型的数据 34

参数说明
items列表式,按 items 中给出的轴标签进行数据的保留。
like字符串,按给定的轴方向和标签中含有 like 所给出标签进行数据的保留。
regex字符串 (常规表达式),按轴方向上满足 re.search(regex, label) == True 条件的标签进行数据的保留。
axis{0 or ‘index’, 1 or ‘columns’, None}, default None,要过滤的轴,表示为索引 (int) 或轴名称 (str)。

默认情况下,这是信息轴,DataFrame。对于系列,此参数未使用,默认为无。

itemslikeregex 参数强制互斥。


示例

列标签
目标列标签
collabels = ['2016', 'Day']
df1cols = pd.DataFrame(df1, columns=collabels)
#           2016   Day
# 0  ¥125,000.00  2015
# 1  ¥920,000.00  2014
# 2   ¥50,000.00  2016
# 3  ¥350,000.00  2015
# 4   ¥15,000.00  2014
#
# Index(['2016', 'Day'], dtype='object')
#
# RangeIndex(start=0, stop=5, step=1)
collabels = ['2016', 'Day']
df1cols = df1[collabels]
collabels = ['2016', 'Day']
df1cols = df1.loc[:, collabels]
df2cols = df2.loc[:, df2.columns.isin(['gender', 'name', 'education'])]
#        name education
# 0     Jacob    master
# 1       Sam  bachelor
# 2      Jane    master
# 3     Smith       PhD
# 4  Einstein  bachelor
df2cols = df2.filter(items=['gender', 'name', 'education'], axis=1)

符合特定模式目标列标签
    1. 含有特定关键字
df2cols = df2.loc[:, df2.columns.str.contains('ex')]
#       sex
# 0    male
# 1  female
# 2  female
# 3    male
# 4    male
df2cols = df2.filter(like='am', axis=1)
#        name
# 0     Jacob
# 1       Sam
# 2      Jane
# 3     Smith
# 4  Einstein
df2cols = df2.loc[:, df2.columns.str.startswith('n')]
#    number      name
# 0       1     Jacob
# 1       2       Sam
# 2       3      Jane
# 3       4     Smith
# 4       5  Einstein
df2cols = df2.filter(like='n', axis=1)
#    number      name education
# 0       1     Jacob    master
# 1       2       Sam  bachelor
# 2       3      Jane    master
# 3       4     Smith       PhD
# 4       5  Einstein  bachelor

列索引

目标列索引如下,

colidx = [3, 4]

根据目标列索引选取指定列的方式如下,

df1cols = df1.iloc[:, colidx]
#             2017 Percent Growth
# 0    ¥162,500.00         30.00%
# 1  ¥1,012,000.00         10.00%
# 2     ¥62,500.00         25.00%
# 3    ¥490,000.00          4.00%
# 4     ¥12,750.00        -15.00%
#
# Index(['2017', 'Percent Growth'], dtype='object')
#
# RangeIndex(start=0, stop=5, step=1)

方法

前提方法说明
已有符合特定条件的值的所在行df.loc[df.column1.isin([val1, val2, val3])] 5col1 列中等于 val1val2 或者 val3 的值所在的行 6
已有目行索引df.iloc[[x, y], :]xy

示例

符合特定条件的值所在行
df2filtered = df2.loc[df2.age.isin(['', np.nan])]
#    number     sex      name  age education
# 0       1    male     Jacob         master
# 2       3  female      Jane         master
# 4       5    male  Einstein  NaN  bachelor

目标行索引
df2filtered = df2.iloc[[0, 3, 4], :]
#    number   sex      name   age education
# 0       1  male     Jacob          master
# 3       4  male     Smith  21.0       PhD
# 4       5  male  Einstein   NaN  bachelor
#
# Int64Index([0, 3, 4], dtype='int64')

方法

方法说明
df.loc[row_indexer,col_indexer]
df[df[col].isna()] 7选取列表签为 col 的列中值为 NaN 的所在行
df.query() 81. 写类似于 sql 语句或者表达式

df.query('Q1 > Q2 > 90')
df.query('Q1 + Q2 > 180')
df.query('Q1 == Q2')
df.query('(Q1<50) & (Q2>40) and (Q3>90)')
df.query('Q1 > Q2 > Q3 > Q4')
2. 表达式

df.query('team != "C"')
df.query('team in ["A","B"]')
df.query('team not in ("E","A","B")')
df.query('team == ["A","B"]')
df.query('team != ["A","B"]')
df.query('name.str.contains("am")')
3. 空值 9

engine = 'python' 的作用是以 python 方式解读表达式,而不是以 pandas。对于是否添加 engine = 'python' 取决于所筛选列的值的数据类型。如果是 floats, 它具备原生 NaN, 则可以不添加; 如果是 str 或者 int,则需添加。

df.query('col1.isnull()', engine='python')

df.query('col1.notnull()', engine='python')

pandasnumpy 中,NaN 不等于 NaN。所以如果想检验一个单元格是否有 NaN 值,可以通过以下方式:
cell_value != cell_value
该表达式只对 NaN 值成立
( 3 !=3False 的 但是 NaN != NaNTrue 的),所以 query 只会返回 NaNs
cell_value == cell_value
该表达式只对非 NaNs 值成立 ( 3 == 3True 的 但是 NaN == NaNFalse 的), 所以 query 只会返回非 NaNs
df.query('col1 == col')
df.query('col1 != col1')
4. 含有空格的值

df.query('B == "team name"')
5. 传入变量

a = df.Q1.mean()
df.query('Q1 > @a+40')
df.query('Q1 > Q2 + @a')
df.lt(value) 10lt 表示 less than,表示在 df 的值是否小于 value
df.gt(value) 10ge 表示 greater than,表示在 df 的值是否大于 value
df.le(value) 10le 表示 less than or equal to,表示在 df 的值是否小于等于 value
df.ge(value) 10ge 表示 greater than or equal to,表示在 df 的值是否大于等于 value
df.eq(value) 10eq 表示 equal,表示在 df 的值是否等于 value
df.ne(value) 10ne 表示 not equal to,表示在 df 的值是否不等于 value
df['col].between(value1, value2)col 列中的值在 value1value2

between

数据分箱( Data Binning )是指将数据放入离散区间或段/箱的过程。我们可以使用 between 方法来对数据进行分箱操作。

between 方法检查数据是否在两个值之间,其语法为:

between(left, right, inclusive='both')

其余更多细节,可参考这里 11


示例

特定值所在行
  • 筛选出列标签为 age 的列中值为 或者 NaN 的所有行;
idx = df2.age.isin(['', np.nan])
# 0     True
# 1    False
# 2     True
# 3    False
# 4     True
# Name: age, dtype: bool
#
# <class 'pandas.core.series.Series'>

df2filtered = df2.loc[df2.age.isin(['', np.nan]), :]
#    number     sex      name  age education
# 0       1    male     Jacob         master
# 2       3  female      Jane         master
# 4       5    male  Einstein  NaN  bachelor
# 
# <class 'pandas.core.frame.DataFrame'>

也可以使用 isna() 用以筛选值为 NaN 或者 None 的所有行,

idx = df2.age.isna()
# 0    False
# 1    False
# 2    False
# 3    False
# 4     True
# Name: age, dtype: bool
#
# <class 'pandas.core.series.Series'>

df2filtered = df2[df2.age.isna()]
#    number   sex      name  age education
# 4       5  male  Einstein  NaN  bachelor
#
# <class 'pandas.core.frame.DataFrame'>

也可以使用 query 用以筛选值为 NaN 或者 None 的所有行,

df2filtered = df2.query('age.isnull()', engine='python')
#    number   sex      name  age education
# 4       5  male  Einstein  NaN  bachelor

又或者反过来,筛选值不为 NaN 或者 None 的所在行,

idx = df2.age.notna()
# 0     True
# 1     True
# 2     True
# 3     True
# 4    False
# Name: age, dtype: bool
#
# <class 'pandas.core.series.Series'>

df2filtered = df2[df2.age.notna()]
#    number     sex   name   age education
# 0       1    male  Jacob          master
# 1       2  female    Sam  31.0  bachelor
# 2       3  female   Jane          master
# 3       4    male  Smith  21.0       PhD
#
# <class 'pandas.core.frame.DataFrame'>

筛选出列标签为 age 的列中值为 或者 NaN 的并以特定值进行赋值替换 1213

df2filtered = df2.loc[df2.age.isin(['', np.nan]), 'age']
# 0
# 2
# 4    NaN
# Name: age, dtype: object
#
# <class 'pandas.core.series.Series'>

df2.loc[df2.age.isin(['', np.nan]), 'age'] = ['new value']*df2.loc[df2.age.isin(['', np.nan]), 'age'].shape[0]
#    number     sex      name        age education
# 0       1    male     Jacob  new value    master
# 1       2  female       Sam       31.0  bachelor
# 2       3  female      Jane  new value    master
# 3       4    male     Smith       21.0       PhD
# 4       5    male  Einstein  new value  bachelor
#
# <class 'pandas.core.frame.DataFrame'>

符合某一范围
单索引
print(df2.number.dtypes)
# int64

idx = (5 > df2.number) & (df2.number > 2)
# 0    False
# 1    False
# 2     True
# 3     True
# 4    False
# Name: number, dtype: boo

df2filtered = df2[(5 > df2.number) & (df2.number > 2)]
#    number     sex   name   age education
# 2       3  female   Jane          master
# 3       4    male  Smith  21.0       PhD

print(df2filtered.number.values)
# [3 4]

又或者 14

idx = df2.number.between(2, 4)
# 0    False
# 1     True
# 2     True
# 3     True
# 4    False
# Name: number, dtype: bool


df2filtered = df2[df2.number.between(2, 4)]
#    number     sex   name   age education
# 1       2  female    Sam  31.0  bachelor
# 2       3  female   Jane          master
# 3       4    male  Smith  21.0       PhD

print(df2filtered.number.values)
# [2 3 4]

又或者,

idx = df2.number.isin(range(2,5))
# 0    False
# 1     True
# 2     True
# 3     True
# 4    False
# Name: number, dtype: bool

df2filtered = df2[idx]
#    number     sex   name   age education
# 1       2  female    Sam  31.0  bachelor
# 2       3  female   Jane          master
# 3       4    male  Smith  21.0       PhD

print(df2filtered.number.values)
# [2 3 4]

多索引
con = (df3[('country2', '2017', 2)] <= 30) & (df3[('country2', '2018', 2)] <= 30)
# data_provider  indicator
# prov_1         ind_a         True
# prov_2         ind_a         True
#                ind_b        False
# prov_3         ind_b         True
# dtype: bool
#
# <class 'pandas.core.series.Series'>
#
# MultiIndex([('prov_1', 'ind_a'),
#             ('prov_2', 'ind_a'),
#             ('prov_2', 'ind_b'),
#             ('prov_3', 'ind_b')],
#            names=['data_provider', 'indicator'])

df3filtered = df3[con]
#                         country2
# year                        2017     2018     2019
# country1                       1   2    2   4    3   6
# data_provider indicator
# prov_1        ind_a            4  30    5  30   30  30
# prov_2        ind_a           30  30   30  30    6  30
# prov_3        ind_b           30  30   30  50   30  60
#
# <class 'pandas.core.frame.DataFrame'>
#
# MultiIndex([('prov_1', 'ind_a'),
#             ('prov_2', 'ind_a'),
#             ('prov_3', 'ind_b')],
#            names=['data_provider', 'indicator'])
#
# MultiIndex([('country2', '2017', 1),
#             ('country2', '2017', 2),
#             ('country2', '2018', 2),
#             ('country2', '2018', 4),
#             ('country2', '2019', 3),
#             ('country2', '2019', 6)],
#            names=[None, 'year', 'country1'])

# keys = df3filtered.columns.get_loc(key=('country2',:, 2))
# print(keys)

可以根据列标签对其对应的值进行排序,参考这里15,如下:

df3sorted = df3filtered.sort_values([('country2', '2017', 2), ('country2', '2018', 2)], ascending=[False, False])
#                         country2
# year                        2017     2018     2019
# country1                       1   2    2   4    3   6
# data_provider indicator
# prov_2        ind_a           30  30   30  30    6  30
# prov_3        ind_b           30  30   30  50   30  60
# prov_1        ind_a            4  30    5  30   30  30
#
# <class 'pandas.core.frame.DataFrame'>
#
# MultiIndex([('prov_2', 'ind_a'),
#             ('prov_3', 'ind_b'),
#             ('prov_1', 'ind_a')],
#            names=['data_provider', 'indicator'])
#
# MultiIndex([('country2', '2017', 1),
#             ('country2', '2017', 2),
#             ('country2', '2018', 2),
#             ('country2', '2018', 4),
#             ('country2', '2019', 3),
#             ('country2', '2019', 6)],
#            names=[None, 'year', 'country1'])

符合某一大小或相等/不等的关系

此处以 df.lq 举例说明。

df.lt 等的用法与其相似,此处不再赘述


单索引

只能对 df2dtypes 为数字类型的列进行关系方面的条件筛选,如下:

print(df2.dtypes)
# number        int64
# sex          object
# name         object
# age          object
# education    object

idx = df2[['number']].le(3)
#   number
# 0    True
# 1    True
# 2    True
# 3   False
# 4   False
#
# <class 'pandas.core.frame.DataFrame'>

print(idx.dtypes)
# number    bool
# dtype: object

df2filtered = df2[idx]
#    number  sex name  age education
# 0     1.0  NaN  NaN  NaN       NaN
# 1     2.0  NaN  NaN  NaN       NaN
# 2     3.0  NaN  NaN  NaN       NaN
# 3     NaN  NaN  NaN  NaN       NaN
# 4     NaN  NaN  NaN  NaN       NaN

多索引

由于 df3 中所有列的数据类型均为数字类型,所以可以直接使用 df3.le(value),如下:

print(df3.dtypes)
#           year  country1
# country2  2017  1           int64
#                 2           int64
#           2018  2           int64
#                 4           int64
#           2019  3           int64
#                 6           int64
# dtype: object

idx = df3.le(10)
#                         country2
# year                        2017     2018     2019
# country1                       1   2    2   4    3   6
# data_provider indicator
# prov_1        ind_a          4.0 NaN  5.0 NaN  NaN NaN
# prov_2        ind_a          NaN NaN  NaN NaN  6.0 NaN
#               ind_b          NaN NaN  NaN NaN  NaN NaN
# prov_3        ind_b          NaN NaN  NaN NaN  NaN NaN
#
# <class 'pandas.core.frame.DataFrame'>

print(idx.dtypes)
#           year  country1
# country2  2017  1           bool
#                 2           bool
#           2018  2           bool
#                 4           bool
#           2019  3           bool
#                 6           bool
# dtype: object

df3filtered = df3[idx]
#                         country2
# year                        2017     2018     2019
# country1                       1   2    2   4    3   6
# data_provider indicator
# prov_1        ind_a          4.0 NaN  5.0 NaN  NaN NaN
# prov_2        ind_a          NaN NaN  NaN NaN  6.0 NaN
#               ind_b          NaN NaN  NaN NaN  NaN NaN
# prov_3        ind_b          NaN NaN  NaN NaN  NaN NaN

参考链接


  1. How to filter a Pandas DataFrame ↩︎

  2. pandas 筛选数据的 8 个骚操作 ↩︎

  3. pandas.DataFrame.filter ↩︎

  4. DataAnalysis/Pandas/Filter/Filter_RowOrCol.py ↩︎

  5. pandas通过list筛选行 ↩︎

  6. pandas通过list筛选行 ↩︎

  7. 【pandas】dataframe根据某列是否是null筛选数据 ↩︎

  8. pandas query() 表达式查询 ↩︎

  9. Querying for NaN and other names in Pandas ↩︎

  10. 常用大于等于小于的特殊标记LT、LE、EQ、NE、GE、GT ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  11. pandas基础:使用between方法进行数据分箱(Binning Data) ↩︎

  12. 请教用pandas处理数据时,如何对行数据进行筛选并赋值处理? ↩︎

  13. pandas 用 .loc[,]=value 筛选并原地赋值回原来的 DataFrame ↩︎

  14. select range of values for all columns in pandas dataframe ↩︎

  15. How to Sort Multiple Columns in pandas DataFrame ↩︎

  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值