本文主要介绍如何使用 Pandas
对值进行筛选。
Updated: 2023 / 08 / 05
Python | Pandas | 筛选
以下面的3个 dataframe
为例对如何针对性筛选数据的方法进行说明 1’ 2:
- 原始数据文件参考此处,如下所示:
df1 = pd.read_csv('dtype_RawData.csv')
# Customer Number Customer Name 2016 ... Month Day Active
# 0 1002 Quest industries ¥125,000.00 ... 1 2015 Y
# 1 552278 Smith Plumbing ¥920,000.00 ... 6 2014 Y
# 2 23477 ACME Industrial ¥50,000.00 ... 3 2016 Y
# 3 24900 Brekke LTD ¥350,000.00 ... 10 2015 Y
# 4 651029 Harbor Co ¥15,000.00 ... 2 2014 N
#
# [5 rows x 9 columns]
#
# Customer Number int64
# Customer Name object
# 2016 object
# 2017 object
# Percent Growth object
# Jan Units object
# Month int64
# Day int64
# Active object
# dtype: object
df2 = {
'number': [1, 2, 3, 4, 5],
'sex': ['male', 'female', 'female', 'male', 'male'],
'name': ['Jacob', 'Sam', 'Jane', 'Smith', 'Einstein'],
'age': ['', '31.0', '', '21.0', np.nan],
'education': ['master', 'bachelor', 'master', 'PhD', 'bachelor']
}
df2 = pd.DataFrame(df2)
# number sex name age education
# 0 1 male Jacob master
# 1 2 female Sam 31.0 bachelor
# 2 3 female Jane master
# 3 4 male Smith 21.0 PhD
# 4 5 male Einstein NaN bachelor
#
# number int64
# sex object
# name object
# age object
# education object
# dtype: object
df3 = pd.DataFrame(
data={"data_provider": ["prov_1", "prov_1", "prov_2", "prov_2", "prov_3", "prov_3"],
"indicator": ["ind_a", "ind_a", "ind_a", "ind_b", "ind_b", "ind_b"],
"unit": ["EUR", "EUR", "EUR", "EUR", "EUR", "EUR"],
"year": ["2017", "2018","2019", "2017","2018","2019"],
"country1": [1, 2, 3, 2, 4, 6],
"country2": [4, 5, 6, 40, 50, 60]}
)
df3 = pd.pivot_table(data=df3,
index=['data_provider', 'indicator'],
columns=['year', 'country1'],
fill_value=30)
# country2
# year 2017 2018 2019
# country1 1 2 2 4 3 6
# data_provider indicator
# prov_1 ind_a 4 30 5 30 30 30
# prov_2 ind_a 30 30 30 30 6 30
# ind_b 30 40 30 30 30 30
# prov_3 ind_b 30 30 30 50 30 60
#
# MultiIndex([('prov_1', 'ind_a'),
# ('prov_2', 'ind_a'),
# ('prov_2', 'ind_b'),
# ('prov_3', 'ind_b')],
# names=['data_provider', 'indicator'])
#
# MultiIndex([('country2', '2017', 1),
# ('country2', '2017', 2),
# ('country2', '2018', 2),
# ('country2', '2018', 4),
# ('country2', '2019', 3),
# ('country2', '2019', 6)],
# names=[None, 'year', 'country1'])
数字
方法
方法 | 说明 |
---|---|
df.select_dtypes(include=None, exclude=None) | |
pd.to_numeric(arg, errors="raise", downcast=None) |
示例
df.select_dtypes
int1 = df1.select_dtypes(include='int')
# Customer Number Month Day
# 0 1002 1 2015
# 1 552278 6 2014
# 2 23477 3 2016
# 3 24900 10 2015
# 4 651029 2 2014
print(int1.dtypes)
# Customer Number int64
# Month int64
# Day int64
# dtype: object
int2 = df2.select_dtypes(include='int')
# number
# 0 1
# 1 2
# 2 3
# 3 4
# 4 5
print(int2.dtypes)
# number int64
# dtype: object
pd.to_numeric
df1nums = []
for col in df1.columns:
df1num = pd.to_numeric(df1[col], errors='coerce')
df1nums.append(df1num)
df1nums = pd.concat(dfnums, axis=1, ignore_index=True)
df1nums.columns = df1.columns
# Customer Number Customer Name 2016 2017 ... Jan Units Month Day Active
# 0 1002 NaN NaN NaN ... 500.0 1 2015 NaN
# 1 552278 NaN NaN NaN ... 700.0 6 2014 NaN
# 2 23477 NaN NaN NaN ... 125.0 3 2016 NaN
# 3 24900 NaN NaN NaN ... 75.0 10 2015 NaN
# 4 651029 NaN NaN NaN ... NaN 2 2014 NaN
#
# [5 rows x 9 columns]
print(df1nums.dtypes)
# Customer Number int64
# Customer Name float64
# 2016 float64
# 2017 float64
# Percent Growth float64
# Jan Units float64
# Month int64
# Day int64
# Active float64
# dtype: object
df2nums = []
for col in df2.columns:
df2num = pd.to_numeric(df2[col], errors='coerce', downcast='integer')
df2nums.append(df2num)
df2nums = pd.concat(df2nums, axis=1, ignore_index=True)
# 0 1 2 3 4
# 0 1 NaN NaN NaN NaN
# 1 2 NaN NaN 31.0 NaN
# 2 3 NaN NaN NaN NaN
# 3 4 NaN NaN 21.0 NaN
# 4 5 NaN NaN NaN NaN
print(df2nums.dtypes)
# 0 int8
# 1 float64
# 2 float64
# 3 float64
# 4 float64
# dtype: object
列
方法
前提 | 方法 | 说明 |
---|---|---|
已有目标列标签 [^pandas读取或选择多列] | pd.Dataframe(df, columns=labels) | |
df[['col1', 'col2']] | ||
df.loc[:, ['col1', 'col2']] | ||
df.loc[:, df.columns.isin(['col1', 'col2', 'col3'])] | ||
df.filter(items=['col1', 'col2', 'col3'], axis=1) | ||
已有符合特定模式的目标列标签 | df.columns.str.contains(‘keyword’) | |
df.filter(like='keyword', axis=1) | 根据指定的索引标签对数据框的行或列进行子集化。请注意,此例程不会根据其内容过滤数据框。过滤器应用于索引的标签。 | |
df.loc[:, df.columns.str.startswith('keyword')] | ||
已有目标列索引 [^pandas读取或选择多列] | df.iloc[:, [x, y]] |
df.filter
DataFrame.filter(items=None, like=None, regex=None, axis=None)
参数 | 说明 |
---|---|
items | 列表式,按 items 中给出的轴标签进行数据的保留。 |
like | 字符串,按给定的轴方向和标签中含有 like 所给出标签进行数据的保留。 |
regex | 字符串 (常规表达式),按轴方向上满足 re.search(regex, label) == True 条件的标签进行数据的保留。 |
axis | {0 or ‘index’, 1 or ‘columns’, None}, default None ,要过滤的轴,表示为索引 (int) 或轴名称 (str)。默认情况下,这是信息轴, DataFrame 的 列 。对于系列,此参数未使用,默认为无。 |
items
、like
和 regex
参数强制互斥。
示例
列标签
目标列标签
collabels = ['2016', 'Day']
df1cols = pd.DataFrame(df1, columns=collabels)
# 2016 Day
# 0 ¥125,000.00 2015
# 1 ¥920,000.00 2014
# 2 ¥50,000.00 2016
# 3 ¥350,000.00 2015
# 4 ¥15,000.00 2014
#
# Index(['2016', 'Day'], dtype='object')
#
# RangeIndex(start=0, stop=5, step=1)
collabels = ['2016', 'Day']
df1cols = df1[collabels]
collabels = ['2016', 'Day']
df1cols = df1.loc[:, collabels]
df2cols = df2.loc[:, df2.columns.isin(['gender', 'name', 'education'])]
# name education
# 0 Jacob master
# 1 Sam bachelor
# 2 Jane master
# 3 Smith PhD
# 4 Einstein bachelor
df2cols = df2.filter(items=['gender', 'name', 'education'], axis=1)
符合特定模式目标列标签
-
- 含有特定关键字
df2cols = df2.loc[:, df2.columns.str.contains('ex')]
# sex
# 0 male
# 1 female
# 2 female
# 3 male
# 4 male
df2cols = df2.filter(like='am', axis=1)
# name
# 0 Jacob
# 1 Sam
# 2 Jane
# 3 Smith
# 4 Einstein
df2cols = df2.loc[:, df2.columns.str.startswith('n')]
# number name
# 0 1 Jacob
# 1 2 Sam
# 2 3 Jane
# 3 4 Smith
# 4 5 Einstein
df2cols = df2.filter(like='n', axis=1)
# number name education
# 0 1 Jacob master
# 1 2 Sam bachelor
# 2 3 Jane master
# 3 4 Smith PhD
# 4 5 Einstein bachelor
列索引
目标列索引如下,
colidx = [3, 4]
根据目标列索引选取指定列的方式如下,
df1cols = df1.iloc[:, colidx]
# 2017 Percent Growth
# 0 ¥162,500.00 30.00%
# 1 ¥1,012,000.00 10.00%
# 2 ¥62,500.00 25.00%
# 3 ¥490,000.00 4.00%
# 4 ¥12,750.00 -15.00%
#
# Index(['2017', 'Percent Growth'], dtype='object')
#
# RangeIndex(start=0, stop=5, step=1)
行
方法
前提 | 方法 | 说明 |
---|---|---|
已有符合特定条件的值的所在行 | df.loc[df.column1.isin([val1, val2, val3])] 5 | col1 列中等于 val1 ,val2 或者 val3 的值所在的行 6 |
已有目行索引 | df.iloc[[x, y], :] | 第 x 、y 行 |
示例
符合特定条件的值所在行
df2filtered = df2.loc[df2.age.isin(['', np.nan])]
# number sex name age education
# 0 1 male Jacob master
# 2 3 female Jane master
# 4 5 male Einstein NaN bachelor
目标行索引
df2filtered = df2.iloc[[0, 3, 4], :]
# number sex name age education
# 0 1 male Jacob master
# 3 4 male Smith 21.0 PhD
# 4 5 male Einstein NaN bachelor
#
# Int64Index([0, 3, 4], dtype='int64')
值
方法
方法 | 说明 |
---|---|
df.loc[row_indexer,col_indexer] | |
df[df[col].isna()] 7 | 选取列表签为 col 的列中值为 NaN 的所在行 |
df.query() 8 | 1. 写类似于 sql 语句或者表达式df.query('Q1 > Q2 > 90') df.query('Q1 + Q2 > 180') df.query('Q1 == Q2') df.query('(Q1<50) & (Q2>40) and (Q3>90)') df.query('Q1 > Q2 > Q3 > Q4') |
2. 表达式df.query('team != "C"') df.query('team in ["A","B"]') df.query('team not in ("E","A","B")') df.query('team == ["A","B"]') df.query('team != ["A","B"]') df.query('name.str.contains("am")') | |
3. 空值 9engine = 'python' 的作用是以 python 方式解读表达式,而不是以 pandas 。对于是否添加 engine = 'python' 取决于所筛选列的值的数据类型。如果是 floats , 它具备原生 NaN , 则可以不添加; 如果是 str 或者 int ,则需添加。df.query('col1.isnull()', engine='python') df.query('col1.notnull()', engine='python') 在 pandas 和 numpy 中,NaN 不等于 NaN 。所以如果想检验一个单元格是否有 NaN 值,可以通过以下方式:cell_value != cell_value 该表达式只对 NaN 值成立( 3 !=3 是 False 的 但是 NaN != NaN 是 True 的),所以 query 只会返回 NaNs 。cell_value == cell_value 该表达式只对非 NaNs 值成立 ( 3 == 3 是 True 的 但是 NaN == NaN 是 False 的), 所以 query 只会返回非 NaNs 。df.query('col1 == col') df.query('col1 != col1') | |
4. 含有空格的值df.query('B == "team name"') | |
5. 传入变量a = df.Q1.mean() df.query('Q1 > @a+40') df.query('Q1 > Q2 + @a') | |
df.lt(value) 10 | lt 表示 less than ,表示在 df 的值是否小于 value |
df.gt(value) 10 | ge 表示 greater than ,表示在 df 的值是否大于 value |
df.le(value) 10 | le 表示 less than or equal to ,表示在 df 的值是否小于等于 value |
df.ge(value) 10 | ge 表示 greater than or equal to ,表示在 df 的值是否大于等于 value |
df.eq(value) 10 | eq 表示 equal ,表示在 df 的值是否等于 value |
df.ne(value) 10 | ne 表示 not equal to ,表示在 df 的值是否不等于 value |
df['col].between(value1, value2) | col 列中的值在 value1 到 value2 中 |
between
数据分箱( Data Binning
)是指将数据放入离散区间或段/箱的过程。我们可以使用 between
方法来对数据进行分箱操作。
between
方法检查数据是否在两个值之间,其语法为:
between(left, right, inclusive='both')
其余更多细节,可参考这里 11。
示例
特定值所在行
- 筛选出列标签为
age
的列中值为NaN
的所有行;
idx = df2.age.isin(['', np.nan])
# 0 True
# 1 False
# 2 True
# 3 False
# 4 True
# Name: age, dtype: bool
#
# <class 'pandas.core.series.Series'>
df2filtered = df2.loc[df2.age.isin(['', np.nan]), :]
# number sex name age education
# 0 1 male Jacob master
# 2 3 female Jane master
# 4 5 male Einstein NaN bachelor
#
# <class 'pandas.core.frame.DataFrame'>
也可以使用 isna()
用以筛选值为 NaN
或者 None
的所有行,
idx = df2.age.isna()
# 0 False
# 1 False
# 2 False
# 3 False
# 4 True
# Name: age, dtype: bool
#
# <class 'pandas.core.series.Series'>
df2filtered = df2[df2.age.isna()]
# number sex name age education
# 4 5 male Einstein NaN bachelor
#
# <class 'pandas.core.frame.DataFrame'>
也可以使用 query
用以筛选值为 NaN
或者 None
的所有行,
df2filtered = df2.query('age.isnull()', engine='python')
# number sex name age education
# 4 5 male Einstein NaN bachelor
又或者反过来,筛选值不为 NaN
或者 None
的所在行,
idx = df2.age.notna()
# 0 True
# 1 True
# 2 True
# 3 True
# 4 False
# Name: age, dtype: bool
#
# <class 'pandas.core.series.Series'>
df2filtered = df2[df2.age.notna()]
# number sex name age education
# 0 1 male Jacob master
# 1 2 female Sam 31.0 bachelor
# 2 3 female Jane master
# 3 4 male Smith 21.0 PhD
#
# <class 'pandas.core.frame.DataFrame'>
筛选出列标签为 age
的列中值为
或者 NaN
的并以特定值进行赋值替换 12’ 13;
df2filtered = df2.loc[df2.age.isin(['', np.nan]), 'age']
# 0
# 2
# 4 NaN
# Name: age, dtype: object
#
# <class 'pandas.core.series.Series'>
df2.loc[df2.age.isin(['', np.nan]), 'age'] = ['new value']*df2.loc[df2.age.isin(['', np.nan]), 'age'].shape[0]
# number sex name age education
# 0 1 male Jacob new value master
# 1 2 female Sam 31.0 bachelor
# 2 3 female Jane new value master
# 3 4 male Smith 21.0 PhD
# 4 5 male Einstein new value bachelor
#
# <class 'pandas.core.frame.DataFrame'>
符合某一范围
单索引
print(df2.number.dtypes)
# int64
idx = (5 > df2.number) & (df2.number > 2)
# 0 False
# 1 False
# 2 True
# 3 True
# 4 False
# Name: number, dtype: boo
df2filtered = df2[(5 > df2.number) & (df2.number > 2)]
# number sex name age education
# 2 3 female Jane master
# 3 4 male Smith 21.0 PhD
print(df2filtered.number.values)
# [3 4]
又或者 14,
idx = df2.number.between(2, 4)
# 0 False
# 1 True
# 2 True
# 3 True
# 4 False
# Name: number, dtype: bool
df2filtered = df2[df2.number.between(2, 4)]
# number sex name age education
# 1 2 female Sam 31.0 bachelor
# 2 3 female Jane master
# 3 4 male Smith 21.0 PhD
print(df2filtered.number.values)
# [2 3 4]
又或者,
idx = df2.number.isin(range(2,5))
# 0 False
# 1 True
# 2 True
# 3 True
# 4 False
# Name: number, dtype: bool
df2filtered = df2[idx]
# number sex name age education
# 1 2 female Sam 31.0 bachelor
# 2 3 female Jane master
# 3 4 male Smith 21.0 PhD
print(df2filtered.number.values)
# [2 3 4]
多索引
con = (df3[('country2', '2017', 2)] <= 30) & (df3[('country2', '2018', 2)] <= 30)
# data_provider indicator
# prov_1 ind_a True
# prov_2 ind_a True
# ind_b False
# prov_3 ind_b True
# dtype: bool
#
# <class 'pandas.core.series.Series'>
#
# MultiIndex([('prov_1', 'ind_a'),
# ('prov_2', 'ind_a'),
# ('prov_2', 'ind_b'),
# ('prov_3', 'ind_b')],
# names=['data_provider', 'indicator'])
df3filtered = df3[con]
# country2
# year 2017 2018 2019
# country1 1 2 2 4 3 6
# data_provider indicator
# prov_1 ind_a 4 30 5 30 30 30
# prov_2 ind_a 30 30 30 30 6 30
# prov_3 ind_b 30 30 30 50 30 60
#
# <class 'pandas.core.frame.DataFrame'>
#
# MultiIndex([('prov_1', 'ind_a'),
# ('prov_2', 'ind_a'),
# ('prov_3', 'ind_b')],
# names=['data_provider', 'indicator'])
#
# MultiIndex([('country2', '2017', 1),
# ('country2', '2017', 2),
# ('country2', '2018', 2),
# ('country2', '2018', 4),
# ('country2', '2019', 3),
# ('country2', '2019', 6)],
# names=[None, 'year', 'country1'])
# keys = df3filtered.columns.get_loc(key=('country2',:, 2))
# print(keys)
可以根据列标签对其对应的值进行排序,参考这里15,如下:
df3sorted = df3filtered.sort_values([('country2', '2017', 2), ('country2', '2018', 2)], ascending=[False, False])
# country2
# year 2017 2018 2019
# country1 1 2 2 4 3 6
# data_provider indicator
# prov_2 ind_a 30 30 30 30 6 30
# prov_3 ind_b 30 30 30 50 30 60
# prov_1 ind_a 4 30 5 30 30 30
#
# <class 'pandas.core.frame.DataFrame'>
#
# MultiIndex([('prov_2', 'ind_a'),
# ('prov_3', 'ind_b'),
# ('prov_1', 'ind_a')],
# names=['data_provider', 'indicator'])
#
# MultiIndex([('country2', '2017', 1),
# ('country2', '2017', 2),
# ('country2', '2018', 2),
# ('country2', '2018', 4),
# ('country2', '2019', 3),
# ('country2', '2019', 6)],
# names=[None, 'year', 'country1'])
符合某一大小或相等/不等的关系
此处以 df.lq
举例说明。
df.lt
等的用法与其相似,此处不再赘述
单索引
只能对 df2
中 dtypes
为数字类型的列进行关系方面的条件筛选,如下:
print(df2.dtypes)
# number int64
# sex object
# name object
# age object
# education object
idx = df2[['number']].le(3)
# number
# 0 True
# 1 True
# 2 True
# 3 False
# 4 False
#
# <class 'pandas.core.frame.DataFrame'>
print(idx.dtypes)
# number bool
# dtype: object
df2filtered = df2[idx]
# number sex name age education
# 0 1.0 NaN NaN NaN NaN
# 1 2.0 NaN NaN NaN NaN
# 2 3.0 NaN NaN NaN NaN
# 3 NaN NaN NaN NaN NaN
# 4 NaN NaN NaN NaN NaN
多索引
由于 df3
中所有列的数据类型均为数字类型,所以可以直接使用 df3.le(value)
,如下:
print(df3.dtypes)
# year country1
# country2 2017 1 int64
# 2 int64
# 2018 2 int64
# 4 int64
# 2019 3 int64
# 6 int64
# dtype: object
idx = df3.le(10)
# country2
# year 2017 2018 2019
# country1 1 2 2 4 3 6
# data_provider indicator
# prov_1 ind_a 4.0 NaN 5.0 NaN NaN NaN
# prov_2 ind_a NaN NaN NaN NaN 6.0 NaN
# ind_b NaN NaN NaN NaN NaN NaN
# prov_3 ind_b NaN NaN NaN NaN NaN NaN
#
# <class 'pandas.core.frame.DataFrame'>
print(idx.dtypes)
# year country1
# country2 2017 1 bool
# 2 bool
# 2018 2 bool
# 4 bool
# 2019 3 bool
# 6 bool
# dtype: object
df3filtered = df3[idx]
# country2
# year 2017 2018 2019
# country1 1 2 2 4 3 6
# data_provider indicator
# prov_1 ind_a 4.0 NaN 5.0 NaN NaN NaN
# prov_2 ind_a NaN NaN NaN NaN 6.0 NaN
# ind_b NaN NaN NaN NaN NaN NaN
# prov_3 ind_b NaN NaN NaN NaN NaN NaN