本文翻译自:How to select rows from a DataFrame based on column values?
How to select rows from a DataFrame
based on values in some column in Python Pandas? 如何基于Python Pandas中某些列中的值从DataFrame
选择行?
In SQL, I would use: 在SQL中,我将使用:
SELECT *
FROM table
WHERE colume_name = some_value
I tried to look at pandas documentation but did not immediately find the answer. 我试图查看熊猫文档,但没有立即找到答案。
#1楼
参考:https://stackoom.com/question/19dAl/如何基于列值从DataFrame中选择行
#2楼
To select rows whose column value equals a scalar, some_value
, use ==
: 要选择列值等于标量some_value
,请使用==
:
df.loc[df['column_name'] == some_value]
To select rows whose column value is in an iterable, some_values
, use isin
: 要选择行其列值是一个迭代, some_values
,使用isin
:
df.loc[df['column_name'].isin(some_values)]
Combine multiple conditions with &
: 将多个条件与&
组合:
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]
Note the parentheses. 注意括号。 Due to Python's operator precedence rules , &
binds more tightly than <=
and >=
. 由于Python的运算符优先级规则 , &
绑定比<=
和>=
更紧密。 Thus, the parentheses in the last example are necessary. 因此,最后一个示例中的括号是必需的。 Without the parentheses 没有括号
df['column_name'] >= A & df['column_name'] <= B
is parsed as 被解析为
df['column_name'] >= (A & df['column_name']) <= B
which results in a Truth value of a Series is ambiguous error . 这导致一个系列的真值是模棱两可的错误 。
To select rows whose column value does not equal some_value
, use !=
: 要选择列值不等于 some_value
,请使用!=
:
df.loc[df['column_name'] != some_value]
isin
returns a boolean Series, so to select rows whose value is not in some_values
, negate the boolean Series using ~
: isin
返回一个布尔系列,因此要选择值不在 some_values
行,请使用~
取反布尔系列:
df.loc[~df['column_name'].isin(some_values)]
For example, 例如,
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields 产量
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
If you have multiple values you want to include, put them in a list (or more generally, any iterable) and use isin
: 如果要包含多个值,请将它们放在列表中(或更普遍地说,是任何可迭代的),然后使用isin
:
print(df.loc[df['B'].isin(['one','three'])])
yields 产量
A B C D
0 foo one 0 0
1 bar one 1 2
3 bar three 3 6
6 foo one 6 12
7 foo three 7 14
Note, however, that if you wish to do this many times, it is more efficient to make an index first, and then use df.loc
: 但是请注意,如果您希望多次执行此操作,则首先创建索引,然后使用df.loc
会更有效:
df = df.set_index(['B'])
print(df.loc['one'])
yields 产量
A C D
B
one foo 0 0
one bar 1 2
one foo 6 12
or, to include multiple values from the index use df.index.isin
: 或者,要包含索引中的多个值,请使用df.index.isin
:
df.loc[df.index.isin(['one','two'])]
yields 产量
A C D
B
one foo 0 0
one bar 1 2
two foo 2 4
two foo 4 8
two bar 5 10
one foo 6 12
#3楼
Here is a simple example 这是一个简单的例子
from pandas import DataFrame
# Create data set
d = {'Revenue':[100,111,222],
'Cost':[333,444,555]}
df = DataFrame(d)
# mask = Return True when the value in column "Revenue" is equal to 111
mask = df['Revenue'] == 111
print mask
# Result:
# 0 False
# 1 True
# 2 False
# Name: Revenue, dtype: bool
# Select * FROM df WHERE Revenue = 111
df[mask]
# Result:
# Cost Revenue
# 1 444 111
#4楼
tl;dr tl; dr
The pandas equivalent to 大熊猫相当于
select * from table where column_name = some_value
is 是
table[table.column_name == some_value]
Multiple conditions: 多个条件:
table[(table.column_name == some_value) | (table.column_name2 == some_value2)]
or 要么
table.query('column_name == some_value | column_name2 == some_value2')
Code example 代码示例
import pandas as pd
# Create data set
d = {'foo':[100, 111, 222],
'bar':[333, 444, 555]}
df = pd.DataFrame(d)
# Full dataframe:
df
# Shows:
# bar foo
# 0 333 100
# 1 444 111
# 2 555 222
# Output only the row(s) in df where foo is 222:
df[df.foo == 222]
# Shows:
# bar foo
# 2 555 222
In the above code it is the line df[df.foo == 222]
that gives the rows based on the column value, 222
in this case. 在上面的代码它是线df[df.foo == 222]
给出基于列的值,行222
在这种情况下。
Multiple conditions are also possible: 多种条件也是可能的:
df[(df.foo == 222) | (df.bar == 444)]
# bar foo
# 1 444 111
# 2 555 222
But at that point I would recommend using the query function, since it's less verbose and yields the same result: 但是在那一点上,我建议使用查询函数,因为它不太冗长,并且产生的结果相同:
df.query('foo == 222 | bar == 444')
#5楼
I find the syntax of the previous answers to be redundant and difficult to remember. 我发现先前答案的语法是多余的,很难记住。 Pandas introduced the query()
method in v0.13 and I much prefer it. 熊猫在v0.13中引入了query()
方法,我更喜欢它。 For your question, you could do df.query('col == val')
对于您的问题,您可以执行df.query('col == val')
Reproduced from http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query 转载自http://pandas.pydata.org/pandas-docs/version/0.17.0/indexing.html#indexing-query
In [167]: n = 10
In [168]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [169]: df
Out[169]:
a b c
0 0.687704 0.582314 0.281645
1 0.250846 0.610021 0.420121
2 0.624328 0.401816 0.932146
3 0.011763 0.022921 0.244186
4 0.590198 0.325680 0.890392
5 0.598892 0.296424 0.007312
6 0.634625 0.803069 0.123872
7 0.924168 0.325076 0.303746
8 0.116822 0.364564 0.454607
9 0.986142 0.751953 0.561512
# pure python
In [170]: df[(df.a < df.b) & (df.b < df.c)]
Out[170]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
# query
In [171]: df.query('(a < b) & (b < c)')
Out[171]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
You can also access variables in the environment by prepending an @
. 您还可以在环境中添加@
来访问变量。
exclude = ('red', 'orange')
df.query('color not in @exclude')
#6楼
To append to this famous question (though a bit too late): You can also do df.groupby('column_name').get_group('column_desired_value').reset_index()
to make a new data frame with specified column having a particular value. 附加到这个著名的问题(虽然为时已晚):您还可以执行df.groupby('column_name').get_group('column_desired_value').reset_index()
来创建一个具有指定值的具有特定值的新数据框。 Eg 例如
import pandas as pd
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split()})
print("Original dataframe:")
print(df)
b_is_two_dataframe = pd.DataFrame(df.groupby('B').get_group('two').reset_index()).drop('index', axis = 1)
#NOTE: the final drop is to remove the extra index column returned by groupby object
print('Sub dataframe where B is two:')
print(b_is_two_dataframe)
Run this gives: 运行此给出:
Original dataframe:
A B
0 foo one
1 bar one
2 foo two
3 bar three
4 foo two
5 bar two
6 foo one
7 foo three
Sub dataframe where B is two:
A B
0 foo two
1 foo two
2 bar two