filter or subset data in pandas with Boolean indexing;
Boolean indexing (also known as Boolean selection) refers to selecting rows by providing a Boolean array , a pandas Series with the same index, but a True or False for each row
informative 提供有用信息的,给予知识的;
Python evaluates False
and True
as 0 and 1 , respectively.
Calculating Boolean statistics
import pandas as pd
import numpy as np
movie=pd.read_csv('movie.csv',index_col='movie_title')
movie[['duration']].head()
Out[5]:
duration
movie_title
Avatar 178.0
Pirates of the Caribbean: At World's End 169.0
Spectre 148.0
The Dark Knight Rises 164.0
Star Wars: Episode VII - The Force Awakens NaN
# determine whether the duration of each movie is longer than 2 hours
movie_2_hours=movie['duration']>120
movie_2_hours.head(10)
Out[8]:
movie_title
Avatar True
Pirates of the Caribbean: At World's End True
Spectre True
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
John Carter True
Spider-Man 3 True
Tangled False
Avengers: Age of Ultron True
Harry Potter and the Half-Blood Prince True
Name: duration, dtype: bool
# use the above Series to determine the number of movies that are longer than two hours
movie_2_hours.sum()
Out[10]: 1039
# Wrong: to find the percentage of movie longer than 2 hours, use .mean method
movie_2_hours.mean()
Out[12]: 0.2113506916192026
# Correct: the duration column has missing values.
# so, need to drop the missing values first, then evaluate the condition and take the mean
movie['duration'].dropna().gt(120).mean()
Out[15]: 0.21199755152009794
# important takeaway: make sure you have dealt with missing values before making calculations
# the boolean condition evaluated all these comparisons against missing values as False
# prove the mean of a Boolean array represents the percentage of True values
# use the .value_counts method to count with the normalize parameter set to True to get its distribution
movie_2_hours.value_counts(normalize=True)
Out[20]:
False 0.788649
True 0.211351
Name: duration, dtype: float64
# .describe method to Boolean arrays the same way it applies to a column of object or strings
movie_2_hours
Out[22]:
movie_title
Avatar True
Pirates of the Caribbean: At World's End True
Spectre True
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
...
Signed Sealed Delivered False
The Following False
A Plague So Pleasant False
Shanghai Calling False
My Date with Drew False
Name: duration, Length: 4916, dtype: bool
movie_2_hours.describe()
Out[23]:
count 4916
unique 2
top False
freq 3877
Name: duration, dtype: object
# if you want quantile information, you could cast the Series into integers
movie_2_hours.astype(int).describe()
Out[25]:
count 4916.000000
mean 0.211351
std 0.408308
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
Name: duration, dtype: float64
# compare two columns from the same DataFrame to produce a Boolean Series
# e.g. determine the percentage of movies that actor1 with more facebook likes than actor 2
actors=movie[['actor_1_facebook_likes','actor_2_facebook_likes']].dropna()
(actors.actor_1_facebook_likes>actors.actor_2_facebook_likes).mean()
Out[29]: 0.9777687130328371
Constructing multiple Boolean conditions
precedence 优先
in Python, Boolean expression use the built-in logical operators ‘and’,‘not’ and ‘or’
the keywords do not work with Boolean indexing in pandas and are respectively replaces with & , | , ~
when combining expressions, each expression must be wrapped in parentheses, or an error will be raised(due to operator precedence)
all the movie that have an ‘’
# find all the movie that have an 'imdb_score' greater than 8, a'content_rating' of PG-13, and a 'title_year' either before 2000 or after 2009
criteria1=movie.imdb_score>8
criteria2=movie.content_rating=="PG-13"
criteria3=(movie.title_year<2000)|(movie.title_year>2009)
criteria_final=criteria1&criteria2&criteria3
criteria_final.head()
Out[35]:
movie_title
Avatar False
Pirates of the Caribbean: At World's End False
Spectre False
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
dtype: bool
# this Boolean array has the same index as the movie DataFrame
# pipe character '|' is used to create a logical or condition
# the ampersand character &
# the operators that pandas uses (the bitwise operators &,|,and ~) have a higher precedence than the comparison operator
# and the comparison operator has a higher precedence than 'and','or' and 'not'
5>4 and 3>4
Out[41]: False
# need parentheses to enforce operator precedence
Filtering with Boolean arrays
# create criteria for the second set of movies
criteria_b1=movie.imdb_score<5
crieria_b2=movie.content_rating=="R"
criteria_b3=(movie.title_year>=2000)&(movie.title_year<=2010)
criteria_final_b=criteria_b1 & criteria_final_b2 & criteria_b3
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-47-6b6881e93b85>", line 1, in <module>
criteria_final_b=criteria_b1 & criteria_final_b2 & criteria_b3
NameError: name 'criteria_final_b2' is not defined
criteria_final_b=criteria_b1 & criteria_b2 & criteria_b3
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-48-9c124f840703>", line 1, in <module>
criteria_final_b=criteria_b1 & criteria_b2 & criteria_b3
NameError: name 'criteria_b2' is not defined
criteria_final_b=criteria_b1 & crieria_b2 & criteria_b3
# combine the two sets of criteria using the pandas or operators
# this yields a Boolean array of all movies that are members of either set
final_crit_all=criteria_final | criteria_final_b
final_crit_all.head()
Out[53]:
movie_title
Avatar False
Pirates of the Caribbean: At World's End False
Spectre False
The Dark Knight Rises True
Star Wars: Episode VII - The Force Awakens False
dtype: bool
movie[final_crit_all]
Out[54]:
color ... movie_facebook_likes
movie_title ...
The Dark Knight Rises Color ... 164000
The Avengers Color ... 123000
Captain America: Civil War Color ... 72000
Guardians of the Galaxy Color ... 96000
Interstellar Color ... 349000
... ... ...
The Young Unknowns Color ... 4
Bled Color ... 128
Hoop Dreams Color ... 0
Death Calls Color ... 16
The Legend of God's Gun Color ... 13
[86 rows x 27 columns]
# once you have your Boolean array, you pass it to the index operator to filter the data
# we can also filter off of the .loc attribute
movie.loc[final_crit_all].head()
Out[57]:
color ... movie_facebook_likes
movie_title ...
The Dark Knight Rises Color ... 164000
The Avengers Color ... 123000
Captain America: Civil War Color ... 72000
Guardians of the Galaxy Color ... 96000
Interstellar Color ... 349000
[5 rows x 27 columns]
# filtering also works with the .loc attribute by simultaneously selecting rows(using the Boolean array) and columns
movie.loc[final_crit_all,["imdb_score","content_rating","title_year"]]
Out[59]:
imdb_score content_rating title_year
movie_title
The Dark Knight Rises 8.5 PG-13 2012.0
The Avengers 8.1 PG-13 2012.0
Captain America: Civil War 8.2 PG-13 2016.0
Guardians of the Galaxy 8.1 PG-13 2014.0
Interstellar 8.6 PG-13 2014.0
... ... ...
The Young Unknowns 4.2 R 2000.0
Bled 3.1 R 2009.0
Hoop Dreams 8.3 PG-13 1994.0
Death Calls 4.3 R 2010.0
The Legend of God's Gun 4.1 R 2007.0
[86 rows x 3 columns]
#that is , we can specify columns to select with the .loc attribute
# the .iloc attribute does not support Boolean array
# and if you pass in a Boolean Series to it, an exception will get raised
# however, it does work with Numpy arrays, so if you call the .to_numpy() method, you canfilter with it
movie.iloc[final_crit_all.to_numpy()]
Out[64]:
color ... movie_facebook_likes
movie_title ...
The Dark Knight Rises Color ... 164000
The Avengers Color ... 123000
Captain America: Civil War Color ... 72000
Guardians of the Galaxy Color ... 96000
Interstellar Color ... 349000
... ... ...
The Young Unknowns Color ... 4
Bled Color ... 128
Hoop Dreams Color ... 0
Death Calls Color ... 16
The Legend of God's Gun Color ... 13
[86 rows x 27 columns]
# you can also use one long Boolean expression in place of several other shorter ones
final_crit_a2=(
(movie.imdb_score>8)
& (movie.content_rating=="PG-13")
&((movie.title_year<2000)|(movie.title_year>2009))
)
final_crit_a2.equals(criteria_final)
Out[67]: True
Comparing row filtering and index filtering
overhead 经常费用
# in this section, select all institutions from a particular state with both Boolean indexing and index selection
# and then compare each of their performance against one another
prefer to filter by columns(using Boolean arrays) rather than on the index
File "<ipython-input-4-e9b74ae09153>", line 1
prefer to filter by columns(using Boolean arrays) rather than on the index
^
SyntaxError: invalid syntax
#prefer to filter by columns(using Boolean arrays) rather than on the index
# column filtering is more powerful as you can use other logical operators and filter on multiple columns
# the first, use Boolean array to select all insitutions from the state of Texas(TX)
import pandas as pd
college = pd.read_csv('college.csv')
college[college['STABBR']=='TX'],head()
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-06a1c3454acc>", line 1, in <module>
college[college['STABBR']=='TX'],head()
NameError: name 'head' is not defined
college[college['STABBR']=='TX'].head()
Out[11]:
INSTNM ... GRAD_DEBT_MDN_SUPP
3610 Abilene Christian University ... 25985
3611 Alvin Community College ... 6750
3612 Amarillo College ... 10950
3613 Angelina College ... PrivacySuppressed
3614 Angelo State University ... 21319.5
[5 rows x 27 columns]
# the second, use the index selection, move 'STABBR' column into the index
# then we can use label-based selection with the .loc indexer
college2=college.set_index('STABBR')
college2.loc["TX"].head()
Out[15]:
INSTNM ... GRAD_DEBT_MDN_SUPP
STABBR ...
TX Abilene Christian University ... 25985
TX Alvin Community College ... 6750
TX Amarillo College ... 10950
TX Angelina College ... PrivacySuppressed
TX Angelo State University ... 21319.5
[5 rows x 26 columns]
# compare the speeed of both methods
%timeit college[college['STABBR']=='TX']
874 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Compiler time: 0.34 s
%timeit college2.loc['TX']
612 µs ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# however, for the second method, setting the index also take time
%timeit college2=college.set_index('STABBR')
1.83 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# you'd better not to switch the index, just to filter by it
# although the fact that selection via the index is much faster than Boolean selection
# if you need to filter on multiple columns, you will have the overhead(and confusing code) from repeatedly switching the index
# how to select multiple states with both Boolean and index selection
states=['TX','CA',"NY"]
college[college.STABBR.isin(states)]
Out[26]:
INSTNM ... GRAD_DEBT_MDN_SUPP
192 Academy of Art University ... 35093
193 ITT Technical Institute-Rancho Cordova ... 25827.5
194 Academy of Chinese Culture and Health Sciences ... PrivacySuppressed
195 The Academy of Radio and TV Broadcasting ... 9500
196 Avalon School of Cosmetology-Alameda ... 9860
... ... ...
7528 WestMed College - Merced ... 15623.5
7529 Vantage College ... 9500
7530 SAE Institute of Technology San Francisco ... 9500
7533 Bay Area Medical Academy - San Jose Satellite ... ... PrivacySuppressed
7534 Excel Learning Center-San Antonio South ... 12125
[1704 rows x 27 columns]
college2.loc[states]
Out[27]:
INSTNM ... GRAD_DEBT_MDN_SUPP
STABBR ...
TX Abilene Christian University ... 25985
TX Alvin Community College ... 6750
TX Amarillo College ... 10950
TX Angelina College ... PrivacySuppressed
TX Angelo State University ... 21319.5
... ... ...
NY Briarcliffe College–Patchogue ... 28720.5
NY Jamestown Business College - ... 12050
NY Pratt Manhattan-A Division of Pratt Institute ... 26691
NY Saint Josephs College-Long Island ... 22143.5
NY Franklin Career Institute - Brooklyn Campus ... PrivacySuppressed
[1704 rows x 26 columns]
# pandas implements the index differently based on whether the index is unique or sorted
Selecting with unique and sorted indexes
index selection performance drastically improves when the index is unique or sorted.
an unsorted index that contained duplicates makes for relatively slow selection.
# check 'college2' : whether the index is sorted
college2.index.is_monotonic
Out[30]: False
# sort the index from college and stores it as another object
college3=college2.sort_index()
college3.index.is_monotonic
Out[33]: True
# time the selection of the state TX from three different DataFrame 'college' 'college2' 'college3'
%timeit college[college["STABBR"]=='TX']
The slowest run took 4.69 times longer than the fastest. This could mean that an intermediate result is being cached.
1.45 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit college2.loc['TX']
464 µs ± 6.96 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit college3.loc['TX']
The slowest run took 6.79 times longer than the fastest. This could mean that an intermediate result is being cached.
117 µs ± 124 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# the sorted index performs nearly an order of magnitude faster than the Boolean selection
# let's turn toward unique indexes,which means we set the institution name 'INSTNM' as the index
college_unique=college.set_index('INSTNM')
college_unique.index.is_unique
Out[43]: True
# select Stanford University with Boolean indexing
college[college["INSTNM"]=="Stanford University"]
Out[45]:
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
4217 Stanford University Stanford ... 86000 12782
[1 rows x 27 columns]
# and it is a DataFrame
# select Stanford University with index selection, and this returns a Series
college_unique.loc["Stanford University"]
Out[48]:
CITY Stanford
STABBR CA
HBCU 0.0
MENONLY 0.0
WOMENONLY 0.0
RELAFFIL 0
SATVRMID 730.0
SATMTMID 745.0
DISTANCEONLY 0.0
UGDS 7018.0
UGDS_WHITE 0.3752
UGDS_BLACK 0.0591
UGDS_HISP 0.1607
UGDS_ASIAN 0.1979
UGDS_AIAN 0.0114
UGDS_NHPI 0.0038
UGDS_2MOR 0.1067
UGDS_NRA 0.0819
UGDS_UNKN 0.0031
PPTUG_EF 0.0
CURROPER 1
PCTPELL 0.1556
PCTFLOAN 0.1256
UG25ABV 0.0401
MD_EARN_WNE_P10 86000
GRAD_DEBT_MDN_SUPP 12782
Name: Stanford University, dtype: object
# if we want a DataFrame rather than a Series, we need to pass in a list of index values into .loc
college_unique.loc[['Stanford University']]
Out[50]:
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM ...
Stanford University Stanford CA ... 86000 12782
[1 rows x 26 columns]
# now time each approach whose results are the same
%timeit college[college['INSTNM']=='Stanford University']
741 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit college_unique.loc[['Stanford University']]
422 µs ± 3.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
when the index is not sorted and contains duplicates, as with college2
, pandas will check every single value in the index to make the correct selection.
when the index is sorted, as with college3
, pandas takes advantage of an algorithm called binary search to improve search performance.
pandas implements unique indexes with a hash table, which makes for even faster selection . Each index location can be looked up in the same time regardless of its length.
concatenate 使连接在一起
Boolean selection gives much more flexibility than index selection as it is possible to condition on any number of columns.
we can concatenate multiple columns together to form an index.
#-----------------------
college.index=(college['CITY']+','+college['STABBR'])
college=college.sort_index()
college.head()
Out[57]:
INSTNM ... GRAD_DEBT_MDN_SUPP
ARTESIA,CA Angeles Institute ... 16850
Aberdeen,SD Presentation College ... 25000
Aberdeen,SD Northern State University ... 24847
Aberdeen,WA Grays Harbor College ... 11490
Abilene,TX Hardin-Simmons University ... 25864
[5 rows x 27 columns]
# so we can select all a particular city and state combination without Boolean indexing
college.loc['Miami,FL']
Out[59]:
INSTNM ... GRAD_DEBT_MDN_SUPP
Miami,FL New Professions Technical Institute ... 8682
Miami,FL Management Resources College ... 12182
Miami,FL Strayer University-Doral ... 36173.5
Miami,FL Keiser University- Miami ... 26063
Miami,FL George T Baker Aviation Technical College ... PrivacySuppressed
Miami,FL Paul Mitchell the School-Miami ... 11976
Miami,FL Eureka Institute of Health and Beauty ... PrivacySuppressed
Miami,FL Nouvelle Institute ... PrivacySuppressed
Miami,FL Mattia College ... 13299
Miami,FL SABER College ... 32431.5
Miami,FL Barry University ... 28250
Miami,FL Saint John Vianney College Seminary ... PrivacySuppressed
Miami,FL Florida Career College-Miami ... 9500
Miami,FL Future-Tech Institute ... PrivacySuppressed
Miami,FL Florida Vocational Institute ... 5375
Miami,FL American Medical Academy ... 7596.5
Miami,FL International Training Careers ... 7840
Miami,FL College of Business and Technology-Flagler ... 9167
Miami,FL Fortis Institute-Miami ... 9970.5
Miami,FL New Concept Massage and Beauty School ... 4280
Miami,FL Florida National University-South Campus ... 24547.5
Miami,FL Everest Institute-Kendall ... 19613
Miami,FL Praxis Institute ... 8150
Miami,FL Strayer University-Brickell ... 36173.5
Miami,FL La Belle Beauty Academy ... 7535
Miami,FL College of Business and Technology-Kendall ... 9167
Miami,FL Everest Institute-North Miami ... 19613
Miami,FL InterAmerican Technical Institute ... PrivacySuppressed
Miami,FL M-DCPS The English Center ... PrivacySuppressed
Miami,FL Florida College of Natural Health-Miami ... 8107
Miami,FL Yechanlaz Instituto Vocacional ... PrivacySuppressed
Miami,FL Polytechnic University of Puerto Rico-Miami ... 16500
Miami,FL Mattia College - ... 13299
Miami,FL Florida International University ... 17250
Miami,FL South Florida Institute of Technology ... 1409
Miami,FL Miami Ad School-Wynwood ... 21500
Miami,FL Professional Hands Institute ... PrivacySuppressed
Miami,FL Beauty Schools of America-Miami ... 9500
Miami,FL City College-Miami ... 26800
Miami,FL Celebrity School of Beauty ... 3666
Miami,FL Robert Morgan Educational Center and Technical... ... PrivacySuppressed
Miami,FL Carlos Albizu University-Miami ... 25000
Miami,FL Dade Medical College-Miami ... 25430
Miami,FL Florida Education Institute ... 9500
Miami,FL Acupuncture and Massage College ... 7600
Miami,FL Miami Dade College ... 8500
Miami,FL D A Dorsey Technical College ... PrivacySuppressed
Miami,FL AI Miami International University of Art and D... ... 31000
Miami,FL Advanced Technical Centers ... PrivacySuppressed
Miami,FL Lindsey Hopkins Technical College ... PrivacySuppressed
[50 rows x 27 columns]
%%timeit
crit1=college['CITY']=='Miami'
crit2=college['STABBR']=='FL'
college[crit1 & crit2]
The slowest run took 4.10 times longer than the fastest. This could mean that an intermediate result is being cached.
2.76 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit college.loc['Miami,FL']
62.2 µs ± 337 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Translating SQL WHERE clauses
idiomatic 表达方式地道的
# Task: find all the female employee who work in the police or fire departments who have a base salary of between 80 and 120 thousand dollars
import pandas as pd
employee=pd.read_csv('employee.csv')
# before filtering out the data, it's helpful to do some manual inspection of each of the filtered columns.
employee.dtypes
Out[6]:
UNIQUE_ID int64
POSITION_TITLE object
DEPARTMENT object
BASE_SALARY float64
RACE object
EMPLOYMENT_TYPE object
GENDER object
EMPLOYMENT_STATUS object
HIRE_DATE object
JOB_DATE object
dtype: object
employee.DEPARTMENT.value_counts().head()
Out[7]:
Houston Police Department-HPD 638
Houston Fire Department (HFD) 384
Public Works & Engineering-PWE 343
Health & Human Services 110
Houston Airport System (HAS) 106
Name: DEPARTMENT, dtype: int64
employee.GENDER.value_counts()
Out[8]:
Male 1397
Female 603
Name: GENDER, dtype: int64
employee.BASE_SALARY.describe()
Out[9]:
count 1886.000000
mean 55767.931601
std 21693.706679
min 24960.000000
25% 40170.000000
50% 54461.000000
75% 66614.000000
max 275000.000000
Name: BASE_SALARY, dtype: float64
depts=['Houston Police Department-HPD','Houston Fire Department(HFD)']
criteria_dept=employee.DEPARTMENT.isin(depts)
criteria_gender=employee.GENDER=='Female'
criteria_sal=(employee.BASE_SALARY>=80000)&(employee.BASE_SALARY<=120000)
# combine all the Boolean arrays
criteria_final= criteria_dept & criteria_gender & criteria_sal
select_columns=['UNIQUE_ID','DEPARTMENT','GENDER','BASE_SALARY']
employee.loc[criteria_final,select_columns].head()
Out[17]:
UNIQUE_ID DEPARTMENT GENDER BASE_SALARY
136 136 Houston Police Department-HPD Female 81239.0
367 367 Houston Police Department-HPD Female 86534.0
474 474 Houston Police Department-HPD Female 91181.0
513 513 Houston Police Department-HPD Female 81239.0
564 564 Houston Police Department-HPD Female 81239.0
# the .isin method is equivalent to the SQL IN operator and accepts a list of all possible values that you would like to keep
crit_sal=emplyee.BASE_SALARY(80000,120000)
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-19-a4f6667784e7>", line 1, in <module>
crit_sal=emplyee.BASE_SALARY(80000,120000)
NameError: name 'emplyee' is not defined
crit_sal=employee.BASE_SALARY(80000,120000)
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-20-76ce9007d009>", line 1, in <module>
crit_sal=employee.BASE_SALARY(80000,120000)
TypeError: 'Series' object is not callable
crit_sal=employee.BASE_SALARY.between(80000,120000)
crit_sal.equals(criteria_sal)
Out[22]: True
# try to exclude the rows from the top five most frequently occurring department
top_5_depts=employee.DEPARTMENT.value_counts().index[:5]
type(top_5_depts)
Out[26]: pandas.core.indexes.base.Index
criteria=employee.DEPARTMENT.isin(top_5_depts) # wha is the difference from the next code?
criteria=~employee.DEPARTMENT.isin(top_5_depts)
employee[criteria]
Out[29]:
UNIQUE_ID POSITION_TITLE ... HIRE_DATE JOB_DATE
0 0 ASSISTANT DIRECTOR (EX LVL) ... 2006-06-12 2012-10-13
1 1 LIBRARY ASSISTANT ... 2000-07-19 2010-09-18
4 4 ELECTRICIAN ... 1989-06-19 1994-10-22
18 18 MAINTENANCE MECHANIC III ... 2008-12-29 2008-12-29
32 32 SENIOR ACCOUNTANT ... 1991-02-11 2016-02-13
... ... ... ... ...
1976 1976 SENIOR SIDELOADER OPERATOR ... 2015-07-20 2016-01-30
1983 1983 ADMINISTRATIVE SPECIALIST ... 2006-10-16 2006-10-16
1985 1985 TRUCK DRIVER ... 2013-06-10 2015-08-01
1988 1988 SENIOR ASSISTANT CITY ATTORNEY II ... 2013-01-23 2013-03-02
1990 1990 BUILDING MAINTENANCE SUPERVISOR ... 1995-10-14 2010-03-20
[419 rows x 10 columns]
Imporving the readability of Boolean indexing with the query method
# Boolean indexing is not necessarily the most pleasant syntax to read or write, especially when using a single line to write a complex filter
# pandas has an alternative string-based syntax through the DataFrame query method that can provide more clarity
# take advantage of the .query method of the DataFrame
# Task: filter the employee data for female employee from the police or fir departments who earn a salry between 80 and 120 thousand dollars
# build the query string and execute the method
qs=(
"DEPARTMENT in @depts "
"and GENDER='Female' "
"and 80000<=BASE_SALARY<=120000"
)
emp_filtered=employee.query(qs)
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-58-4f16388941fc>", line 1, in <module>
emp_filtered=employee.query(qs)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3466, in query
res = self.eval(expr, **kwargs)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3596, in eval
return _eval(expr, inplace=inplace, **kwargs)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\computation\eval.py", line 342, in eval
parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\computation\expr.py", line 798, in __init__
self.terms = self.parse()
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\computation\expr.py", line 817, in parse
return self._visitor.visit(self.expr)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\computation\expr.py", line 397, in visit
raise e
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\computation\expr.py", line 393, in visit
node = ast.fix_missing_locations(ast.parse(clean))
File "D:\PyCharm2020\python2020\lib\ast.py", line 50, in parse
return compile(source, filename, mode, flags,
File "<unknown>", line 1
DEPARTMENT in __pd_eval_local_depts and GENDER ='Female'and 80000 <=BASE_SALARY <=120000
^
SyntaxError: Python keyword not valid identifier in numexpr query
qs=(
"DEPARTMENT in @depts "
"and GENDER=='Female' "
"and 80000<=BASE_SALARY<=120000"
)
emp_filtered=employee.query(qs)
emp_filtered[select_columns].head()
Out[61]:
UNIQUE_ID DEPARTMENT GENDER BASE_SALARY
136 136 Houston Police Department-HPD Female 81239.0
367 367 Houston Police Department-HPD Female 86534.0
474 474 Houston Police Department-HPD Female 91181.0
513 513 Houston Police Department-HPD Female 81239.0
564 564 Houston Police Department-HPD Female 81239.0
# use the at symbol (@) to reference Python
# all DataFrame column names are available in the query namespace by referencing their name without extra quotes
# if a string is need, such as Female, inner quotes will need to wrap it
# another feature of the query syntax is the ability to combine Boolean operators using 'and' 'or' 'not'
# Task : find all the female employee who were not a member of the top 10 departments by frequency
top10_depts=employee.DEPARTMENT.value_counts().index[:10].tolist()
qs="DEPARTMENT not in @top10_depts and GENDER=='Female' "
employee_filtered2=employee.query(qs)
employee_filtered2.head()
Out[70]:
UNIQUE_ID POSITION_TITLE ... HIRE_DATE JOB_DATE
0 0 ASSISTANT DIRECTOR (EX LVL) ... 2006-06-12 2012-10-13
73 73 ADMINISTRATIVE SPECIALIST ... 2011-12-19 2013-11-23
96 96 ASSISTANT CITY CONTROLLER III ... 2013-06-10 2013-06-10
117 117 SENIOR ASSISTANT CITY ATTORNEY I ... 1998-03-20 2012-07-21
146 146 SENIOR STAFF ANALYST ... 2014-03-17 2014-03-17
[5 rows x 10 columns]
Preserving Series size with the .where method
clump 聚簇,聚集
when you filter with Boolean arrays, the resulting Series or DataFrame is typically smaller.
the .where
method preserves the size of your Series or DataFrame and either sets the values that don’t meet the criteria to missing or replace them with something else.instead of dropping all these values, it is possible to keep them.
the .where
method again preserves the size and shape of the calling object and does not modify the values where the passed Boolean is True.
import pandas as pd
movie=pd.read_csv('movie.csv',index_col='movie_title')
# important to drop the missing values in the step1 as the .where method would have eventually replaced them with a valid number in future steps
fb_likes=movie["actor_1_facebook_likes"].dropna()
fb_likes.head()
Out[5]:
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World's End 40000.0
Spectre 11000.0
The Dark Knight Rises 27000.0
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
fb_likes.describe()
Out[6]:
count 4909.000000
mean 6494.488491
std 15106.986884
min 0.000000
25% 607.000000
50% 982.000000
75% 11000.000000
max 640000.000000
Name: actor_1_facebook_likes, dtype: float64
# plot a histogram of this Series to visually inspect the distribution
# plt.subplots to specify the figure size
import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
fig,ax=plt.subplots(figsize=(10,8))
fb_likes.hist(ax=ax)
Out[11]: <AxesSubplot:>
fig.savefig('c7-hist.png',dpi=300)
# we can know tha data is very highly skewed to the right with a few very large observation(more than an order of magnitude greater than the median)
# the data has too many outliers for a plain histogram to make a good plot
# .where method allows us to place a ceiling and floor on our data, which reaults in a histogram with less variance
criteria_high=fb_likes<20000
criteria_high.mean().round(2)
Out[16]: 0.91
# .where method accepts a Boolean array, and its default behavior is to return a Series the same size as the original, but which has all the `False` locations replaced with a missing value
fb_likes.where(criteria_high).head()
Out[18]:
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World's End NaN
Spectre 11000.0
The Dark Knight Rises NaN
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
# the second parameter to the .where method , other, control the replacement value
# here, we change all the missing values to 20000
fb_likes.where(criteria_high,other=20000).head()
Out[21]:
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World's End 20000.0
Spectre 11000.0
The Dark Knight Rises 20000.0
Star Wars: Episode VII - The Force Awakens 131.0
Name: actor_1_facebook_likes, dtype: float64
# create criteria to put a floor on the minimum number of likes
criteria_low=fb_likes>300
fb_likes_cap=fb_likes.dropna().where(criteria_high,other=20000).where(criteria_low,other=300)
fb_likes_cap.head()
Out[26]:
movie_title
Avatar 1000.0
Pirates of the Caribbean: At World's End 20000.0
Spectre 11000.0
The Dark Knight Rises 20000.0
Star Wars: Episode VII - The Force Awakens 300.0
Name: actor_1_facebook_likes, dtype: float64
# the lengths of the original Series and the modified Series are the same
len(fb_likes)==len(fb_likes_cap)
Out[28]: True
# then we make a histogram with the modified Series
# notice now the data in a much tighter range
fig,ax=plt.subplots(figsize=(10,8))
fb_likes_cap.hist(ax=ax)
Out[32]: <AxesSubplot:>
fig.savefig('c7-hist2.png',dpi=300)
fb_likes_cap2=fb_likes.clip(lower=300,upper=20000)
fb_likes_cap2.equals(fb_likes_cap)
Out[35]: True
Masking DataFrame rows
the .mask method performs the complement of the .where method.
By default, it creates missing values wherever the Boolean condition is True
in essence, it is literally masking, or covering up, values in your dataset
#Task: mask all rows of the movie dataset that were made after 2010 and then filter all the rows with missing values
c1=movie["title_year"]>=2010
c2=movie["title_year"].isna()
criteria=c1|c2
# use the .mask method on a DataFrame to remove the values for all the values in rows with movies that were made from 2010
# any movie that originally had a missing value for title_year is also masked
movie.mask(criteria).head()
Out[43]:
color ... movie_facebook_likes
movie_title ...
Avatar Color ... 33000.0
Pirates of the Caribbean: At World's End Color ... 0.0
Spectre NaN ... NaN
The Dark Knight Rises NaN ... NaN
Star Wars: Episode VII - The Force Awakens NaN ... NaN
[5 rows x 27 columns]
# chain the .dropna method to remove rows that have all values missing
movie_mask=movie.mask(criteria).dropna(how='all')
movie_mask.head()
Out[46]:
color ... movie_facebook_likes
movie_title ...
Avatar Color ... 33000.0
Pirates of the Caribbean: At World's End Color ... 0.0
Spider-Man 3 Color ... 0.0
Harry Potter and the Half-Blood Prince Color ... 10000.0
Superman Returns Color ... 0.0
[5 rows x 27 columns]
# check the two method produce the same DataFrame
movie_mask.equals(movie[movie['title_year']<2010])
Out[48]: False
# further,do soma sanity checking--whether they are the same type
movie_mask.shape==movie[movie['title_year']<2010].shape
Out[51]: True
# when we used the preceding .mask method, it creates many missing values
# the .equal method return False if the data types of the columns are different, even if the values are same
# check the equality of the data types
movie_mask.dtypes==movie[movie['title_year']>2010].dtypes
Out[55]:
color True
director_name True
num_critic_for_reviews True
duration True
director_facebook_likes True
actor_3_facebook_likes True
actor_2_name True
actor_1_facebook_likes True
gross True
genres True
actor_1_name True
num_voted_users False
cast_total_facebook_likes False
actor_3_name True
facenumber_in_poster True
plot_keywords True
movie_imdb_link True
num_user_for_reviews True
language True
country True
content_rating True
budget True
title_year True
actor_2_facebook_likes True
imdb_score True
aspect_ratio True
movie_facebook_likes False
dtype: bool
# in pandas's 'testing' module, there is a function,assert_frame_equal, that allows you to check the equality Series and DataFrame without also checking the equality of the data types
from pandas.testing import assert_frame_equal
assert_frame_equal(movie_mask,movie[movie['title_year']<2010],check_dtype=False)
[239(260/627)]
Selecting with Booleans, integer location, and labels
we will filter both rows and columns with the .iloc
and .loc
attributes.
# create a Boolean array matching all movies with a content rating of G and an IMDB score less than 4
c1=movie['content_rating']=='G'
c2=movie['imdb_score']<4
criteria=c1&c2
# pass these criteria to .loc to filter the rows
movie_loc=movie.loc[criteria]
movie_loc.head()
Out[69]:
color ... movie_facebook_likes
movie_title ...
The True Story of Puss'N Boots Color ... 90
Doogal Color ... 346
Thomas and the Magic Railroad Color ... 663
Barney's Great Adventure Color ... 436
Justin Bieber: Never Say Never Color ... 62000
[5 rows x 27 columns]
movie_loc.equals(movie[criteria])
Out[70]: True
# check this DataFrame is equal to one generated directly from the indexing operator
# attempt the same Boolean indexing with .iloc indexer
movie_iloc=movie.iloc[criteria.to_numpy()]
# we cannot directly use a series of boolean because of the index
movie_loc.equals(movie.iloc[criteria.to_numpy()])
Out[75]: True
# select particular column with boolean indexing
# here, select all the columns that have a data type of 64-bit intergers
import numpy as np
criteria_col=movie.dtypes==np.int64
movie.loc[:,criteria_col].head()
Out[81]:
num_voted_users ... movie_facebook_likes
movie_title ...
Avatar 886204 ... 33000
Pirates of the Caribbean: At World's End 471220 ... 0
Spectre 275868 ... 85000
The Dark Knight Rises 1144337 ... 164000
Star Wars: Episode VII - The Force Awakens 8 ... 0
[5 rows x 3 columns]
# as criteria_col is a Series, which always has an index, you must use the underlying ndarry to make it work with .iloc
movie.iloc[:,criteria_col.to_numpy()].head()
Out[83]:
num_voted_users ... movie_facebook_likes
movie_title ...
Avatar 886204 ... 33000
Pirates of the Caribbean: At World's End 471220 ... 0
Spectre 275868 ... 85000
The Dark Knight Rises 1144337 ... 164000
Star Wars: Episode VII - The Force Awakens 8 ... 0
[5 rows x 3 columns]
# when using .loc, you can use a Boolean array to select rows and specify the columns you want with a list of labels
cols=['content_rating','imdb_score','title_year','gross']
movie.loc[criteria,cols]
Out[86]:
content_rating ... gross
movie_title ...
The True Story of Puss'N Boots G ... NaN
Doogal G ... 7382993.0
Thomas and the Magic Railroad G ... 15911333.0
Barney's Great Adventure G ... 11144518.0
Justin Bieber: Never Say Never G ... 73000942.0
Sunday School Musical G ... NaN
[6 rows x 4 columns]
# create the same operation with .iloc, and you need to specify the position of the columns
# use a list comprehension to loop through all th edesired column names to find their integer location with the index method .get_loc
col_index=[movie.columns.get_loc(col) for col in cols]
col_index
Out[89]: [20, 24, 22, 8]
movie.iloc[criteria.to_numpy(),col_index].sort_values("imdb_score")
Out[91]:
content_rating ... gross
movie_title ...
Justin Bieber: Never Say Never G ... 73000942.0
Sunday School Musical G ... NaN
Doogal G ... 7382993.0
Barney's Great Adventure G ... 11144518.0
The True Story of Puss'N Boots G ... NaN
Thomas and the Magic Railroad G ... 15911333.0
[6 rows x 4 columns]
# both .loc and .iloc attributes have some support filtering with
# with the caveat that .iloc cannot be passed a Series but the underlying ndarray
# take a look at the one-dimensional ndarray underlying criteria
a=criteria.to_numpy()
a
Out[96]: array([False, False, False, ..., False, False, False])
len(a),len(criteria)
Out[97]: (4916, 4916)
# the array is the same length as the Series, which is the same length as the movie DataFrame
# the integer location for the Boolean array aligns with the integer location of the DataFrame
# and the filter happens as expected
# these arrays also work with the .loc attribute as well, but they are a necessity with .iloc