【cookbook pandas】chapter7 Filtering Row

最新推荐文章于 2021-04-04 18:24:34 发布

萝卜丝皮尔

最新推荐文章于 2021-04-04 18:24:34 发布

阅读量286

点赞数

分类专栏：数据分析Python基础篇文章标签： cookbook pandas

本文链接：https://blog.csdn.net/qq_43448491/article/details/115054693

版权

数据分析Python基础篇专栏收录该内容

23 篇文章 0 订阅

订阅专栏

filter or subset data in pandas with Boolean indexing;
Boolean indexing (also known as Boolean selection) refers to selecting rows by providing a Boolean array , a pandas Series with the same index, but a True or False for each row
informative 提供有用信息的，给予知识的；
Python evaluates False and True as 0 and 1 , respectively.

Calculating Boolean statistics

import pandas as pd
import numpy as np
movie=pd.read_csv('movie.csv',index_col='movie_title')
movie[['duration']].head()
Out[5]: 
                                            duration
movie_title                                         
Avatar                                         178.0
Pirates of the Caribbean: At World's End       169.0
Spectre                                        148.0
The Dark Knight Rises                          164.0
Star Wars: Episode VII - The Force Awakens       NaN
# determine whether the duration of each movie is longer than 2 hours 
movie_2_hours=movie['duration']>120
movie_2_hours.head(10)
Out[8]: 
movie_title
Avatar                                         True
Pirates of the Caribbean: At World's End       True
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
John Carter                                    True
Spider-Man 3                                   True
Tangled                                       False
Avengers: Age of Ultron                        True
Harry Potter and the Half-Blood Prince         True
Name: duration, dtype: bool
# use the above Series to determine the number of movies that are longer than two hours
movie_2_hours.sum()
Out[10]: 1039
# Wrong: to find the percentage of movie longer than 2 hours, use .mean method
movie_2_hours.mean()
Out[12]: 0.2113506916192026
# Correct: the duration column has missing values.
# so, need to drop the missing values first, then evaluate the condition and take the mean
movie['duration'].dropna().gt(120).mean()
Out[15]: 0.21199755152009794
# important takeaway: make sure you have dealt with missing values before making calculations
# the boolean condition evaluated all these comparisons against missing values as False
# prove the mean of a Boolean array represents the percentage of True values
# use the .value_counts method to count with the normalize parameter set to True to get its distribution
movie_2_hours.value_counts(normalize=True)
Out[20]: 
False    0.788649
True     0.211351
Name: duration, dtype: float64
# .describe method to Boolean arrays the same way it applies to a column of object or strings
movie_2_hours
Out[22]: 
movie_title
Avatar                                         True
Pirates of the Caribbean: At World's End       True
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
                                              ...  
Signed Sealed Delivered                       False
The Following                                 False
A Plague So Pleasant                          False
Shanghai Calling                              False
My Date with Drew                             False
Name: duration, Length: 4916, dtype: bool
movie_2_hours.describe()
Out[23]: 
count      4916
unique        2
top       False
freq       3877
Name: duration, dtype: object
# if you want quantile information, you could cast the Series into integers
movie_2_hours.astype(int).describe()
Out[25]: 
count    4916.000000
mean        0.211351
std         0.408308
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: duration, dtype: float64
# compare two columns from the same DataFrame to produce a Boolean Series
# e.g. determine the percentage of movies that actor1 with more facebook likes than actor 2
actors=movie[['actor_1_facebook_likes','actor_2_facebook_likes']].dropna()
(actors.actor_1_facebook_likes>actors.actor_2_facebook_likes).mean()
Out[29]: 0.9777687130328371

Constructing multiple Boolean conditions

precedence 优先
in Python, Boolean expression use the built-in logical operators ‘and’,‘not’ and ‘or’
the keywords do not work with Boolean indexing in pandas and are respectively replaces with & , | , ~
when combining expressions, each expression must be wrapped in parentheses, or an error will be raised(due to operator precedence)
all the movie that have an ‘’

# find all the movie that have an 'imdb_score' greater than 8, a'content_rating' of PG-13, and a 'title_year' either before 2000 or after 2009
criteria1=movie.imdb_score>8
criteria2=movie.content_rating=="PG-13"
criteria3=(movie.title_year<2000)|(movie.title_year>2009)
criteria_final=criteria1&criteria2&criteria3
criteria_final.head()
Out[35]: 
movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool
# this Boolean array has the same index as the movie DataFrame
# pipe character '|' is used to create a logical or condition
# the ampersand character &
# the operators that pandas uses (the bitwise operators &,|,and ~) have a higher precedence than the comparison operator 
# and the comparison operator has a higher precedence than 'and','or' and 'not'
5>4 and 3>4
Out[41]: False
# need parentheses to enforce operator precedence

Filtering with Boolean arrays

# create criteria for the second set of movies
criteria_b1=movie.imdb_score<5
crieria_b2=movie.content_rating=="R"
criteria_b3=(movie.title_year>=2000)&(movie.title_year<=2010)
criteria_final_b=criteria_b1 & criteria_final_b2 & criteria_b3
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-47-6b6881e93b85>", line 1, in <module>
    criteria_final_b=criteria_b1 & criteria_final_b2 & criteria_b3
NameError: name 'criteria_final_b2' is not defined
criteria_final_b=criteria_b1 & criteria_b2 & criteria_b3
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-48-9c124f840703>", line 1, in <module>
    criteria_final_b=criteria_b1 & criteria_b2 & criteria_b3
NameError: name 'criteria_b2' is not defined
criteria_final_b=criteria_b1 & crieria_b2 & criteria_b3
# combine the two sets of criteria using the pandas or operators 
# this yields a Boolean array of all movies that are members of either set
final_crit_all=criteria_final | criteria_final_b
final_crit_all.head()
Out[53]: 
movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool
movie[final_crit_all]
Out[54]: 
                            color  ... movie_facebook_likes
movie_title                        ...                     
The Dark Knight Rises       Color  ...               164000
The Avengers                Color  ...               123000
Captain America: Civil War  Color  ...                72000
Guardians of the Galaxy     Color  ...                96000
Interstellar                Color  ...               349000
                           ...  ...                  ...
The Young Unknowns          Color  ...                    4
Bled                        Color  ...                  128
Hoop Dreams                 Color  ...                    0
Death Calls                 Color  ...                   16
The Legend of God's Gun     Color  ...                   13
[86 rows x 27 columns]
# once you have your Boolean array, you pass it to the index operator to filter the data
# we can also filter off of the .loc attribute
movie.loc[final_crit_all].head()
Out[57]: 
                            color  ... movie_facebook_likes
movie_title                        ...                     
The Dark Knight Rises       Color  ...               164000
The Avengers                Color  ...               123000
Captain America: Civil War  Color  ...                72000
Guardians of the Galaxy     Color  ...                96000
Interstellar                Color  ...               349000
[5 rows x 27 columns]
# filtering also works with the .loc attribute by simultaneously selecting rows(using the Boolean array) and columns
movie.loc[final_crit_all,["imdb_score","content_rating","title_year"]]
Out[59]: 
                            imdb_score content_rating  title_year
movie_title                                                      
The Dark Knight Rises              8.5          PG-13      2012.0
The Avengers                       8.1          PG-13      2012.0
Captain America: Civil War         8.2          PG-13      2016.0
Guardians of the Galaxy            8.1          PG-13      2014.0
Interstellar                       8.6          PG-13      2014.0
                                ...            ...         ...
The Young Unknowns                 4.2              R      2000.0
Bled                               3.1              R      2009.0
Hoop Dreams                        8.3          PG-13      1994.0
Death Calls                        4.3              R      2010.0
The Legend of God's Gun            4.1              R      2007.0
[86 rows x 3 columns]
#that is , we can specify columns to select with the .loc attribute
# the .iloc attribute does not support Boolean array
# and if you pass in a Boolean Series to it, an exception will get raised 
# however, it does work with Numpy arrays, so if you call the .to_numpy() method, you canfilter with it
movie.iloc[final_crit_all.to_numpy()]
Out[64]: 
                            color  ... movie_facebook_likes
movie_title                        ...                     
The Dark Knight Rises       Color  ...               164000
The Avengers                Color  ...               123000
Captain America: Civil War  Color  ...                72000
Guardians of the Galaxy     Color  ...                96000
Interstellar                Color  ...               349000
                           ...  ...                  ...
The Young Unknowns          Color  ...                    4
Bled                        Color  ...                  128
Hoop Dreams                 Color  ...                    0
Death Calls                 Color  ...                   16
The Legend of God's Gun     Color  ...                   13
[86 rows x 27 columns]
# you can also use one long Boolean expression in place of several other shorter ones
final_crit_a2=(
    (movie.imdb_score>8)
    & (movie.content_rating=="PG-13")
    &((movie.title_year<2000)|(movie.title_year>2009))
)
final_crit_a2.equals(criteria_final)
Out[67]: True

Comparing row filtering and index filtering

overhead 经常费用

# in this section, select all institutions from a particular state with both Boolean indexing and index selection
# and then compare each of their performance against one another
prefer to filter by columns(using Boolean arrays) rather than on the index
  File "<ipython-input-4-e9b74ae09153>", line 1
    prefer to filter by columns(using Boolean arrays) rather than on the index
           ^
SyntaxError: invalid syntax
#prefer to filter by columns(using Boolean arrays) rather than on the index
# column filtering is more powerful as you can use other logical operators and filter on multiple columns 
# the first, use Boolean array to select all insitutions from the state of Texas(TX)
import pandas as pd
college = pd.read_csv('college.csv')
college[college['STABBR']=='TX'],head()
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-06a1c3454acc>", line 1, in <module>
    college[college['STABBR']=='TX'],head()
NameError: name 'head' is not defined
college[college['STABBR']=='TX'].head()
Out[11]: 
                            INSTNM  ... GRAD_DEBT_MDN_SUPP
3610  Abilene Christian University  ...              25985
3611       Alvin Community College  ...               6750
3612              Amarillo College  ...              10950
3613              Angelina College  ...  PrivacySuppressed
3614       Angelo State University  ...            21319.5
[5 rows x 27 columns]
# the second, use the index selection, move 'STABBR' column into the index
# then we can use label-based selection with the .loc indexer
college2=college.set_index('STABBR')
college2.loc["TX"].head()
Out[15]: 
                              INSTNM  ... GRAD_DEBT_MDN_SUPP
STABBR                                ...                   
TX      Abilene Christian University  ...              25985
TX           Alvin Community College  ...               6750
TX                  Amarillo College  ...              10950
TX                  Angelina College  ...  PrivacySuppressed
TX           Angelo State University  ...            21319.5
[5 rows x 26 columns]
# compare the speeed of both methods
%timeit college[college['STABBR']=='TX']
874 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Compiler time: 0.34 s
%timeit college2.loc['TX']
612 µs ± 371 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# however, for the second method, setting the index also take time
%timeit college2=college.set_index('STABBR')
1.83 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# you'd better not to switch the index, just to filter by it
# although the fact that selection via the index is much faster than Boolean selection
# if you need to filter on multiple columns, you will have the overhead(and confusing code) from repeatedly switching the index
# how to select multiple states with both Boolean and index selection 
states=['TX','CA',"NY"]
college[college.STABBR.isin(states)]
Out[26]: 
                                                 INSTNM  ... GRAD_DEBT_MDN_SUPP
192                           Academy of Art University  ...              35093
193              ITT Technical Institute-Rancho Cordova  ...            25827.5
194      Academy of Chinese Culture and Health Sciences  ...  PrivacySuppressed
195            The Academy of Radio and TV Broadcasting  ...               9500
196                Avalon School of Cosmetology-Alameda  ...               9860
                                                 ...  ...                ...
7528                           WestMed College - Merced  ...            15623.5
7529                                    Vantage College  ...               9500
7530         SAE Institute of Technology  San Francisco  ...               9500
7533  Bay Area Medical Academy - San Jose Satellite ...  ...  PrivacySuppressed
7534            Excel Learning Center-San Antonio South  ...              12125
[1704 rows x 27 columns]
college2.loc[states]
Out[27]: 
                                               INSTNM  ... GRAD_DEBT_MDN_SUPP
STABBR                                                 ...                   
TX                       Abilene Christian University  ...              25985
TX                            Alvin Community College  ...               6750
TX                                   Amarillo College  ...              10950
TX                                   Angelina College  ...  PrivacySuppressed
TX                            Angelo State University  ...            21319.5
                                               ...  ...                ...
NY                      Briarcliffe College–Patchogue  ...            28720.5
NY                       Jamestown Business College -  ...              12050
NY      Pratt Manhattan-A Division of Pratt Institute  ...              26691
NY                  Saint Josephs College-Long Island  ...            22143.5
NY        Franklin Career Institute - Brooklyn Campus  ...  PrivacySuppressed
[1704 rows x 26 columns]
# pandas implements the index differently based on whether the index is unique or sorted

Selecting with unique and sorted indexes

index selection performance drastically improves when the index is unique or sorted.
an unsorted index that contained duplicates makes for relatively slow selection.

# check 'college2' : whether the index is sorted
college2.index.is_monotonic
Out[30]: False
# sort the index from college and stores it as another object 
college3=college2.sort_index()
college3.index.is_monotonic
Out[33]: True
# time the selection of the state TX from three different DataFrame 'college' 'college2' 'college3'
%timeit college[college["STABBR"]=='TX']
The slowest run took 4.69 times longer than the fastest. This could mean that an intermediate result is being cached.
1.45 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit college2.loc['TX']
464 µs ± 6.96 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit college3.loc['TX']
The slowest run took 6.79 times longer than the fastest. This could mean that an intermediate result is being cached.
117 µs ± 124 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# the sorted index performs nearly an order of magnitude faster than the Boolean selection
# let's turn toward unique indexes,which means we set the institution name 'INSTNM' as the index
college_unique=college.set_index('INSTNM')
college_unique.index.is_unique
Out[43]: True
# select Stanford University with Boolean indexing
college[college["INSTNM"]=="Stanford University"]
Out[45]: 
                   INSTNM      CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
4217  Stanford University  Stanford  ...           86000               12782
[1 rows x 27 columns]
# and it is a DataFrame
# select Stanford University with index selection, and this returns a Series
college_unique.loc["Stanford University"]
Out[48]: 
CITY                  Stanford
STABBR                      CA
HBCU                       0.0
MENONLY                    0.0
WOMENONLY                  0.0
RELAFFIL                     0
SATVRMID                 730.0
SATMTMID                 745.0
DISTANCEONLY               0.0
UGDS                    7018.0
UGDS_WHITE              0.3752
UGDS_BLACK              0.0591
UGDS_HISP               0.1607
UGDS_ASIAN              0.1979
UGDS_AIAN               0.0114
UGDS_NHPI               0.0038
UGDS_2MOR               0.1067
UGDS_NRA                0.0819
UGDS_UNKN               0.0031
PPTUG_EF                   0.0
CURROPER                     1
PCTPELL                 0.1556
PCTFLOAN                0.1256
UG25ABV                 0.0401
MD_EARN_WNE_P10          86000
GRAD_DEBT_MDN_SUPP       12782
Name: Stanford University, dtype: object
# if we want a DataFrame rather than a Series, we need to pass in a list of index values into .loc
college_unique.loc[['Stanford University']]
Out[50]: 
                         CITY STABBR  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
INSTNM                                ...                                     
Stanford University  Stanford     CA  ...            86000               12782
[1 rows x 26 columns]
# now time each approach whose results are the same
%timeit college[college['INSTNM']=='Stanford University']
741 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit college_unique.loc[['Stanford University']]
422 µs ± 3.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

when the index is not sorted and contains duplicates, as with college2, pandas will check every single value in the index to make the correct selection.
when the index is sorted, as with college3, pandas takes advantage of an algorithm called binary search to improve search performance.
pandas implements unique indexes with a hash table, which makes for even faster selection . Each index location can be looked up in the same time regardless of its length.

concatenate 使连接在一起
Boolean selection gives much more flexibility than index selection as it is possible to condition on any number of columns.
we can concatenate multiple columns together to form an index.

#-----------------------
college.index=(college['CITY']+','+college['STABBR'])
college=college.sort_index()
college.head()
Out[57]: 
                                INSTNM  ... GRAD_DEBT_MDN_SUPP
ARTESIA,CA           Angeles Institute  ...              16850
Aberdeen,SD       Presentation College  ...              25000
Aberdeen,SD  Northern State University  ...              24847
Aberdeen,WA       Grays Harbor College  ...              11490
Abilene,TX   Hardin-Simmons University  ...              25864
[5 rows x 27 columns]
# so we can select all a particular city and state combination without Boolean indexing
college.loc['Miami,FL']
Out[59]: 
                                                     INSTNM  ... GRAD_DEBT_MDN_SUPP
Miami,FL                New Professions Technical Institute  ...               8682
Miami,FL                       Management Resources College  ...              12182
Miami,FL                           Strayer University-Doral  ...            36173.5
Miami,FL                           Keiser University- Miami  ...              26063
Miami,FL          George T Baker Aviation Technical College  ...  PrivacySuppressed
Miami,FL                     Paul Mitchell the School-Miami  ...              11976
Miami,FL              Eureka Institute of Health and Beauty  ...  PrivacySuppressed
Miami,FL                                 Nouvelle Institute  ...  PrivacySuppressed
Miami,FL                                     Mattia College  ...              13299
Miami,FL                                      SABER College  ...            32431.5
Miami,FL                                   Barry University  ...              28250
Miami,FL                Saint John Vianney College Seminary  ...  PrivacySuppressed
Miami,FL                       Florida Career College-Miami  ...               9500
Miami,FL                              Future-Tech Institute  ...  PrivacySuppressed
Miami,FL                       Florida Vocational Institute  ...               5375
Miami,FL                           American Medical Academy  ...             7596.5
Miami,FL                     International Training Careers  ...               7840
Miami,FL         College of Business and Technology-Flagler  ...               9167
Miami,FL                             Fortis Institute-Miami  ...             9970.5
Miami,FL              New Concept Massage and Beauty School  ...               4280
Miami,FL           Florida National University-South Campus  ...            24547.5
Miami,FL                          Everest Institute-Kendall  ...              19613
Miami,FL                                   Praxis Institute  ...               8150
Miami,FL                        Strayer University-Brickell  ...            36173.5
Miami,FL                            La Belle Beauty Academy  ...               7535
Miami,FL         College of Business and Technology-Kendall  ...               9167
Miami,FL                      Everest Institute-North Miami  ...              19613
Miami,FL                  InterAmerican Technical Institute  ...  PrivacySuppressed
Miami,FL                          M-DCPS The English Center  ...  PrivacySuppressed
Miami,FL            Florida College of Natural Health-Miami  ...               8107
Miami,FL                     Yechanlaz Instituto Vocacional  ...  PrivacySuppressed
Miami,FL        Polytechnic University of Puerto Rico-Miami  ...              16500
Miami,FL                                   Mattia College -  ...              13299
Miami,FL                   Florida International University  ...              17250
Miami,FL              South Florida Institute of Technology  ...               1409
Miami,FL                            Miami Ad School-Wynwood  ...              21500
Miami,FL                       Professional Hands Institute  ...  PrivacySuppressed
Miami,FL                    Beauty Schools of America-Miami  ...               9500
Miami,FL                                 City College-Miami  ...              26800
Miami,FL                         Celebrity School of Beauty  ...               3666
Miami,FL  Robert Morgan Educational Center and Technical...  ...  PrivacySuppressed
Miami,FL                     Carlos Albizu University-Miami  ...              25000
Miami,FL                         Dade Medical College-Miami  ...              25430
Miami,FL                        Florida Education Institute  ...               9500
Miami,FL                    Acupuncture and Massage College  ...               7600
Miami,FL                                 Miami Dade College  ...               8500
Miami,FL                       D A Dorsey Technical College  ...  PrivacySuppressed
Miami,FL  AI Miami International University of Art and D...  ...              31000
Miami,FL                         Advanced Technical Centers  ...  PrivacySuppressed
Miami,FL                  Lindsey Hopkins Technical College  ...  PrivacySuppressed
[50 rows x 27 columns]
%%timeit
crit1=college['CITY']=='Miami'
crit2=college['STABBR']=='FL'
college[crit1 & crit2]
The slowest run took 4.10 times longer than the fastest. This could mean that an intermediate result is being cached.
2.76 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit college.loc['Miami,FL']
62.2 µs ± 337 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Translating SQL WHERE clauses

idiomatic 表达方式地道的

# Task: find all the female employee who work in the police or fire departments who have a base salary of between 80 and 120 thousand dollars
import pandas as pd
employee=pd.read_csv('employee.csv')
# before filtering out the data, it's helpful to do some manual inspection of each of the filtered columns.
employee.dtypes
Out[6]: 
UNIQUE_ID              int64
POSITION_TITLE        object
DEPARTMENT            object
BASE_SALARY          float64
RACE                  object
EMPLOYMENT_TYPE       object
GENDER                object
EMPLOYMENT_STATUS     object
HIRE_DATE             object
JOB_DATE              object
dtype: object
employee.DEPARTMENT.value_counts().head()
Out[7]: 
Houston Police Department-HPD     638
Houston Fire Department (HFD)     384
Public Works & Engineering-PWE    343
Health & Human Services           110
Houston Airport System (HAS)      106
Name: DEPARTMENT, dtype: int64
employee.GENDER.value_counts()
Out[8]: 
Male      1397
Female     603
Name: GENDER, dtype: int64
employee.BASE_SALARY.describe()
Out[9]: 
count      1886.000000
mean      55767.931601
std       21693.706679
min       24960.000000
25%       40170.000000
50%       54461.000000
75%       66614.000000
max      275000.000000
Name: BASE_SALARY, dtype: float64
depts=['Houston Police Department-HPD','Houston Fire Department(HFD)']
criteria_dept=employee.DEPARTMENT.isin(depts)
criteria_gender=employee.GENDER=='Female'
criteria_sal=(employee.BASE_SALARY>=80000)&(employee.BASE_SALARY<=120000)
# combine all the Boolean arrays
criteria_final= criteria_dept & criteria_gender & criteria_sal
select_columns=['UNIQUE_ID','DEPARTMENT','GENDER','BASE_SALARY']
employee.loc[criteria_final,select_columns].head()
Out[17]: 
     UNIQUE_ID                     DEPARTMENT  GENDER  BASE_SALARY
136        136  Houston Police Department-HPD  Female      81239.0
367        367  Houston Police Department-HPD  Female      86534.0
474        474  Houston Police Department-HPD  Female      91181.0
513        513  Houston Police Department-HPD  Female      81239.0
564        564  Houston Police Department-HPD  Female      81239.0
# the .isin method is equivalent to the SQL IN operator and accepts a list of all possible values that you would like to keep
crit_sal=emplyee.BASE_SALARY(80000,120000)
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-19-a4f6667784e7>", line 1, in <module>
    crit_sal=emplyee.BASE_SALARY(80000,120000)
NameError: name 'emplyee' is not defined
crit_sal=employee.BASE_SALARY(80000,120000)
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-20-76ce9007d009>", line 1, in <module>
    crit_sal=employee.BASE_SALARY(80000,120000)
TypeError: 'Series' object is not callable
crit_sal=employee.BASE_SALARY.between(80000,120000)
crit_sal.equals(criteria_sal)
Out[22]: True
# try to exclude the rows from the top five most frequently occurring department
top_5_depts=employee.DEPARTMENT.value_counts().index[:5]

type(top_5_depts)
Out[26]: pandas.core.indexes.base.Index
criteria=employee.DEPARTMENT.isin(top_5_depts) # wha is the difference from the next code?
criteria=~employee.DEPARTMENT.isin(top_5_depts)
employee[criteria]
Out[29]: 
      UNIQUE_ID                     POSITION_TITLE  ...   HIRE_DATE    JOB_DATE
0             0        ASSISTANT DIRECTOR (EX LVL)  ...  2006-06-12  2012-10-13
1             1                  LIBRARY ASSISTANT  ...  2000-07-19  2010-09-18
4             4                        ELECTRICIAN  ...  1989-06-19  1994-10-22
18           18           MAINTENANCE MECHANIC III  ...  2008-12-29  2008-12-29
32           32                  SENIOR ACCOUNTANT  ...  1991-02-11  2016-02-13
         ...                                ...  ...         ...         ...
1976       1976         SENIOR SIDELOADER OPERATOR  ...  2015-07-20  2016-01-30
1983       1983          ADMINISTRATIVE SPECIALIST  ...  2006-10-16  2006-10-16
1985       1985                       TRUCK DRIVER  ...  2013-06-10  2015-08-01
1988       1988  SENIOR ASSISTANT CITY ATTORNEY II  ...  2013-01-23  2013-03-02
1990       1990    BUILDING MAINTENANCE SUPERVISOR  ...  1995-10-14  2010-03-20
[419 rows x 10 columns]

Imporving the readability of Boolean indexing with the query method

# Boolean indexing is not necessarily the most pleasant syntax to read or write, especially when using a single line to write a complex filter
# pandas has an alternative string-based syntax through the DataFrame query method that can provide more clarity
# take advantage of the .query method of the DataFrame 
# Task: filter the employee data for female employee from the police or fir departments who earn a salry between 80 and 120 thousand dollars
# build the query string and execute the method
qs=(
    "DEPARTMENT in @depts "
    "and GENDER='Female' "
    "and 80000<=BASE_SALARY<=120000"
)
emp_filtered=employee.query(qs)
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-58-4f16388941fc>", line 1, in <module>
    emp_filtered=employee.query(qs)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3466, in query
    res = self.eval(expr, **kwargs)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\frame.py", line 3596, in eval
    return _eval(expr, inplace=inplace, **kwargs)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\computation\eval.py", line 342, in eval
    parsed_expr = Expr(expr, engine=engine, parser=parser, env=env)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\computation\expr.py", line 798, in __init__
    self.terms = self.parse()
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\computation\expr.py", line 817, in parse
    return self._visitor.visit(self.expr)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\computation\expr.py", line 397, in visit
    raise e
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\computation\expr.py", line 393, in visit
    node = ast.fix_missing_locations(ast.parse(clean))
  File "D:\PyCharm2020\python2020\lib\ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
    DEPARTMENT in __pd_eval_local_depts and GENDER ='Female'and 80000 <=BASE_SALARY <=120000 
    ^
SyntaxError: Python keyword not valid identifier in numexpr query
qs=(
    "DEPARTMENT in @depts "
    "and GENDER=='Female' "
    "and 80000<=BASE_SALARY<=120000"
)
emp_filtered=employee.query(qs)
emp_filtered[select_columns].head()
Out[61]: 
     UNIQUE_ID                     DEPARTMENT  GENDER  BASE_SALARY
136        136  Houston Police Department-HPD  Female      81239.0
367        367  Houston Police Department-HPD  Female      86534.0
474        474  Houston Police Department-HPD  Female      91181.0
513        513  Houston Police Department-HPD  Female      81239.0
564        564  Houston Police Department-HPD  Female      81239.0
# use the at symbol (@) to reference Python
# all DataFrame column names are available in the query namespace by referencing their name without extra quotes
# if a string is need, such as Female, inner quotes will need to wrap it 
# another feature of the query syntax is the ability to combine Boolean operators using 'and' 'or' 'not'
# Task : find all the female employee who were not a member of the top 10 departments by frequency
top10_depts=employee.DEPARTMENT.value_counts().index[:10].tolist()
qs="DEPARTMENT not in @top10_depts and GENDER=='Female' "
employee_filtered2=employee.query(qs)
employee_filtered2.head()
Out[70]: 
     UNIQUE_ID                    POSITION_TITLE  ...   HIRE_DATE    JOB_DATE
0            0       ASSISTANT DIRECTOR (EX LVL)  ...  2006-06-12  2012-10-13
73          73         ADMINISTRATIVE SPECIALIST  ...  2011-12-19  2013-11-23
96          96     ASSISTANT CITY CONTROLLER III  ...  2013-06-10  2013-06-10
117        117  SENIOR ASSISTANT CITY ATTORNEY I  ...  1998-03-20  2012-07-21
146        146              SENIOR STAFF ANALYST  ...  2014-03-17  2014-03-17
[5 rows x 10 columns]

Preserving Series size with the .where method

clump 聚簇，聚集
when you filter with Boolean arrays, the resulting Series or DataFrame is typically smaller.
the .where method preserves the size of your Series or DataFrame and either sets the values that don’t meet the criteria to missing or replace them with something else.instead of dropping all these values, it is possible to keep them.
the .where method again preserves the size and shape of the calling object and does not modify the values where the passed Boolean is True.

import pandas as pd
movie=pd.read_csv('movie.csv',index_col='movie_title')
# important to drop the missing values in the step1 as the .where method would have eventually replaced them with a valid number in future steps
fb_likes=movie["actor_1_facebook_likes"].dropna()
fb_likes.head()
Out[5]: 
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      40000.0
Spectre                                       11000.0
The Dark Knight Rises                         27000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64
fb_likes.describe()
Out[6]: 
count      4909.000000
mean       6494.488491
std       15106.986884
min           0.000000
25%         607.000000
50%         982.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64
# plot a histogram of this Series to visually inspect the distribution
# plt.subplots to specify the figure size
import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
fig,ax=plt.subplots(figsize=(10,8))
fb_likes.hist(ax=ax)
Out[11]: <AxesSubplot:>
fig.savefig('c7-hist.png',dpi=300)
# we can know tha data is very highly skewed to the right with a few very large observation(more than an order of magnitude greater than the median)
# the data has too many outliers for a plain histogram to make a good plot
# .where method allows us to place a ceiling and floor on our data, which reaults in a histogram with less variance 
criteria_high=fb_likes<20000
criteria_high.mean().round(2)
Out[16]: 0.91
# .where method accepts a Boolean array, and its default behavior is to return a Series the same size as the original, but which has all the `False` locations replaced with a missing value
fb_likes.where(criteria_high).head()
Out[18]: 
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End          NaN
Spectre                                       11000.0
The Dark Knight Rises                             NaN
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

# the second parameter to the .where method , other, control the replacement value
# here, we change all the missing values to 20000
fb_likes.where(criteria_high,other=20000).head()
Out[21]: 
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      131.0
Name: actor_1_facebook_likes, dtype: float64

# create criteria to put a floor on the minimum number of likes
criteria_low=fb_likes>300
fb_likes_cap=fb_likes.dropna().where(criteria_high,other=20000).where(criteria_low,other=300)
fb_likes_cap.head()
Out[26]: 
movie_title
Avatar                                         1000.0
Pirates of the Caribbean: At World's End      20000.0
Spectre                                       11000.0
The Dark Knight Rises                         20000.0
Star Wars: Episode VII - The Force Awakens      300.0
Name: actor_1_facebook_likes, dtype: float64
# the lengths of the original Series and the modified Series are the same
len(fb_likes)==len(fb_likes_cap)
Out[28]: True
# then we make a histogram with the modified Series
# notice now the data in a much tighter range
fig,ax=plt.subplots(figsize=(10,8))
fb_likes_cap.hist(ax=ax)
Out[32]: <AxesSubplot:>
fig.savefig('c7-hist2.png',dpi=300)
fb_likes_cap2=fb_likes.clip(lower=300,upper=20000)
fb_likes_cap2.equals(fb_likes_cap)
Out[35]: True

Masking DataFrame rows

the .mask method performs the complement of the .where method.
By default, it creates missing values wherever the Boolean condition is True
in essence, it is literally masking, or covering up, values in your dataset

#Task: mask all rows of the movie dataset that were made after 2010 and then filter all the rows with missing values
c1=movie["title_year"]>=2010
c2=movie["title_year"].isna()
criteria=c1|c2
# use the .mask method on a DataFrame to remove the values for all the values in rows with movies that were made from 2010
# any movie that originally had a missing value for title_year is also masked 
movie.mask(criteria).head()
Out[43]: 
                                            color  ... movie_facebook_likes
movie_title                                        ...                     
Avatar                                      Color  ...              33000.0
Pirates of the Caribbean: At World's End    Color  ...                  0.0
Spectre                                       NaN  ...                  NaN
The Dark Knight Rises                         NaN  ...                  NaN
Star Wars: Episode VII - The Force Awakens    NaN  ...                  NaN
[5 rows x 27 columns]
# chain the .dropna method to remove rows that have all values missing
movie_mask=movie.mask(criteria).dropna(how='all')
movie_mask.head()
Out[46]: 
                                          color  ... movie_facebook_likes
movie_title                                      ...                     
Avatar                                    Color  ...              33000.0
Pirates of the Caribbean: At World's End  Color  ...                  0.0
Spider-Man 3                              Color  ...                  0.0
Harry Potter and the Half-Blood Prince    Color  ...              10000.0
Superman Returns                          Color  ...                  0.0
[5 rows x 27 columns]
# check the two method produce the same DataFrame
movie_mask.equals(movie[movie['title_year']<2010])
Out[48]: False
# further,do soma sanity checking--whether they are the same type    
movie_mask.shape==movie[movie['title_year']<2010].shape
Out[51]: True
# when we used the preceding .mask method, it creates many missing values
# the .equal method return False if the data types of the columns are different, even if the values are same
# check the equality of the data types 
movie_mask.dtypes==movie[movie['title_year']>2010].dtypes
Out[55]: 
color                         True
director_name                 True
num_critic_for_reviews        True
duration                      True
director_facebook_likes       True
actor_3_facebook_likes        True
actor_2_name                  True
actor_1_facebook_likes        True
gross                         True
genres                        True
actor_1_name                  True
num_voted_users              False
cast_total_facebook_likes    False
actor_3_name                  True
facenumber_in_poster          True
plot_keywords                 True
movie_imdb_link               True
num_user_for_reviews          True
language                      True
country                       True
content_rating                True
budget                        True
title_year                    True
actor_2_facebook_likes        True
imdb_score                    True
aspect_ratio                  True
movie_facebook_likes         False
dtype: bool
# in pandas's 'testing' module, there is a function,assert_frame_equal, that allows you to check the equality Series and DataFrame without also checking the equality of the data types

from pandas.testing import assert_frame_equal
assert_frame_equal(movie_mask,movie[movie['title_year']<2010],check_dtype=False)

在这里插入图片描述

在这里插入图片描述
[239(260/627)]

Selecting with Booleans, integer location, and labels

we will filter both rows and columns with the .iloc and .loc attributes.

# create a Boolean array matching all movies with a content rating of G and an IMDB score less than 4
c1=movie['content_rating']=='G'
c2=movie['imdb_score']<4
criteria=c1&c2
# pass these criteria to .loc to filter the rows 
movie_loc=movie.loc[criteria]
movie_loc.head()
Out[69]: 
                                color  ... movie_facebook_likes
movie_title                            ...                     
The True Story of Puss'N Boots  Color  ...                   90
Doogal                          Color  ...                  346
Thomas and the Magic Railroad   Color  ...                  663
Barney's Great Adventure        Color  ...                  436
Justin Bieber: Never Say Never  Color  ...                62000
[5 rows x 27 columns]
movie_loc.equals(movie[criteria])
Out[70]: True
# check this DataFrame is equal to one generated directly from the indexing operator
# attempt the same Boolean indexing with .iloc indexer
movie_iloc=movie.iloc[criteria.to_numpy()]
# we cannot directly use a series of boolean because of the index
movie_loc.equals(movie.iloc[criteria.to_numpy()])
Out[75]: True
# select particular column with boolean indexing 
# here, select all the columns that have a data type of 64-bit intergers
import numpy as np
criteria_col=movie.dtypes==np.int64
movie.loc[:,criteria_col].head()
Out[81]: 
                                            num_voted_users  ...  movie_facebook_likes
movie_title                                                  ...                      
Avatar                                               886204  ...                 33000
Pirates of the Caribbean: At World's End             471220  ...                     0
Spectre                                              275868  ...                 85000
The Dark Knight Rises                               1144337  ...                164000
Star Wars: Episode VII - The Force Awakens                8  ...                     0
[5 rows x 3 columns]
# as criteria_col is a Series, which always has an index, you must use the underlying ndarry to make it work with .iloc
movie.iloc[:,criteria_col.to_numpy()].head()
Out[83]: 
                                            num_voted_users  ...  movie_facebook_likes
movie_title                                                  ...                      
Avatar                                               886204  ...                 33000
Pirates of the Caribbean: At World's End             471220  ...                     0
Spectre                                              275868  ...                 85000
The Dark Knight Rises                               1144337  ...                164000
Star Wars: Episode VII - The Force Awakens                8  ...                     0
[5 rows x 3 columns]
# when using .loc, you can use a Boolean array to select rows and specify the columns you want with a list of labels
cols=['content_rating','imdb_score','title_year','gross']
movie.loc[criteria,cols]
Out[86]: 
                               content_rating  ...       gross
movie_title                                    ...            
The True Story of Puss'N Boots              G  ...         NaN
Doogal                                      G  ...   7382993.0
Thomas and the Magic Railroad               G  ...  15911333.0
Barney's Great Adventure                    G  ...  11144518.0
Justin Bieber: Never Say Never              G  ...  73000942.0
Sunday School Musical                       G  ...         NaN
[6 rows x 4 columns]
# create the same operation with .iloc, and you need to specify the position of the columns
# use a list comprehension to loop through all th edesired column names to find their integer location with the index method .get_loc
col_index=[movie.columns.get_loc(col) for col in cols]
col_index
Out[89]: [20, 24, 22, 8]

movie.iloc[criteria.to_numpy(),col_index].sort_values("imdb_score")
Out[91]: 
                               content_rating  ...       gross
movie_title                                    ...            
Justin Bieber: Never Say Never              G  ...  73000942.0
Sunday School Musical                       G  ...         NaN
Doogal                                      G  ...   7382993.0
Barney's Great Adventure                    G  ...  11144518.0
The True Story of Puss'N Boots              G  ...         NaN
Thomas and the Magic Railroad               G  ...  15911333.0
[6 rows x 4 columns]
# both .loc and .iloc attributes have some support filtering with 
# with the caveat that .iloc cannot be passed a Series but the underlying ndarray
# take a look at the one-dimensional ndarray underlying criteria
a=criteria.to_numpy()
a
Out[96]: array([False, False, False, ..., False, False, False])
len(a),len(criteria)
Out[97]: (4916, 4916)
# the array is the same length as the Series, which is the same length as the movie DataFrame
# the integer location for the Boolean array aligns with the integer location of the DataFrame
# and the filter happens as expected 
# these arrays also work with the .loc attribute as well, but they are a necessity with .iloc