【cookbook pandas】Chap 8 Index Alignment

最新推荐文章于 2023-05-22 14:13:34 发布

萝卜丝皮尔

最新推荐文章于 2023-05-22 14:13:34 发布

阅读量220

点赞数

分类专栏：数据分析Python基础篇文章标签： cookbook pandas

本文链接：https://blog.csdn.net/qq_43448491/article/details/115111935

版权

数据分析Python基础篇专栏收录该内容

23 篇文章 0 订阅

订阅专栏

the uninitiated 无经验的人，门外汉
in-depth 彻底的，深入的
when Series or DataFrame are combined , each dimension of the data automatically aligns on each axis first before any computation happens.
each axis of a DataFrame or a Series has an index object that label the values.
all index objects ,except for the MultiIndex, are single-dimensional data structures that combine the functionality of Python sets and Numpy ndarrays

# the index object operations appear to have quite a bit in common with both Series and ndarrays
# indxes are immutable and their values cannot be changed once created
import pandas as pd
import numpy as np
college=pd.read_csv('college.csv')
# a variable columns holds the column index
columns=college.columns
columns
Out[7]: 
Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')
# use the .values attribute to access the underlying Numpy 
columns.values
Out[9]: 
array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
       'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
       'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
       'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
       'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
       'GRAD_DEBT_MDN_SUPP'], dtype=object)
# indexes share many of the same methods as Series and DataFrames
type(columns)
Out[11]: pandas.core.indexes.base.Index
columns.min(),columns.isnull().sum()
Out[12]: ('CITY', 0)
# Indexes are immutable objects. after its creation, trying to change an index value fails.

# indexes support the set operations : union, intersection, difference, and symmetridifference

c1=columns[:4]
c1
Out[17]: Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')
c2=columns[2:6]
c2
Out[19]: Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')
c1.union(c2)# or 'c1 | c2'
Out[20]: Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR', 'WOMENONLY'], dtype='object')
c1.symmetric_difference(c2)# or 'c1 ^ c2'
Out[21]: Index(['CITY', 'INSTNM', 'MENONLY', 'WOMENONLY'], dtype='object')

在这里插入图片描述

Producing Cartesian products

whenever a Series or DataFrame operates with another Series or DataFrame, the indexes (both the row index and column index) of each object align first before any operation begins.
this alignment always creates a Cartesian product between the indexes unless the indexes are identical.
A Cartesian product between two sets is all the combinations of pairs of both of sets.
在这里插入图片描述

做笛卡尔积时，遇到单个的索引，无法完成对应工作，就被pandas判成missing value。如此一来，导致Series的数据类型发生变化，尽管只有int data,也会被存成float data.

import pandas as pd
import numpy as np
college=pd.read_csv('college.csv')
# a variable 'columns' holds the column index
columns=college.columns
columns
Out[7]: 
Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
       'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
       'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
       'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
       'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
      dtype='object')
# use the .values attribute to access the underlying Numpy 
columns.values
Out[9]: 
array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
       'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
       'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
       'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
       'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
       'GRAD_DEBT_MDN_SUPP'], dtype=object)
# indexes share many of the same methods as Series and DataFrames
type(columns)
Out[11]: pandas.core.indexes.base.Index
columns.min(),columns.isnull().sum()
Out[12]: ('CITY', 0)
# Indexes are immutable objects. after its creation, trying to change an index value fails.
# indexes support the set operations : union, intersection, difference, and symmetridifference
c1=columns[:4]
c1
Out[17]: Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')
c2=columns[2:6]
c2
Out[19]: Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')
c1.union(c2)# or 'c1 | c2'
Out[20]: Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR', 'WOMENONLY'], dtype='object')
c1.symmetric_difference(c2)# or 'c1 ^ c2'
Out[21]: Index(['CITY', 'INSTNM', 'MENONLY', 'WOMENONLY'], dtype='object')
# the Cartesian product is not created when the index are unique or contains both the same exact elements and elements in the same order
# the statement consists of two situation
# if a Cartesian product is not created, aand the indexes instead align by their position
# notice here that each element aligned exactly by position and that the data type remained an integer 
s1=pd.Series(index=list('aaabb'),data=np.arange(5))
s2=pd.Series(index=list('aaabb'),data=np.arange(5))
s1+s2
Out[30]: 
a    0
a    2
a    4
b    6
b    8
dtype: int32
# another situation
s2=pd.Series(index=list('bbaaa'),data=np.array(5))
s1+s2
Out[28]: 
a    5
a    5
a    5
a    6
a    6
a    6
a    7
a    7
a    7
b    8
b    8
b    9
b    9
dtype: int32

# if the elements of the index are identical,but the order is different between the Serires, the Cartesian product occurs
# be aware of this as pandas has drastically different outcomes for this same operation
# ???? if you do a 'groupby' with multiple columns and one is of the type' categorical' ,you will get a Cartesian product where each outer index will have every inner index value
# array-like data structures in Python and other languages do not allow operations to take place when the operating dimensions do not contain the same number of elements.
# pandas allows this to happened by aligning the index first before completing the operation

Exploding indexes

在这里插入图片描述
step2 in fact, it creates a single object that is referred to by two variable names.
the expression employee['BASE_SALARY'] technically creates a view, and not a brand new copy. this is verified with is operator.
in pandas, a view is not a new object but just a reference to another object, usually some subset of a DataFrame. this shared object can cause for many issues.
to ensure that the variables reference completely different objects, we use the .copy method and then verify that they are different objects with the is operator.

# Filling values with unequal indexes
# when two Series are added together using the plus operator and one of the index labels does not appear in the other, the resulting value is missing  
# pandas has the .add method which provides an option to fill the missing value 
# we add together multiple Series from the baseball dataset with unequal(but unique) indexes using the .add method with the `fill_value` 
#parameter to ensure that there are no missing values in the result
import pandas as pd
baseball_14=pd.read_csv('baseball14.csv',index_col='playerID')
baseball_15=pd.read_csv('baseball15.csv',index_col='playerID')
baseball_16=pd.read_csv('baseball16.csv',index_col='playerID')
baseball_14.head()
Out[17]: 
           yearID  stint teamID lgID    G   AB  ...     SO  IBB  HBP   SH   SF  GIDP
playerID                                        ...                                 
altuvjo01    2014      1    HOU   AL  158  660  ...   53.0  7.0  5.0  1.0  5.0  20.0
cartech02    2014      1    HOU   AL  145  507  ...  182.0  6.0  5.0  0.0  4.0  12.0
castrja01    2014      1    HOU   AL  126  465  ...  151.0  1.0  9.0  1.0  3.0  11.0
corpoca01    2014      1    HOU   AL   55  170  ...   37.0  0.0  3.0  1.0  2.0   3.0
dominma01    2014      1    HOU   AL  157  564  ...  125.0  2.0  5.0  2.0  7.0  23.0
[5 rows x 21 columns]
# use .difference method on the index to discover which index labels are in `baseball_14` and not in `baseball_15` and vice versa
baseball_14.index.difference(baseball_15.index)
Out[19]: 
Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01',
       'hoeslj01', 'krausma01', 'preslal01', 'singljo02'],
      dtype='object', name='playerID')
baseball_15.index.difference(baseball_14.index)
Out[20]: 
Index(['congeha01', 'correca01', 'gattiev01', 'gomezca01', 'lowrije01',
       'rasmuco01', 'tuckepr01', 'valbulu01'],
      dtype='object', name='playerID')
# there are quite a fewe players unique to each index
# let's find out how many hits each player has in total over the three-year period
# the H column contains the number of hits
hits_14=baseball_14['H']
hits_15=baseball_15['H']
hits_16=baseball_16["H"]
hits_14.head()
Out[27]: 
playerID
altuvjo01    225
cartech02    115
castrja01    103
corpoca01     40
dominma01    121
Name: H, dtype: int64
# let's first add together two Series using the plus operator
(hits_14+hits_15).head()
Out[29]: 
playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01      NaN
corpoca01      NaN
Name: H, dtype: float64
#use .add the method with the 'fill_value' parameter to avoid missing value   
hits_14.add(hits_15,fill_value=0).head()
Out[32]: 
playerID
altuvjo01    425.0
cartech02    193.0
castrja01    174.0
congeha01     46.0
corpoca01     40.0
Name: H, dtype: float64
# add hits from 2016 by chaining add method once more
hits_14.add(hits_15.add(hits_16,fill_value=0),fill_value=0)
Out[34]: 
playerID
altuvjo01    641.0
bregmal01     53.0
cartech02    193.0
castrja01    243.0
congeha01     46.0
corpoca01     40.0
correca01    266.0
Name: H, dtype: float64
hits_total=hits_14.add(hits_15.add(hits_16,fill_value=0),fill_value=0)
# check for missing values in the result
hits_total.hasnans
Out[36]: False

the .add method works in a similar way to the plus operator, but allows for more flexibility by providing the fiII_value parameter to take the place of a non-matching index.

# there will be occasions when each Series contains index labels that correspond to missing values 
# in this specific instance, when the two Series are added, the index label will still correspond to a missing value regardless of whether the `fill_value` parameter is used 
import numpy as np
s = pd.Series(index=list('abcd'),data=[np.nan,3,np.nan,1])
s
Out[42]: 
a    NaN
b    3.0
c    NaN
d    1.0
dtype: float64
s1=pd.Series(index=list('abc'),data=[np.nan,6,10])
s.add(s1,fill_value=5)
Out[44]: 
a     NaN
b     9.0
c    15.0
d     6.0
dtype: float64
s1.add(s,fill_value=5)
Out[45]: 
a     NaN
b     9.0
c    15.0
d     6.0
dtype: float64
# as above, show how to add Series with only a single index together 
# adding two DataFrame will align both the index and columns before computation and insert missing values for non-matching indexes
# first, selecting a few of the columns from the 2014 baseball dataset
df_14=baseball_14[["G","AB","R","H"]]
df_14.head()
Out[51]: 
             G   AB   R    H
playerID                    
altuvjo01  158  660  85  225
cartech02  145  507  68  115
castrja01  126  465  43  103
corpoca01   55  170  22   40
dominma01  157  564  51  121
# second, select a few of the same and a few different columns fromthe 2015 baseball dataset
df_15=baseball_15[["AB","R","H",'HR']]
df_15.head()
Out[54]: 
            AB   R    H  HR
playerID                   
altuvjo01  638  86  200  15
cartech02  391  50   78  24
castrja01  337  38   71  11
congeha01  201  25   46  11
correca01  387  52  108  22

在这里插入图片描述

Adding columns from different DataFrames

all DataFrame can add new columns to themselves.
whenever a DataFrame is adding a new column from the another DataFame or a Series, the indexes align first, and then the new column is created .

# Task: append a new column containing the maximum salary of that employee's department
import pandas as pd
employee=pd.read_csv('employee.csv')
# select the 'DEPARTMENT' and 'BASE_SALARY' in a new DataFrame
dept_sal=employee[['DEPARTMENT','BASE_SALARY']]
# sort this smaller DataFrame by salary within each department
dept_sal=dept_sal.sort_values(["DEPARTMENT","BASE_SALARY"],ascending=[True,False])
dept_sal.head(5)
Out[8]: 
                      DEPARTMENT  BASE_SALARY
1494  Admn. & Regulatory Affairs     140416.0
237   Admn. & Regulatory Affairs     130416.0
1679  Admn. & Regulatory Affairs     103776.0
988   Admn. & Regulatory Affairs      72741.0
693   Admn. & Regulatory Affairs      66825.0
# use .drop_duplicates method to keep the first row of each DEPARTMENT
dept_sal=dept_sal.drop_duplicates(subset=["DEPARTMENT"],keep='first')
# put the 'DEPARTMENT' into the index for each DataFrame
dept_sal=dept_sal.set_index("DEPARTMENT")
employee=employee.set_index('DEPARTMENT')
# remember: after .asssign method, pass the result to 'employee'
# now that the indexes contain matching values, we can add a new column to the employee DataFrame
# because each row index from the left DataFrame `employee` aligns with one, and only one, index from the right DataFrame. this step works
#if the right DataFrame has duplicates of any departments in its index, then we will get a Cartesian product,causing error
employee=employee.assign(MAX_DEPT_SAL=dept_sal['BASE_SALARY'])
# validate our result with the query method to check whether there exist any rows where `BASE_SALARY` is greater than `MAX_DEPT_SALARY`
employee.query("BASE_SALARY>MAX_DEPT_SAL")
Out[17]: 
Empty DataFrame
Columns: [UNIQUE_ID, POSITION_TITLE, BASE_SALARY, RACE, EMPLOYMENT_TYPE, GENDER, EMPLOYMENT_STATUS, HIRE_DATE, JOB_DATE, MAX_DEPT_SAL]
Index: []

表面上看是简单地赋值，但其实是做了一个笛卡尔积，当右边的DataFrame有重复索引时，无法完成表的连接。

# during the alignment, if there is nothing for the DataFame index to align to, the resulting value will be missing
employee.assign(MAX_SALARY2=dept_sal["BASE_SALARY"].head(3)).MAX_SALARY2.value_counts()
Out[21]: 
140416.0    29
100000.0    11
64251.0      5
Name: MAX_SALARY2, dtype: int64
employee.assign(MAX_SALARY2=dept_sal["BASE_SALARY"].head(3)).MAX_SALARY2.value_counts(dropna=False)
Out[22]: 
NaN         1955
140416.0      29
100000.0      11
64251.0        5
Name: MAX_SALARY2, dtype: int64
# notice that .value_counts(), by default, drop nan.
###############################
# there are more method to finish add one column containing max_salary of that employee's department 

# the second  method : more clear to read
import pandas as pd
employee=pd.read_csv('employee.csv')
max_sal=employee.groupby("DEPARTMENT").BASE_SALARY.transform("max")
max_sal
Out[5]: 
0       121862.0
1       107763.0
2       199596.0
3       210588.0
4        89194.0
          ...   
1995    199596.0
1996    210588.0
1997    199596.0
1998    199596.0
1999    210588.0
Name: BASE_SALARY, Length: 2000, dtype: float64
type(max_sal)
Out[6]: pandas.core.series.Series
employee.assign(MAX_DEPARTMENT_SALARY=max_sal)
Out[7]: 
      UNIQUE_ID               POSITION_TITLE  ...    JOB_DATE  MAX_DEPARTMENT_SALARY
0             0  ASSISTANT DIRECTOR (EX LVL)  ...  2012-10-13               121862.0
1             1            LIBRARY ASSISTANT  ...  2010-09-18               107763.0
2             2               POLICE OFFICER  ...  2015-02-03               199596.0
3             3            ENGINEER/OPERATOR  ...  1991-05-25               210588.0
4             4                  ELECTRICIAN  ...  1994-10-22                89194.0
         ...                          ...  ...         ...                    ...
1995       1995               POLICE OFFICER  ...  2015-06-09               199596.0
1996       1996       COMMUNICATIONS CAPTAIN  ...  2013-10-06               210588.0
1997       1997               POLICE OFFICER  ...  2015-10-13               199596.0
1998       1998               POLICE OFFICER  ...  2011-07-02               199596.0
1999       1999                 FIRE FIGHTER  ...  2010-07-12               210588.0
[2000 rows x 11 columns]
# this work because .transform preserves the original index

# the third method
# if you use .groupby that creates a new index, you can use .merge method to combine the data
# we just tell it to merge on `DEPARTMENT` for the left side and the index for the right side
mal_sal=employee.groupby("DEPARTMENT").BASE_SALARY.max()
mal_sal
Out[14]: 
DEPARTMENT
Admn. & Regulatory Affairs        140416.0
City Controller's Office           64251.0
City Council                      100000.0
Convention and Entertainment       38397.0
Dept of Neighborhoods (DON)        89221.0
Finance                            96272.0
Fleet Management Department       125884.0
General Services Department        89194.0
Health & Human Services           180416.0
Housing and Community Devp.        98536.0
Houston Airport System (HAS)      186192.0
Houston Emergency Center (HEC)     84456.0
Houston Fire Department (HFD)     210588.0
Houston Information Tech Svcs     102019.0
Houston Police Department-HPD     199596.0
Human Resources Dept.             110547.0
Legal Department                  275000.0
Library                           107763.0
Mayor's Office                    120750.0
Municipal Courts Department       121862.0
Parks & Recreation                 85055.0
Planning & Development             68762.0
Public Works & Engineering-PWE    178331.0
Solid Waste Management            110005.0
Name: BASE_SALARY, dtype: float64

employee.merge(mal_sal.rename('MAX_DEPARTMENT_SALARY'),how="left",left_on="DEPARTMENT",right_index=True)
Out[16]: 
      UNIQUE_ID               POSITION_TITLE  ...    JOB_DATE  MAX_DEPARTMENT_SALARY
0             0  ASSISTANT DIRECTOR (EX LVL)  ...  2012-10-13               121862.0
1             1            LIBRARY ASSISTANT  ...  2010-09-18               107763.0
2             2               POLICE OFFICER  ...  2015-02-03               199596.0
3             3            ENGINEER/OPERATOR  ...  1991-05-25               210588.0
4             4                  ELECTRICIAN  ...  1994-10-22                89194.0
         ...                          ...  ...         ...                    ...
1995       1995               POLICE OFFICER  ...  2015-06-09               199596.0
1996       1996       COMMUNICATIONS CAPTAIN  ...  2013-10-06               210588.0
1997       1997               POLICE OFFICER  ...  2015-10-13               199596.0
1998       1998               POLICE OFFICER  ...  2011-07-02               199596.0
1999       1999                 FIRE FIGHTER  ...  2010-07-12               210588.0
[2000 rows x 11 columns]

merge更像是在做表的连接，通过参数设置是左连接还是右连接，内连接还是外连接。

culprit 犯罪的人，引起问题的事物；
refactor v.重构
refactor the code to make it easier to read.

# the college dataset has many numeric columns describing different metric about each school
# Task: discover the has the maximum value for each numeric column  
# read and set the institution name as the index
import pandas as pd
college=pd.read_csv('college.csv',index_col="INSTNM")
college.dtypes
Out[7]: 
CITY                   object
STABBR                 object
HBCU                  float64
MENONLY               float64
WOMENONLY             float64
RELAFFIL                int64
SATVRMID              float64
SATMTMID              float64
DISTANCEONLY          float64
UGDS                  float64
UGDS_WHITE            float64
UGDS_BLACK            float64
UGDS_HISP             float64
UGDS_ASIAN            float64
UGDS_AIAN             float64
UGDS_NHPI             float64
UGDS_2MOR             float64
UGDS_NRA              float64
UGDS_UNKN             float64
PPTUG_EF              float64
CURROPER                int64
PCTPELL               float64
PCTFLOAN              float64
UG25ABV               float64
MD_EARN_WNE_P10        object
GRAD_DEBT_MDN_SUPP     object
dtype: object
# the last two columns are not numeric, we divine them more
# .value_counts() method to reveal offending characters
college.MD_EARN_WNE_P10.value_counts()
Out[9]: 
PrivacySuppressed    822
38800                151
21500                 97
49200                 78
27400                 46
                    ... 
73900                  1
60200                  1
65500                  1
11600                  1
135100                 1
Name: MD_EARN_WNE_P10, Length: 598, dtype: int64
# we uncover the PrivacySuppressed values that are causing havoc
college.MD_EARN_WNE_P10.dtypes
Out[10]: dtype('O')
college.MD_EARN_WNE_P10.apply(type)
Out[11]: 
INSTNM
Alabama A & M University                                    <class 'str'>
University of Alabama at Birmingham                         <class 'str'>
Amridge University                                          <class 'str'>
University of Alabama in Huntsville                         <class 'str'>
Alabama State University                                    <class 'str'>
                                                               ...       
SAE Institute of Technology  San Francisco                <class 'float'>
Rasmussen College - Overland Park                         <class 'float'>
National Personal Training Institute of Cleveland         <class 'float'>
Bay Area Medical Academy - San Jose Satellite Location    <class 'float'>
Excel Learning Center-San Antonio South                   <class 'float'>
Name: MD_EARN_WNE_P10, Length: 7535, dtype: object
set(college.MD_EARN_WNE_P10.apply(type))
Out[12]: {float, str}
# the culprit appears to be that some schools have privacy concerns about these two columns of data
# to force these columns to be numeric, use the pandas function to_numeric 
# if we use the ''errors='coerce' '' parameter, it will convert those values to NaN
cols=["MD_EARN_WNE_P10","GRAD_DEBT_MDN_SUPP"]
for col in cols:
    college[col]=pd.to_numeric(college[col],errors='coerce')
    
college.dtypes.loc[cols]
Out[19]: 
MD_EARN_WNE_P10       float64
GRAD_DEBT_MDN_SUPP    float64
dtype: object
# use the .select_dtypes method to filter for only numeric columns 
# this will exclude the 'STABBR' and 'CITY' columns,where a maximum value doesn't make sense with this problem
college_n=college.select_dtypes("number")
college_n.head()
Out[23]: 
                                     HBCU  ...  GRAD_DEBT_MDN_SUPP
INSTNM                                     ...                    
Alabama A & M University              1.0  ...             33888.0
University of Alabama at Birmingham   0.0  ...             21941.5
Amridge University                    0.0  ...             23370.0
University of Alabama in Huntsville   0.0  ...             24097.0
Alabama State University              1.0  ...             33118.5
[5 rows x 24 columns]
# several columns have binary only (0 or 1) values that will not provide useful information for maximum values 
# to find these columns, create a Boolean Series and find all the columns that have two unique values with the .nunique method 
# .nuique method used to find how many unique values  each column  has 
college_n.nunique()
Out[27]: 
HBCU                     2
MENONLY                  2
WOMENONLY                2
RELAFFIL                 2
SATVRMID               163
SATMTMID               167
DISTANCEONLY             2
UGDS                  2932
UGDS_WHITE            4397
UGDS_BLACK            3242
UGDS_HISP             2809
UGDS_ASIAN            1254
UGDS_AIAN              601
UGDS_NHPI              363
UGDS_2MOR              957
UGDS_NRA               920
UGDS_UNKN             1517
PPTUG_EF              3420
CURROPER                 2
PCTPELL               4422
PCTFLOAN              4155
UG25ABV               4285
MD_EARN_WNE_P10        597
GRAD_DEBT_MDN_SUPP    2037
dtype: int64
binary_only=college_n.nunique()==2
binary_only.head()
Out[29]: 
HBCU          True
MENONLY       True
WOMENONLY     True
RELAFFIL      True
SATVRMID     False
dtype: bool
binary_only[binary_only]
Out[30]: 
HBCU            True
MENONLY         True
WOMENONLY       True
RELAFFIL        True
DISTANCEONLY    True
CURROPER        True
dtype: bool
# use a Boolean array to create a ist of binary columns
binary_cols=binary_only[binary_only].index
binary_cols
Out[33]: Index(['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER'], dtype='object')
# drop the binary columns using .drop method
college_n2=college_n.drop(columns=binary_cols)
college_n2.head()
Out[36]: 
                                     SATVRMID  ...  GRAD_DEBT_MDN_SUPP
INSTNM                                         ...                    
Alabama A & M University                424.0  ...             33888.0
University of Alabama at Birmingham     570.0  ...             21941.5
Amridge University                        NaN  ...             23370.0
University of Alabama in Huntsville     595.0  ...             24097.0
Alabama State University                425.0  ...             33118.5
[5 rows x 18 columns]
# use the .idmax method to find the index label of the maximum values for each columns
max_cols=college_n2.idxmax()
max_cols
Out[39]: 
SATVRMID                             California Institute of Technology
SATMTMID                             California Institute of Technology
UGDS                                      University of Phoenix-Arizona
UGDS_WHITE                       Mr Leon's School of Hair Design-Moscow
UGDS_BLACK                           Velvatex College of Beauty Culture
UGDS_HISP                       Thunderbird School of Global Management
UGDS_ASIAN                          Cosmopolitan Beauty and Tech School
UGDS_AIAN                             Haskell Indian Nations University
UGDS_NHPI                                       Palau Community College
UGDS_2MOR                                                 LIU Brentwood
UGDS_NRA               California University of Management and Sciences
UGDS_UNKN             Le Cordon Bleu College of Culinary Arts-San Fr...
PPTUG_EF                        Thunderbird School of Global Management
PCTPELL                                        MTI Business College Inc
PCTFLOAN                                         ABC Beauty College Inc
UG25ABV                                  Dongguk University-Los Angeles
MD_EARN_WNE_P10                            Medical College of Wisconsin
GRAD_DEBT_MDN_SUPP           Southwest University of Visual Arts-Tucson
dtype: object
# call the .unique method on the 'max_cols' Series
# this returns an ndarray of the index values in college_n2 that has a maximum value
unique_max_cols=max_cols.unique()
type(max_cols)
Out[43]: pandas.core.series.Series

max_cols.index[:2]
Out[46]: Index(['SATVRMID', 'SATMTMID'], dtype='object')
type(unique_max_cols)
Out[47]: numpy.ndarray
# we can find .unique method to find unique 'values'rather than 'index', and return a arary

在这里插入图片描述

havoc 灾害
override 否决，推翻，不理会

在这里插入图片描述

[275(296/627)]

Replicating idxmax with method chaining

replicate 复制
……a little complex…

Finding the most common maximum of

Hispanic 西班牙人种的

import pandas as pd
college=pd.read_csv('college.csv')
college_ugds=college.filer(like="UGDS_")
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-4-ff075abd44b6>", line 1, in <module>
    college_ugds=college.filer(like="UGDS_")
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'filer'
college=pd.read_csv('college.csv',index_col="INSTNM")
college_ugds=college.filter(like="UGDS_")
college_ugds.head()
Out[7]: 
                                     UGDS_WHITE  ...  UGDS_UNKN
INSTNM                                           ...           
Alabama A & M University                 0.0333  ...     0.0138
University of Alabama at Birmingham      0.5922  ...     0.0100
Amridge University                       0.2990  ...     0.2715
University of Alabama in Huntsville      0.6988  ...     0.0350
Alabama State University                 0.0158  ...     0.0137
[5 rows x 9 columns]

# apply .idmax method to each row of data to find tthe column with the largest value
# we need to alter its default behavior with axis parameter
highest_percentage_rage=college_ugds.idxmax(axis="columns")
highest_percentage_rage
Out[9]: 
INSTNM
Alabama A & M University                                  UGDS_BLACK
University of Alabama at Birmingham                       UGDS_WHITE
Amridge University                                        UGDS_BLACK
University of Alabama in Huntsville                       UGDS_WHITE
Alabama State University                                  UGDS_BLACK
                                                             ...    
SAE Institute of Technology  San Francisco                       NaN
Rasmussen College - Overland Park                                NaN
National Personal Training Institute of Cleveland                NaN
Bay Area Medical Academy - San Jose Satellite Location           NaN
Excel Learning Center-San Antonio South                          NaN
Length: 7535, dtype: object

highest=college_ugds.idxmax()
highest.head()
Out[11]: 
UGDS_WHITE     Mr Leon's School of Hair Design-Moscow
UGDS_BLACK         Velvatex College of Beauty Culture
UGDS_HISP     Thunderbird School of Global Management
UGDS_ASIAN        Cosmopolitan Beauty and Tech School
UGDS_AIAN           Haskell Indian Nations University
dtype: object

highest_percentage_race=highest_percentage_rage
highest_percentage_race is highest_percentage_rage
Out[14]: True
# they are one object
# use the .idxmax method applied against the column axis to get the college name with the highest race percentage for each row 
# use .value_counts method to return the distribution of maximum occurrences
# add the normalize=True parameter so that it sums to 1
# we pass True to the normalize parameter as we are interested in the distribution (relative frequency) and not the raw counts
highest_percentage_race.value_counts(normalize=True)
Out[18]: 
UGDS_WHITE    0.670352
UGDS_BLACK    0.151586
UGDS_HISP     0.129473
UGDS_UNKN     0.023422
UGDS_ASIAN    0.012074
UGDS_AIAN     0.006110
UGDS_NRA      0.004073
UGDS_NHPI     0.001746
UGDS_2MOR     0.001164
dtype: float64

# for those schools with more Black students than any other race, 
# what is the distribution of its second highest race percentage
college_ugds[highest_percentage_race=="UGDS_BLACK"].drop(columns="UGDS_BLACK").idxmax(axis="columns").value_counts(normalize=True)
Out[21]: 
UGDS_WHITE    0.661228
UGDS_HISP     0.230326
UGDS_UNKN     0.071977
UGDS_NRA      0.018234
UGDS_ASIAN    0.009597
UGDS_2MOR     0.006718
UGDS_NHPI     0.000960
UGDS_AIAN     0.000960
dtype: float64

萝卜丝皮尔

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【cookbook pandas】Chap 8 Index Alignment

the uninitiated 无经验的人，门外汉in-depth 彻底的，深入的when Series or DataFrame are combined , each dimension of the data automatically aligns on each axis first before any computation happens.
复制链接

扫一扫

专栏目录