the uninitiated 无经验的人,门外汉
in-depth 彻底的,深入的
when Series or DataFrame are combined , each dimension of the data automatically aligns on each axis first before any computation happens.
each axis of a DataFrame or a Series has an index object that label the values.
all index objects ,except for the MultiIndex, are single-dimensional data structures that combine the functionality of Python sets and Numpy ndarrays
# the index object operations appear to have quite a bit in common with both Series and ndarrays
# indxes are immutable and their values cannot be changed once created
import pandas as pd
import numpy as np
college=pd.read_csv('college.csv')
# a variable columns holds the column index
columns=college.columns
columns
Out[7]:
Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
dtype='object')
# use the .values attribute to access the underlying Numpy
columns.values
Out[9]:
array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
'GRAD_DEBT_MDN_SUPP'], dtype=object)
# indexes share many of the same methods as Series and DataFrames
type(columns)
Out[11]: pandas.core.indexes.base.Index
columns.min(),columns.isnull().sum()
Out[12]: ('CITY', 0)
# Indexes are immutable objects. after its creation, trying to change an index value fails.
# indexes support the set operations : union, intersection, difference, and symmetridifference
c1=columns[:4]
c1
Out[17]: Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')
c2=columns[2:6]
c2
Out[19]: Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')
c1.union(c2)# or 'c1 | c2'
Out[20]: Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR', 'WOMENONLY'], dtype='object')
c1.symmetric_difference(c2)# or 'c1 ^ c2'
Out[21]: Index(['CITY', 'INSTNM', 'MENONLY', 'WOMENONLY'], dtype='object')
Producing Cartesian products
whenever a Series or DataFrame operates with another Series or DataFrame, the indexes (both the row index and column index) of each object align first before any operation begins.
this alignment always creates a Cartesian product between the indexes unless the indexes are identical.
A Cartesian product between two sets is all the combinations of pairs of both of sets.
做笛卡尔积时,遇到单个的索引,无法完成对应工作,就被pandas判成missing value。如此一来,导致Series的数据类型发生变化,尽管只有int data,也会被存成float data.
import pandas as pd
import numpy as np
college=pd.read_csv('college.csv')
# a variable 'columns' holds the column index
columns=college.columns
columns
Out[7]:
Index(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL',
'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS', 'UGDS_WHITE',
'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI',
'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF', 'CURROPER', 'PCTPELL',
'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10', 'GRAD_DEBT_MDN_SUPP'],
dtype='object')
# use the .values attribute to access the underlying Numpy
columns.values
Out[9]:
array(['INSTNM', 'CITY', 'STABBR', 'HBCU', 'MENONLY', 'WOMENONLY',
'RELAFFIL', 'SATVRMID', 'SATMTMID', 'DISTANCEONLY', 'UGDS',
'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP', 'UGDS_ASIAN', 'UGDS_AIAN',
'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA', 'UGDS_UNKN', 'PPTUG_EF',
'CURROPER', 'PCTPELL', 'PCTFLOAN', 'UG25ABV', 'MD_EARN_WNE_P10',
'GRAD_DEBT_MDN_SUPP'], dtype=object)
# indexes share many of the same methods as Series and DataFrames
type(columns)
Out[11]: pandas.core.indexes.base.Index
columns.min(),columns.isnull().sum()
Out[12]: ('CITY', 0)
# Indexes are immutable objects. after its creation, trying to change an index value fails.
# indexes support the set operations : union, intersection, difference, and symmetridifference
c1=columns[:4]
c1
Out[17]: Index(['INSTNM', 'CITY', 'STABBR', 'HBCU'], dtype='object')
c2=columns[2:6]
c2
Out[19]: Index(['STABBR', 'HBCU', 'MENONLY', 'WOMENONLY'], dtype='object')
c1.union(c2)# or 'c1 | c2'
Out[20]: Index(['CITY', 'HBCU', 'INSTNM', 'MENONLY', 'STABBR', 'WOMENONLY'], dtype='object')
c1.symmetric_difference(c2)# or 'c1 ^ c2'
Out[21]: Index(['CITY', 'INSTNM', 'MENONLY', 'WOMENONLY'], dtype='object')
# the Cartesian product is not created when the index are unique or contains both the same exact elements and elements in the same order
# the statement consists of two situation
# if a Cartesian product is not created, aand the indexes instead align by their position
# notice here that each element aligned exactly by position and that the data type remained an integer
s1=pd.Series(index=list('aaabb'),data=np.arange(5))
s2=pd.Series(index=list('aaabb'),data=np.arange(5))
s1+s2
Out[30]:
a 0
a 2
a 4
b 6
b 8
dtype: int32
# another situation
s2=pd.Series(index=list('bbaaa'),data=np.array(5))
s1+s2
Out[28]:
a 5
a 5
a 5
a 6
a 6
a 6
a 7
a 7
a 7
b 8
b 8
b 9
b 9
dtype: int32
# if the elements of the index are identical,but the order is different between the Serires, the Cartesian product occurs
# be aware of this as pandas has drastically different outcomes for this same operation
# ???? if you do a 'groupby' with multiple columns and one is of the type' categorical' ,you will get a Cartesian product where each outer index will have every inner index value
# array-like data structures in Python and other languages do not allow operations to take place when the operating dimensions do not contain the same number of elements.
# pandas allows this to happened by aligning the index first before completing the operation
Exploding indexes
step2 in fact, it creates a single object that is referred to by two variable names.
the expression employee['BASE_SALARY']
technically creates a view, and not a brand new copy. this is verified with is
operator.
in pandas, a view is not a new object but just a reference to another object, usually some subset of a DataFrame. this shared object can cause for many issues.
to ensure that the variables reference completely different objects, we use the .copy
method and then verify that they are different objects with the is
operator.
# Filling values with unequal indexes
# when two Series are added together using the plus operator and one of the index labels does not appear in the other, the resulting value is missing
# pandas has the .add method which provides an option to fill the missing value
# we add together multiple Series from the baseball dataset with unequal(but unique) indexes using the .add method with the `fill_value`
#parameter to ensure that there are no missing values in the result
import pandas as pd
baseball_14=pd.read_csv('baseball14.csv',index_col='playerID')
baseball_15=pd.read_csv('baseball15.csv',index_col='playerID')
baseball_16=pd.read_csv('baseball16.csv',index_col='playerID')
baseball_14.head()
Out[17]:
yearID stint teamID lgID G AB ... SO IBB HBP SH SF GIDP
playerID ...
altuvjo01 2014 1 HOU AL 158 660 ... 53.0 7.0 5.0 1.0 5.0 20.0
cartech02 2014 1 HOU AL 145 507 ... 182.0 6.0 5.0 0.0 4.0 12.0
castrja01 2014 1 HOU AL 126 465 ... 151.0 1.0 9.0 1.0 3.0 11.0
corpoca01 2014 1 HOU AL 55 170 ... 37.0 0.0 3.0 1.0 2.0 3.0
dominma01 2014 1 HOU AL 157 564 ... 125.0 2.0 5.0 2.0 7.0 23.0
[5 rows x 21 columns]
# use .difference method on the index to discover which index labels are in `baseball_14` and not in `baseball_15` and vice versa
baseball_14.index.difference(baseball_15.index)
Out[19]:
Index(['corpoca01', 'dominma01', 'fowlede01', 'grossro01', 'guzmaje01',
'hoeslj01', 'krausma01', 'preslal01', 'singljo02'],
dtype='object', name='playerID')
baseball_15.index.difference(baseball_14.index)
Out[20]:
Index(['congeha01', 'correca01', 'gattiev01', 'gomezca01', 'lowrije01',
'rasmuco01', 'tuckepr01', 'valbulu01'],
dtype='object', name='playerID')
# there are quite a fewe players unique to each index
# let's find out how many hits each player has in total over the three-year period
# the H column contains the number of hits
hits_14=baseball_14['H']
hits_15=baseball_15['H']
hits_16=baseball_16["H"]
hits_14.head()
Out[27]:
playerID
altuvjo01 225
cartech02 115
castrja01 103
corpoca01 40
dominma01 121
Name: H, dtype: int64
# let's first add together two Series using the plus operator
(hits_14+hits_15).head()
Out[29]:
playerID
altuvjo01 425.0
cartech02 193.0
castrja01 174.0
congeha01 NaN
corpoca01 NaN
Name: H, dtype: float64
#use .add the method with the 'fill_value' parameter to avoid missing value
hits_14.add(hits_15,fill_value=0).head()
Out[32]:
playerID
altuvjo01 425.0
cartech02 193.0
castrja01 174.0
congeha01 46.0
corpoca01 40.0
Name: H, dtype: float64
# add hits from 2016 by chaining add method once more
hits_14.add(hits_15.add(hits_16,fill_value=0),fill_value=0)
Out[34]:
playerID
altuvjo01 641.0
bregmal01 53.0
cartech02 193.0
castrja01 243.0
congeha01 46.0
corpoca01 40.0
correca01 266.0
Name: H, dtype: float64
hits_total=hits_14.add(hits_15.add(hits_16,fill_value=0),fill_value=0)
# check for missing values in the result
hits_total.hasnans
Out[36]: False
the .add method works in a similar way to the plus operator, but allows for more flexibility by providing the fiII_value
parameter to take the place of a non-matching index.
# there will be occasions when each Series contains index labels that correspond to missing values
# in this specific instance, when the two Series are added, the index label will still correspond to a missing value regardless of whether the `fill_value` parameter is used
import numpy as np
s = pd.Series(index=list('abcd'),data=[np.nan,3,np.nan,1])
s
Out[42]:
a NaN
b 3.0
c NaN
d 1.0
dtype: float64
s1=pd.Series(index=list('abc'),data=[np.nan,6,10])
s.add(s1,fill_value=5)
Out[44]:
a NaN
b 9.0
c 15.0
d 6.0
dtype: float64
s1.add(s,fill_value=5)
Out[45]:
a NaN
b 9.0
c 15.0
d 6.0
dtype: float64
# as above, show how to add Series with only a single index together
# adding two DataFrame will align both the index and columns before computation and insert missing values for non-matching indexes
# first, selecting a few of the columns from the 2014 baseball dataset
df_14=baseball_14[["G","AB","R","H"]]
df_14.head()
Out[51]:
G AB R H
playerID
altuvjo01 158 660 85 225
cartech02 145 507 68 115
castrja01 126 465 43 103
corpoca01 55 170 22 40
dominma01 157 564 51 121
# second, select a few of the same and a few different columns fromthe 2015 baseball dataset
df_15=baseball_15[["AB","R","H",'HR']]
df_15.head()
Out[54]:
AB R H HR
playerID
altuvjo01 638 86 200 15
cartech02 391 50 78 24
castrja01 337 38 71 11
congeha01 201 25 46 11
correca01 387 52 108 22
Adding columns from different DataFrames
all DataFrame can add new columns to themselves.
whenever a DataFrame is adding a new column from the another DataFame or a Series, the indexes align first, and then the new column is created .
# Task: append a new column containing the maximum salary of that employee's department
import pandas as pd
employee=pd.read_csv('employee.csv')
# select the 'DEPARTMENT' and 'BASE_SALARY' in a new DataFrame
dept_sal=employee[['DEPARTMENT','BASE_SALARY']]
# sort this smaller DataFrame by salary within each department
dept_sal=dept_sal.sort_values(["DEPARTMENT","BASE_SALARY"],ascending=[True,False])
dept_sal.head(5)
Out[8]:
DEPARTMENT BASE_SALARY
1494 Admn. & Regulatory Affairs 140416.0
237 Admn. & Regulatory Affairs 130416.0
1679 Admn. & Regulatory Affairs 103776.0
988 Admn. & Regulatory Affairs 72741.0
693 Admn. & Regulatory Affairs 66825.0
# use .drop_duplicates method to keep the first row of each DEPARTMENT
dept_sal=dept_sal.drop_duplicates(subset=["DEPARTMENT"],keep='first')
# put the 'DEPARTMENT' into the index for each DataFrame
dept_sal=dept_sal.set_index("DEPARTMENT")
employee=employee.set_index('DEPARTMENT')
# remember: after .asssign method, pass the result to 'employee'
# now that the indexes contain matching values, we can add a new column to the employee DataFrame
# because each row index from the left DataFrame `employee` aligns with one, and only one, index from the right DataFrame. this step works
#if the right DataFrame has duplicates of any departments in its index, then we will get a Cartesian product,causing error
employee=employee.assign(MAX_DEPT_SAL=dept_sal['BASE_SALARY'])
# validate our result with the query method to check whether there exist any rows where `BASE_SALARY` is greater than `MAX_DEPT_SALARY`
employee.query("BASE_SALARY>MAX_DEPT_SAL")
Out[17]:
Empty DataFrame
Columns: [UNIQUE_ID, POSITION_TITLE, BASE_SALARY, RACE, EMPLOYMENT_TYPE, GENDER, EMPLOYMENT_STATUS, HIRE_DATE, JOB_DATE, MAX_DEPT_SAL]
Index: []
表面上看是简单地赋值,但其实是做了一个笛卡尔积,当右边的DataFrame有重复索引时,无法完成表的连接。
# during the alignment, if there is nothing for the DataFame index to align to, the resulting value will be missing
employee.assign(MAX_SALARY2=dept_sal["BASE_SALARY"].head(3)).MAX_SALARY2.value_counts()
Out[21]:
140416.0 29
100000.0 11
64251.0 5
Name: MAX_SALARY2, dtype: int64
employee.assign(MAX_SALARY2=dept_sal["BASE_SALARY"].head(3)).MAX_SALARY2.value_counts(dropna=False)
Out[22]:
NaN 1955
140416.0 29
100000.0 11
64251.0 5
Name: MAX_SALARY2, dtype: int64
# notice that .value_counts(), by default, drop nan.
###############################
# there are more method to finish add one column containing max_salary of that employee's department
# the second method : more clear to read
import pandas as pd
employee=pd.read_csv('employee.csv')
max_sal=employee.groupby("DEPARTMENT").BASE_SALARY.transform("max")
max_sal
Out[5]:
0 121862.0
1 107763.0
2 199596.0
3 210588.0
4 89194.0
...
1995 199596.0
1996 210588.0
1997 199596.0
1998 199596.0
1999 210588.0
Name: BASE_SALARY, Length: 2000, dtype: float64
type(max_sal)
Out[6]: pandas.core.series.Series
employee.assign(MAX_DEPARTMENT_SALARY=max_sal)
Out[7]:
UNIQUE_ID POSITION_TITLE ... JOB_DATE MAX_DEPARTMENT_SALARY
0 0 ASSISTANT DIRECTOR (EX LVL) ... 2012-10-13 121862.0
1 1 LIBRARY ASSISTANT ... 2010-09-18 107763.0
2 2 POLICE OFFICER ... 2015-02-03 199596.0
3 3 ENGINEER/OPERATOR ... 1991-05-25 210588.0
4 4 ELECTRICIAN ... 1994-10-22 89194.0
... ... ... ... ...
1995 1995 POLICE OFFICER ... 2015-06-09 199596.0
1996 1996 COMMUNICATIONS CAPTAIN ... 2013-10-06 210588.0
1997 1997 POLICE OFFICER ... 2015-10-13 199596.0
1998 1998 POLICE OFFICER ... 2011-07-02 199596.0
1999 1999 FIRE FIGHTER ... 2010-07-12 210588.0
[2000 rows x 11 columns]
# this work because .transform preserves the original index
# the third method
# if you use .groupby that creates a new index, you can use .merge method to combine the data
# we just tell it to merge on `DEPARTMENT` for the left side and the index for the right side
mal_sal=employee.groupby("DEPARTMENT").BASE_SALARY.max()
mal_sal
Out[14]:
DEPARTMENT
Admn. & Regulatory Affairs 140416.0
City Controller's Office 64251.0
City Council 100000.0
Convention and Entertainment 38397.0
Dept of Neighborhoods (DON) 89221.0
Finance 96272.0
Fleet Management Department 125884.0
General Services Department 89194.0
Health & Human Services 180416.0
Housing and Community Devp. 98536.0
Houston Airport System (HAS) 186192.0
Houston Emergency Center (HEC) 84456.0
Houston Fire Department (HFD) 210588.0
Houston Information Tech Svcs 102019.0
Houston Police Department-HPD 199596.0
Human Resources Dept. 110547.0
Legal Department 275000.0
Library 107763.0
Mayor's Office 120750.0
Municipal Courts Department 121862.0
Parks & Recreation 85055.0
Planning & Development 68762.0
Public Works & Engineering-PWE 178331.0
Solid Waste Management 110005.0
Name: BASE_SALARY, dtype: float64
employee.merge(mal_sal.rename('MAX_DEPARTMENT_SALARY'),how="left",left_on="DEPARTMENT",right_index=True)
Out[16]:
UNIQUE_ID POSITION_TITLE ... JOB_DATE MAX_DEPARTMENT_SALARY
0 0 ASSISTANT DIRECTOR (EX LVL) ... 2012-10-13 121862.0
1 1 LIBRARY ASSISTANT ... 2010-09-18 107763.0
2 2 POLICE OFFICER ... 2015-02-03 199596.0
3 3 ENGINEER/OPERATOR ... 1991-05-25 210588.0
4 4 ELECTRICIAN ... 1994-10-22 89194.0
... ... ... ... ...
1995 1995 POLICE OFFICER ... 2015-06-09 199596.0
1996 1996 COMMUNICATIONS CAPTAIN ... 2013-10-06 210588.0
1997 1997 POLICE OFFICER ... 2015-10-13 199596.0
1998 1998 POLICE OFFICER ... 2011-07-02 199596.0
1999 1999 FIRE FIGHTER ... 2010-07-12 210588.0
[2000 rows x 11 columns]
merge更像是在做表的连接,通过参数设置是左连接还是右连接,内连接还是外连接。
culprit 犯罪的人,引起问题的事物;
refactor v.重构
refactor the code to make it easier to read.
# the college dataset has many numeric columns describing different metric about each school
# Task: discover the has the maximum value for each numeric column
# read and set the institution name as the index
import pandas as pd
college=pd.read_csv('college.csv',index_col="INSTNM")
college.dtypes
Out[7]:
CITY object
STABBR object
HBCU float64
MENONLY float64
WOMENONLY float64
RELAFFIL int64
SATVRMID float64
SATMTMID float64
DISTANCEONLY float64
UGDS float64
UGDS_WHITE float64
UGDS_BLACK float64
UGDS_HISP float64
UGDS_ASIAN float64
UGDS_AIAN float64
UGDS_NHPI float64
UGDS_2MOR float64
UGDS_NRA float64
UGDS_UNKN float64
PPTUG_EF float64
CURROPER int64
PCTPELL float64
PCTFLOAN float64
UG25ABV float64
MD_EARN_WNE_P10 object
GRAD_DEBT_MDN_SUPP object
dtype: object
# the last two columns are not numeric, we divine them more
# .value_counts() method to reveal offending characters
college.MD_EARN_WNE_P10.value_counts()
Out[9]:
PrivacySuppressed 822
38800 151
21500 97
49200 78
27400 46
...
73900 1
60200 1
65500 1
11600 1
135100 1
Name: MD_EARN_WNE_P10, Length: 598, dtype: int64
# we uncover the PrivacySuppressed values that are causing havoc
college.MD_EARN_WNE_P10.dtypes
Out[10]: dtype('O')
college.MD_EARN_WNE_P10.apply(type)
Out[11]:
INSTNM
Alabama A & M University <class 'str'>
University of Alabama at Birmingham <class 'str'>
Amridge University <class 'str'>
University of Alabama in Huntsville <class 'str'>
Alabama State University <class 'str'>
...
SAE Institute of Technology San Francisco <class 'float'>
Rasmussen College - Overland Park <class 'float'>
National Personal Training Institute of Cleveland <class 'float'>
Bay Area Medical Academy - San Jose Satellite Location <class 'float'>
Excel Learning Center-San Antonio South <class 'float'>
Name: MD_EARN_WNE_P10, Length: 7535, dtype: object
set(college.MD_EARN_WNE_P10.apply(type))
Out[12]: {float, str}
# the culprit appears to be that some schools have privacy concerns about these two columns of data
# to force these columns to be numeric, use the pandas function to_numeric
# if we use the ''errors='coerce' '' parameter, it will convert those values to NaN
cols=["MD_EARN_WNE_P10","GRAD_DEBT_MDN_SUPP"]
for col in cols:
college[col]=pd.to_numeric(college[col],errors='coerce')
college.dtypes.loc[cols]
Out[19]:
MD_EARN_WNE_P10 float64
GRAD_DEBT_MDN_SUPP float64
dtype: object
# use the .select_dtypes method to filter for only numeric columns
# this will exclude the 'STABBR' and 'CITY' columns,where a maximum value doesn't make sense with this problem
college_n=college.select_dtypes("number")
college_n.head()
Out[23]:
HBCU ... GRAD_DEBT_MDN_SUPP
INSTNM ...
Alabama A & M University 1.0 ... 33888.0
University of Alabama at Birmingham 0.0 ... 21941.5
Amridge University 0.0 ... 23370.0
University of Alabama in Huntsville 0.0 ... 24097.0
Alabama State University 1.0 ... 33118.5
[5 rows x 24 columns]
# several columns have binary only (0 or 1) values that will not provide useful information for maximum values
# to find these columns, create a Boolean Series and find all the columns that have two unique values with the .nunique method
# .nuique method used to find how many unique values each column has
college_n.nunique()
Out[27]:
HBCU 2
MENONLY 2
WOMENONLY 2
RELAFFIL 2
SATVRMID 163
SATMTMID 167
DISTANCEONLY 2
UGDS 2932
UGDS_WHITE 4397
UGDS_BLACK 3242
UGDS_HISP 2809
UGDS_ASIAN 1254
UGDS_AIAN 601
UGDS_NHPI 363
UGDS_2MOR 957
UGDS_NRA 920
UGDS_UNKN 1517
PPTUG_EF 3420
CURROPER 2
PCTPELL 4422
PCTFLOAN 4155
UG25ABV 4285
MD_EARN_WNE_P10 597
GRAD_DEBT_MDN_SUPP 2037
dtype: int64
binary_only=college_n.nunique()==2
binary_only.head()
Out[29]:
HBCU True
MENONLY True
WOMENONLY True
RELAFFIL True
SATVRMID False
dtype: bool
binary_only[binary_only]
Out[30]:
HBCU True
MENONLY True
WOMENONLY True
RELAFFIL True
DISTANCEONLY True
CURROPER True
dtype: bool
# use a Boolean array to create a ist of binary columns
binary_cols=binary_only[binary_only].index
binary_cols
Out[33]: Index(['HBCU', 'MENONLY', 'WOMENONLY', 'RELAFFIL', 'DISTANCEONLY', 'CURROPER'], dtype='object')
# drop the binary columns using .drop method
college_n2=college_n.drop(columns=binary_cols)
college_n2.head()
Out[36]:
SATVRMID ... GRAD_DEBT_MDN_SUPP
INSTNM ...
Alabama A & M University 424.0 ... 33888.0
University of Alabama at Birmingham 570.0 ... 21941.5
Amridge University NaN ... 23370.0
University of Alabama in Huntsville 595.0 ... 24097.0
Alabama State University 425.0 ... 33118.5
[5 rows x 18 columns]
# use the .idmax method to find the index label of the maximum values for each columns
max_cols=college_n2.idxmax()
max_cols
Out[39]:
SATVRMID California Institute of Technology
SATMTMID California Institute of Technology
UGDS University of Phoenix-Arizona
UGDS_WHITE Mr Leon's School of Hair Design-Moscow
UGDS_BLACK Velvatex College of Beauty Culture
UGDS_HISP Thunderbird School of Global Management
UGDS_ASIAN Cosmopolitan Beauty and Tech School
UGDS_AIAN Haskell Indian Nations University
UGDS_NHPI Palau Community College
UGDS_2MOR LIU Brentwood
UGDS_NRA California University of Management and Sciences
UGDS_UNKN Le Cordon Bleu College of Culinary Arts-San Fr...
PPTUG_EF Thunderbird School of Global Management
PCTPELL MTI Business College Inc
PCTFLOAN ABC Beauty College Inc
UG25ABV Dongguk University-Los Angeles
MD_EARN_WNE_P10 Medical College of Wisconsin
GRAD_DEBT_MDN_SUPP Southwest University of Visual Arts-Tucson
dtype: object
# call the .unique method on the 'max_cols' Series
# this returns an ndarray of the index values in college_n2 that has a maximum value
unique_max_cols=max_cols.unique()
type(max_cols)
Out[43]: pandas.core.series.Series
max_cols.index[:2]
Out[46]: Index(['SATVRMID', 'SATMTMID'], dtype='object')
type(unique_max_cols)
Out[47]: numpy.ndarray
# we can find .unique method to find unique 'values'rather than 'index', and return a arary
havoc 灾害
override 否决,推翻,不理会
[275(296/627)]
Replicating idxmax with method chaining
replicate 复制
……a little complex…
Finding the most common maximum of
Hispanic 西班牙人种的
import pandas as pd
college=pd.read_csv('college.csv')
college_ugds=college.filer(like="UGDS_")
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-4-ff075abd44b6>", line 1, in <module>
college_ugds=college.filer(like="UGDS_")
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 5462, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'filer'
college=pd.read_csv('college.csv',index_col="INSTNM")
college_ugds=college.filter(like="UGDS_")
college_ugds.head()
Out[7]:
UGDS_WHITE ... UGDS_UNKN
INSTNM ...
Alabama A & M University 0.0333 ... 0.0138
University of Alabama at Birmingham 0.5922 ... 0.0100
Amridge University 0.2990 ... 0.2715
University of Alabama in Huntsville 0.6988 ... 0.0350
Alabama State University 0.0158 ... 0.0137
[5 rows x 9 columns]
# apply .idmax method to each row of data to find tthe column with the largest value
# we need to alter its default behavior with axis parameter
highest_percentage_rage=college_ugds.idxmax(axis="columns")
highest_percentage_rage
Out[9]:
INSTNM
Alabama A & M University UGDS_BLACK
University of Alabama at Birmingham UGDS_WHITE
Amridge University UGDS_BLACK
University of Alabama in Huntsville UGDS_WHITE
Alabama State University UGDS_BLACK
...
SAE Institute of Technology San Francisco NaN
Rasmussen College - Overland Park NaN
National Personal Training Institute of Cleveland NaN
Bay Area Medical Academy - San Jose Satellite Location NaN
Excel Learning Center-San Antonio South NaN
Length: 7535, dtype: object
highest=college_ugds.idxmax()
highest.head()
Out[11]:
UGDS_WHITE Mr Leon's School of Hair Design-Moscow
UGDS_BLACK Velvatex College of Beauty Culture
UGDS_HISP Thunderbird School of Global Management
UGDS_ASIAN Cosmopolitan Beauty and Tech School
UGDS_AIAN Haskell Indian Nations University
dtype: object
highest_percentage_race=highest_percentage_rage
highest_percentage_race is highest_percentage_rage
Out[14]: True
# they are one object
# use the .idxmax method applied against the column axis to get the college name with the highest race percentage for each row
# use .value_counts method to return the distribution of maximum occurrences
# add the normalize=True parameter so that it sums to 1
# we pass True to the normalize parameter as we are interested in the distribution (relative frequency) and not the raw counts
highest_percentage_race.value_counts(normalize=True)
Out[18]:
UGDS_WHITE 0.670352
UGDS_BLACK 0.151586
UGDS_HISP 0.129473
UGDS_UNKN 0.023422
UGDS_ASIAN 0.012074
UGDS_AIAN 0.006110
UGDS_NRA 0.004073
UGDS_NHPI 0.001746
UGDS_2MOR 0.001164
dtype: float64
# for those schools with more Black students than any other race,
# what is the distribution of its second highest race percentage
college_ugds[highest_percentage_race=="UGDS_BLACK"].drop(columns="UGDS_BLACK").idxmax(axis="columns").value_counts(normalize=True)
Out[21]:
UGDS_WHITE 0.661228
UGDS_HISP 0.230326
UGDS_UNKN 0.071977
UGDS_NRA 0.018234
UGDS_ASIAN 0.009597
UGDS_2MOR 0.006718
UGDS_NHPI 0.000960
UGDS_AIAN 0.000960
dtype: float64