Selecting Subsets of Data
every dimension of data in a Series or DataFrame is labeled in the index object.
it is this index that separates pandas data structures from Numpy’s n-dimension array.
indexes provide meaningful labels for each rows and column of data.
pandas allows user to select data according to the position of the rows and columns.
Python dictionaries and lists are built-in data structures that select their data in exactly one of these ways.
caution against doing sth 警告某人不要做某事
promptly 立刻,马上
immutable object 不可改变的,永恒不变的
a dictionary’s key (its label) must be an immutable object ,such as a string, integer, or tuple.
lists must either use integers (the position) or slice objects for selection.
dictionaries can only select one object at a time by passing the key to the indexing operator.
in this way, pandas is combining the ability to select data using integers, as with lists, and labels, as with dictionaries.
Selecting Series data
Series and DataFrame are complex data containers that have multiple attributes that use an index operation to select data in different ways.
Series and DataFrame allow selection by position (like python lists) and by label (like python dictionaries)
when we index off of the .iloc
attribute, pandas selects only by position and works similarly to Python lists
the .loc
attribute selects only by index label, which is similar to how Python dictionaries work
the .loc
and .iloc
attributes are available on both Series and DataFrame
these indexer accept scalar values, lists and slices.
an index operation is when you put brackets,[ ], following a variable
# when you slice by position, pandas uses the half-open interval
# when you slice by label, pandas uses the closed interval and includes both the start and end index
# directly use .loc .iloc on the original DataFrame
# pass in a tuple (without parentheses) of row and column leabels or positions
college.iloc["Alabama A & M University","CITY"]
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-6-a19a8d9dcc9c>", line 1, in <module>
college.iloc["Alabama A & M University","CITY"]
NameError: name 'college' is not defined
import pandas as pd
college=pd.read_csv("college.csv")
college.iloc["Alabama A & M University","CITY"]
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 723, in _has_valid_tuple
self._validate_key(k, i)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 1375, in _validate_key
raise ValueError(f"Can only index by location with a [{self._valid_types}]")
ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-9-a19a8d9dcc9c>", line 1, in <module>
college.iloc["Alabama A & M University","CITY"]
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 889, in __getitem__
return self._getitem_tuple(key)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 1450, in _getitem_tuple
self._has_valid_tuple(tup)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 725, in _has_valid_tuple
raise ValueError(
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
college.loc["Alabama A & M University","CITY"]
Traceback (most recent call last):
File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-87a530e1e6f6>", line 1, in <module>
college.loc["Alabama A & M University","CITY"]
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 889, in __getitem__
return self._getitem_tuple(key)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 1060, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 807, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 1124, in _getitem_axis
return self._get_label(key, axis=axis)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 1073, in _get_label
return self.obj.xs(label, axis=axis)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 3738, in xs
loc = index.get_loc(key)
File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\range.py", line 354, in get_loc
raise KeyError(key)
KeyError: 'Alabama A & M University'
college.head()
Out[11]:
INSTNM ... GRAD_DEBT_MDN_SUPP
0 Alabama A & M University ... 33888
1 University of Alabama at Birmingham ... 21941.5
2 Amridge University ... 23370
3 University of Alabama in Huntsville ... 24097
4 Alabama State University ... 33118.5
[5 rows x 27 columns]
college=pd.read_csv("college.csv",index_col='INSTNM')
college.loc["Alabama A & M University","CITY"]
Out[13]: 'Normal'
college.iloc[0,0]
Out[14]: 'Normal'
college.loc[["Alabama A & M University","Alabama State University"],"CITY"]
Out[15]:
INSTNM
Alabama A & M University Normal
Alabama State University Montgomery
Name: CITY, dtype: object
college.iloc[0:5,0]
Out[16]:
INSTNM
Alabama A & M University Normal
University of Alabama at Birmingham Birmingham
Amridge University Montgomery
University of Alabama in Huntsville Huntsville
Alabama State University Montgomery
Name: CITY, dtype: object
# when using slicing off of .loc , if the start index appears affter the stop index , then an empty Series is returned without an exception
city=college.CITY
city.loc['Reid State Technical College':'Alabama State University']
Out[21]: Series([], Name: CITY, dtype: object)
# selecting DataFrame rows
# the most explicit and preferred way to select DataFrame rows is with .iloc and .loc.
# They are both capable of selecting by rows or by columns
# to select an entire row at the position, pass an integer to .iloc
college.iloc[6]
Out[27]:
CITY Alexander City
STABBR AL
HBCU 0.0
MENONLY 0.0
WOMENONLY 0.0
RELAFFIL 0
SATVRMID NaN
SATMTMID NaN
DISTANCEONLY 0.0
UGDS 1592.0
UGDS_WHITE 0.7255
UGDS_BLACK 0.2613
UGDS_HISP 0.0044
UGDS_ASIAN 0.0025
UGDS_AIAN 0.0044
UGDS_NHPI 0.0
UGDS_2MOR 0.0
UGDS_NRA 0.0
UGDS_UNKN 0.0019
PPTUG_EF 0.3882
CURROPER 1
PCTPELL 0.5892
PCTFLOAN 0.3977
UG25ABV 0.3153
MD_EARN_WNE_P10 27500
GRAD_DEBT_MDN_SUPP 16127
Name: Central Alabama Community College, dtype: object
# because Python is zero-based, this is actually the 61st row
# pandas represents this row as a Series
# to get the same row as the preceding step, pass the index label to .loc
college.loc['Central Alabama Community College']
Out[32]:
CITY Alexander City
STABBR AL
HBCU 0.0
MENONLY 0.0
WOMENONLY 0.0
RELAFFIL 0
SATVRMID NaN
SATMTMID NaN
DISTANCEONLY 0.0
UGDS 1592.0
UGDS_WHITE 0.7255
UGDS_BLACK 0.2613
UGDS_HISP 0.0044
UGDS_ASIAN 0.0025
UGDS_AIAN 0.0044
UGDS_NHPI 0.0
UGDS_2MOR 0.0
UGDS_NRA 0.0
UGDS_UNKN 0.0019
PPTUG_EF 0.3882
CURROPER 1
PCTPELL 0.5892
PCTFLOAN 0.3977
UG25ABV 0.3153
MD_EARN_WNE_P10 27500
GRAD_DEBT_MDN_SUPP 16127
Name: Central Alabama Community College, dtype: object
# to select a disjointed set of rows as a DataFrame, pass a list of integers to .iloc
college.iloc[[60,99,3]]
Out[34]:
CITY ... GRAD_DEBT_MDN_SUPP
INSTNM ...
University of Alaska Anchorage Anchorage ... 19449.5
International Academy of Hair Design Tempe ... 10556
University of Alabama in Huntsville Huntsville ... 24097
[3 rows x 26 columns]
# because we passed in a list of row position, this returns a DataFrame
# the same DateFrame from the preceding step may be reproduced with .loc by passing it a list of the institution names
labels=['University of Alaska Anchorage','International Academy of Hair Design',"University of Alabama in Huntsville"]
college.loc[labels]
Out[38]:
CITY ... GRAD_DEBT_MDN_SUPP
INSTNM ...
University of Alaska Anchorage Anchorage ... 19449.5
International Academy of Hair Design Tempe ... 10556
University of Alabama in Huntsville Huntsville ... 24097
[3 rows x 26 columns]
#use slice notation with .iloc to select continuous rows of the data
college[99:102]
Out[40]:
CITY ... GRAD_DEBT_MDN_SUPP
INSTNM ...
International Academy of Hair Design Tempe ... 10556
GateWay Community College Phoenix ... 7283
Mesa Community College Mesa ... 8000
[3 rows x 26 columns]
# slice notation also works with .loc and is a closed interval
start = 'International Academy of Hair Design'
end='Mesa Community College'
college.loc[start:end]
Out[46]:
CITY ... GRAD_DEBT_MDN_SUPP
INSTNM ...
International Academy of Hair Design Tempe ... 10556
GateWay Community College Phoenix ... 7283
Mesa Community College Mesa ... 8000
[3 rows x 26 columns]
# when we pass a scalar value, a list of scalars, or a slice to .iloc or .loc,
# this causes pandas to scan the index for the appropriate rows and returned them
# if a single scalar value is passed, a Series is returned
# if a list or slice is passed, a DataFrame is returned
# the list of index labels can be selected directly from the DataFrame
college.iloc[[66,99,3]].index.tolist()
Out[52]:
['Charter College-Anchorage',
'International Academy of Hair Design',
'University of Alabama in Huntsville']
上面的报错就不一一删除了,从中可以看出:Series和DataFrame都是pandas的数据结构格式不错,但是二者的使用也是有差别的。比如说:Series 可以直接用Seriesname[0]
取到第一条记录,但是DataFrame却不行,后者必须通过iloc实现;在select rows
时,Series可以返回一个Scalar value (or Series),但是DataFrame最少也要返回一个Series,也可能返回一个DataFrame。
注意:只有将INSTNM
设置为index_col
,才能够在后续的处理中以INSTNM
中的值作为索引。
Selecting DataFrame rows and columns simultaneously
#-----
# indexing directly on a DataFrame does not allow you to select both rows and columns simultaneously
# to select rows and columns, pass both valid row and column selections separated by a comma to either .iloc or .loc
# the generic form to select rows and columns
# df.iloc[row_idxs,column_idxs] where 'row_idxs' and 'column_idxs'can be scalar integers, lists of integers, or integer slice
# df.loc[row_names,column_names] where 'row_name' and 'column_name' can be the scalar names, lists of names, or names slice, row_names can also be a Boolean array
college.iloc[:3,:4]
Out[59]:
CITY STABBR HBCU MENONLY
INSTNM
Alabama A & M University Normal AL 1.0 0.0
University of Alabama at Birmingham Birmingham AL 0.0 0.0
Amridge University Montgomery AL 0.0 0.0
college.loc[:"Amridge University",:"MENONLY"]
Out[60]:
CITY STABBR HBCU MENONLY
INSTNM
Alabama A & M University Normal AL 1.0 0.0
University of Alabama at Birmingham Birmingham AL 0.0 0.0
Amridge University Montgomery AL 0.0 0.0
# select all rows of two different columns
college.iloc[:,[4,6]].head()
Out[62]:
WOMENONLY SATVRMID
INSTNM
Alabama A & M University 0.0 424.0
University of Alabama at Birmingham 0.0 570.0
Amridge University 0.0 NaN
University of Alabama in Huntsville 0.0 595.0
Alabama State University 0.0 425.0
college.loc[:,["WOMENONLY","SATVRMID"]]
Out[63]:
WOMENONLY SATVRMID
INSTNM
Alabama A & M University 0.0 424.0
University of Alabama at Birmingham 0.0 570.0
Amridge University 0.0 NaN
University of Alabama in Huntsville 0.0 595.0
Alabama State University 0.0 425.0
... ...
SAE Institute of Technology San Francisco NaN NaN
Rasmussen College - Overland Park NaN NaN
National Personal Training Institute of Cleveland NaN NaN
Bay Area Medical Academy - San Jose Satellite L... NaN NaN
Excel Learning Center-San Antonio South NaN NaN
[7535 rows x 2 columns]
college.loc[:,["WOMENONLY","SATVRMID"]].head()
Out[64]:
WOMENONLY SATVRMID
INSTNM
Alabama A & M University 0.0 424.0
University of Alabama at Birmingham 0.0 570.0
Amridge University 0.0 NaN
University of Alabama in Huntsville 0.0 595.0
Alabama State University 0.0 425.0
# select disjointed rows and columns
college.iloc[[100,200],[7,15]]
Out[66]:
SATMTMID UGDS_NHPI
INSTNM
GateWay Community College NaN 0.0029
American Baptist Seminary of the West NaN NaN
rows_name=["GateWay Community College",'American Baptist Seminary of the West']
columns_name=['SATMTMID','UGDS_NHPI']
college.loc[rows_name,columns_name]
Out[69]:
SATMTMID UGDS_NHPI
INSTNM
GateWay Community College NaN 0.0029
American Baptist Seminary of the West NaN NaN
college.loc["American Baptist Seminary of the West","UGDS_NHPI"]
Out[70]: nan
# select a single scalar column as above
# slice the row and select a single column
college.iloc[90:80:-2,5]
Out[73]:
INSTNM
Empire Beauty School-Flagstaff 0
Charles of Italy Beauty College 0
Central Arizona College 0
University of Arizona 0
Arizona State University-Tempe 0
Name: RELAFFIL, dtype: int64
# colon means :, represents a slice object that returns all the values for that dimesion
# to select only rows(along with all the columns), it id necessary to use a colon following a comma
# the default behavior is to select all the clolumns if there is no comma present
college.iloc[:10].equals(college.iloc[:10,:])
Out[77]: True
selecting data with both integers and labels
infamous 臭名远扬的
# use the index method .get_loc to find the integer position of the desireed column
col_start=college.columns.get_loc('UGDS_WHITE')
col_end=college.columns.get_loc('UGDS_UNKN')+1
col_start,col_end
Out[81]: (10, 19)
# use col_start and col_end to select columns by position using .iloc
college.iloc[:5,col_start:col_end]
Out[83]:
UGDS_WHITE ... UGDS_UNKN
INSTNM ...
Alabama A & M University 0.0333 ... 0.0138
University of Alabama at Birmingham 0.5922 ... 0.0100
Amridge University 0.2990 ... 0.2715
University of Alabama in Huntsville 0.6988 ... 0.0350
Alabama State University 0.0158 ... 0.0137
[5 rows x 9 columns]
# first retrieves the column index through the .columns attribute
# .get_loc method accepts an index label and returns its integer location
# the next, show how to select the 10th through 15th(inclusive) rows, along with columns'UGDS_WHITE' through 'UGDS_UNKN'
row_start=college.index[10]
row_end=college.index[15]
college.loc[row_start:row_end,"UGDS_WHITE":"UGDS_UNKN"]
Out[89]:
UGDS_WHITE ... UGDS_UNKN
INSTNM ...
Birmingham Southern College 0.7983 ... 0.0051
Chattahoochee Valley Community College 0.4661 ... 0.0139
Concordia College Alabama 0.0280 ... 0.0000
South University-Montgomery 0.3046 ... 0.0326
Enterprise State Community College 0.6408 ... 0.0069
James H Faulkner State Community College 0.6979 ... 0.0009
[6 rows x 9 columns]
# for the same result, chaining .loc and .iloc together is a bad idea.
# because it can be slower, and it is also undetermined wheter it returns a view or a copy
# that is not problematic when viewing the data, but can be when updating data
slicing lexicographically
# for the same result, chaining .loc and .iloc together is a bad idea.
# because it can be slower, and it is also undetermined wheter it returns a view or a copy
# that is not problematic when viewing the data, but can be when updating data
# sort the index, very important
college=college.sort_index()
college.loc["Sp":"Su"]
Out[95]:
CITY ... GRAD_DEBT_MDN_SUPP
INSTNM ...
Spa Tech Institute-Ipswich Ipswich ... 6333
Spa Tech Institute-Plymouth Plymouth ... 6333
Spa Tech Institute-Westboro Westboro ... 6333
Spa Tech Institute-Westbrook Westbrook ... 6333
Spalding University Louisville ... 25000
... ... ...
Studio Academy of Beauty Chandler ... 6333
Studio Jewelers New York ... PrivacySuppressed
Stylemaster College of Hair Design Longview ... 13320
Styles and Profiles Beauty College Selmer ... PrivacySuppressed
Styletrends Barber and Hairstyling Academy Rock Hill ... 9495.5
[201 rows x 26 columns]
#if using .loc, it raises a KeyError when these labels are not found in the index
# however, one special exception to this behavior exsits whenever the index is lexicographically sorted, and a slice is passed to it
# to select all colleges that begin with the letters D through S,
college.loc['D':'E']
Out[99]:
CITY ... GRAD_DEBT_MDN_SUPP
INSTNM ...
D & L Academy of Hair Design Twin Falls ... PrivacySuppressed
D A Dorsey Technical College Miami ... PrivacySuppressed
D'Jay's School of Beauty Arts and Sciences Baton Rouge ... 7327
D'Mart Institute Barranquitas ... PrivacySuppressed
D'Youville College Buffalo ... 25500
... ... ...
Durham Beauty Academy Durham ... 15332
Durham Technical Community College Durham ... 11069.5
Dutchess BOCES-Practical Nursing Program Poughkeepsie ... 9500
Dutchess Community College Poughkeepsie ... 10250
Dyersburg State Community College Dyersburg ... 7475
[224 rows x 26 columns]
# notice when lexicographically ordered, it is also half-open interval
college.loc['D':'T']
Out[101]:
CITY ... GRAD_DEBT_MDN_SUPP
INSTNM ...
D & L Academy of Hair Design Twin Falls ... PrivacySuppressed
D A Dorsey Technical College Miami ... PrivacySuppressed
D'Jay's School of Beauty Arts and Sciences Baton Rouge ... 7327
D'Mart Institute Barranquitas ... PrivacySuppressed
D'Youville College Buffalo ... 25500
... ... ...
Swedish Institute a College of Health Sciences New York ... 20333
Sweet Briar College Sweet Briar ... 25500
Sylvain Melloul International Hair Academy Lynchburg ... 12500
Syracuse City Schools Practical Nursing Program Syracuse ... 9500
Syracuse University Syracuse ... 27000
[4580 rows x 26 columns]
#the type of slicing also works when the index is sorted in the opposite direction
# determine in which direction the index is sorted wit the index attribute .is_monotonic_increasing or .is_monotonic_decreasing
# either of these must be True in oder for lexicographic slicing to work
college=college.sort_index(ascending=False)
college.index.is_monotonic_decreasing
Out[106]: True
college.loc['E':'B']
Out[108]:
CITY ... GRAD_DEBT_MDN_SUPP
INSTNM ...
Dyersburg State Community College Dyersburg ... 7475
Dutchess Community College Poughkeepsie ... 10250
Dutchess BOCES-Practical Nursing Program Poughkeepsie ... 9500
Durham Technical Community College Durham ... 11069.5
Durham Beauty Academy Durham ... 15332
... ... ...
Bacone College Muskogee ... 26350
Babson College Wellesley ... 27000
BJ's Beauty & Barber College Auburn ... PrivacySuppressed
BIR Training Center Chicago ... 15394
B M Spurr School of Practical Nursing Glen Dale ... PrivacySuppressed
[1411 rows x 26 columns]
[208/(229/627)]