【cookbook pandas】学习笔记第六章

Selecting Subsets of Data

every dimension of data in a Series or DataFrame is labeled in the index object.
it is this index that separates pandas data structures from Numpy’s n-dimension array.
indexes provide meaningful labels for each rows and column of data.
pandas allows user to select data according to the position of the rows and columns.
Python dictionaries and lists are built-in data structures that select their data in exactly one of these ways.

caution against doing sth 警告某人不要做某事
promptly 立刻,马上
immutable object 不可改变的,永恒不变的
a dictionary’s key (its label) must be an immutable object ,such as a string, integer, or tuple.
lists must either use integers (the position) or slice objects for selection.
dictionaries can only select one object at a time by passing the key to the indexing operator.
in this way, pandas is combining the ability to select data using integers, as with lists, and labels, as with dictionaries.

Selecting Series data

Series and DataFrame are complex data containers that have multiple attributes that use an index operation to select data in different ways.
Series and DataFrame allow selection by position (like python lists) and by label (like python dictionaries)
when we index off of the .iloc attribute, pandas selects only by position and works similarly to Python lists
the .loc attribute selects only by index label, which is similar to how Python dictionaries work
the .loc and .iloc attributes are available on both Series and DataFrame
these indexer accept scalar values, lists and slices.
an index operation is when you put brackets,[ ], following a variable

# when you slice by position, pandas uses the half-open interval
# when you slice by label, pandas uses the closed interval and includes both the start and end index
# directly use .loc .iloc on the original DataFrame
# pass in a tuple (without parentheses) of row and column leabels or positions
college.iloc["Alabama A & M University","CITY"]
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-a19a8d9dcc9c>", line 1, in <module>
    college.iloc["Alabama A & M University","CITY"]
NameError: name 'college' is not defined
import pandas as pd
college=pd.read_csv("college.csv")
college.iloc["Alabama A & M University","CITY"]
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 723, in _has_valid_tuple
    self._validate_key(k, i)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 1375, in _validate_key
    raise ValueError(f"Can only index by location with a [{self._valid_types}]")
ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-9-a19a8d9dcc9c>", line 1, in <module>
    college.iloc["Alabama A & M University","CITY"]
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 889, in __getitem__
    return self._getitem_tuple(key)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 1450, in _getitem_tuple
    self._has_valid_tuple(tup)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 725, in _has_valid_tuple
    raise ValueError(
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
college.loc["Alabama A & M University","CITY"]
Traceback (most recent call last):
  File "D:\PyCharm2020\python2020\lib\site-packages\IPython\core\interactiveshell.py", line 3427, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-87a530e1e6f6>", line 1, in <module>
    college.loc["Alabama A & M University","CITY"]
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 889, in __getitem__
    return self._getitem_tuple(key)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 1060, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 807, in _getitem_lowerdim
    section = self._getitem_axis(key, axis=i)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 1124, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexing.py", line 1073, in _get_label
    return self.obj.xs(label, axis=axis)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\generic.py", line 3738, in xs
    loc = index.get_loc(key)
  File "D:\PyCharm2020\python2020\lib\site-packages\pandas\core\indexes\range.py", line 354, in get_loc
    raise KeyError(key)
KeyError: 'Alabama A & M University'
college.head()
Out[11]: 
                                INSTNM  ... GRAD_DEBT_MDN_SUPP
0             Alabama A & M University  ...              33888
1  University of Alabama at Birmingham  ...            21941.5
2                   Amridge University  ...              23370
3  University of Alabama in Huntsville  ...              24097
4             Alabama State University  ...            33118.5
[5 rows x 27 columns]
college=pd.read_csv("college.csv",index_col='INSTNM')
college.loc["Alabama A & M University","CITY"]
Out[13]: 'Normal'
college.iloc[0,0]
Out[14]: 'Normal'
college.loc[["Alabama A & M University","Alabama State University"],"CITY"]
Out[15]: 
INSTNM
Alabama A & M University        Normal
Alabama State University    Montgomery
Name: CITY, dtype: object
college.iloc[0:5,0]
Out[16]: 
INSTNM
Alabama A & M University                   Normal
University of Alabama at Birmingham    Birmingham
Amridge University                     Montgomery
University of Alabama in Huntsville    Huntsville
Alabama State University               Montgomery
Name: CITY, dtype: object
# when using slicing off of .loc , if the start index  appears affter the stop index , then an empty Series is returned without an exception
city=college.CITY
city.loc['Reid State Technical College':'Alabama State University']
Out[21]: Series([], Name: CITY, dtype: object)
# selecting DataFrame rows
# the most explicit and preferred way to select DataFrame rows is with .iloc and .loc.
# They are both capable of selecting by rows or by columns 
# to select an entire row at the position, pass an integer to .iloc

college.iloc[6]
Out[27]: 
CITY                  Alexander City
STABBR                            AL
HBCU                             0.0
MENONLY                          0.0
WOMENONLY                        0.0
RELAFFIL                           0
SATVRMID                         NaN
SATMTMID                         NaN
DISTANCEONLY                     0.0
UGDS                          1592.0
UGDS_WHITE                    0.7255
UGDS_BLACK                    0.2613
UGDS_HISP                     0.0044
UGDS_ASIAN                    0.0025
UGDS_AIAN                     0.0044
UGDS_NHPI                        0.0
UGDS_2MOR                        0.0
UGDS_NRA                         0.0
UGDS_UNKN                     0.0019
PPTUG_EF                      0.3882
CURROPER                           1
PCTPELL                       0.5892
PCTFLOAN                      0.3977
UG25ABV                       0.3153
MD_EARN_WNE_P10                27500
GRAD_DEBT_MDN_SUPP             16127
Name: Central Alabama Community College, dtype: object
# because Python is zero-based, this is actually the 61st row
# pandas represents this row as a Series
# to get the same row as the preceding step, pass the index label to .loc
college.loc['Central Alabama Community College']
Out[32]: 
CITY                  Alexander City
STABBR                            AL
HBCU                             0.0
MENONLY                          0.0
WOMENONLY                        0.0
RELAFFIL                           0
SATVRMID                         NaN
SATMTMID                         NaN
DISTANCEONLY                     0.0
UGDS                          1592.0
UGDS_WHITE                    0.7255
UGDS_BLACK                    0.2613
UGDS_HISP                     0.0044
UGDS_ASIAN                    0.0025
UGDS_AIAN                     0.0044
UGDS_NHPI                        0.0
UGDS_2MOR                        0.0
UGDS_NRA                         0.0
UGDS_UNKN                     0.0019
PPTUG_EF                      0.3882
CURROPER                           1
PCTPELL                       0.5892
PCTFLOAN                      0.3977
UG25ABV                       0.3153
MD_EARN_WNE_P10                27500
GRAD_DEBT_MDN_SUPP             16127
Name: Central Alabama Community College, dtype: object
# to select a disjointed set of rows as a DataFrame, pass a list of integers to .iloc
college.iloc[[60,99,3]]
Out[34]: 
                                            CITY  ... GRAD_DEBT_MDN_SUPP
INSTNM                                            ...                   
University of Alaska Anchorage         Anchorage  ...            19449.5
International Academy of Hair Design       Tempe  ...              10556
University of Alabama in Huntsville   Huntsville  ...              24097
[3 rows x 26 columns]
# because we passed in a list of row position, this returns a DataFrame
# the same DateFrame from the preceding step may be reproduced with .loc by passing it a list of the institution names
labels=['University of Alaska Anchorage','International Academy of Hair Design',"University of Alabama in Huntsville"]
college.loc[labels]
Out[38]: 
                                            CITY  ... GRAD_DEBT_MDN_SUPP
INSTNM                                            ...                   
University of Alaska Anchorage         Anchorage  ...            19449.5
International Academy of Hair Design       Tempe  ...              10556
University of Alabama in Huntsville   Huntsville  ...              24097
[3 rows x 26 columns]
#use slice notation with .iloc to select continuous rows of the data
college[99:102]
Out[40]: 
                                         CITY  ... GRAD_DEBT_MDN_SUPP
INSTNM                                         ...                   
International Academy of Hair Design    Tempe  ...              10556
GateWay Community College             Phoenix  ...               7283
Mesa Community College                   Mesa  ...               8000
[3 rows x 26 columns]
# slice notation also works with .loc and is a closed interval 
start = 'International Academy of Hair Design'
end='Mesa Community College'
college.loc[start:end]
Out[46]: 
                                         CITY  ... GRAD_DEBT_MDN_SUPP
INSTNM                                         ...                   
International Academy of Hair Design    Tempe  ...              10556
GateWay Community College             Phoenix  ...               7283
Mesa Community College                   Mesa  ...               8000
[3 rows x 26 columns]
# when we pass a scalar value, a list of scalars, or a slice to .iloc or .loc,
# this causes pandas to scan the index for the appropriate rows and returned them
# if a single scalar value is passed, a Series is returned 
# if a list or slice is passed, a DataFrame is returned 
# the list of index labels can be selected directly from the DataFrame
college.iloc[[66,99,3]].index.tolist()
Out[52]: 
['Charter College-Anchorage',
 'International Academy of Hair Design',
 'University of Alabama in Huntsville']

上面的报错就不一一删除了,从中可以看出:Series和DataFrame都是pandas的数据结构格式不错,但是二者的使用也是有差别的。比如说:Series 可以直接用Seriesname[0]取到第一条记录,但是DataFrame却不行,后者必须通过iloc实现;在select rows时,Series可以返回一个Scalar value (or Series),但是DataFrame最少也要返回一个Series,也可能返回一个DataFrame。
注意:只有将INSTNM设置为index_col,才能够在后续的处理中以INSTNM中的值作为索引。

Selecting DataFrame rows and columns simultaneously

#-----
# indexing directly on a DataFrame does not allow you to select both rows and columns simultaneously
# to select rows and columns, pass both valid row and column selections separated by a comma to either .iloc or .loc
# the generic form to select rows and columns
# df.iloc[row_idxs,column_idxs] where 'row_idxs' and 'column_idxs'can be scalar integers, lists of integers, or integer slice
# df.loc[row_names,column_names] where 'row_name' and 'column_name' can be the scalar names, lists of names, or names slice, row_names can also be a Boolean array
college.iloc[:3,:4]
Out[59]: 
                                           CITY STABBR  HBCU  MENONLY
INSTNM                                                               
Alabama A & M University                 Normal     AL   1.0      0.0
University of Alabama at Birmingham  Birmingham     AL   0.0      0.0
Amridge University                   Montgomery     AL   0.0      0.0
college.loc[:"Amridge University",:"MENONLY"]
Out[60]: 
                                           CITY STABBR  HBCU  MENONLY
INSTNM                                                               
Alabama A & M University                 Normal     AL   1.0      0.0
University of Alabama at Birmingham  Birmingham     AL   0.0      0.0
Amridge University                   Montgomery     AL   0.0      0.0
# select all rows of two different columns
college.iloc[:,[4,6]].head()
Out[62]: 
                                     WOMENONLY  SATVRMID
INSTNM                                                  
Alabama A & M University                   0.0     424.0
University of Alabama at Birmingham        0.0     570.0
Amridge University                         0.0       NaN
University of Alabama in Huntsville        0.0     595.0
Alabama State University                   0.0     425.0
college.loc[:,["WOMENONLY","SATVRMID"]]
Out[63]: 
                                                    WOMENONLY  SATVRMID
INSTNM                                                                 
Alabama A & M University                                  0.0     424.0
University of Alabama at Birmingham                       0.0     570.0
Amridge University                                        0.0       NaN
University of Alabama in Huntsville                       0.0     595.0
Alabama State University                                  0.0     425.0
                                                       ...       ...
SAE Institute of Technology  San Francisco                NaN       NaN
Rasmussen College - Overland Park                         NaN       NaN
National Personal Training Institute of Cleveland         NaN       NaN
Bay Area Medical Academy - San Jose Satellite L...        NaN       NaN
Excel Learning Center-San Antonio South                   NaN       NaN
[7535 rows x 2 columns]
college.loc[:,["WOMENONLY","SATVRMID"]].head()
Out[64]: 
                                     WOMENONLY  SATVRMID
INSTNM                                                  
Alabama A & M University                   0.0     424.0
University of Alabama at Birmingham        0.0     570.0
Amridge University                         0.0       NaN
University of Alabama in Huntsville        0.0     595.0
Alabama State University                   0.0     425.0
# select disjointed rows and columns 
college.iloc[[100,200],[7,15]]
Out[66]: 
                                       SATMTMID  UGDS_NHPI
INSTNM                                                    
GateWay Community College                   NaN     0.0029
American Baptist Seminary of the West       NaN        NaN
rows_name=["GateWay Community College",'American Baptist Seminary of the West']
columns_name=['SATMTMID','UGDS_NHPI']
college.loc[rows_name,columns_name]
Out[69]: 
                                       SATMTMID  UGDS_NHPI
INSTNM                                                    
GateWay Community College                   NaN     0.0029
American Baptist Seminary of the West       NaN        NaN
college.loc["American Baptist Seminary of the West","UGDS_NHPI"]
Out[70]: nan
# select a single scalar column as above 
# slice the row and select a single column
college.iloc[90:80:-2,5]
Out[73]: 
INSTNM
Empire Beauty School-Flagstaff     0
Charles of Italy Beauty College    0
Central Arizona College            0
University of Arizona              0
Arizona State University-Tempe     0
Name: RELAFFIL, dtype: int64
# colon means :, represents a slice object that returns all the values for that dimesion
# to select only rows(along with all the columns), it id necessary to use a colon following a comma
# the default behavior is to select all the clolumns if there is no comma present
college.iloc[:10].equals(college.iloc[:10,:])
Out[77]: True

selecting data with both integers and labels

infamous 臭名远扬的

# use the index method .get_loc to find the integer position of the desireed column
col_start=college.columns.get_loc('UGDS_WHITE')
col_end=college.columns.get_loc('UGDS_UNKN')+1
col_start,col_end
Out[81]: (10, 19)
# use col_start and col_end to select columns by position using .iloc 
college.iloc[:5,col_start:col_end]
Out[83]: 
                                     UGDS_WHITE  ...  UGDS_UNKN
INSTNM                                           ...           
Alabama A & M University                 0.0333  ...     0.0138
University of Alabama at Birmingham      0.5922  ...     0.0100
Amridge University                       0.2990  ...     0.2715
University of Alabama in Huntsville      0.6988  ...     0.0350
Alabama State University                 0.0158  ...     0.0137
[5 rows x 9 columns]
# first retrieves the column index through the .columns attribute
# .get_loc method accepts an index label and returns its integer location
# the next, show how to select the 10th through 15th(inclusive) rows, along with columns'UGDS_WHITE' through 'UGDS_UNKN'
row_start=college.index[10]
row_end=college.index[15]
college.loc[row_start:row_end,"UGDS_WHITE":"UGDS_UNKN"]
Out[89]: 
                                          UGDS_WHITE  ...  UGDS_UNKN
INSTNM                                                ...           
Birmingham Southern College                   0.7983  ...     0.0051
Chattahoochee Valley Community College        0.4661  ...     0.0139
Concordia College Alabama                     0.0280  ...     0.0000
South University-Montgomery                   0.3046  ...     0.0326
Enterprise State Community College            0.6408  ...     0.0069
James H Faulkner State Community College      0.6979  ...     0.0009
[6 rows x 9 columns]
# for the same result, chaining .loc and .iloc together is a bad idea.
# because it can be slower, and it is also undetermined wheter it returns a view or a copy
# that is not problematic when viewing the data, but can be when updating data

slicing lexicographically

# for the same result, chaining .loc and .iloc together is a bad idea.
# because it can be slower, and it is also undetermined wheter it returns a view or a copy
# that is not problematic when viewing the data, but can be when updating data
# sort the index, very important
college=college.sort_index()
college.loc["Sp":"Su"]
Out[95]: 
                                                  CITY  ... GRAD_DEBT_MDN_SUPP
INSTNM                                                  ...                   
Spa Tech Institute-Ipswich                     Ipswich  ...               6333
Spa Tech Institute-Plymouth                   Plymouth  ...               6333
Spa Tech Institute-Westboro                   Westboro  ...               6333
Spa Tech Institute-Westbrook                 Westbrook  ...               6333
Spalding University                         Louisville  ...              25000
                                                ...  ...                ...
Studio Academy of Beauty                      Chandler  ...               6333
Studio Jewelers                               New York  ...  PrivacySuppressed
Stylemaster College of Hair Design            Longview  ...              13320
Styles and Profiles Beauty College              Selmer  ...  PrivacySuppressed
Styletrends Barber and Hairstyling Academy   Rock Hill  ...             9495.5
[201 rows x 26 columns]
#if using .loc, it raises a KeyError when these labels are not found in the index 
# however, one special exception to this behavior exsits whenever the index is lexicographically sorted, and a slice is passed to it
# to select all colleges that begin with the letters D through S,  
college.loc['D':'E']
Out[99]: 
                                                    CITY  ... GRAD_DEBT_MDN_SUPP
INSTNM                                                    ...                   
D & L Academy of Hair Design                  Twin Falls  ...  PrivacySuppressed
D A Dorsey Technical College                       Miami  ...  PrivacySuppressed
D'Jay's School of Beauty Arts and Sciences   Baton Rouge  ...               7327
D'Mart Institute                            Barranquitas  ...  PrivacySuppressed
D'Youville College                               Buffalo  ...              25500
                                                  ...  ...                ...
Durham Beauty Academy                             Durham  ...              15332
Durham Technical Community College                Durham  ...            11069.5
Dutchess BOCES-Practical Nursing Program    Poughkeepsie  ...               9500
Dutchess Community College                  Poughkeepsie  ...              10250
Dyersburg State Community College              Dyersburg  ...               7475
[224 rows x 26 columns]
# notice when lexicographically ordered, it is also half-open interval
college.loc['D':'T']
Out[101]: 
                                                         CITY  ... GRAD_DEBT_MDN_SUPP
INSTNM                                                         ...                   
D & L Academy of Hair Design                       Twin Falls  ...  PrivacySuppressed
D A Dorsey Technical College                            Miami  ...  PrivacySuppressed
D'Jay's School of Beauty Arts and Sciences        Baton Rouge  ...               7327
D'Mart Institute                                 Barranquitas  ...  PrivacySuppressed
D'Youville College                                    Buffalo  ...              25500
                                                       ...  ...                ...
Swedish Institute a College of Health Sciences       New York  ...              20333
Sweet Briar College                               Sweet Briar  ...              25500
Sylvain Melloul International Hair Academy          Lynchburg  ...              12500
Syracuse City Schools Practical Nursing Program      Syracuse  ...               9500
Syracuse University                                  Syracuse  ...              27000
[4580 rows x 26 columns]
#the type of slicing also works when the index is sorted in the opposite direction
# determine in which direction the index is sorted wit the index attribute .is_monotonic_increasing or .is_monotonic_decreasing 
# either of these must be True in oder for lexicographic slicing to work
college=college.sort_index(ascending=False)
college.index.is_monotonic_decreasing
Out[106]: True
college.loc['E':'B']
Out[108]: 
                                                  CITY  ... GRAD_DEBT_MDN_SUPP
INSTNM                                                  ...                   
Dyersburg State Community College            Dyersburg  ...               7475
Dutchess Community College                Poughkeepsie  ...              10250
Dutchess BOCES-Practical Nursing Program  Poughkeepsie  ...               9500
Durham Technical Community College              Durham  ...            11069.5
Durham Beauty Academy                           Durham  ...              15332
                                                ...  ...                ...
Bacone College                                Muskogee  ...              26350
Babson College                               Wellesley  ...              27000
BJ's Beauty & Barber College                    Auburn  ...  PrivacySuppressed
BIR Training Center                            Chicago  ...              15394
B M Spurr School of Practical Nursing        Glen Dale  ...  PrivacySuppressed
[1411 rows x 26 columns]

[208/(229/627)]

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值