选择 DataFrame 列的方法大全

因 457079928 python草堂群,网友 @星河·璀动 问到pandas选择数据列的问题,故把自己的学习总结的笔记整理出来。

完整的ipynb文件,和文中示例数据请移步python草堂群下载。

注意:这些方法返回的都是视图。

  1. 如果你确信原始数据的一些列是无用的,可以在删除这些列后,创建 DataFrame 的一个 .copy(deep=True),然后删除原 DataFrame, 以节省内存;
  2. 如果列不是很多,在导入数据时就排除这些列。

选择列的两种思路

  1. 需要的列较少,直接选择需要的列;
  2. 不需要的列较少,删除或排除不要的列。
import numpy as np
import pandas as pd
import pprint as pp

baseball_df = pd.read_csv('baseball.csv',index_col='No.')
pp.pprint(baseball_df)
              id  year  stint team   lg    g   ab   r    h  X2b  ...   rbi  \
No.                                                              ...         
4      ansonca01  1871      1  RC1  NaN   25  120  29   39   11  ...  16.0   
44     forceda01  1871      1  WS3  NaN   32  162  45   45    9  ...  29.0   
68     mathebo01  1871      1  FW1  NaN   19   89  15   24    3  ...  10.0   
99     startjo01  1871      1  NY2  NaN   33  161  35   58    5  ...  34.0   
102    suttoez01  1871      1  CL1  NaN   29  128  35   45    3  ...  23.0   
...          ...   ...    ...  ...  ...  ...  ...  ..  ...  ...  ...   ...   
89525  benitar01  2007      2  FLO   NL   34    0   0    0    0  ...   0.0   
89526  benitar01  2007      1  SFN   NL   19    0   0    0    0  ...   0.0   
89530  ausmubr01  2007      1  HOU   NL  117  349  38   82   16  ...  25.0   
89533   aloumo01  2007      1  NYN   NL   87  328  51  112   19  ...  49.0   
89534  alomasa02  2007      1  NYN   NL    8   22   1    3    1  ...   0.0   

        sb   cs  bb    so  ibb  hbp   sh   sf  gidp  
No.                                                  
4      6.0  2.0   2   1.0  NaN  NaN  NaN  NaN   NaN  
44     8.0  0.0   4   0.0  NaN  NaN  NaN  NaN   NaN  
68     2.0  1.0   2   0.0  NaN  NaN  NaN  NaN   NaN  
99     4.0  2.0   3   0.0  NaN  NaN  NaN  NaN   NaN  
102    3.0  1.0   1   0.0  NaN  NaN  NaN  NaN   NaN  
...    ...  ...  ..   ...  ...  ...  ...  ...   ...  
89525  0.0  0.0   0   0.0  0.0  0.0  0.0  0.0   0.0  
89526  0.0  0.0   0   0.0  0.0  0.0  0.0  0.0   0.0  
89530  6.0  1.0  37  74.0  3.0  6.0  4.0  1.0  11.0  
89533  3.0  0.0  27  30.0  5.0  2.0  0.0  3.0  13.0  
89534  0.0  0.0   0   3.0  0.0  0.0  0.0  0.0   0.0  

[21699 rows x 22 columns]

方法一:直接指定需要的列

pp.pprint(baseball_df.query('g>50')[['year','g','ab','r','h']])
       year    g   ab   r    h
No.                           
249    1872   55  282  62   76
268    1872   51  248  66   66
276    1873   52  254  53  101
293    1873   55  245  56   62
344    1873   52  223  40   43
...     ...  ...  ...  ..  ...
89499  2007   69  189  23   48
89521  2007  126  340  75   94
89523  2007  141  517  68  130
89530  2007  117  349  38   82
89533  2007   87  328  51  112

[11704 rows x 5 columns]

方法二:获取需要的列名列表

使用列表推导排除不需要的列

cols = [i for i in baseball_df.columns if i not in ['id', 'ibb','gidp','X2b','X3b','lg']]
baseball_df[cols]
yearstintteamgabrhhrrbisbcsbbsohbpshsf
No.
418711RC1251202939016.06.02.021.0NaNNaNNaN
4418711WS3321624545029.08.00.040.0NaNNaNNaN
6818711FW119891524010.02.01.020.0NaNNaNNaN
9918711NY2331613558134.04.02.030.0NaNNaNNaN
10218711CL1291283545323.03.01.010.0NaNNaNNaN
...................................................
8952520072FLO3400000.00.00.000.00.00.00.0
8952620071SFN1900000.00.00.000.00.00.00.0
8953020071HOU1173493882325.06.01.03774.06.04.01.0
8953320071NYN87328511121349.03.00.02730.02.00.03.0
8953420071NYN8221300.00.00.003.00.00.00.0

21699 rows × 16 columns

baseball_df.query('g>100.0')[cols]
yearstintteamgabrhhrrbisbcsbbsohbpshsf
No.
198818841LS2103447101150447.00.0NaN13NaN2.0NaNNaN
198918841CL510745193123532.00.0NaN24NaN4.0NaNNaN
205318841LS21064043989040.00.0NaN13NaN5.0NaNNaN
208218841CN2112472117148771.00.0NaN37NaN10.0NaNNaN
210218841SL411047411513010.00.0NaN18NaN6.0NaNNaN
...................................................
8948120071SLN11736539921253.00.02.04175.00.02.03.0
8948920071NYN139538711392487.04.00.052118.011.00.06.0
8952120071SFN12634075942866.05.00.013254.03.00.02.0
8952320071HOU141517681301050.04.03.023112.03.07.05.0
8953020071HOU1173493882325.06.01.03774.06.04.01.0

7554 rows × 16 columns

#也可以这样:

baseball_df[cols].query('g>100')
yearstintteamlggabrhhrrbisbcsbbsohbpshsf
No.
198818841LS2AA103447101150447.00.0NaN13NaN2.0NaNNaN
198918841CL5AA10745193123532.00.0NaN24NaN4.0NaNNaN
205318841LS2AA1064043989040.00.0NaN13NaN5.0NaNNaN
208218841CN2AA112472117148771.00.0NaN37NaN10.0NaNNaN
210218841SL4AA11047411513010.00.0NaN18NaN6.0NaNNaN
......................................................
8948120071SLNNL11736539921253.00.02.04175.00.02.03.0
8948920071NYNNL139538711392487.04.00.052118.011.00.06.0
8952120071SFNNL12634075942866.05.00.013254.03.00.02.0
8952320071HOUNL141517681301050.04.03.023112.03.07.05.0
8953020071HOUNL1173493882325.06.01.03774.06.04.01.0

7554 rows × 17 columns

使用difference方法返回索引的差集

baseball_df.columns
Index(['id', 'year', 'stint', 'team', 'lg', 'g', 'ab', 'r', 'h', 'X2b', 'X3b',
       'hr', 'rbi', 'sb', 'cs', 'bb', 'so', 'ibb', 'hbp', 'sh', 'sf', 'gidp'],
      dtype='object')
cols = baseball_df.columns.difference(['ibb','gidp','X2b','X3b','id'],sort=False) 
#sort=False,否则返回的df的columns会重新排序
cols
Index(['year', 'stint', 'team', 'lg', 'g', 'ab', 'r', 'h', 'hr', 'rbi', 'sb',
       'cs', 'bb', 'so', 'hbp', 'sh', 'sf'],
      dtype='object')
baseball_df.query('g>100.0')[cols]
yearstintteamlggabrhhrrbisbcsbbsohbpshsf
No.
198818841LS2AA103447101150447.00.0NaN13NaN2.0NaNNaN
198918841CL5AA10745193123532.00.0NaN24NaN4.0NaNNaN
205318841LS2AA1064043989040.00.0NaN13NaN5.0NaNNaN
208218841CN2AA112472117148771.00.0NaN37NaN10.0NaNNaN
210218841SL4AA11047411513010.00.0NaN18NaN6.0NaNNaN
......................................................
8948120071SLNNL11736539921253.00.02.04175.00.02.03.0
8948920071NYNNL139538711392487.04.00.052118.011.00.06.0
8952120071SFNNL12634075942866.05.00.013254.03.00.02.0
8952320071HOUNL141517681301050.04.03.023112.03.07.05.0
8953020071HOUNL1173493882325.06.01.03774.06.04.01.0

7554 rows × 17 columns

pp.pprint(baseball_df)
              id  year  stint team   lg    g   ab   r    h  X2b  ...   rbi  \
No.                                                              ...         
4      ansonca01  1871      1  RC1  NaN   25  120  29   39   11  ...  16.0   
44     forceda01  1871      1  WS3  NaN   32  162  45   45    9  ...  29.0   
68     mathebo01  1871      1  FW1  NaN   19   89  15   24    3  ...  10.0   
99     startjo01  1871      1  NY2  NaN   33  161  35   58    5  ...  34.0   
102    suttoez01  1871      1  CL1  NaN   29  128  35   45    3  ...  23.0   
...          ...   ...    ...  ...  ...  ...  ...  ..  ...  ...  ...   ...   
89525  benitar01  2007      2  FLO   NL   34    0   0    0    0  ...   0.0   
89526  benitar01  2007      1  SFN   NL   19    0   0    0    0  ...   0.0   
89530  ausmubr01  2007      1  HOU   NL  117  349  38   82   16  ...  25.0   
89533   aloumo01  2007      1  NYN   NL   87  328  51  112   19  ...  49.0   
89534  alomasa02  2007      1  NYN   NL    8   22   1    3    1  ...   0.0   

        sb   cs  bb    so  ibb  hbp   sh   sf  gidp  
No.                                                  
4      6.0  2.0   2   1.0  NaN  NaN  NaN  NaN   NaN  
44     8.0  0.0   4   0.0  NaN  NaN  NaN  NaN   NaN  
68     2.0  1.0   2   0.0  NaN  NaN  NaN  NaN   NaN  
99     4.0  2.0   3   0.0  NaN  NaN  NaN  NaN   NaN  
102    3.0  1.0   1   0.0  NaN  NaN  NaN  NaN   NaN  
...    ...  ...  ..   ...  ...  ...  ...  ...   ...  
89525  0.0  0.0   0   0.0  0.0  0.0  0.0  0.0   0.0  
89526  0.0  0.0   0   0.0  0.0  0.0  0.0  0.0   0.0  
89530  6.0  1.0  37  74.0  3.0  6.0  4.0  1.0  11.0  
89533  3.0  0.0  27  30.0  5.0  2.0  0.0  3.0  13.0  
89534  0.0  0.0   0   3.0  0.0  0.0  0.0  0.0   0.0  

[21699 rows x 22 columns]

删除不需要的列

pp.pprint(baseball_df.drop(['ibb','gidp','X2b','X3b','id','lg'],axis=1))
       year  stint team    g   ab   r    h  hr   rbi   sb   cs  bb    so  hbp  \
No.                                                                             
4      1871      1  RC1   25  120  29   39   0  16.0  6.0  2.0   2   1.0  NaN   
44     1871      1  WS3   32  162  45   45   0  29.0  8.0  0.0   4   0.0  NaN   
68     1871      1  FW1   19   89  15   24   0  10.0  2.0  1.0   2   0.0  NaN   
99     1871      1  NY2   33  161  35   58   1  34.0  4.0  2.0   3   0.0  NaN   
102    1871      1  CL1   29  128  35   45   3  23.0  3.0  1.0   1   0.0  NaN   
...     ...    ...  ...  ...  ...  ..  ...  ..   ...  ...  ...  ..   ...  ...   
89525  2007      2  FLO   34    0   0    0   0   0.0  0.0  0.0   0   0.0  0.0   
89526  2007      1  SFN   19    0   0    0   0   0.0  0.0  0.0   0   0.0  0.0   
89530  2007      1  HOU  117  349  38   82   3  25.0  6.0  1.0  37  74.0  6.0   
89533  2007      1  NYN   87  328  51  112  13  49.0  3.0  0.0  27  30.0  2.0   
89534  2007      1  NYN    8   22   1    3   0   0.0  0.0  0.0   0   3.0  0.0   

        sh   sf  
No.              
4      NaN  NaN  
44     NaN  NaN  
68     NaN  NaN  
99     NaN  NaN  
102    NaN  NaN  
...    ...  ...  
89525  0.0  0.0  
89526  0.0  0.0  
89530  4.0  1.0  
89533  0.0  3.0  
89534  0.0  0.0  

[21699 rows x 16 columns]

使用 filter 方法

baseball_df.filter(items=['year','g'])
yearg
No.
4187125
44187132
68187119
99187133
102187129
.........
89525200734
89526200719
895302007117
89533200787
8953420078

21699 rows × 2 columns

baseball_df.filter(regex='b$') #选择列名中包含b 的列
abX2bX3bsbbbibb
No.
41201136.02NaN
44162948.04NaN
6889312.02NaN
99161514.03NaN
102128373.01NaN
.....................
895250000.000.0
895260000.000.0
895303491636.0373.0
895333281913.0275.0
8953422100.000.0

21699 rows × 6 columns

baseball_df.filter(like='ea')
yearteam
No.
41871RC1
441871WS3
681871FW1
991871NY2
1021871CL1
.........
895252007FLO
895262007SFN
895302007HOU
895332007NYN
895342007NYN

21699 rows × 2 columns


  • 2
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Python草堂

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值