pandas小记：pandas索引和选择

最新推荐文章于 2024-09-03 16:26:28 发布

-柚子皮-

最新推荐文章于 2024-09-03 16:26:28 发布

阅读量9.3w

点赞数 12

分类专栏： Pandas小记文章标签： python pandas 索引和选择

本文链接：https://blog.csdn.net/pipisorry/article/details/18012125

版权

本文详细介绍了Pandas库中DataFrame和Series的索引与选择操作，包括列选择、行选择、布尔索引、标签定位(loc)、位置定位(iloc)、at[]和iat[]等方法。此外，还讨论了使用ix（已弃用）、unique()、value_counts()、设置和重置索引(set_index和reset_index)以及层次化索引的概念。内容涵盖选择单个或多个列、行、布尔条件过滤数据以及处理重复索引值的场景。

摘要由CSDN通过智能技术生成

http://blog.csdn.net/pipisorry/article/details/18012125

检索/选择

索引选择时建议全部使用loc：通过index或者label进行索引，这样就可能包含下标（或者iloc：类似py的位置索引）（尤其是修改df原本数据时），原因是最下面说的视图和显示拷贝。

和Series一样，在DataFrame中的一列可以通过字典记法或属性来检索，返回Series。

dataframe列选择

df[*]表示的是选择label=*的列，相当于df.loc[:, *]。*表示数字或者字符串。

df[0] #选择列label=0的列，如果label不是默认的0-n而是其它label就会出错。
df['year'] 或者 df.year

Note: 返回的Series包含和DataFrame相同的索引，并它们的 name 属性也被正确的设置了。

dataframe选择多列

1 df选择分开的多列

lines = lines[[0, 1, 4]] #选择列label=014的列，如果label不是默认的0-n而是其它label就会出错。

lines = lines[['user', 'check-in_time', 'location_id']]

2 df连续选择多列

如选择除最后一列的所有列：

df.iloc[:,0:-1]

或者df[df.columns[:-1]]

或者df.loc[:, df.columns[:-1]]

~~或者df.ix[:,0:-1]~~

df = df[ [c for i, c in enumerate(df_new.columns) if i < a or i >= b]]（选择列索引a-b之间的所有列）

3 dataframe选择最后一列

df.iloc[:, -1]

或者df[df.columns[-1]]

或者df.loc[:, df.columns[-1]]

~~或者df.ix[:,-1]~~

4 通过列类型选取列
DataFrame.select_dtypes(include=None, exclude=None)

e.g. df.select_dtypes(include=object, exclude=None)表示选择列类型为str的所有列，但是这个是copy的子列，操作子列对原df没有效果，这就很尴尬了。

dataframe行选择

行可以使用一些方法通过位置num或名字label来检索。

dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
A B C D
2013-01-01 2.036209 1.354010 -0.677409 -0.331978
2013-01-02 -1.403797 -1.094992 0.304359 -1.576272
2013-01-03 1.137673 0.636973 -0.746928 -0.606468
2013-01-04 0.833169 -2.575147 0.866364 1.337163

df[num:num2]表示行范围选择，相当于df.iloc[num:num2]，只是必须是数字范围或者字符串范围索引（不同于series只有数字或字符串也可以）：

>>> df[1:3]

>>> df.iloc[1:3]

>>> df['2013-01-02':'2013-01-03']

A B C D
2013-01-02 -1.403797 -1.094992 0.304359 -1.576272
2013-01-03 1.137673 0.636973 -0.746928 -0.606468

series行选择

时间序列数据的索引技术

pandas 最基本的时间序列类型就是以时间戳（TimeStamp）为 index 元素的 Series 类型。

[pandas时间序列分析和处理Timeseries ]

ix{行选+行列选} .ix is deprecated.

~~frame2.ix['three']~~

df.ix[3]
A -0.976627
B 0.766333
C -1.043501
D 0.554586
Name: 2013-01-04 00:00:00, dtype: float64

~~假设我们需数据第一列的前5行：~~

df.ix[:,0].head()
>>> df.ix[1:3, 0:3] #相当于df.ix[1:3, ['A', 'B', 'C']]
A B C

~~2013-01-02 -1.403797 -1.094992 0.304359~~

~~2013-01-03 1.137673 0.636973 -0.746928~~

Selection by Label仅通过label选择行: loc

.loc for label based indexing. []中只有一个label或者label列表时，表示行的label，不存在则为None。

For getting a cross section using a label

In [26]: df.loc[dates[0]]
A    0.469112
B   -0.282863
C   -1.509059
D   -1.135632
Name: 2013-01-01 00:00:00, dtype: float64

df = pd.DataFrame(np.random.randn(6,4), index=list('ABCEFG'), columns=list('ABCD'))
A B C D
A 0.050076 1.227450 0.164633 0.984434
B 0.033011 -0.306205 -0.937499 -0.010431
C -0.738045 0.342813 -0.793019 -0.215236
E -0.030727 2.236805 -0.059536 1.186334
F -1.721832 0.467983 1.939437 -0.786999
G 0.936548 -0.981672 1.108834 0.383627

df.loc['A']
A 0.050076
B 1.227450
C 0.164633
D 0.984434
Name: A, dtype: float64
df.loc[['A','D']]
/Applications/PyCharm.app/Contents/helpers/pydev/pydevconsole.py:1: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here:https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike '''
A B C D
A 0.050076 1.22745 0.164633 0.984434
D NaN NaN NaN NaN

Selecting on a multi-axis by label

In [27]: df.loc[:,['A','B']]
                   A         B
2013-01-01  0.469112 -0.282863
2013-01-02  1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04  0.721555 -0.706771
2013-01-05 -0.424972  0.567020
2013-01-06 -0.673690  0.113648

[Selection by Label]

Selection by Position: iloc

.iloc for positional indexing通过位置index取数据，只有一个数字或者数字范围时默认行（或行范围）选择. Select via the position of the passed integers。与ix, [], at的区别是，iloc[3]选择是的数据第3行，而其它如ix[3]或者loc[3]选择的是索引为3的那一行！

dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

A B C D
2013-01-01 -1.045103 0.707092 0.144719 0.433786
2013-01-02 0.462774 -0.015817 0.257240 1.180046
2013-01-03 -1.685669 -0.213246 -1.094254 0.353149
2013-01-04 -0.529709 0.230746 0.080492 0.151858
2013-01-05 -0.313612 -0.588399 3.063299 2.316896
2013-01-06 0.715335 0.204240 0.303301 0.254586

df.iloc[3]
A 0.721555
B -0.706771
C -1.039575
D 0.271860
Name: 2013-01-04 00:00:00, dtype: float64

df.iloc[2:3]
A B C D
2013-01-03 -1.685669 -0.213246 -1.094254 0.353149

By integer slices, acting similar to numpy/python

In [33]: df.iloc[3:5,0:2]
                   A         B
2013-01-04  0.721555 -0.706771
2013-01-05 -0.424972  0.567020

By lists of integer position locations, similar to the numpy/python style

In [34]: df.iloc[[1,2,4],[0,2]]
                   A         C
2013-01-02  1.212112  0.119209
2013-01-03 -0.861849 -0.494929
2013-01-05 -0.424972  0.276232

[Selection by Position]

[