pandas之基本功能

最新推荐文章于 2024-06-16 20:03:01 发布

pyStar_公众号

最新推荐文章于 2024-06-16 20:03:01 发布

阅读量5.1k

点赞数

分类专栏：数据分析 pandas基础文章标签： pandas python 重新索引 reindex 删除指定轴上的项

本文链接：https://blog.csdn.net/qq_42413820/article/details/80711211

版权

数据分析同时被 2 个专栏收录

14 篇文章 0 订阅

订阅专栏

pandas基础

7 篇文章 1 订阅

订阅专栏

pandas 的官方文档：

http://pandas.pydata.org/pandas-docs/stable/indexing.html

1. 重新索引

作用：创建一个适应新索引的新对象，会根据新索引对原数据进行重排，如果是新引入的索引，则会引入缺失值(也可用 fill_value 指定填充值)。

reindex 的函数参数：

index	New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying
method	Interpolation (fill) method, see table for options.
fill_value	Substitute value to use when introducing missing data by reindexing
limit	When forward- or backfilling, maximum size gap to fill
level	Match simple Index on level of MultiIndex, otherwise select subset of
copy	Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data).

In [49]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [50]: obj
Out[50]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
In [51]: obj.reindex(['a','b','c','d','e'])  # obj.reindex(['a','b','c','d','e']，fill_value=0) 
Out[51]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN     # e  0
dtype: float64

对于有序的索引序列，在重新索引时，我们可以用 method 选项进行前后填充值：

In [56]: obj1 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
In [57]: obj1.reindex(range(6),method='ffill')
Out[59]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

reindex 的(插值)method选项：

ffill or pad	Fill (or carry) values forward
bfill or backfill	Fill (or carry) values backward

对于 Dataframe 可以单独重新指定 index 和 columns，也可以同时指定，默认是重新索引行。

Dataframe 中的插值只能应用在行上(即轴0)。

n [64]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
    ...:                      columns=['Ohio', 'Texas', 'California'])
In [66]: frame.reindex(['a','b','c','d'])
In [67]: states = ['Texas', 'Utah', 'California']
In [68]: frame.reindex(columns=states)
In [71]: frame.reindex(index=['a','b','c','d'],columns=states)
In [82]: frame.reindex(index=['a','b','c','d'],method='ffill').reindex(columns=states) 
                                       # python数据分析书中这句代码是：
                                       # frme.reindex(index=['a', 'b', 'c', 'd'], method='ffill',columns=states) 
                                       # 由于版本的原因执行这句代码可能会出错

python数据分析书上利用 ix 的标签索引功能，这个在未来可能会废弃掉：

In[87]:frame.ix[['a','b','c','d'],states]
W:\software\Python\Python35\Scripts\ipython:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
W:\software\Python\Python35\Scripts\ipython:1: FutureWarning:Passing list-likes to 
.loc or [] with any missing label will raiseKeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[87]:
   Texas  Utah  California
a    1.0   NaN         2.0
b    NaN   NaN         NaN
c    4.0   NaN         5.0
d    7.0   NaN         8.0
In [88]: frame.loc[['a','b','c','d'],states]
W:\software\Python\Python35\Scripts\ipython:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raiseKeyError in the future,
you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[88]:
   Texas  Utah  California
a    1.0   NaN         2.0
b    NaN   NaN         NaN
c    4.0   NaN         5.0
d    7.0   NaN         8.0

2. 删除指定轴上的项

drop 方法返回的是一个删除指定轴上的项后的新对象。

In [96]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [97]: new_obj = obj.drop('c')
In [98]: obj.drop(['d', 'c'])
Out[98]:
a    0.0
b    1.0
e    4.0
dtype: float64

对于 Dataframe 可以删除任意轴上的索引值：

In [99]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
    ...: ....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
    ...: ....: columns=['one', 'two', 'three', 'four'])
In [100]: data
Out[100]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
In [101]: data.drop(['Colorado', 'Ohio'])
Out[101]:
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

In [102]: data.drop('two', axis=1)
Out[102]:
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15
In [103]: data.drop(['two', 'four'], axis=1)
Out[103]:
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

3.索引、选取、过滤

Series 的类似于numpy 数组的索引：

In [102]: obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
In [103]: obj['b']
In [104]: obj[1]
In [105]: obj[2:4]
In [106]: obj[['b', 'a', 'd']]
In [107]: obj[[1, 3]]
In [108]: obj[obj < 2]

利用标签进行索引和赋值(其末端包含)：

In [110]: obj['b':'c'] = 5

对于 Dataframe 进行索引就是选取一个或多个列：

In [112]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
.....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])
In [114]: data['two'] 
In [115]: data[['three', 'one']]

通过切片或布尔型数组选取行：

In [116]: data[:2]    
In [117]: data[data['three'] > 5]

通过布尔型进行索引：

In [118]: data < 5
In [119]: data[data < 5] = 0
In [120]: data

用 ix 进行索引列和行(未来可能废除，改用其他方法，例：loc、iloc)：

http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

In [121]: data.ix['Colorado', ['two', 'three']]
In [122]: data.ix[['Colorado', 'Utah'], [3, 0, 1]]
In [123]: data.ix[2]
In [124]: data.ix[:'Utah', 'two']
In [125]: data.ix[data.three > 5, :3]

Dataframe 的索引选项：

obj.ix[val]	Selects single row of subset of rows from the DataFrame.
obj.ix[:, val]	Selects single column of subset of columns.
obj.ix[val1, val2]	Select both rows and columns.
reindex	Conform one or more axes to new indexes.
xs	Select single row or column as a Series by label.
icol, irow	Select single column or row, respectively, as a Series by integer location.
get_value, set_value	Select single value by row and column label.

4. 算术运算

对象相加时，结果索引是每个对象的索引的并集，对于不重叠的索引，其值会填充 NA(可以指定填充值)：

In [126]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
In [127]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
In [128]: s1 + s2
Out[128]:
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [129]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
     ...: index=['Ohio', 'Texas', 'Colorado'])
In [130]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
     ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [131]: df1 + df2
Out[131]:
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

In [132]: df1.add(df2,fill_value=0)
Out[132]:
            b    c     d     e
Colorado  6.0  7.0   8.0   NaN
Ohio      3.0  1.0   6.0   5.0
Oregon    9.0  NaN  10.0  11.0
Texas     9.0  4.0  12.0   8.0
Utah      0.0  NaN   1.0   2.0

Method Description

add	Method for addition (+)
sub	Method for subtraction (-)
div	Method for division (/)
mul	Method for multiplication (*)

Dataframe 和 Series 之间的运算

这两者之间的运算涉及到了广播的知识，以后会有介绍广播相关的知识。一维二维的广播都比较容易理解。

In [153]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
     ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [154]: series = frame.iloc[0]
In [155]: frame - series        # 嗯，大概就是这样，理解一下
Out[155]:
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

如果你希望匹配行且在列上广播，必须使用算数运算：

In [160]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
     ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [161]: series1 = frame['d']
In [162]: frame.sub(series1,axis=0)
Out[162]:
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0

5. 函数应用于映射

numpy 的 ufuncs (元素级数组方法)也可用于操作pandas对象：

In [164]: frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),
     ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [167]: frame
Out[167]:
               b         d         e
Utah   -0.637896  0.509292 -0.919939
Ohio   -0.604495  0.298296 -0.377575
Texas  -0.710751 -0.091902  0.607375
Oregon  0.576612  1.664728  0.264065
In [168]: np.abs(frame)
Out[168]:
               b         d         e
Utah    0.637896  0.509292  0.919939
Ohio    0.604495  0.298296  0.377575
Texas   0.710751  0.091902  0.607375
Oregon  0.576612  1.664728  0.264065

也可用apply方法把函数应用到由各列或各行形成的一维数组上：

In [172]: f = lambda x: x.max() - x.min()
In [173]: frame.apply(f)
Out[173]:
b    1.287363
d    1.756631
e    1.527314
dtype: float64

也可返回多个值组成的Series：

In [176]: def f(x):
     ...:     return pd.Series([x.min(), x.max()], index=['min', 'max'])
In [177]: frame.apply(f)
Out[177]:
            b         d         e
min -0.710751 -0.091902 -0.919939
max  0.576612  1.664728  0.607375

元素级的python函数也是可用的，使用applymap方法：

In [179]: format = lambda x: '%.2f' % x
In [180]: frame.applymap(format)
Out[180]:
            b      d      e
Utah    -0.64   0.51  -0.92
Ohio    -0.60   0.30  -0.38
Texas   -0.71  -0.09   0.61
Oregon   0.58   1.66   0.26

6. 排序与排名

对行或列索引进行排序可以使用 sort_index 方法

对Series安值进行排序，可使用sort_values方法，若某个索引缺失值，则会被放到末尾

In [183]: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
In [187]: obj.sort_index()    # obj.sort_index(ascending=False)  降序
In [189]: obj.sort_values()   # obj.sort_values(ascending=False)  降序

对于Dataframe 可以根据任意轴上的索引进行排序，默认是升序，也可降序排序：

In [196]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
     ...:  columns=['d', 'a', 'b', 'c'])
In [197]: frame.sort_index()
Out[197]:
       d  a  b  c
one    4  5  6  7
three  0  1  2  3
In [198]: frame.sort_index(axis=1)
Out[198]:
       a  b  c  d
three  1  2  3  0
one    5  6  7  4
In [199]: frame.sort_index(axis=1, ascending=False)
Out[199]:
       d  c  b  a
three  0  3  2  1
one    4  7  6  5

在 Dataframe 上还可以使用 by 关键字，根据一或多列的值进行排序：

In [203]: frame.sort_values(by='b')    # FutureWarning: by argument to sort_index
                                       # is deprecated, please use .sort_values(by=...)
Out[203]:
   a  b
2  0 -3
3  1  2
0  0  4
1  1  7
In [204]: frame.sort_values(by=['a','b'])
Out[204]:
   a  b
2  0 -3
0  0  4
3  1  2
1  1  7

注意：对DataFrame的值进行排序的时候，我们必须要使用by指定某一行（列）或者某几行（列），
     如果不使用by参数进行指定的时候，就会报TypeError: sort_values() missing 1 required positional argument: 'by'。
     使用by参数进行某几列（行）排序的时候，以列表中的第一个为准，可能后面的不会生效，因为有的时候无法做到既对第一行（列）
     进行升序排序又对第二行（列）进行排序。在指定行值进行排序的时候，必须设置axis=1，不然会报错，因为默认指定的是列索引，
     找不到这个索引所以报错，axis=1的意思是指定行索引。

排名：

排名和排序有点类似，排名会有一个排名值（从1开始，一直到数组中有效数据的数量），它与numpy.argsort的间接排序索引差不多，只不过它可以根据某种规则破坏平级关系。

In [214]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])    # 下标对应 (0, 1, 2, 3, 4, 5, 6)
In [215]: obj.rank()        # 默认是根据值的大小进行平均排名  
Out[215]:                
0    6.5                    #  7 最大 由于有两个 7 ，所以排名为 6,7 名，平均排名 6.5
1    1.0                    # -5 最小 对应下标为 1 ，排在第一
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

根据值在数组中出现的顺序进行排名：

In [216]: obj.rank(method='first')  # 也可以按降序排名  obj.rank(ascending=False,method='max') 按照分组的最大排名排序
Out[216]:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

也可以指定轴进行排名：

In [219]: frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
     ...:                       'c': [-2, 5, 8, -2.5]})
In [220]: frame
Out[220]:
   a    b    c
0  0  4.3 -2.0
1  1  7.0  5.0
2  0 -3.0  8.0
3  1  2.0 -2.5
In [221]: frame.rank(axis=1)
Out[221]:
     a    b    c
0  2.0  3.0  1.0
1  1.0  3.0  2.0
2  2.0  1.0  3.0
3  2.0  3.0  1.0

排名时用于破坏平级关系的method选项：

average	Default: assign the average rank to each entry in the equal group
min	Use the minimum rank for the whole group
max	Use the maximum rank for the whole group
first	Assign ranks in the order the values appear in the data

7. 汇总和计算描述统计

可以指定对行或列进行统计。统计时默认会跳过 NA 值，也可以用 skipna 指定不跳过。

In [224]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
     ...: [np.nan, np.nan], [0.75, -1.3]],
     ...: index=['a', 'b', 'c', 'd'],
     ...: columns=['one', 'two'])

In [225]: df
Out[225]:
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3
In [227]: df.sum()
Out[227]:
one    9.25
two   -5.80
dtype: float64
In [228]: df.sum(axis=1)
Out[228]:
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
In [229]: df.sum(axis=1,skipna=False)
Out[229]:
a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

约简方法的选项
Method Description

axis	Axis to reduce over. 0 for DataFrame’s rows and 1 for columns.
skipna	Exclude missing values, True by default.
level	Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex).

描述和汇总统计
Method Description

count	Number of non-NA values
describe	Compute set of summary statistics for Series or each DataFrame column
min, max	Compute minimum and maximum values
argmin, argmax	Compute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmax	Compute index values at which minimum or maximum value obtained, respectively
quantile	Compute sample quantile ranging from 0 to 1
sum	Sum of values
mean	Mean of values
median	Arithmetic median (50% quantile) of values
mad	Mean absolute deviation from mean value
var	Sample variance of values
std	Sample standard deviation of values
skew	Sample skewness (3rd moment) of values
kurt	Sample kurtosis (4th moment) of values
cumsum	Cumulative sum of values
cummin, cummax	Cumulative minimum or maximum of values, respectively
cumprod	Cumulative product of values
diff	Compute 1st arithmetic difference (useful for time series)
pct_change	Compute percent changes

8.唯一值，值计数，成员资格

In [231]: obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
In [232]: uniques = obj.unique()      # 唯一值
In [233]: uniques
Out[233]: array(['c', 'a', 'd', 'b'], dtype=object)
In [234]: obj.value_counts()        # 计数值
Out[234]:
c    3
a    3
b    2
d    1
dtype: int64
In [235]: mask = obj.isin(['b', 'c'])         # 成员关系
In [236]: mask
Out[236]:
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
In [237]:  obj[mask]
Out[237]:
0    c
5    b
6    b
7    c
8    c
dtype: object

Method Description

isin	Compute boolean array indicating whether each Series value is contained in the passed sequence of values.
unique	Compute array of unique values in a Series, returned in the order observed.
value_counts	Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order.

将pandas的 pandas.value_counts 传给 Dataframe 的 apply 函数：

In [239]: data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
     ...: 'Qu2': [2, 3, 1, 2, 3],
     ...: 'Qu3': [1, 5, 2, 4, 4]})
In [240]: data
Out[240]:
   Qu1  Qu2  Qu3
0    1    2    1
1    3    3    5
2    4    1    2
3    3    2    4
4    4    3    4
In [241]: result = data.apply(pd.value_counts).fillna(0)
In [242]: result
Out[242]:
   Qu1  Qu2  Qu3
1  1.0  1.0  1.0
2  0.0  2.0  1.0
3  2.0  2.0  0.0
4  2.0  0.0  2.0
5  0.0  0.0  1.0

9.缺失值的处理

pandas 中使用浮点值 NAN (Not a Number) 表示浮点和非浮点数组中的缺失数据，只是一种便于表示的标记。

NA处理的方法：

dropna	Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna	Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
isnull	Return like-type object containing boolean values indicating which values are missing / NA.
notnull	Negation of isnull.

过滤缺失数据：dropna

对于 Series ，dropna()仅仅返回一个非空数据和索引值的 Series：
In [6]: from numpy import nan as NA
In [7]: data = pd.Series([1,NA,4,NA,5])
In [8]: data.dropna()           # 也可以通过bool索引达到此目的：data[data.notnull()]
Out[8]: 
0    1.0
2    4.0
4    5.0
dtype: float64

对于 Dataframe ，dropna 默认丢弃任何含有缺失值的行，传入参数 how='all' ，只丢弃全为NA的行。要丢弃为NA的行，传入参数 axis=1，即可。参数 thresh 可以保留部分数据。

填充缺失数据：fillna

In [9]: df = pd.DataFrame(np.random.randn(7,3))
In [10]: df
Out[10]: 
          0         1         2
0 -1.405991 -1.032070 -0.421601
1  0.878711 -0.786235  1.483527
2 -0.082090 -0.163028 -0.718293
3 -0.576532  0.229013  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039
In [12]: df.iloc[:4,1] = NA
In [13]: df.iloc[:2,2] = NA
In [14]: df
Out[14]: 
          0         1         2
0 -1.405991       NaN       NaN
1  0.878711       NaN       NaN
2 -0.082090       NaN -0.718293
3 -0.576532       NaN  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039

In [15]: df.fillna(0)                 # 用任意数去填充所有的缺失值
Out[15]: 
          0         1         2
0 -1.405991  0.000000  0.000000
1  0.878711  0.000000  0.000000
2 -0.082090  0.000000 -0.718293
3 -0.576532  0.000000  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039

In [20]: df.fillna({1:0.5,2:1.})    # 传入一个字典，可以实现对列中缺失值的填充
Out[20]: 
          0         1         2
0 -1.405991  0.500000  1.000000
1  0.878711  0.500000  1.000000
2 -0.082090  0.500000 -0.718293
3 -0.576532  0.500000  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039

fillna 默认总会产生一个新的对象，用inplace 参数实现就地修改。

用 method 参数填充数据：

In [27]: df.fillna(method='bfill')
Out[27]: 
          0         1         2
0 -1.405991  0.547743 -0.718293
1  0.878711  0.547743 -0.718293
2 -0.082090  0.547743 -0.718293
3 -0.576532  0.547743  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039

In [28]: df.fillna(method='bfill',limit=2)
Out[28]: 
          0         1         2
0 -1.405991       NaN -0.718293
1  0.878711       NaN -0.718293
2 -0.082090  0.547743 -0.718293
3 -0.576532  0.547743  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039

fillna 参数：

value	Scalar value or dict-like object to use to fill missing values
method	Interpolation, by default 'ffill' if function called with no other arguments
axis	Axis to fill on, default axis=0
inplace	Modify the calling object without producing a copy
limit	For forward and backward filling, maximum number of consecutive periods to fill

10. 层次化索引

能在一个轴上拥有多个索引级别，能以低维度形式处理高纬度数据。

创建一个层次化索引的 Series：

 In [31]: data = pd.Series(np.random.randn(10),
    ...: index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
    ...: [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])

In [32]: data
Out[32]: 
a  1   -0.413827
   2    0.660228
   3    0.209686
b  1   -0.361603
   2   -0.982985
   3   -0.267620
c  1   -1.130506
   2   -2.023760
d  2    0.989250
   3    1.074886
dtype: float64

In [33]: data.index
Out[33]: 
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

索引：

In [34]: data['a']
Out[34]: 
1   -0.413827
2    0.660228
3    0.209686
dtype: float64
In [35]: data[:,2]              # 实现内层索引
Out[35]: 
a    0.660228
b   -0.982985
c   -2.023760
d    0.989250dtype: float64
In [36]: data['b':'d']         # 实现切片索引
Out[36]: 
b  1   -0.361603   
2   -0.982985 
  3   -0.267620
c  
1   -1.130506   
2   -2.023760
d  
2    0.989250   
3    1.074886
dtype: float64
In [37]: data.loc[['b','c']] 
Out[37]: 
b  
1   -0.361603  
2   -0.982985   
3   -0.267620
c  
1   -1.130506   
2   -2.023760
dtype: float64

层次化索引可以通过 unstack 方法生成 Dataframe 数据：

In [38]: data.unstack()
Out[38]: 
          1         2         3
a -0.413827  0.660228  0.209686
b -0.361603 -0.982985 -0.267620
c -1.130506 -2.023760       NaN
d       NaN  0.989250  1.074886

In [39]: data.unstack().unstack()
Out[39]: 
1  a   -0.413827
   b   -0.361603
   c   -1.130506
   d         NaN
2  a    0.660228
   b   -0.982985
   c   -2.023760
   d    0.989250
3  a    0.209686
   b   -0.267620
   c         NaN
   d    1.074886
dtype: float64

In [42]: data.unstack().stack()        # stack 是unstack的逆运算
Out[42]: 
a  1   -0.413827
   2    0.660228
   3    0.209686
b  1   -0.361603
   2   -0.982985
   3   -0.267620
c  1   -1.130506
   2   -2.023760
d  2    0.989250
   3    1.074886
dtype: float64

对于 Dataframe 每条轴都可以有层次化索引，每个索引还都可以有名字：

In [44]: frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
    ...:  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    ...:  columns=[['Ohio', 'Ohio', 'Colorado'],
    ...:  ['Green', 'Red', 'Green']]
    ...: )
  
In [45]: frame
Out[45]: 
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11
In [46]: frame.index.names = ['key1', 'key2']
In [47]: frame.columns.names = ['state', 'color']
In [48]: frame
Out[48]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

swaplevel() : 调整某条轴上各级别的顺序；sort_index(): 对各级别上的数据值进行排序

In [49]: frame.swaplevel('key1','key2')
Out[49]: 
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11

In [51]: frame.swaplevel(0,1)
Out[51]: 
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11

In [53]: frame.swaplevel(0,1,axis=1)
Out[53]: 
color     Green  Red    Green
state      Ohio Ohio Colorado
key1 key2
a    1        0    1        2
     2        3    4        5
b    1        6    7        8
     2        9   10       11

In [54]: frame.sortlevel(1)
W:\software\Python\Python35\Scripts\ipython:1: FutureWarning: sortlevel is
 deprecated, use sort_index(level= ...)
Out[54]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11

In [55]: frame.sort_index(level=1)
Out[55]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11

In [56]: frame.sort_index(level=1,axis=1)
Out[56]: 
state     Colorado  Ohio
color        Green Green Red
key1 key2
a    1           2     0   1
     2           5     3   4
b    1           8     6   7
     2          11     9  10

有的时候我们想要把 Dataframe 的某列或某几列当做 Dataframe 的索引：

In [59]: frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
    ...:  'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
    ...: 'd': [0, 1, 2, 0, 1, 2, 3]})

In [60]: frame
Out[60]: 
   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3

In [61]: frame1 = frame.set_index(['c','d'])

In [62]: frame1
Out[62]: 
       a  b
c   d
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1

默认情况下，被当做索引的列会被移除，也可通过drop=False保存下来：

In [63]: frame.set_index(['c','d'],drop=False)
Out[63]: 
       a  b    c  d
c   d
one 0  0  7  one  0
    1  1  6  one  1
    2  2  5  one  2
two 0  3  4  two  0
    1  4  3  two  1
    2  5  2  two  2
    3  6  1  two  3

reset_index的作用跟 set_index 正好相反：

In [64]: frame1.reset_index()
Out[64]: 
     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  two  0  3  4
4  two  1  4  3
5  two  2  5  2
6  two  3  6  1

pyStar_公众号

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pandas之基本功能

pandas 的官方文档：http://pandas.pydata.org/pandas-docs/stable/indexing.html1. 重新索引作用：创建一个适应新索引的新对象，会根据新索引对原数据进行重排，如果是新引入的索引，则会引入缺失值(也可用 fill_value 指定填充值)。reindex 的函数参数：indexNew sequence to use as inde...
复制链接

扫一扫

专栏目录