Pandas数据结构介绍
- Series 是一种类似于一纬数组的对象,它由一组数据以及一组与之相关的数据标签组成
In [3]: obj = Series([4,7,-5,3])
In [4]: obj
Out[4]:
0 4
1 7
2 -5
3 3
dtype: int64
- 数据存在放一个python字典中,可以通过这个字典创建Series
In [11]: states = ['Californie','Ohio','Oregon','Texas']
In [14]: obj4 = Series(sdata,index=states)
In [15]: obj4
Out[15]:
Californie NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
- Series对象本身都有一个name属性,该属性跟pandas其他的关键功能关系非常密切
In [17]: obj4.name = 'population'
In [18]: obj4.index.name = 'state'
In [19]: obj4
Out[19]:
state
Californie NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
DataFrame
- DataFrame含有一组有序的列,每列可以是不同的值类型
- 构建DataFrame:
In [19]: obj4
Out[19]:
state
Californie NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
Name: population, dtype: float64
In [20]: data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]}
In [21]: frame = DataFrame(data)
In [22]: frame
Out[22]:
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
- DataFrame列按照指定顺序进行排序
In [23]: DataFrame(data,columns=['year','state','pop'])
Out[23]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
- 通过类似字典标记的方式或属性,将DataFrame获取为一个Series
In [25]: frame2
Out[25]:
year state pop deby
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
In [26]: frame.columns
Out[26]: Index(['state', 'year', 'pop'], dtype='object')
In [27]: frame.year
Out[27]:
0 2000
1 2001
2 2002
3 2001
4 2002
Name: year, dtype: int64
- 列可以通过赋值的方式进行修改
In [33]: frame2['bebt'] = 16.5
In [34]: frame2
Out[34]:
year state pop deby bebt
one 2000 Ohio 1.5 NaN 16.5
two 2001 Ohio 1.7 NaN 16.5
three 2002 Ohio 3.6 NaN 16.5
four 2001 Nevada 2.4 NaN 16.5
five 2002 Nevada 2.9 NaN 16.5
- 将列表或数组复制给某个值
In [41]: val = Series([-1.2,-1.5,-1.7],index=['two','four','five'])
In [42]: val
Out[42]:
two -1.2
four -1.5
five -1.7
dtype: float64
In [43]: frame2[['debt'] = val
...:
...:
...:
...: frame2
...:
...:
...:
...:
...:
...: ]
File "<ipython-input-43-998cf51724b6>", line 1
frame2[['debt'] = val
^
SyntaxError: invalid syntax
In [44]: frame2['debt'] = val
In [45]: frame2
Out[45]:
year state pop deby bebt debt
one 2000 Ohio 1.5 NaN 0.0 NaN
two 2001 Ohio 1.7 NaN 1.0 -1.2
three 2002 Ohio 3.6 NaN 2.0 NaN
four 2001 Nevada 2.4 NaN 3.0 -1.5
five 2002 Nevada 2.9 NaN 4.0 -1.7
- 为不存在的列赋值,关键字del用于删除列
In [47]: frame2['eastern'] = frame2.state ='Ohio'
In [48]: frame2
Out[48]:
year state pop deby bebt debt eastern
one 2000 Ohio 1.5 NaN 0.0 NaN Ohio
two 2001 Ohio 1.7 NaN 1.0 -1.2 Ohio
three 2002 Ohio 3.6 NaN 2.0 NaN Ohio
four 2001 Ohio 2.4 NaN 3.0 -1.5 Ohio
five 2002 Ohio 2.9 NaN 4.0 -1.7 Ohio
In [49]: del frame2['eastern']
In [50]: frame2
Out[50]:
year state pop deby bebt debt
one 2000 Ohio 1.5 NaN 0.0 NaN
two 2001 Ohio 1.7 NaN 1.0 -1.2
three 2002 Ohio 3.6 NaN 2.0 NaN
four 2001 Ohio 2.4 NaN 3.0 -1.5
five 2002 Ohio 2.9 NaN 4.0 -1.7
- 另一种常见的数据形式是且嵌套字典
In [51]: pop = {'Nevada':{2001:2.4,2002:2.9},'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
In [52]: frame3 = DataFrame(pop)
In [53]: frame3
Out[53]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
- 对结果进行转置:
In [54]: frame3.T
Out[54]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
索引对象
- pandas的索引对象负责管理轴标签和其他元数据
In [4]: obj = Series(range(3),index=['a','b','c'])
In [5]: index = obj.index
In [6]: index
Out[6]: Index(['a', 'b', 'c'], dtype='object')
index对象是不可修改的,因此用户不能对其进行修改
In [7]: index[1:]
Out[7]: Index(['b', 'c'], dtype='object')
In [8]: index[1] = 'd'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-a452e55ce13b> in <module>()
----> 1 index[1] = 'd'
重新索引
- pandas对象的一个重要方法是reindex,其作用是创建一个适应新索引的新对象
reindex根据新索引引进重排,当某个索引当前不存在就引入缺失值
In [12]: obj2 = obj.reindex(['a','b','c','d','e'])
In [13]: obj2
Out[13]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
In [14]: obj.reindex(['a','b','c','d','e'],fill_value=0)
Out[14]:
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0
dtype: float64
- method可以实现向前填充
In [15]: obj3 = Series(['blue','purple','yellow'],index=[0,2,4])
In [16]: obj3.reindex(range(6),method='ffill')
Out[16]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
丢弃指定轴上的项
- drop方法返回是的一个在指定轴上删除了指定值的新对象
In [31]: obj = Series(np.arange(5.),index=['a','b','c','d','e'])
In [32]: new_obj = obj.drop('c')
In [33]: new_obj
Out[33]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
In [34]: data = DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two'
...: ,'three','four'])
In [35]: data
Out[35]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [36]: data.drop(['Colorado','Ohio'])
Out[36]:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
axis=1可以删除列
In [41]: data.drop('two',axis=1)
Out[41]:
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
索引、选取和过滤
- Series索引的工作方式类似于Numpy数组的索引
In [49]: data
Out[49]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [50]: data[data['three']>5]
Out[50]:
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
通过布尔型DataFrame进行索引
In [51]: data <5
Out[51]:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
In [52]: data[data <5] = 0
In [53]: data
Out[53]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
- ix索引字段,可以通过numpy式的标记法以及轴标签从DataFrame中选区行和列的子集
In [55]: data.ix[['Colorado','Utah'],[3,0,1]]
D:\Python3.5\Scripts\ipython:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Out[55]:
four one two
Colorado 7 0 5
Utah 11 8 9
In [56]: data
Out[56]:
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [57]: data.ix[data.three>5,:3]
D:\Python3.5\Scripts\ipython:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Out[57]:
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
算术运算和数据对齐
- 不同索引的对象进行算术运算,将对象相加时,结果的索引就是该索引对的并集
In [61]: s1
Out[61]:
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
In [62]: s2
Out[62]:
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
In [63]: s1 + s2
Out[63]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
In [72]: df2
Out[72]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
两个索引相加后将返回一个新的DataFrame,其索引和列为其他两个DataFrame的并集
In [73]: df1
Out[73]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
In [74]: df1 + df2
Out[74]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
算术方法中填充值
- 在不同索引的对象进行算术运算时,当一个对象中某个轴的标签在另一个对象中找不到时填充一个特殊值
n [83]: df1 + df2
Out[83]:
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
将没有重叠的位置填充0
In [84]: df1.add(df2,fill_value=0)
Out[84]:
a b c d e
0 0.0 2.0 4.0 6.0 4.0
1 9.0 11.0 13.0 15.0 9.0
2 18.0 20.0 22.0 24.0 14.0
3 15.0 16.0 17.0 18.0 19.0
DataFrame和Series之间的运算
- 二维数组与其某行之间的差
In [89]: arr
Out[89]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
In [90]: arr[0]
Out[90]: array([0., 1., 2., 3.])
In [91]: arr-arr[0]
Out[91]:
array([[0., 0., 0., 0.],
[4., 4., 4., 4.],
[8., 8., 8., 8.]])
- DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播
In [95]: frame
Out[95]:
d b e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [96]: series
Out[96]:
d 0.0
b 1.0
e 2.0
Name: Utah, dtype: float64
In [97]: frame - series
Out[97]:
d b e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
- 使用axis=0在列上广播
In [103]: frame
Out[103]:
d b e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [104]: series3
Out[104]:
Utah 0.0
Ohio 3.0
Texas 6.0
Oregon 9.0
Name: d, dtype: float64
In [105]: frame.sub(series3,axis=0)
Out[105]:
d b e
Utah 0.0 1.0 2.0
Ohio 0.0 1.0 2.0
Texas 0.0 1.0 2.0
Oregon 0.0 1.0 2.0
函数应用和映射
- 函数应用到各列或行所形成的一维数组上,dataframe的apply方法可实现此功能
In [8]: frame
Out[8]:
b d e
Utah -0.319750 0.009537 -1.435092
Ohio -0.274765 -1.271983 0.215677
Texas -1.504100 -2.480888 -2.347232
Oregon 1.139005 0.906616 -1.379979
In [9]: f = lambda x:x.max() - x.min()
In [10]: frame.apply(f)
Out[10]:
b 2.643105
d 3.387504
e 2.562910
dtype: float64
In [11]: frame.apply(f,axis=1)
Out[11]:
Utah 1.444629
Ohio 1.487660
Texas 0.976788
Oregon 2.518983
dtype: float64
- apply 带入函数,返回由多个值组成的Series:
In [14]: def f(x):
...: return Series([x.min(),x.max()
...: ],index=['min','max'])
In [15]: frame.apply(f)
Out[15]:
b d e
min -1.504100 -2.480888 -2.347232
max 1.139005 0.906616 0.215677
- 得到frame中各个浮点值得格式化字符串,使用applymap
In [16]: format = lambda x: '%.2f' %x
In [17]: frame.applymap(format)
Out[17]:
b d e
Utah -0.32 0.01 -1.44
Ohio -0.27 -1.27 0.22
Texas -1.50 -2.48 -2.35
Oregon 1.14 0.91 -1.38
排序和排名
- soirt_index方法返回一个已排序的新对象
In [3]: obj = Series(range(4),index=['d','a','b','c'])
In [4]: obj.sort_index()
Out[4]:
a 1
b 2
c 3
d 0
dtype: int64
- DataFrame可以根据任意一个轴上的索引进行排序
In [8]: frame
Out[8]:
d a b c
three 0 1 2 3
one 4 5 6 7
In [9]: frame.sort_index()
Out[9]:
d a b c
one 4 5 6 7
three 0 1 2 3
In [10]: frame.sort_index(axis=1)
Out[10]:
a b c d
three 1 2 3 0
one 5 6 7 4
- DataFrame根据多值进行排序
In [17]: frame.sort_index(by='b')
D:\Python3.5\Scripts\ipython:1: FutureWarning: by argument to sort_index is deprecated, please use .sort_values(by=...)
Out[17]:
b a
2 -3 0
3 2 1
0 4 0
1 7 1
带有重复值的轴索引
- is_unique 属性可以告诉你它的值是否唯一
In [26]: obj = Series(range(5),index=['a','a','b','b','c'])
In [27]: obj.index.is_unique
Out[27]: False
汇总和计算描述统计
pandas拥有一组常用的数学和统计方法,用于从Series中提取单个值或从DataFrame中的行或列中提取一个Series
sum方法进行求和运算
In [34]: df
Out[34]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [35]: df.sum()
Out[35]:
one 9.25
two -5.80
dtype: float64
- 进行间接统计,得到最大值和最小值的索引
In [40]: df.idxmin()
Out[40]:
one d
two b
dtype: object
In [41]: df.idxmax()
Out[41]:
one b
two d
dtype: object
- 累加方法
In [44]: df
Out[44]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [45]: df.cumsum()
Out[45]:
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
- describe用于一次性产生多个汇总统计
In [47]: df.describe()
Out[47]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
对于非数值型数据,describe会产生另外一种汇总统计
In [48]: obj = Series(['a','a','b','c'] * 4)
In [49]: obj.describe()
Out[49]:
count 16
unique 3
top a
freq 8
dtype: object
唯一值、值计数以及成员资格
- unique计算唯一值
In [3]: obj = Series(['c','a','d','a','a','a','b','b','c','c'])
In [4]: uniques = obj.unique()
In [5]: uniques
Out[5]: array(['c', 'a', 'd', 'b'], dtype=object)
- value_counts 计算Series中各值出现的频率
In [7]: obj.value_counts()
Out[7]:
a 4
c 3
b 2
d 1
dtype: int64
- values_counts是一个顶级pandas方法,默认按值频率降序排列
In [10]: pd.value_counts(obj.values,sort=False)
Out[10]:
b 2
a 4
d 1
c 3
dtype: int64
- isin,用于判断矢量化集合的成员资格,isin函数对应布尔数组对象可以用作其他
In [11]: mask = obj.isin(['b','c'])
In [13]: mask
Out[13]:
0 True
1 False
2 False
3 False
4 False
5 False
6 True
7 True
8 True
9 True
dtype: bool
In [14]: obj[mask]
Out[14]:
0 c
6 b
7 b
8 c
9 c
dtype: object
处理缺失数据
- isnull用来判断pandas对象是否为空
In [17]: string_data = Series(['aardvark','artichkoe',np.nan,'avocado'])
In [18]: string_data
Out[18]:
0 aardvark
1 artichkoe
2 NaN
3 avocado
dtype: object
In [19]: string_data.isnull()
Out[19]:
0 False
1 False
2 True
3 False
dtype: bool
滤除缺失数据
- dropna返回一个仅含非空数据和索引值的Series
In [21]: data =Series([1,NA,3.5,NA,7])
In [22]: data.dropna()
Out[22]:
0 1.0
2 3.5
4 7.0
dtype: float64
In [23]: data[data.notnull()]
Out[23]:
0 1.0
2 3.5
4 7.0
dtype: float64
- 用dropna来处理DataFrame对象更加复杂
In [25]: data
Out[25]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
In [26]: cleaned = data.dropna()
# dropn默认丢弃任何含有丢失值得行
In [27]: cleaned
Out[27]:
0 1 2
0 1.0 6.5 3.0
In [28]: data.dropna(how='all')
# how='all'将只丢弃全为na得那些行
Out[28]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
In [29]: data[4] = NA
In [30]: data.dropna(axis=1,how='all')
# axis=1 ,丢弃列全部为na的对象
Out[30]:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
- thresh参数只留一部分非空数据
In [33]: df
Out[33]:
0 1 2
0 0.481775 NaN NaN
1 -0.338072 NaN NaN
2 -0.642257 NaN NaN
3 -1.890957 NaN 1.887299
4 0.811897 NaN 0.721258
5 -0.611888 0.188227 -0.708599
6 2.256012 0.033532 0.042494
In [34]: df.dropna(thresh=3)
Out[34]:
0 1 2
5 -0.611888 0.188227 -0.708599
6 2.256012 0.033532 0.042494
填充缺失数据
- fillna主要的方法将缺失值替换成 那常数值
In [6]: df
Out[6]:
0 1 2
0 0.384681 0.711003 1.672274
1 -0.453881 1.268871 0.688568
2 -0.272381 0.689529 1.194154
3 0.011967 -0.141345 -0.008628
4 -0.405584 0.427896 0.266322
5 0.155367 0.831051 0.040576
6 0.082301 1.629965 -0.433486
In [9]: df.ix[:4,1]=NA;df.ix[:2,2]=NA
In [10]: df
Out[10]:
0 1 2
0 0.384681 NaN NaN
1 -0.453881 NaN NaN
2 -0.272381 NaN NaN
3 0.011967 NaN -0.008628
4 -0.405584 NaN 0.266322
5 0.155367 0.831051 0.040576
6 0.082301 1.629965 -0.433486
In [11]: df.fillna(0)
#用0填充NA值
Out[11]:
0 1 2
0 0.384681 0.000000 0.000000
1 -0.453881 0.000000 0.000000
2 -0.272381 0.000000 0.000000
3 0.011967 0.000000 -0.008628
4 -0.405584 0.000000 0.266322
5 0.155367 0.831051 0.040576
6 0.082301 1.629965 -0.433486
In [12]: df.fillna({1:0.5,3:-1})
# 通过一个字典调用fillna,实现对不同的列填充不同的值
Out[12]:
0 1 2
0 0.384681 0.500000 NaN
1 -0.453881 0.500000 NaN
2 -0.272381 0.500000 NaN
3 0.011967 0.500000 -0.008628
4 -0.405584 0.500000 0.266322
5 0.155367 0.831051 0.040576
6 0.082301 1.629965 -0.433486
- reindex有效的插值方法可以用于fillna:
In [12]: df
Out[12]:
0 1 2
0 0.473072 0.586030 -2.169215
1 0.675384 -1.530588 -0.918324
2 -1.483835 NaN -1.287079
3 -1.387653 NaN 2.044451
4 0.247890 NaN NaN
5 0.151385 NaN NaN
# method='ffill',以缺失数据的前一个非缺失数据来填充缺失部分数据
In [13]: df.fillna(method='ffill')
Out[13]:
0 1 2
0 0.473072 0.586030 -2.169215
1 0.675384 -1.530588 -0.918324
2 -1.483835 -1.530588 -1.287079
3 -1.387653 -1.530588 2.044451
4 0.247890 -1.530588 2.044451
5 0.151385 -1.530588 2.044451
#限制仅两个两个NA数值
In [13]: df.fillna(method='ffill')
Out[13]:
0 1 2
0 0.473072 0.586030 -2.169215
1 0.675384 -1.530588 -0.918324
2 -1.483835 -1.530588 -1.287079
3 -1.387653 -1.530588 2.044451
4 0.247890 -1.530588 2.044451
5 0.151385 -1.530588 2.044451
层次化索引
- 打印索引
In [16]: data
Out[16]:
a 1 0.091050
2 0.515947
3 -0.115889
b 1 0.165276
2 0.490398
3 -0.184115
c 1 -1.258075
2 -0.552847
d 2 -1.097689
3 0.113758
dtype: float64
In [17]: data.index
Out[17]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
- 对层次化索引对象,选取子集
In [19]: data['b':'c']
Out[19]:
b 1 0.165276
2 0.490398
3 -0.184115
c 1 -1.258075
2 -0.552847
dtype: float64
#在多层索引内层选取
In [20]: data[:,2]
Out[20]:
a 0.515947
b 0.490398
c -0.552847
d -1.097689
dtype: float64
- 数据可以通过unstack方法重新安排到一个DataFrame中
In [22]: data.unstack()
Out[22]:
1 2 3
a 0.091050 0.515947 -0.115889
b 0.165276 0.490398 -0.184115
c -1.258075 -0.552847 NaN
d NaN -1.097689 0.113758
#unstack的逆运算是stack:
In [23]: data.unstack().stack()
Out[23]:
a 1 0.091050
2 0.515947
3 -0.115889
b 1 0.165276
2 0.490398
3 -0.184115
c 1 -1.258075
2 -0.552847
d 2 -1.097689
3 0.113758
dtype: float64
重排分级顺序
-swaplevel接受两个级别编码或名称,并返回一个互换了级别的新对象
In [31]: frame.swaplevel('key1','key2')
Out[31]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
- sortlevel根据单个级别中的值对数据进行排序,交换级别时,常常用到sortlevel
In [36]: frame
Out[36]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [37]: frame.swaplevel(0,1).sortlevel(0)
Out[37]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11
根据级别汇总统计
- level选项,用于指定在某条轴上求和的级别,可以根据行或列上的级别来进行求和
In [39]: frame
Out[39]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
#根据key2来求和
In [40]: frame.sum(level='key2')
Out[40]:
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
#根据颜色进行求和
In [41]: frame.sum(level='color',axis=1)
Out[41]:
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10
使用DataFrame的列
- 将DataFrame的一个或多个列当作行索引来用,
#使用C,D两列来当行索引
In [45]: frame.set_index(['c','d'])
Out[45]:
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1