pandas入门

Pandas数据结构介绍
  • Series 是一种类似于一纬数组的对象,它由一组数据以及一组与之相关的数据标签组成
In [3]: obj = Series([4,7,-5,3])

In [4]: obj
Out[4]:
0    4
1    7
2   -5
3    3
dtype: int64
  • 数据存在放一个python字典中,可以通过这个字典创建Series
In [11]: states = ['Californie','Ohio','Oregon','Texas']
In [14]: obj4 = Series(sdata,index=states)

In [15]: obj4
Out[15]:
Californie        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
  • Series对象本身都有一个name属性,该属性跟pandas其他的关键功能关系非常密切
In [17]: obj4.name = 'population'

In [18]: obj4.index.name = 'state'

In [19]: obj4
Out[19]:
state
Californie        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64
DataFrame
  • DataFrame含有一组有序的列,每列可以是不同的值类型
  • 构建DataFrame:
In [19]: obj4
Out[19]:
state
Californie        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [20]: data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]}

In [21]: frame = DataFrame(data)

In [22]: frame
Out[22]:
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
  • DataFrame列按照指定顺序进行排序
In [23]: DataFrame(data,columns=['year','state','pop'])
Out[23]:
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
  • 通过类似字典标记的方式或属性,将DataFrame获取为一个Series
In [25]: frame2
Out[25]:
       year   state  pop deby
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

In [26]: frame.columns
Out[26]: Index(['state', 'year', 'pop'], dtype='object')

In [27]: frame.year
Out[27]:
0    2000
1    2001
2    2002
3    2001
4    2002
Name: year, dtype: int64
  • 列可以通过赋值的方式进行修改
In [33]: frame2['bebt'] = 16.5

In [34]: frame2
Out[34]:
       year   state  pop deby  bebt
one    2000    Ohio  1.5  NaN  16.5
two    2001    Ohio  1.7  NaN  16.5
three  2002    Ohio  3.6  NaN  16.5
four   2001  Nevada  2.4  NaN  16.5
five   2002  Nevada  2.9  NaN  16.5
  • 将列表或数组复制给某个值
In [41]: val = Series([-1.2,-1.5,-1.7],index=['two','four','five'])

In [42]: val
Out[42]:
two    -1.2
four   -1.5
five   -1.7
dtype: float64

In [43]: frame2[['debt'] = val
    ...:
    ...:
    ...:
    ...: frame2
    ...:
    ...:
    ...:
    ...:
    ...:
    ...: ]
  File "<ipython-input-43-998cf51724b6>", line 1
    frame2[['debt'] = val
                    ^
SyntaxError: invalid syntax


In [44]: frame2['debt'] = val

In [45]: frame2
Out[45]:
       year   state  pop deby  bebt  debt
one    2000    Ohio  1.5  NaN   0.0   NaN
two    2001    Ohio  1.7  NaN   1.0  -1.2
three  2002    Ohio  3.6  NaN   2.0   NaN
four   2001  Nevada  2.4  NaN   3.0  -1.5
five   2002  Nevada  2.9  NaN   4.0  -1.7
  • 为不存在的列赋值,关键字del用于删除列
In [47]: frame2['eastern'] = frame2.state ='Ohio'

In [48]: frame2
Out[48]:
       year state  pop deby  bebt  debt eastern
one    2000  Ohio  1.5  NaN   0.0   NaN    Ohio
two    2001  Ohio  1.7  NaN   1.0  -1.2    Ohio
three  2002  Ohio  3.6  NaN   2.0   NaN    Ohio
four   2001  Ohio  2.4  NaN   3.0  -1.5    Ohio
five   2002  Ohio  2.9  NaN   4.0  -1.7    Ohio

In [49]: del frame2['eastern']

In [50]: frame2
Out[50]:
       year state  pop deby  bebt  debt
one    2000  Ohio  1.5  NaN   0.0   NaN
two    2001  Ohio  1.7  NaN   1.0  -1.2
three  2002  Ohio  3.6  NaN   2.0   NaN
four   2001  Ohio  2.4  NaN   3.0  -1.5
five   2002  Ohio  2.9  NaN   4.0  -1.7
  • 另一种常见的数据形式是且嵌套字典
In [51]: pop = {'Nevada':{2001:2.4,2002:2.9},'Ohio':{2000:1.5,2001:1.7,2002:3.6}}

In [52]: frame3 = DataFrame(pop)

In [53]: frame3
Out[53]:
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6
  • 对结果进行转置:
In [54]: frame3.T
Out[54]:
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6
索引对象
  • pandas的索引对象负责管理轴标签和其他元数据
In [4]: obj = Series(range(3),index=['a','b','c'])
In [5]: index = obj.index
In [6]: index
Out[6]: Index(['a', 'b', 'c'], dtype='object')
index对象是不可修改的,因此用户不能对其进行修改
In [7]: index[1:]
Out[7]: Index(['b', 'c'], dtype='object')
In [8]: index[1] = 'd'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-a452e55ce13b> in <module>()
----> 1 index[1] = 'd'
重新索引
  • pandas对象的一个重要方法是reindex,其作用是创建一个适应新索引的新对象
reindex根据新索引引进重排,当某个索引当前不存在就引入缺失值
In [12]: obj2 = obj.reindex(['a','b','c','d','e'])

In [13]: obj2
Out[13]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
In [14]: obj.reindex(['a','b','c','d','e'],fill_value=0)
Out[14]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64
  • method可以实现向前填充
In [15]: obj3 = Series(['blue','purple','yellow'],index=[0,2,4])
In [16]: obj3.reindex(range(6),method='ffill')
Out[16]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
丢弃指定轴上的项
  • drop方法返回是的一个在指定轴上删除了指定值的新对象
In [31]: obj = Series(np.arange(5.),index=['a','b','c','d','e'])

In [32]: new_obj = obj.drop('c')

In [33]: new_obj
Out[33]:
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [34]: data = DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two'
    ...: ,'three','four'])

In [35]: data
Out[35]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [36]: data.drop(['Colorado','Ohio'])
Out[36]:
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

axis=1可以删除列
In [41]: data.drop('two',axis=1)
Out[41]:
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15
索引、选取和过滤
  • Series索引的工作方式类似于Numpy数组的索引
In [49]: data
Out[49]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [50]: data[data['three']>5]
Out[50]:
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

通过布尔型DataFrame进行索引
In [51]: data <5
Out[51]:
            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False

In [52]: data[data <5] = 0

In [53]: data
Out[53]:
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
  • ix索引字段,可以通过numpy式的标记法以及轴标签从DataFrame中选区行和列的子集
In [55]: data.ix[['Colorado','Utah'],[3,0,1]]
D:\Python3.5\Scripts\ipython:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Out[55]:
          four  one  two
Colorado     7    0    5
Utah        11    8    9

In [56]: data
Out[56]:
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [57]: data.ix[data.three>5,:3]
D:\Python3.5\Scripts\ipython:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
Out[57]:
          one  two  three
Colorado    0    5      6
Utah        8    9     10
New York   12   13     14

DataFrame索引选项

算术运算和数据对齐
  • 不同索引的对象进行算术运算,将对象相加时,结果的索引就是该索引对的并集
In [61]: s1
Out[61]:
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [62]: s2
Out[62]:
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [63]: s1 + s2
Out[63]:
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [72]: df2
Out[72]:
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

两个索引相加后将返回一个新的DataFrame,其索引和列为其他两个DataFrame的并集
In [73]: df1
Out[73]:
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

In [74]: df1 + df2
Out[74]:
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN
算术方法中填充值
  • 在不同索引的对象进行算术运算时,当一个对象中某个轴的标签在另一个对象中找不到时填充一个特殊值
n [83]: df1 + df2
Out[83]:
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN
将没有重叠的位置填充0
In [84]: df1.add(df2,fill_value=0)
Out[84]:
      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0  11.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

灵活的算术方法

DataFrame和Series之间的运算
  • 二维数组与其某行之间的差
In [89]: arr
Out[89]:
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [90]: arr[0]
Out[90]: array([0., 1., 2., 3.])

In [91]: arr-arr[0]
Out[91]:
array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])
  • DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播
In [95]: frame
Out[95]:
          d     b     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [96]: series
Out[96]:
d    0.0
b    1.0
e    2.0
Name: Utah, dtype: float64

In [97]: frame - series
Out[97]:
          d    b    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0
  • 使用axis=0在列上广播
In [103]: frame
Out[103]:
          d     b     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [104]: series3
Out[104]:
Utah      0.0
Ohio      3.0
Texas     6.0
Oregon    9.0
Name: d, dtype: float64

In [105]: frame.sub(series3,axis=0)
Out[105]:
          d    b    e
Utah    0.0  1.0  2.0
Ohio    0.0  1.0  2.0
Texas   0.0  1.0  2.0
Oregon  0.0  1.0  2.0
函数应用和映射
  • 函数应用到各列或行所形成的一维数组上,dataframe的apply方法可实现此功能
In [8]: frame
Out[8]:
               b         d         e
Utah   -0.319750  0.009537 -1.435092
Ohio   -0.274765 -1.271983  0.215677
Texas  -1.504100 -2.480888 -2.347232
Oregon  1.139005  0.906616 -1.379979

In [9]: f = lambda x:x.max() - x.min()

In [10]: frame.apply(f)
Out[10]:
b    2.643105
d    3.387504
e    2.562910
dtype: float64

In [11]: frame.apply(f,axis=1)
Out[11]:
Utah      1.444629
Ohio      1.487660
Texas     0.976788
Oregon    2.518983
dtype: float64
  • apply 带入函数,返回由多个值组成的Series:
In [14]: def f(x):
    ...:     return Series([x.min(),x.max()
    ...:     ],index=['min','max'])

In [15]: frame.apply(f)
Out[15]:
            b         d         e
min -1.504100 -2.480888 -2.347232
max  1.139005  0.906616  0.215677
  • 得到frame中各个浮点值得格式化字符串,使用applymap
In [16]: format = lambda x: '%.2f' %x

In [17]: frame.applymap(format)
Out[17]:
            b      d      e
Utah    -0.32   0.01  -1.44
Ohio    -0.27  -1.27   0.22
Texas   -1.50  -2.48  -2.35
Oregon   1.14   0.91  -1.38
排序和排名
  • soirt_index方法返回一个已排序的新对象
In [3]: obj = Series(range(4),index=['d','a','b','c'])

In [4]: obj.sort_index()
Out[4]:
a    1
b    2
c    3
d    0
dtype: int64
  • DataFrame可以根据任意一个轴上的索引进行排序
In [8]: frame
Out[8]:
       d  a  b  c
three  0  1  2  3
one    4  5  6  7

In [9]: frame.sort_index()
Out[9]:
       d  a  b  c
one    4  5  6  7
three  0  1  2  3

In [10]: frame.sort_index(axis=1)
Out[10]:
       a  b  c  d
three  1  2  3  0
one    5  6  7  4
  • DataFrame根据多值进行排序
In [17]: frame.sort_index(by='b')
D:\Python3.5\Scripts\ipython:1: FutureWarning: by argument to sort_index is deprecated, please use .sort_values(by=...)
Out[17]:
   b  a
2 -3  0
3  2  1
0  4  0
1  7  1
带有重复值的轴索引
  • is_unique 属性可以告诉你它的值是否唯一
In [26]: obj = Series(range(5),index=['a','a','b','b','c'])

In [27]: obj.index.is_unique
Out[27]: False
汇总和计算描述统计
  • pandas拥有一组常用的数学和统计方法,用于从Series中提取单个值或从DataFrame中的行或列中提取一个Series

  • sum方法进行求和运算

In [34]: df
Out[34]:
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

In [35]: df.sum()
Out[35]:
one    9.25
two   -5.80
dtype: float64
  • 进行间接统计,得到最大值和最小值的索引
In [40]: df.idxmin()
Out[40]:
one    d
two    b
dtype: object

In [41]: df.idxmax()
Out[41]:
one    b
two    d
dtype: object
  • 累加方法
In [44]: df
Out[44]:
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

In [45]: df.cumsum()
Out[45]:
    one  two
a  1.40  NaN
b  8.50 -4.5
c   NaN  NaN
d  9.25 -5.8
  • describe用于一次性产生多个汇总统计
In [47]: df.describe()
Out[47]:
            one       two
count  3.000000  2.000000
mean   3.083333 -2.900000
std    3.493685  2.262742
min    0.750000 -4.500000
25%    1.075000 -3.700000
50%    1.400000 -2.900000
75%    4.250000 -2.100000
max    7.100000 -1.300000
对于非数值型数据,describe会产生另外一种汇总统计
In [48]: obj = Series(['a','a','b','c'] * 4)

In [49]: obj.describe()
Out[49]:
count     16
unique     3
top        a
freq       8
dtype: object
唯一值、值计数以及成员资格
  • unique计算唯一值
In [3]: obj = Series(['c','a','d','a','a','a','b','b','c','c'])

In [4]: uniques = obj.unique()

In [5]: uniques
Out[5]: array(['c', 'a', 'd', 'b'], dtype=object)
  • value_counts 计算Series中各值出现的频率
In [7]: obj.value_counts()
Out[7]:
a    4
c    3
b    2
d    1
dtype: int64
  • values_counts是一个顶级pandas方法,默认按值频率降序排列
In [10]: pd.value_counts(obj.values,sort=False)
Out[10]:
b    2
a    4
d    1
c    3
dtype: int64
  • isin,用于判断矢量化集合的成员资格,isin函数对应布尔数组对象可以用作其他
In [11]: mask = obj.isin(['b','c'])
In [13]: mask
Out[13]:
0     True
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8     True
9     True
dtype: bool

In [14]: obj[mask]
Out[14]:
0    c
6    b
7    b
8    c
9    c
dtype: object
处理缺失数据
  • isnull用来判断pandas对象是否为空
In [17]: string_data = Series(['aardvark','artichkoe',np.nan,'avocado'])

In [18]: string_data
Out[18]:
0     aardvark
1    artichkoe
2          NaN
3      avocado
dtype: object

In [19]: string_data.isnull()
Out[19]:
0    False
1    False
2     True
3    False
dtype: bool

NA处理方法

滤除缺失数据
  • dropna返回一个仅含非空数据和索引值的Series
In [21]: data =Series([1,NA,3.5,NA,7])

In [22]: data.dropna()
Out[22]:
0    1.0
2    3.5
4    7.0
dtype: float64

In [23]: data[data.notnull()]
Out[23]:
0    1.0
2    3.5
4    7.0
dtype: float64
  • 用dropna来处理DataFrame对象更加复杂

In [25]: data
Out[25]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

In [26]: cleaned = data.dropna()
# dropn默认丢弃任何含有丢失值得行
In [27]: cleaned
Out[27]:
     0    1    2
0  1.0  6.5  3.0

In [28]: data.dropna(how='all')
# how='all'将只丢弃全为na得那些行
Out[28]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0

In [29]: data[4] = NA

In [30]: data.dropna(axis=1,how='all')
# axis=1 ,丢弃列全部为na的对象
Out[30]:
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
  • thresh参数只留一部分非空数据
In [33]: df
Out[33]:
          0         1         2
0  0.481775       NaN       NaN
1 -0.338072       NaN       NaN
2 -0.642257       NaN       NaN
3 -1.890957       NaN  1.887299
4  0.811897       NaN  0.721258
5 -0.611888  0.188227 -0.708599
6  2.256012  0.033532  0.042494

In [34]: df.dropna(thresh=3)
Out[34]:
          0         1         2
5 -0.611888  0.188227 -0.708599
6  2.256012  0.033532  0.042494
填充缺失数据
  • fillna主要的方法将缺失值替换成 那常数值
In [6]: df
Out[6]:
          0         1         2
0  0.384681  0.711003  1.672274
1 -0.453881  1.268871  0.688568
2 -0.272381  0.689529  1.194154
3  0.011967 -0.141345 -0.008628
4 -0.405584  0.427896  0.266322
5  0.155367  0.831051  0.040576
6  0.082301  1.629965 -0.433486

In [9]: df.ix[:4,1]=NA;df.ix[:2,2]=NA
In [10]: df
Out[10]:
          0         1         2
0  0.384681       NaN       NaN
1 -0.453881       NaN       NaN
2 -0.272381       NaN       NaN
3  0.011967       NaN -0.008628
4 -0.405584       NaN  0.266322
5  0.155367  0.831051  0.040576
6  0.082301  1.629965 -0.433486

In [11]: df.fillna(0)
#用0填充NA值
Out[11]:
          0         1         2
0  0.384681  0.000000  0.000000
1 -0.453881  0.000000  0.000000
2 -0.272381  0.000000  0.000000
3  0.011967  0.000000 -0.008628
4 -0.405584  0.000000  0.266322
5  0.155367  0.831051  0.040576
6  0.082301  1.629965 -0.433486

In [12]: df.fillna({1:0.5,3:-1})
# 通过一个字典调用fillna,实现对不同的列填充不同的值
Out[12]:
          0         1         2
0  0.384681  0.500000       NaN
1 -0.453881  0.500000       NaN
2 -0.272381  0.500000       NaN
3  0.011967  0.500000 -0.008628
4 -0.405584  0.500000  0.266322
5  0.155367  0.831051  0.040576
6  0.082301  1.629965 -0.433486
  • reindex有效的插值方法可以用于fillna:
In [12]: df
Out[12]:
          0         1         2
0  0.473072  0.586030 -2.169215
1  0.675384 -1.530588 -0.918324
2 -1.483835       NaN -1.287079
3 -1.387653       NaN  2.044451
4  0.247890       NaN       NaN
5  0.151385       NaN       NaN

# method='ffill',以缺失数据的前一个非缺失数据来填充缺失部分数据
In [13]: df.fillna(method='ffill')
Out[13]:
          0         1         2
0  0.473072  0.586030 -2.169215
1  0.675384 -1.530588 -0.918324
2 -1.483835 -1.530588 -1.287079
3 -1.387653 -1.530588  2.044451
4  0.247890 -1.530588  2.044451
5  0.151385 -1.530588  2.044451
#限制仅两个两个NA数值
In [13]: df.fillna(method='ffill')
Out[13]:
          0         1         2
0  0.473072  0.586030 -2.169215
1  0.675384 -1.530588 -0.918324
2 -1.483835 -1.530588 -1.287079
3 -1.387653 -1.530588  2.044451
4  0.247890 -1.530588  2.044451
5  0.151385 -1.530588  2.044451

fillna函数的其他参数

层次化索引
  • 打印索引
In [16]: data
Out[16]:
a  1    0.091050
   2    0.515947
   3   -0.115889
b  1    0.165276
   2    0.490398
   3   -0.184115
c  1   -1.258075
   2   -0.552847
d  2   -1.097689
   3    0.113758
dtype: float64

In [17]: data.index
Out[17]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
  • 对层次化索引对象,选取子集
In [19]: data['b':'c']
Out[19]:
b  1    0.165276
   2    0.490398
   3   -0.184115
c  1   -1.258075
   2   -0.552847
dtype: float64
#在多层索引内层选取
In [20]: data[:,2]
Out[20]:
a    0.515947
b    0.490398
c   -0.552847
d   -1.097689
dtype: float64
  • 数据可以通过unstack方法重新安排到一个DataFrame中
In [22]: data.unstack()
Out[22]:
          1         2         3
a  0.091050  0.515947 -0.115889
b  0.165276  0.490398 -0.184115
c -1.258075 -0.552847       NaN
d       NaN -1.097689  0.113758

#unstack的逆运算是stack:

In [23]: data.unstack().stack()
Out[23]:
a  1    0.091050
   2    0.515947
   3   -0.115889
b  1    0.165276
   2    0.490398
   3   -0.184115
c  1   -1.258075
   2   -0.552847
d  2   -1.097689
   3    0.113758
dtype: float64
重排分级顺序

-swaplevel接受两个级别编码或名称,并返回一个互换了级别的新对象

In [31]: frame.swaplevel('key1','key2')
Out[31]:
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11
  • sortlevel根据单个级别中的值对数据进行排序,交换级别时,常常用到sortlevel
In [36]: frame
Out[36]:
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

In [37]: frame.swaplevel(0,1).sortlevel(0)
Out[37]:
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
     b        6   7        8
2    a        3   4        5
     b        9  10       11
根据级别汇总统计
  • level选项,用于指定在某条轴上求和的级别,可以根据行或列上的级别来进行求和
In [39]: frame
Out[39]:
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11
#根据key2来求和
In [40]: frame.sum(level='key2')
Out[40]:
state  Ohio     Colorado
color Green Red    Green
key2
1         6   8       10
2        12  14       16
#根据颜色进行求和
In [41]: frame.sum(level='color',axis=1)
Out[41]:
color      Green  Red
key1 key2
a    1         2    1
     2         8    4
b    1        14    7
     2        20   10
使用DataFrame的列
  • 将DataFrame的一个或多个列当作行索引来用,
#使用C,D两列来当行索引
In [45]: frame.set_index(['c','d'])
Out[45]:
       a  b
c   d
one 0  0  7
    1  1  6  
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1
Pandas是一种基于NumPy的数据分析工具,它可以帮助我们对数据进行清洗、编辑和分析等工作。掌握Pandas的常规用法是构建机器学习模型的第一步。首先,我们需要安装Pandas。如果已经安装了Anaconda,可以直接使用Anaconda自带的包管理工具来安装Pandas。如果没有安装Anaconda,可以使用Python自带的包管理工具pip来安装Pandas,命令为pip install pandas。安装完成后,我们可以导入Pandas库并查询相应的版本信息。通常,我们还会导入NumPy库,因为Pandas和NumPy常常结合在一起使用。导入Pandas库的命令为import pandas as pd,导入NumPy库的命令为import numpy as np。要查询Pandas的版本信息,可以使用print(pd.__version__)命令。接下来,我们可以学习Pandas的数据类型,包括Series和DataFrame。Series是一种一维的数据结构,类似于数组或列表,而DataFrame是一种二维的数据结构,类似于表格。在学习Pandas的过程中,我们可以通过导入Excel数据、输出Excel数据、数据概览、数据查看、数据清洗、数据选择、数据排序、数据分组、数据透视、数据合并和数据可视化等操作来熟悉Pandas的用法。\[1\]\[2\]\[3\] #### 引用[.reference_title] - *1* *2* [非常全面的Pandas入门教程](https://blog.csdn.net/weixin_44489066/article/details/89494395)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^koosearch_v1,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [pandas 快速入门教程](https://blog.csdn.net/down_12345/article/details/105345429)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^koosearch_v1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值