dataframe python格式_python3.6 pandas,Series和DataFrame基础格式与用法,附代码实例

pandas 是基于numpy构建的库,加上numpy,主要用于科学运算和数据处理。

也是一个让我忘记昂贵的MATLAB,并且不得不复习SQL的库..

一般引入规定:

In [105]: from pandas import Series,DataFrame

In [106]: import pandas as pd

In [107]: import numpy as np

Series

类似一维数组,有一组数据和一组与之相关的索引组成。

In [68]: o2 = Series([4,-7,35,99])

In [69]: o2

Out[69]:

0 4

1 -7

2 35

3 99

dtype: int64

In [70]: o2 = Series([4,-7,35,99],index=['a','b','c','d'])

In [71]: o2

Out[71]:

a 4

b -7

c 35

d 99

dtype: int64

DataFrame

表格型数据结构,可以看成是一系列Series组成的字典(共用同一个索引)。

In [21]: frame = DataFrame(np.arange(9).reshape(3,3),index=['a','b','c'],column

...: s=['Ohio','Texas','Califor'])

In [22]: frame

Out[22]:

Ohio Texas Califor

a 0 1 2

b 3 4 5

c 6 7 8

在算数方法中填充值

1.1. 两个长度不同的数组,直接相加,不存在/不对应的值会广播NaN

1.2. NaN可以用fill_value填充值

In [31]: df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))

In [32]: df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))

In [33]: df1

Out[33]:

a b c d

0 0.0 1.0 2.0 3.0

1 4.0 5.0 6.0 7.0

2 8.0 9.0 10.0 11.0

In [34]: df2

Out[34]:

a b c d e

0 0.0 1.0 2.0 3.0 4.0

1 5.0 6.0 7.0 8.0 9.0

2 10.0 11.0 12.0 13.0 14.0

3 15.0 16.0 17.0 18.0 19.0

In [35]: df1 + df2

Out[35]:

a b c d e

0 0.0 2.0 4.0 6.0 NaN

1 9.0 11.0 13.0 15.0 NaN

2 18.0 20.0 22.0 24.0 NaN

3 NaN NaN NaN NaN NaN

In [36]:

In [36]: df1.add(df2,fill_value=0)

Out[36]:

a b c d e

0 0.0 2.0 4.0 6.0 4.0

1 9.0 11.0 13.0 15.0 9.0

2 18.0 20.0 22.0 24.0 14.0

3 15.0 16.0 17.0 18.0 19.0

DataFrame和Series之间的运算--广播

2.1. 一般是沿行做广播运算

2.2. 沿列做广播运算需要运用算术方法

In [41]: arr = np.arange(12.).reshape(3,4)

In [42]: arr

Out[42]:

array([[ 0., 1., 2., 3.],

[ 4., 5., 6., 7.],

[ 8., 9., 10., 11.]])

In [43]: arr - arr[0]

Out[43]:

array([[ 0., 0., 0., 0.],

[ 4., 4., 4., 4.],

[ 8., 8., 8., 8.]])

函数的映射和应用

一般是使用lambda和写函数式

#lambda

In [56]: frame

Out[56]:

b d e

Utah 0.073770 -0.264937 1.085603

Ohio 1.274547 0.820050 0.056422

Texas 1.346414 1.786314 -0.311222

Oregon 0.571323 -0.731404 0.502011

In [57]: f = lambda x : x.max() - x.min()

In [58]: frame.apply(f)

Out[58]:

b 1.272643

d 2.517719

e 1.396825

dtype: float64

In [59]: frame.apply(f,axis=1)

Out[59]:

Utah 1.350540

Ohio 1.218125

Texas 2.097536

Oregon 1.302727

dtype: float64

#f(x)

In [60]: def f(x):

...: return Series([x.min(),x.max()],index=['min','max'])

In [61]: frame.apply(f)

Out[61]:

b d e

min 0.073770 -0.731404 -0.311222

max 1.346414 1.786314 1.085603

汇总和计算描述统计

In [70]: df

Out[70]:

0 1 2

a 1.037884 0.932937 0.480702

a -1.453084 -1.039968 0.306588

b 0.352103 0.083231 -0.264383

b 0.628823 -0.454043 -0.993764

In [71]: df.describe()

Out[71]:

0 1 2

count 4.000000 4.000000 4.000000

mean 0.141432 -0.119461 -0.117714

std 1.099703 0.838233 0.665109

min -1.453084 -1.039968 -0.993764

25% -0.099194 -0.600524 -0.446728

50% 0.490463 -0.185406 0.021103

75% 0.731088 0.295658 0.350117

max 1.037884 0.932937 0.480702

处理缺失值

In [89]: df1

Out[89]:

0 1 2

0 1.700089 NaN NaN

1 0.209934 NaN NaN

2 -1.300037 NaN NaN

3 -0.044868 NaN 1.712725

4 0.624518 NaN -0.559871

5 -1.036317 1.075744 1.267794

6 -0.201066 0.268681 -0.356206

In [90]: df1.fillna(0)

Out[90]:

0 1 2

0 1.700089 0.000000 0.000000

1 0.209934 0.000000 0.000000

2 -1.300037 0.000000 0.000000

3 -0.044868 0.000000 1.712725

4 0.624518 0.000000 -0.559871

5 -1.036317 1.075744 1.267794

6 -0.201066 0.268681 -0.356206

In [91]: df1

Out[91]:

0 1 2

0 1.700089 NaN NaN

1 0.209934 NaN NaN

2 -1.300037 NaN NaN

3 -0.044868 NaN 1.712725

4 0.624518 NaN -0.559871

5 -1.036317 1.075744 1.267794

6 -0.201066 0.268681 -0.356206

In [92]: df1.fillna({1:0.5,2:33})

Out[92]:

0 1 2

0 1.700089 0.500000 33.000000

1 0.209934 0.500000 33.000000

2 -1.300037 0.500000 33.000000

3 -0.044868 0.500000 1.712725

4 0.624518 0.500000 -0.559871

5 -1.036317 1.075744 1.267794

6 -0.201066 0.268681 -0.356206

层次化索引/多层索引

6.1. 基础就是多层索引

In [100]: data = Series(np.random.rand(10),index=[['a','a','a','b','b','b','c',

...: 'c','d','d'],[1,2,3,1,2,3,1,2,2,3]])

In [101]: data

Out[101]:

a 1 0.676413

2 0.623518

3 0.414257

b 1 0.434586

2 0.905924

3 0.726079

c 1 0.693546

2 0.708168

d 2 0.667362

3 0.789808

dtype: float64

In [102]: data.index

Out[102]:

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],

labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2

]])

6.2. 通过unstack,可以将其从Series转化为DataFrame

In [114]: data.unstack()

Out[114]:

1 2 3

a 0.676413 0.623518 0.414257

b 0.434586 0.905924 0.726079

c 0.693546 0.708168 NaN

d NaN 0.667362 0.789808

6.3. unstack的逆运算是stack

In [115]: data.unstack().stack()

Out[115]:

a 1 0.676413

2 0.623518

3 0.414257

b 1 0.434586

2 0.905924

3 0.726079

c 1 0.693546

2 0.708168

d 2 0.667362

3 0.789808

dtype: float64

6.4. DataFrame每条轴都可以做多层索引

In [118]: frame =DataFrame(np.arange(12).reshape(4,3),

...: index = [['a','a','b','b'],[1,2,1,2]],

...: columns = [['city1','city1','city2'],['G','R','G']])

In [120]: frame

Out[120]:

city1 city2

G R G

a 1 0 1 2

2 3 4 5

b 1 6 7 8

2 9 10 11

In [121]: frame.index.names = ['key1','key2']

In [122]: frame.columns.names = ['citys','color']

In [123]: frame

Out[123]:

citys city1 city2

color G R G

key1 key2

a 1 0 1 2

2 3 4 5

b 1 6 7 8

2 9 10 11

In [124]:

把DataFrame的列当成索引使用

7.1. set_index , 把DataFrame的列当成索引使用, 可以选择是否保留原列

7.2. reset_index 将7.1.恢复原样

#7.1. set_index

In [134]: f

Out[134]:

a b c d

0 0 7 one 0

1 1 6 one 1

2 2 5 one 2

3 3 4 two 0

4 4 3 two 1

5 5 2 two 2

6 6 1 two 3

In [135]: f.set_index(['c','d'])

Out[135]:

a b

c d

one 0 0 7

1 1 6

2 2 5

two 0 3 4

1 4 3

2 5 2

3 6 1

In [136]: f.set_index(['c','d'],drop=False)

Out[136]:

a b c d

c d

one 0 0 7 one 0

1 1 6 one 1

2 2 5 one 2

two 0 3 4 two 0

1 4 3 two 1

2 5 2 two 2

3 6 1 two 3

# 7.2. reset_index example

In [137]: frame2= f.set_index(['c','d'])

In [139]: frame2

Out[139]:

a b

c d

one 0 0 7

1 1 6

2 2 5

two 0 3 4

1 4 3

2 5 2

3 6 1

In [140]: frame2.reset_index()

Out[140]:

c d a b

0 one 0 0 7

1 one 1 1 6

2 one 2 2 5

3 two 0 3 4

4 two 1 4 3

5 two 2 5 2

6 two 3 6 1

面板数据/三维版DataFrame

书里提到比较少用,一般可以降到二维。

我觉得这个pandas功能也很像excel VB语言,果然语言都是很相似的,原理是矩阵和逻辑,要用再查参考书。

话说,数据分析在排障也很好用啊,万万没想到

2018.7.20

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值