pandas 快速教程

最新推荐文章于 2022-11-22 21:54:54 发布

Zore808

最新推荐文章于 2022-11-22 21:54:54 发布

阅读量216

点赞数

分类专栏：数据挖掘技术文章标签： python pandas

数据挖掘技术专栏收录该内容

9 篇文章 0 订阅

订阅专栏

快速教程---Pandas

对象创建

通过传递值列表创建一个系列，让大熊猫创建一个默认的整数索引：

In [1]:

import pandas as pd

import numpy as np

s = pd.Series([1,3,5,np.nan,6,8])

Out[1]:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

通过传递带有日期时间索引和标记列的NUMPY数组创建数据文件：

In [4]:

dates = pd.date_range('20130101', periods=6)

dates

Out[4]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [15]:

df = pd.DataFrame(np.arange(24).reshape(6,4), index=dates, columns=list('ABCD'))

df

Out[15]:

	A	B	C	D
2013-01-01	0	1	2	3
2013-01-02	4	5	6	7
2013-01-03	8	9	10	11
2013-01-04	12	13	14	15
2013-01-05	16	17	18	19
2013-01-06	20	21	22	23

In [16]:

df2 = pd.DataFrame({ 'A' : 1.,

                    'B' : pd.Timestamp('20130102'),

                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),

                    'D' : np.array([3] * 4,dtype='int32'),

                    'E' : pd.Categorical(["test","train","test","train"]),

                    'F' : 'foo' })

df2

Out[16]:

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

生成的数据文件的列具有不同的dType。

In [17]:

df2.dtypes

Out[17]:

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看数据

下面是如何查看框架的顶部和底部行：

In [18]:

df.head()

Out[18]:

	A	B	C	D
2013-01-01	0	1	2	3
2013-01-02	4	5	6	7
2013-01-03	8	9	10	11
2013-01-04	12	13	14	15
2013-01-05	16	17	18	19

In [19]:

df.tail(3)

Out[19]:

	A	B	C	D
2013-01-04	12	13	14	15
2013-01-05	16	17	18	19
2013-01-06	20	21	22	23

显示索引、列和下面的NUMPY数据：

In [20]:

df.index

Out[20]:

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [21]:

df.columns

Out[21]:

Index(['A', 'B', 'C', 'D'], dtype='object')

In [22]:

df.values

Out[22]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

描述你的数据的快速统计摘要：

In [23]:

df.describe()

Out[23]:

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	10.000000	11.000000	12.000000	13.000000
std	7.483315	7.483315	7.483315	7.483315
min	0.000000	1.000000	2.000000	3.000000
25%	5.000000	6.000000	7.000000	8.000000
50%	10.000000	11.000000	12.000000	13.000000
75%	15.000000	16.000000	17.000000	18.000000
max	20.000000	21.000000	22.000000	23.000000

转置你的数据：

In [24]:

df.T

Out[24]:

	2013-01-01 00:00:00	2013-01-02 00:00:00	2013-01-03 00:00:00	2013-01-04 00:00:00	2013-01-05 00:00:00	2013-01-06 00:00:00
A	0	4	8	12	16	20
B	1	5	9	13	17	21
C	2	6	10	14	18	22
D	3	7	11	15	19	23

轴排序：

In [25]:

df.sort_index(axis=1, ascending=False)

Out[25]:

	D	C	B	A
2013-01-01	3	2	1	0
2013-01-02	7	6	5	4
2013-01-03	11	10	9	8
2013-01-04	15	14	13	12
2013-01-05	19	18	17	16
2013-01-06	23	22	21	20

按值排序：

In [27]:

df.sort_values(by='B',ascending=False)

Out[27]:

	A	B	C	D
2013-01-06	20	21	22	23
2013-01-05	16	17	18	19
2013-01-04	12	13	14	15
2013-01-03	8	9	10	11
2013-01-02	4	5	6	7
2013-01-01	0	1	2	3

选择

选择单个列，生成一个序列，相当于df.a：

In [28]:

df['A']

Out[28]:

2013-01-01     0
2013-01-02     4
2013-01-03     8
2013-01-04    12
2013-01-05    16
2013-01-06    20
Freq: D, Name: A, dtype: int32

In [29]:

df[0:3]

Out[29]:

	A	B	C	D
2013-01-01	0	1	2	3
2013-01-02	4	5	6	7
2013-01-03	8	9	10	11

In [30]:

df['20130102':'20130104']

Out[30]:

	A	B	C	D
2013-01-02	4	5	6	7
2013-01-03	8	9	10	11
2013-01-04	12	13	14	15

标签选择

使用标签获取横截面：

In [31]:

df.loc[dates[0]]

Out[31]:

A    0
B    1
C    2
D    3
Name: 2013-01-01 00:00:00, dtype: int32

In [32]:

df.loc[:,['A','B']]

Out[32]:

	A	B
2013-01-01	0	1
2013-01-02	4	5
2013-01-03	8	9
2013-01-04	12	13
2013-01-05	16	17
2013-01-06	20	21

In [33]:

df.loc['20130102':'20130104',['A','B']]

Out[33]:

	A	B
2013-01-02	4	5
2013-01-03	8	9
2013-01-04	12	13

In [34]:

df.loc['20130102',['A','B']]

Out[34]:

A    4
B    5
Name: 2013-01-02 00:00:00, dtype: int32

In [35]:

df.loc[dates[0],'A']

Out[35]:

为了获得对标量的快速访问（等价于先前的方法）：

In [36]:

df.at[dates[0],'A']

Out[36]:

位置选择

通过传递整数的位置进行选择：

In [37]:

df.iloc[3]

Out[37]:

A    12
B    13
C    14
D    15
Name: 2013-01-04 00:00:00, dtype: int32

In [38]:

df.iloc[3:5,0:2]

Out[38]:

	A	B
2013-01-04	12	13
2013-01-05	16	17

In [39]:

df.iloc[[1,2,4],[0,2]]

Out[39]:

	A	C
2013-01-02	4	6
2013-01-03	8	10
2013-01-05	16	18

In [40]:

df.iloc[1:3,:]

Out[40]:

	A	B	C	D
2013-01-02	4	5	6	7
2013-01-03	8	9	10	11

In [41]:

df.iloc[:,1:3]

Out[41]:

	B	C
2013-01-01	1	2
2013-01-02	5	6
2013-01-03	9	10
2013-01-04	13	14
2013-01-05	17	18
2013-01-06	21	22

In [42]:

df.iloc[1,1]

Out[42]:

为了获得对标量的快速访问（等价于先前的方法）：

In [43]:

df.iat[1,1]

Out[43]:

布尔索引

使用单个列的值来选择数据。

In [44]:

df[df.A > 0]

Out[44]:

	A	B	C	D
2013-01-02	4	5	6	7
2013-01-03	8	9	10	11
2013-01-04	12	13	14	15
2013-01-05	16	17	18	19
2013-01-06	20	21	22	23

In [45]:

df[df > 0]

Out[45]:

	A	B	C	D
2013-01-01	NaN	1	2	3
2013-01-02	4.0	5	6	7
2013-01-03	8.0	9	10	11
2013-01-04	12.0	13	14	15
2013-01-05	16.0	17	18	19
2013-01-06	20.0	21	22	23

使用ISIN（）方法进行过滤：

In [47]:

df2 = df.copy()

df2['E'] = ['one', 'one','two','three','four','three']

df2

Out[47]:

	A	B	C	D	E
2013-01-01	0	1	2	3	one
2013-01-02	4	5	6	7	one
2013-01-03	8	9	10	11	two
2013-01-04	12	13	14	15	three
2013-01-05	16	17	18	19	four
2013-01-06	20	21	22	23	three

In [48]:

df2[df2['E'].isin(['two','four'])]

Out[48]:

	A	B	C	D	E
2013-01-03	8	9	10	11	two
2013-01-05	16	17	18	19	four

设置

设置一个新列会自动根据索引对数据进行对齐。

In [49]:

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20130102', periods=6))

s1

Out[49]:

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [50]:

df['F'] = s1

df

Out[50]:

	A	B	C	D	F
2013-01-01	0	1	2	3	NaN
2013-01-02	4	5	6	7	1.0
2013-01-03	8	9	10	11	2.0
2013-01-04	12	13	14	15	3.0
2013-01-05	16	17	18	19	4.0
2013-01-06	20	21	22	23	5.0

In [51]:

df.at[dates[0],'A'] = 0

df.iat[0,1] = 0

df

Out[51]:

	A	B	C	D	F
2013-01-01	0	0	2	3	NaN
2013-01-02	4	5	6	7	1.0
2013-01-03	8	9	10	11	2.0
2013-01-04	12	13	14	15	3.0
2013-01-05	16	17	18	19	4.0
2013-01-06	20	21	22	23	5.0

In [52]:

df.loc[:,'D'] = np.array([5] * len(df))

df

Out[52]:

	A	B	C	D	F
2013-01-01	0	0	2	5	NaN
2013-01-02	4	5	6	5	1.0
2013-01-03	8	9	10	5	2.0
2013-01-04	12	13	14	5	3.0
2013-01-05	16	17	18	5	4.0
2013-01-06	20	21	22	5	5.0

In [53]:

df2 = df.copy()

df2[df2 > 0] = -df2

df2

Out[53]:

	A	B	C	D	F
2013-01-01	0	0	-2	-5	NaN
2013-01-02	-4	-5	-6	-5	-1.0
2013-01-03	-8	-9	-10	-5	-2.0
2013-01-04	-12	-13	-14	-5	-3.0
2013-01-05	-16	-17	-18	-5	-4.0
2013-01-06	-20	-21	-22	-5	-5.0

缺失数据

熊猫主要使用值NP.Na表示缺失的数据。默认情况下不包括在计算中。重新索引允许您在指定的轴上更改/添加/删除索引。这将返回数据的副本。

In [54]:

df

Out[54]:

	A	B	C	D	F
2013-01-01	0	0	2	5	NaN
2013-01-02	4	5	6	5	1.0
2013-01-03	8	9	10	5	2.0
2013-01-04	12	13	14	5	3.0
2013-01-05	16	17	18	5	4.0
2013-01-06	20	21	22	5	5.0

In [55]:

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

df1.loc[dates[0]:dates[1],'E'] = 1

df1

Out[55]:

	A	B	C	D	F	E
2013-01-01	0	0	2	5	NaN	1.0
2013-01-02	4	5	6	5	1.0	1.0
2013-01-03	8	9	10	5	2.0	NaN
2013-01-04	12	13	14	5	3.0	NaN

In [56]:

df1.dropna(how='any')             #To drop any rows that have missing data.

Out[56]:

	A	B	C	D	F	E
2013-01-02	4	5	6	5	1.0	1.0

In [57]:

df1.fillna(value=9)               #Filling missing data.

Out[57]:

	A	B	C	D	F	E
2013-01-01	0	0	2	5	9.0	1.0
2013-01-02	4	5	6	5	1.0	1.0
2013-01-03	8	9	10	5	2.0	9.0
2013-01-04	12	13	14	5	3.0	9.0

In [58]:

pd.isna(df1)                     #To get the boolean mask where values are nan.

Out[58]:

	A	B	C	D	F	E
2013-01-01	False	False	False	False	True	False
2013-01-02	False	False	False	False	False	False
2013-01-03	False	False	False	False	False	True
2013-01-04	False	False	False	False	False	True

操作

一般来说，操作排除了丢失的数据。执行描述性统计：

In [59]:

df.mean()

Out[59]:

A    10.000000
B    10.833333
C    12.000000
D     5.000000
F     3.000000
dtype: float64

In [60]:

df.mean(1)                    #Same operation on the other axis:

Out[60]:

2013-01-01     1.75
2013-01-02     4.20
2013-01-03     6.80
2013-01-04     9.40
2013-01-05    12.00
2013-01-06    14.60
Freq: D, dtype: float64

操作具有不同维度和需要对齐的对象。此外，熊猫自动沿指定维度广播。

In [64]:

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(-2)

Out[64]:

2013-01-01    5.0
2013-01-02    NaN
2013-01-03    6.0
2013-01-04    8.0
2013-01-05    NaN
2013-01-06    NaN
Freq: D, dtype: float64

应用

将函数应用于数据：

In [65]:

df

Out[65]:

	A	B	C	D	F
2013-01-01	0	0	2	5	NaN
2013-01-02	4	5	6	5	1.0
2013-01-03	8	9	10	5	2.0
2013-01-04	12	13	14	5	3.0
2013-01-05	16	17	18	5	4.0
2013-01-06	20	21	22	5	5.0

In [66]:

df.apply(np.cumsum)

Out[66]:

	A	B	C	D	F
2013-01-01	0	0	2	5	NaN
2013-01-02	4	5	8	10	1.0
2013-01-03	12	14	18	15	3.0
2013-01-04	24	27	32	20	6.0
2013-01-05	40	44	50	25	10.0
2013-01-06	60	65	72	30	15.0

In [67]:

df.apply(lambda x: x.max() - x.min())

Out[67]:

A    20.0
B    21.0
C    20.0
D     0.0
F     4.0
dtype: float64

直方图化

In [68]:

s = pd.Series(np.random.randint(0, 7, size=10))

Out[68]:

0    2
1    3
2    5
3    0
4    1
5    1
6    0
7    5
8    2
9    6
dtype: int32

In [69]:

s.value_counts()

Out[69]:

5    2
2    2
1    2
0    2
6    1
3    1
dtype: int64

字符串方法

系列在str属性中配备了一组字符串处理方法，使得对数组的每个元素进行操作变得容易，如下面的代码片段所示。注意，STR中的模式匹配通常默认使用正则表达式。

In [70]:

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

s.str.lower()

Out[70]:

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

合并

pandas提供了各种工具，用于在join/merge类型操作的情况下，轻松地将Series、DataFrame和Player对象与各种用于索引和关系代数功能的设置逻辑组合在一起。
将pandas对象与Copter（）连接在一起：

In [76]:

df = pd.DataFrame(np.arange(10,50).reshape(10,4))

df

Out[76]:

	0	1	2	3
0	10	11	12	13
1	14	15	16	17
2	18	19	20	21
3	22	23	24	25
4	26	27	28	29
5	30	31	32	33
6	34	35	36	37
7	38	39	40	41
8	42	43	44	45
9	46	47	48	49

In [77]:

# break it into pieces

pieces = [df[:3], df[3:7], df[7:]]

pieces

Out[77]:

[    0   1   2   3
 0  10  11  12  13
 1  14  15  16  17
 2  18  19  20  21,     0   1   2   3
 3  22  23  24  25
 4  26  27  28  29
 5  30  31  32  33
 6  34  35  36  37,     0   1   2   3
 7  38  39  40  41
 8  42  43  44  45
 9  46  47  48  49]

In [79]:

len(pieces),type(pieces[0])

Out[79]:

(3, pandas.core.frame.DataFrame)

In [80]:

pd.concat(pieces)

Out[80]:

	0	1	2	3
0	10	11	12	13
1	14	15	16	17
2	18	19	20	21
3	22	23	24	25
4	26	27	28	29
5	30	31	32	33
6	34	35	36	37
7	38	39	40	41
8	42	43	44	45
9	46	47	48	49

加入

SQL样式合并。

In [81]:

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

left

Out[81]:

	key	lval
0	foo	1
1	foo	2

In [82]:

right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

right

Out[82]:

	key	rval
0	foo	4
1	foo	5

In [83]:

pd.merge(left, right, on='key')

Out[83]:

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

In [84]:

left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

left

Out[84]:

	key	lval
0	foo	1
1	bar	2

In [85]:

right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

right

Out[85]:

	key	rval
0	foo	4
1	bar	5

In [86]:

pd.merge(left, right, on='key')

Out[86]:

	key	lval	rval
0	foo	1	4
1	bar	2	5

追加

将行追加到数据文件中。请参阅追加部分。

In [87]:

df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

df

Out[87]:

	A	B	C	D
0	0.476360	-0.142900	1.306915	0.886458
1	0.685598	0.673836	0.181896	0.446095
2	-0.240541	-1.670222	-1.086695	0.024977
3	0.433192	1.776994	-0.226416	-0.376136
4	0.955323	1.655059	0.500006	-1.382097
5	0.969208	1.964340	-1.249788	1.448280
6	-1.078558	-0.989534	-1.044546	0.529323
7	1.851316	1.235936	-0.437019	0.546799

In [88]:

s = df.iloc[3]

df.append(s, ignore_index=True)

Out[88]:

	A	B	C	D
0	0.476360	-0.142900	1.306915	0.886458
1	0.685598	0.673836	0.181896	0.446095
2	-0.240541	-1.670222	-1.086695	0.024977
3	0.433192	1.776994	-0.226416	-0.376136
4	0.955323	1.655059	0.500006	-1.382097
5	0.969208	1.964340	-1.249788	1.448280
6	-1.078558	-0.989534	-1.044546	0.529323
7	1.851316	1.235936	-0.437019	0.546799
8	0.433192	1.776994	-0.226416	-0.376136

分组

通过“分组”，我们指的是涉及一个或多个以下步骤的过程：
1、根据一些标准将数据分成组
2、将函数独立地应用于各组
3、将结果结合到数据结构中

In [89]:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',

                          'foo', 'bar', 'foo', 'foo'],

                   'B' : ['one', 'one', 'two', 'three',

                          'two', 'two', 'one', 'three'],

                   'C' : np.random.randn(8),

                   'D' : np.random.randn(8)})

df

Out[89]:

	A	B	C	D
0	foo	one	-1.026284	2.063183
1	bar	one	0.512460	-0.363375
2	foo	two	-0.484939	0.982710
3	bar	three	2.073508	-0.393659
4	foo	two	-0.197030	2.044099
5	bar	two	-1.369176	2.172274
6	foo	one	0.445250	1.503858
7	foo	three	-1.675851	0.207408

In [90]:

#Grouping and then applying the sum() function to the resulting groups.

df.groupby('A').sum()

Out[90]:

	C	D
A
bar	1.216791	1.415240
foo	-2.938853	6.801258

In [91]:

#Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.

df.groupby(['A','B']).sum()

Out[91]:

		C	D
A	B
bar	one	0.512460	-0.363375
	three	2.073508	-0.393659
	two	-1.369176	2.172274
foo	one	-0.581034	3.567041
	three	-1.675851	0.207408
	two	-0.681968	3.026809

重塑

栈

In [92]:

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz','foo', 'foo', 'qux', 'qux'],

                    ['one', 'two', 'one', 'two','one', 'two', 'one', 'two']]))

tuples

Out[92]:

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [93]:

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

index

Out[93]:

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [95]:

df = pd.DataFrame(np.arange(10,26).reshape(8,2), index=index, columns=['A', 'B'])

df

Out[95]:

		A	B
first	second
bar	one	10	11
bar	two	12	13
baz	one	14	15
baz	two	16	17
foo	one	18	19
foo	two	20	21
qux	one	22	23
qux	two	24	25

In [96]:

df2 = df[:4]

df2

Out[96]:

		A	B
first	second
bar	one	10	11
bar	two	12	13
baz	one	14	15
baz	two	16	17

In [97]:

#The stack() method “compresses” a level in the DataFrame’s columns.

stacked = df2.stack()

stacked

Out[97]:

first  second   
bar    one     A    10
               B    11
       two     A    12
               B    13
baz    one     A    14
               B    15
       two     A    16
               B    17
dtype: int32

对于“堆叠的”DataFrame或Series（具有MultiIndex作为索引），stack()的逆操作是unstack()，在默认情况下，它解锁最后一个级别：

In [98]:

stacked.unstack()

Out[98]:

		A	B
first	second
bar	one	10	11
bar	two	12	13
baz	one	14	15
baz	two	16	17

In [99]:

stacked.unstack(1)

Out[99]:

	second	one	two
first
bar	A	10	12
bar	B	11	13
baz	A	14	16
baz	B	15	17

In [100]:

stacked.unstack(0)

Out[100]:

	first	bar	baz
second
one	A	10	14
one	B	11	15
two	A	12	16
two	B	13	17

基准表

In [104]:

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,

                   'B' : ['A', 'B', 'C'] * 4,

                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,

                   'D' : np.arange(12),

                   'E' : np.arange(10,22)})

df

Out[104]:

	A	B	C	D	E
0	one	A	foo	0	10
1	one	B	foo	1	11
2	two	C	foo	2	12
3	three	A	bar	3	13
4	one	B	bar	4	14
5	one	C	bar	5	15
6	two	A	foo	6	16
7	three	B	foo	7	17
8	one	C	foo	8	18
9	one	A	bar	9	19
10	two	B	bar	10	20
11	three	C	bar	11	21

In [105]:

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Out[105]:

	C	bar	foo
A	B
one	A	9.0	0.0
	B	4.0	1.0
	C	5.0	8.0
three	A	3.0	NaN
	B	NaN	7.0
	C	11.0	NaN
two	A	NaN	6.0
	B	10.0	NaN
	C	NaN	2.0

时间序列

pandas具有简单、强大和高效的功能，用于在频率转换期间执行重采样操作（例如，将第二数据转换为5分钟数据）。这在金融应用中非常普遍，但并不局限于此。

In [107]:

rng = pd.date_range('1/1/2018', periods=10, freq='S')

rng

Out[107]:

DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 00:00:01',
               '2018-01-01 00:00:02', '2018-01-01 00:00:03',
               '2018-01-01 00:00:04', '2018-01-01 00:00:05',
               '2018-01-01 00:00:06', '2018-01-01 00:00:07',
               '2018-01-01 00:00:08', '2018-01-01 00:00:09'],
              dtype='datetime64[ns]', freq='S')

In [108]:

ts = pd.Series(np.random.randint(0, 10, len(rng)), index=rng)

ts

Out[108]:

2018-01-01 00:00:00    2
2018-01-01 00:00:01    9
2018-01-01 00:00:02    4
2018-01-01 00:00:03    1
2018-01-01 00:00:04    5
2018-01-01 00:00:05    3
2018-01-01 00:00:06    7
2018-01-01 00:00:07    2
2018-01-01 00:00:08    5
2018-01-01 00:00:09    2
Freq: S, dtype: int32

In [112]:

ts.resample('5S').sum()

Out[112]:

2018-01-01 00:00:00    21
2018-01-01 00:00:05    19
Freq: 5S, dtype: int32

时间跨度表示之间的转换：

In [114]:

rng = pd.date_range('1/1/2018', periods=5, freq='M')

ts = pd.Series(np.random.randn(len(rng)), index=rng)

ts

Out[114]:

2018-01-31   -0.488761
2018-02-28   -0.407915
2018-03-31    0.428891
2018-04-30   -0.306055
2018-05-31   -0.805451
Freq: M, dtype: float64

In [115]:

ps = ts.to_period()

ps

Out[115]:

2018-01   -0.488761
2018-02   -0.407915
2018-03    0.428891
2018-04   -0.306055
2018-05   -0.805451
Freq: M, dtype: float64

In [116]:

ps.to_timestamp()

Out[116]:

2018-01-01   -0.488761
2018-02-01   -0.407915
2018-03-01    0.428891
2018-04-01   -0.306055
2018-05-01   -0.805451
Freq: MS, dtype: float64

范畴

pandas可以在数据文件中包含分类数据。

In [117]:

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

df

Out[117]:

	id	raw_grade
0	1	a
1	2	b
2	3	b
3	4	a
4	5	a
5	6	e

In [118]:

#Convert the raw grades to a categorical data type.

df["grade"] = df["raw_grade"].astype("category")

df["grade"]

Out[118]:

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

In [119]:

#Rename the categories to more meaningful names (assigning to Series.cat.categories is inplace!).

df["grade"].cat.categories = ["very good", "good", "very bad"]

df

Out[119]:

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

In [120]:

#Reorder the categories and simultaneously add the missing categories (methods under Series .cat return a new Series by default).

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

df["grade"]

Out[120]:

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

In [121]:

df.sort_values(by="grade")

Out[121]:

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

In [122]:

#Grouping by a categorical column also shows empty categories.

df.groupby("grade").size()

Out[122]:

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

作图

In [126]:

%matplotlib inline

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))

ts = ts.cumsum()

ts.plot()

Out[126]:

<matplotlib.axes._subplots.AxesSubplot at 0xacef320>

In [127]:

import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])

df = df.cumsum()

plt.figure(); df.plot(); plt.legend(loc='best')

Out[127]:

<matplotlib.legend.Legend at 0xada7160>

<matplotlib.figure.Figure at 0x97fc240>

获取数据输入/输出

写入CSV文件。

In [ ]:

df.to_csv('foo.csv')

In [ ]:

#Reading from a csv file.

pd.read_csv('foo.csv')

表格

EXCEL的阅读和写作。

In [ ]:

df.to_excel('foo.xlsx', sheet_name='Sheet1')

In [ ]:

pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Zore808

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas 快速教程

快速教程---Pandas对象创建通过传递值列表创建一个系列，让大熊猫创建一个默认的整数索引：In [1]: import pandas as pdimport numpy as nps = pd.Series([1,3,5,np.nan,6,8])sOut[1]:0 1.01 3.02 5.03 NaN4 ...
复制链接

扫一扫