pandas快速入门相关操作和基础用法

最新推荐文章于 2022-03-17 15:42:11 发布

鹏鹏哥哥的小红帽

最新推荐文章于 2022-03-17 15:42:11 发布

阅读量511

点赞数

文章标签： pandas 入门基础

本文链接：https://blog.csdn.net/kepengs/article/details/84076789

版权

初识Pandas

pandas是以NumPy为基础，在其之上提供了更加易用的数据结构和数据分析工具。
因此即使不使用pandas，也不会使得数据分析或机器学习任务无法完成，但是pandas可以使得你的工作效率更加高。

首先，pandas重点提供了两种数据结构：

Series
序列，一维数据。是对NumPy的一维数组的封装，但是相较于NumPy使用整型下标，我们可以使用自定义(比如具有意义的字符串)的索引(index)。
DataFrame
数据框，二维数据。是对NumPy的二维数组的封装，但是相较于NumPy使用整型下标，它可以使用自定义的索引(index)和列名(column)。

这两种数据结构对于日常工作的数据工作是最常用的，而是用具有意义的字符串来访问数据相较于使用数字则更加方便。
而在使用索引、列名之外，这两个封装额外还附带了更多的趁手的方法，比如：

describe —— 快速地计算数据的各种描述性统计值(均值、总和、中位数、四分位数等)
unique —— 数据的独立值列表(比如想知道某个特征的所有取值可能)
value_count —— 各个值的计数
hist —— 直接绘制直方图
plot —— 对matplotlib进行了简单的封装，可以快速地进行简单的数据绘图

总之，pandas尽力抽象出最经常用的一维数组、二维数组的工作，将它们编写成现成的方法，为你节省时间。

其次，pandas本身还提供了很多非常有用的处理数据时的小工具，比如：

便捷的I/O —— 提供了直接读取Excel、CSV等常见的数据文件工具
媲美SQL的功能 —— 提供了groupby, join等功能
媲美Excel的功能 —— 透视表(pivot table)功能
极其方便的日期相关功能 —— 直观到像自然语言，不必费劲去理解Python自带的日期库

最后，pandas的文档非常丰富，更新频繁，社区十分活跃。

因此，对于从事数据相关工作的人，pandas应该是你工具箱中最趁手的工具之一。

这是一个面向pandas新手、简短的教程。
通过这个教程，我们将大略地领略下pandas能提供的核心功能。

首先我们来导入需要用的模块：

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

创建对象(Object Creation)

通过传入一个列表数据，我们的pandas可以创建一个使用默认整型作为索引的Series对象。

In [3]:

s = pd.Series([1,3,5,np.nan,6,8])
s

Out[3]:

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

我们可以构建一个使用日期和标签作为索引的DataFrame对象:

In [4]:

dates = pd.date_range('20171026', periods=6)
dates

Out[4]:

DatetimeIndex(['2017-10-26', '2017-10-27', '2017-10-28', '2017-10-29',
               '2017-10-30', '2017-10-31'],
              dtype='datetime64[ns]', freq='D')

In [5]:

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=tuple('ABCD')) # tuple('ABCD') is short for ('A', 'B', 'C', 'D')
df

Out[5]:

	A	B	C	D
2017-10-26	-0.721239	0.095539	0.056906	0.785019
2017-10-27	0.742965	0.876526	1.003626	0.058834
2017-10-28	0.910823	-0.455952	-0.602184	0.185776
2017-10-29	1.263669	-0.799012	0.765561	1.731729
2017-10-30	0.249711	-0.074355	-0.277041	1.138278
2017-10-31	-1.409005	0.020601	0.008270	-0.464991

我们也可以使用一个字典(dict)来创建一个DataFrame对象，而且它会自动应用NumPy的广播：

In [6]:

df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20171026'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2

Out[6]:

	A	B	C	D	E	F
0	1.0	2017-10-26	1.0	3	test	foo
1	1.0	2017-10-26	1.0	3	train	foo
2	1.0	2017-10-26	1.0	3	test	foo
3	1.0	2017-10-26	1.0	3	train	foo

In [7]:

df2.dtypes

Out[7]:

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [8]:

df2.C # 直接使用标签来选择列，等价于df2['C']

Out[8]:

0    1.0
1    1.0
2    1.0
3    1.0
Name: C, dtype: float32

实际上，所有的列都可以通过标签名来访问，如"A", "B", "C", "D", "E", "F"。
相较于NumPy的二维数组a[:,3]，df.D表意能力更强，而且大家就不需要计所需要的数据是第几列了。

查看数据(viewing data)

比如我们想看看一个frame的头部和尾部:

In [9]:

df.head()

Out[9]:

	A	B	C	D
2017-10-26	-0.721239	0.095539	0.056906	0.785019
2017-10-27	0.742965	0.876526	1.003626	0.058834
2017-10-28	0.910823	-0.455952	-0.602184	0.185776
2017-10-29	1.263669	-0.799012	0.765561	1.731729
2017-10-30	0.249711	-0.074355	-0.277041	1.138278

In [10]:

df.tail(3)

Out[10]:

	A	B	C	D
2017-10-29	1.263669	-0.799012	0.765561	1.731729
2017-10-30	0.249711	-0.074355	-0.277041	1.138278
2017-10-31	-1.409005	0.020601	0.008270	-0.464991

我们也可以看看索引、列名、以及底层的numpy数据都是什么样：

In [11]:

df.index

Out[11]:

DatetimeIndex(['2017-10-26', '2017-10-27', '2017-10-28', '2017-10-29',
               '2017-10-30', '2017-10-31'],
              dtype='datetime64[ns]', freq='D')

In [12]:

df.columns

Out[12]:

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [ ]:

df.values

而且我们可以通过describe()方法来快速地看看数据的概括统计：

In [13]:

df.describe()

Out[13]:

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.172821	-0.056109	0.159190	0.572441
std	1.036059	0.567592	0.613838	0.799860
min	-1.409005	-0.799012	-0.602184	-0.464991
25%	-0.478501	-0.360553	-0.205713	0.090569
50%	0.496338	-0.026877	0.032588	0.485398
75%	0.868858	0.076805	0.588397	1.049964
max	1.263669	0.876526	1.003626	1.731729

In [14]:

df.T # 转置

Out[14]:

	2017-10-26 00:00:00	2017-10-27 00:00:00	2017-10-28 00:00:00	2017-10-29 00:00:00	2017-10-30 00:00:00	2017-10-31 00:00:00
A	-0.721239	0.742965	0.910823	1.263669	0.249711	-1.409005
B	0.095539	0.876526	-0.455952	-0.799012	-0.074355	0.020601
C	0.056906	1.003626	-0.602184	0.765561	-0.277041	0.008270
D	0.785019	0.058834	0.185776	1.731729	1.138278	-0.464991

也可以以某一个轴来排序，注意这是按照轴自己的值来排序，比如我们按照列名来排序：

In [15]:

df.sort_index(axis=1, ascending=False)

Out[15]:

	D	C	B	A
2017-10-26	0.785019	0.056906	0.095539	-0.721239
2017-10-27	0.058834	1.003626	0.876526	0.742965
2017-10-28	0.185776	-0.602184	-0.455952	0.910823
2017-10-29	1.731729	0.765561	-0.799012	1.263669
2017-10-30	1.138278	-0.277041	-0.074355	0.249711
2017-10-31	-0.464991	0.008270	0.020601	-1.409005

当然我们也可以按照数据的值来排序：

In [16]:

df.sort_values(by='B')

Out[16]:

	A	B	C	D
2017-10-29	1.263669	-0.799012	0.765561	1.731729
2017-10-28	0.910823	-0.455952	-0.602184	0.185776
2017-10-30	0.249711	-0.074355	-0.277041	1.138278
2017-10-31	-1.409005	0.020601	0.008270	-0.464991
2017-10-26	-0.721239	0.095539	0.056906	0.785019
2017-10-27	0.742965	0.876526	1.003626	0.058834

选择数据(Selection)

请注意，虽然用于选择/设置数据的标准Python和NumPy表达式很直观，而且对于交互式工作非常方便。但对于生产代码，我们还是推荐使用经过优化的pandas数据访问方法，如：.at, .iat, .loc, iloc以及.ix。

我们这里先给一下索引(index/selection)方法的概览:

操作	语法	结果类型
选择列	df[col]	Series
选择行	df.loc[label]	Series
选择列、行	df.loc[index, column]	DataFrame
使用位置选择行	df.iloc[loc]	Series
使用位置选择行、列	df.iloc[v_loc, h_loc]	DataFrame
行切片	df[5:10] / df[index1:index2]	DataFrame
使用布尔向量选择行	df[bool_vec]	DataFrame

访问数据(Getting)

选择某一列，会返回一个Series对象，等价于df.A:

In [17]:

df['A']

Out[17]:

2017-10-26   -0.721239
2017-10-27    0.742965
2017-10-28    0.910823
2017-10-29    1.263669
2017-10-30    0.249711
2017-10-31   -1.409005
Freq: D, Name: A, dtype: float64

我们也可以使用切片的方式获得某些行：

In [26]:

df[0:3]

Out[26]:

	A	B	C	D
2017-10-26	-0.721239	0.095539	0.056906	0.785019
2017-10-27	0.742965	0.876526	1.003626	0.058834
2017-10-28	0.910823	-0.455952	-0.602184	0.185776

这种方法和NumPy的二维数组没什么差别，但是在pandas中我们可以直接使用索引的值，更加自然：

In [19]:

df['20171026':'20171028'] # 注意，使用这种索引值，结束值也会被返回。因为它们并不是整形数字

Out[19]:

	A	B	C	D
2017-10-26	-0.721239	0.095539	0.056906	0.785019
2017-10-27	0.742965	0.876526	1.003626	0.058834
2017-10-28	0.910823	-0.455952	-0.602184	0.185776

使用标签选择数据

loc方法使用索引或者列标签来选择数据:

In [27]:

df.loc[dates[0]]

Out[27]:

A   -0.721239
B    0.095539
C    0.056906
D    0.785019
Name: 2017-10-26 00:00:00, dtype: float64

In [21]:

df.loc[:,['A','B']] # 指定要A, B两列

Out[21]:

	A	B
2017-10-26	-0.721239	0.095539
2017-10-27	0.742965	0.876526
2017-10-28	0.910823	-0.455952
2017-10-29	1.263669	-0.799012
2017-10-30	0.249711	-0.074355
2017-10-31	-1.409005	0.020601

In [22]:

df.loc['20171026':'20171028',['A','B']] # index是可以被切片的

Out[22]:

	A	B
2017-10-26	-0.721239	0.095539
2017-10-27	0.742965	0.876526
2017-10-28	0.910823	-0.455952

In [23]:

df.loc['20171026':'20171028','A':'B'] # columns也可以被切片

Out[23]:

	A	B
2017-10-26	-0.721239	0.095539
2017-10-27	0.742965	0.876526
2017-10-28	0.910823	-0.455952

In [24]:

df.loc[dates[0],'A'] # 获取特定位置的数据

Out[24]:

-0.7212387718995426

In [25]:

df.at[dates[0], 'A'] # 和上面一行等价

Out[25]:

-0.7212387718995426

使用位置来选择(selection by position)

iloc使用位置来选择数据，基本类似于NumPy的方法：

In [28]:

df.iloc[3]

Out[28]:

A    1.263669
B   -0.799012
C    0.765561
D    1.731729
Name: 2017-10-29 00:00:00, dtype: float64

In [29]:

df.iloc[3:5,0:2]

Out[29]:

	A	B
2017-10-29	1.263669	-0.799012
2017-10-30	0.249711	-0.074355

In [30]:

df.iloc[[1,2,4],[0,2]] # 选择1,2,4行，第0，2列

Out[30]:

	A	C
2017-10-27	0.742965	1.003626
2017-10-28	0.910823	-0.602184
2017-10-30	0.249711	-0.277041

In [31]:

df.iloc[1:3,:]

Out[31]:

	A	B	C	D
2017-10-27	0.742965	0.876526	1.003626	0.058834
2017-10-28	0.910823	-0.455952	-0.602184	0.185776

In [32]:

df.iloc[:,1:3]

Out[32]:

	B	C
2017-10-26	0.095539	0.056906
2017-10-27	0.876526	1.003626
2017-10-28	-0.455952	-0.602184
2017-10-29	-0.799012	0.765561
2017-10-30	-0.074355	-0.277041
2017-10-31	0.020601	0.008270

iloc也可以某个特殊位置的数据值

In [33]:

df.iloc[1,1] # 等价于 df.iat[1,1]

Out[33]:

0.8765258678924436

布尔索引

这种方法也和NumPy很类似：

In [34]:

df[df.A > 0]

Out[34]:

	A	B	C	D
2017-10-27	0.742965	0.876526	1.003626	0.058834
2017-10-28	0.910823	-0.455952	-0.602184	0.185776
2017-10-29	1.263669	-0.799012	0.765561	1.731729
2017-10-30	0.249711	-0.074355	-0.277041	1.138278

In [35]:

df[df > 0]

Out[35]:

	A	B	C	D
2017-10-26	NaN	0.095539	0.056906	0.785019
2017-10-27	0.742965	0.876526	1.003626	0.058834
2017-10-28	0.910823	NaN	NaN	0.185776
2017-10-29	1.263669	NaN	0.765561	1.731729
2017-10-30	0.249711	NaN	NaN	1.138278
2017-10-31	NaN	0.020601	0.008270	NaN

我们可以使用isin()(is in)方法来进行过滤，对于非数值型数据很有用：

In [36]:

df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']
df2

Out[36]:

	A	B	C	D	E
2017-10-26	-0.721239	0.095539	0.056906	0.785019	one
2017-10-27	0.742965	0.876526	1.003626	0.058834	one
2017-10-28	0.910823	-0.455952	-0.602184	0.185776	two
2017-10-29	1.263669	-0.799012	0.765561	1.731729	three
2017-10-30	0.249711	-0.074355	-0.277041	1.138278	four
2017-10-31	-1.409005	0.020601	0.008270	-0.464991	three

In [37]:

df2[df2['E'].isin(['two','four'])]

Out[37]:

	A	B	C	D	E
2017-10-28	0.910823	-0.455952	-0.602184	0.185776	two
2017-10-30	0.249711	-0.074355	-0.277041	1.138278	four

赋值(setting)

In [79]:

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20171027', periods=6))
s1

Out[79]:

2017-10-27    1
2017-10-28    2
2017-10-29    3
2017-10-30    4
2017-10-31    5
2017-11-01    6
Freq: D, dtype: int64

In [80]:

df['F'] = s1
df

Out[80]:

	A	B	C	D	F
2017-10-26	0.000000	0.000000	0.501078	5	NaN
2017-10-27	0.440487	0.716872	-2.013572	5	1.0
2017-10-28	0.563258	0.612541	1.214156	5	2.0
2017-10-29	-0.304800	1.262876	0.739776	5	3.0
2017-10-30	0.654254	-1.002900	1.236297	5	4.0
2017-10-31	-0.938760	1.302596	0.221423	5	5.0

In [81]:

df.at[dates[0],'A'] = 0
df

Out[81]:

	A	B	C	D	F
2017-10-26	0.000000	0.000000	0.501078	5	NaN
2017-10-27	0.440487	0.716872	-2.013572	5	1.0
2017-10-28	0.563258	0.612541	1.214156	5	2.0
2017-10-29	-0.304800	1.262876	0.739776	5	3.0
2017-10-30	0.654254	-1.002900	1.236297	5	4.0
2017-10-31	-0.938760	1.302596	0.221423	5	5.0

In [82]:

df.iat[0, 1] = 0
df

Out[82]:

	A	B	C	D	F
2017-10-26	0.000000	0.000000	0.501078	5	NaN
2017-10-27	0.440487	0.716872	-2.013572	5	1.0
2017-10-28	0.563258	0.612541	1.214156	5	2.0
2017-10-29	-0.304800	1.262876	0.739776	5	3.0
2017-10-30	0.654254	-1.002900	1.236297	5	4.0
2017-10-31	-0.938760	1.302596	0.221423	5	5.0

In [83]:

df.loc[:, 'D'] = np.array([5]*len(df))
df

Out[83]:

	A	B	C	D	F
2017-10-26	0.000000	0.000000	0.501078	5	NaN
2017-10-27	0.440487	0.716872	-2.013572	5	1.0
2017-10-28	0.563258	0.612541	1.214156	5	2.0
2017-10-29	-0.304800	1.262876	0.739776	5	3.0
2017-10-30	0.654254	-1.002900	1.236297	5	4.0
2017-10-31	-0.938760	1.302596	0.221423	5	5.0

缺失值

pandas一般使用np.nan来表示缺失值。默认情况下，它不会参与计算。

重建索引(reindexing)可以修改、增加、删除索引，而且会返回一份拷贝后的数据：

In [84]:

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Out[84]:

	A	B	C	D	F	E
2017-10-26	0.000000	0.000000	0.501078	5	NaN	1.0
2017-10-27	0.440487	0.716872	-2.013572	5	1.0	1.0
2017-10-28	0.563258	0.612541	1.214156	5	2.0	NaN
2017-10-29	-0.304800	1.262876	0.739776	5	3.0	NaN

删除含有缺失值的所有行：

In [85]:

df1.dropna(how='any')

Out[85]:

	A	B	C	D	F	E
2017-10-27	0.440487	0.716872	-2.013572	5	1.0	1.0

使用指定的值来替换缺失值：

In [86]:

df1.fillna(value=5)

Out[86]:

	A	B	C	D	F	E
2017-10-26	0.000000	0.000000	0.501078	5	5.0	1.0
2017-10-27	0.440487	0.716872	-2.013572	5	1.0	1.0
2017-10-28	0.563258	0.612541	1.214156	5	2.0	5.0
2017-10-29	-0.304800	1.262876	0.739776	5	3.0	5.0

获取DataFrame中缺失值的掩码布尔矩阵：

In [87]:

pd.isnull(df1)

Out[87]:

	A	B	C	D	F	E
2017-10-26	False	False	False	False	True	False
2017-10-27	False	False	False	False	False	False
2017-10-28	False	False	False	False	False	True
2017-10-29	False	False	False	False	False	True

操作(operations)

统计(stats)

操作一般都不计缺失值：

In [88]:

df.mean()

Out[88]:

A    0.069073
B    0.481998
C    0.316526
D    5.000000
F    3.000000
dtype: float64

In [89]:

df.mean(1) # 在另一个轴上

Out[89]:

2017-10-26    1.375270
2017-10-27    1.028757
2017-10-28    1.877991
2017-10-29    1.939570
2017-10-30    1.977530
2017-10-31    2.117052
Freq: D, dtype: float64

应用函数(apply)

对数据应用函数

In [90]:

df.apply(np.cumsum)

Out[90]:

	A	B	C	D	F
2017-10-26	0.000000	0.000000	0.501078	5	NaN
2017-10-27	0.440487	0.716872	-1.512494	10	1.0
2017-10-28	1.003745	1.329414	-0.298339	15	3.0
2017-10-29	0.698945	2.592289	0.441438	20	6.0
2017-10-30	1.353199	1.589389	1.677734	25	10.0
2017-10-31	0.414439	2.891985	1.899158	30	15.0

In [91]:

df.apply(lambda x: x.max() - x.min())

Out[91]:

A    1.593014
B    2.305497
C    3.249869
D    0.000000
F    4.000000
dtype: float64

计数(histogramming)

In [92]:

s = pd.Series(np.random.randint(0, 7, size=10))
s

Out[92]:

0    6
1    2
2    5
3    5
4    1
5    1
6    0
7    5
8    2
9    0
dtype: int64

In [93]:

s.value_counts()

Out[93]:

5    3
2    2
1    2
0    2
6    1
dtype: int64

合并(merge)

连接(concat)

pandas为了方便地将Series、DataFrame组合在一起，开发了各种各样的功能。

使用concat()来将pandas的对象连接在一起：

In [94]:

df = pd.DataFrame(np.random.randn(10, 4))
df

Out[94]:

	0	1	2	3
0	-0.122555	-1.577594	-0.162647	-1.223825
1	-1.660596	-2.031171	0.048468	-1.193978
2	0.604721	0.428738	-0.763315	-1.347055
3	-0.670626	-0.361797	-0.547268	-0.551849
4	0.039113	0.101693	0.886864	-1.587129
5	1.292044	-1.016282	-0.600570	-0.079083
6	0.739309	-0.626648	0.338591	-0.548946
7	-0.041177	-1.078038	-1.587588	0.086584
8	1.612034	-0.193076	-0.735807	0.335072
9	-1.049429	0.087083	0.977108	-0.596081

In [95]:

pieces = [df[:3], df[3:7], df[7:]]
pieces

Out[95]:

[          0         1         2         3
 0 -0.122555 -1.577594 -0.162647 -1.223825
 1 -1.660596 -2.031171  0.048468 -1.193978
 2  0.604721  0.428738 -0.763315 -1.347055,
           0         1         2         3
 3 -0.670626 -0.361797 -0.547268 -0.551849
 4  0.039113  0.101693  0.886864 -1.587129
 5  1.292044 -1.016282 -0.600570 -0.079083
 6  0.739309 -0.626648  0.338591 -0.548946,
           0         1         2         3
 7 -0.041177 -1.078038 -1.587588  0.086584
 8  1.612034 -0.193076 -0.735807  0.335072
 9 -1.049429  0.087083  0.977108 -0.596081]

In [96]:

pd.concat(pieces)

Out[96]:

	0	1	2	3
0	-0.122555	-1.577594	-0.162647	-1.223825
1	-1.660596	-2.031171	0.048468	-1.193978
2	0.604721	0.428738	-0.763315	-1.347055
3	-0.670626	-0.361797	-0.547268	-0.551849
4	0.039113	0.101693	0.886864	-1.587129
5	1.292044	-1.016282	-0.600570	-0.079083
6	0.739309	-0.626648	0.338591	-0.548946
7	-0.041177	-1.078038	-1.587588	0.086584
8	1.612034	-0.193076	-0.735807	0.335072
9	-1.049429	0.087083	0.977108	-0.596081

Join

SQL风格的操作。

In [97]:

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

left

Out[97]:

	key	lval
0	foo	1
1	foo	2

In [98]:

right

Out[98]:

	key	rval
0	foo	4
1	foo	5

In [99]:

pd.merge(left, right, on='key')

Out[99]:

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

另一个可能更能演示的例子：

In [100]:

left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

left

Out[100]:

	key	lval
0	foo	1
1	bar	2

In [101]:

right

Out[101]:

	key	rval
0	foo	4
1	bar	5

In [102]:

pd.merge(left, right, on='key')

Out[102]:

	key	lval	rval
0	foo	1	4
1	bar	2	5

追加(append)

向DataFrame追加行：

In [103]:

df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

Out[103]:

	A	B	C	D
0	-1.959949	-0.294810	0.339831	0.515830
1	-0.377887	0.988353	-0.946725	1.028688
2	1.343592	0.566358	0.933397	-0.485905
3	0.518054	1.769740	-0.301971	1.188588
4	0.389517	-0.247958	-1.144920	1.669438
5	-0.288169	-0.773408	-1.850832	0.658924
6	-0.256120	-0.936557	0.067185	-0.669268
7	1.490088	0.913402	1.236560	-0.347774

In [104]:

s = df.iloc[3]
s

Out[104]:

A    0.518054
B    1.769740
C   -0.301971
D    1.188588
Name: 3, dtype: float64

In [106]:

df.append(s, ignore_index=True)

Out[106]:

	A	B	C	D
0	-1.959949	-0.294810	0.339831	0.515830
1	-0.377887	0.988353	-0.946725	1.028688
2	1.343592	0.566358	0.933397	-0.485905
3	0.518054	1.769740	-0.301971	1.188588
4	0.389517	-0.247958	-1.144920	1.669438
5	-0.288169	-0.773408	-1.850832	0.658924
6	-0.256120	-0.936557	0.067185	-0.669268
7	1.490088	0.913402	1.236560	-0.347774
8	0.518054	1.769740	-0.301971	1.188588

聚合(grouping)

"group by"这个的含义指的是涉及如下一个或多个步骤的一种处理过程：

根据某些条件将数据切分成一些组
对每个组独立地进行某种操作
将结果组合到一个数据结构中

实际上，如果你熟悉SQL的group by，那么你就完全能理解我们这里的聚合。

In [107]:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                           'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

Out[107]:

	A	B	C	D
0	foo	one	0.416123	1.103077
1	bar	one	0.829122	1.060971
2	foo	two	-0.330427	-2.779396
3	bar	three	-0.044427	-0.916213
4	foo	two	0.400866	0.289545
5	bar	two	0.544435	-0.412748
6	foo	one	-0.670108	-0.356459
7	foo	three	0.110842	-0.378312

In [108]:

df.groupby('A').sum() # 等价于SQL中的 select A, sum(C), sum(D) from df group by A

Out[108]:

	C	D
A
bar	1.329130	-0.267990
foo	-0.072704	-2.121544

In [109]:

df.groupby(tuple('AB')).sum() # 等价于SQL中的 select A, B, sum(C), sum(D) from df group by A, B

/Users/sunkepeng/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Interpreting tuple 'by' as a list of keys, rather than a single key. Use 'by=[...]' instead of 'by=(...)'. In the future, a tuple will always mean a single key.
  """Entry point for launching an IPython kernel.

Out[109]:

		C	D
A	B
bar	one	0.829122	1.060971
	three	-0.044427	-0.916213
	two	0.544435	-0.412748
foo	one	-0.253985	0.746618
	three	0.110842	-0.378312
	two	0.070439	-2.489850

变更形状(reshaping)

堆叠(stack)

In [110]:

tuples = list(zip(['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])

df2 = df[:4]
df2

Out[110]:

		A	B
first	second
bar	one	0.148196	1.148491
bar	two	0.130791	-0.569285
baz	one	-0.498698	-0.631549
baz	two	-1.582463	-0.038371

stack()方法实际上是将DataFrame的某一级列转成行索引，它和我们在NumPy中学的stack不是一回事：

In [111]:

stacked = df2.stack()
stacked

Out[111]:

first  second   
bar    one     A    0.148196
               B    1.148491
       two     A    0.130791
               B   -0.569285
baz    one     A   -0.498698
               B   -0.631549
       two     A   -1.582463
               B   -0.038371
dtype: float64

如果一个DataFrame或者Series是堆叠的(它的索引是多重索引，即MultiIndex)，那么可以使用unstack()来将索引的某一层转成列，默认转最后一级索引：

In [112]:

stacked.unstack()

Out[112]:

		A	B
first	second
bar	one	0.148196	1.148491
bar	two	0.130791	-0.569285
baz	one	-0.498698	-0.631549
baz	two	-1.582463	-0.038371

In [113]:

stacked.unstack(1) # 指定转第二级(下标从0开始)

Out[113]:

	second	one	two
first
bar	A	0.148196	0.130791
bar	B	1.148491	-0.569285
baz	A	-0.498698	-1.582463
baz	B	-0.631549	-0.038371

In [114]:

stacked.unstack(0)

Out[114]:

	first	bar	baz
second
one	A	0.148196	-0.498698
one	B	1.148491	-0.631549
two	A	0.130791	-1.582463
two	B	-0.569285	-0.038371

数据透视表(Pivot Tables)

这个功能和Excel里面的数据透视表能够完成的功能几乎完全一样：选定维度来汇总数据，以从不同的视角来审视数据。

In [115]:

np.random.seed(1)
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

Out[115]:

	A	B	C	D	E
0	one	A	foo	1.624345	-0.322417
1	one	B	foo	-0.611756	-0.384054
2	two	C	foo	-0.528172	1.133769
3	three	A	bar	-1.072969	-1.099891
4	one	B	bar	0.865408	-0.172428
5	one	C	bar	-2.301539	-0.877858
6	two	A	foo	1.744812	0.042214
7	three	B	foo	-0.761207	0.582815
8	one	C	foo	0.319039	-1.100619
9	one	A	bar	-0.249370	1.144724
10	two	B	bar	1.462108	0.901591
11	three	C	bar	-2.060141	0.502494

In [116]:

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) # 对应于Excel，以A，B为行维度，以C为列维度，计算D列之和

Out[116]:

	C	bar	foo
A	B
one	A	-0.249370	1.624345
	B	0.865408	-0.611756
	C	-2.301539	0.319039
three	A	-1.072969	NaN
	B	NaN	-0.761207
	C	-2.060141	NaN
two	A	NaN	1.744812
	B	1.462108	NaN
	C	NaN	-0.528172

以上操作等价于在Excel中的操作：
excel pivot table

绘图(plotting)

我们可以使用matplotlib来绘图，不过pandas也对matplotlib进行一定程度的封装，使得在一些场景下更方便用：

In [118]:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2017', periods=1000))
ts = ts.cumsum()
%matplotlib inline
ts.plot()

Out[118]:

<matplotlib.axes._subplots.AxesSubplot at 0x121e697d0>

对于DataFrame，plot()方法可以非常方便地对所有列进行绘图：

In [119]:

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot();plt.legend(loc='best')

Out[119]:

<matplotlib.legend.Legend at 0x121add690>

对于更加丰富的信息，我们推荐大家多阅读pandas官方文档。

鹏鹏哥哥的小红帽

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pandas快速入门相关操作和基础用法

初识Pandaspandas是以NumPy为基础，在其之上提供了更加易用的数据结构和数据分析工具。因此即使不使用pandas，也不会使得数据分析或机器学习任务无法完成，但是pandas可以使得你的工作效率更加高。首先，pandas重点提供了两种数据结构： Series 序列，一维数据。是对NumPy的一维数组的封装，但是相较于NumPy使用整型下标，我们可以使用自定义(比如具有意...
复制链接

扫一扫