写在前面,想看如何将你的数据文件读入pandas对象的,请直接点击目录的"数据输入 / 输出".
另外,有一些比较好的教程,我把地址贴在这里: 快速查看相关的函数, 中文网,官网,github项目链接.
一、Pandas 概览:
1、Pandas 是什么?
Pandas 是 Python 的核心数据分析支持库,提供了快速、灵活、明确的数据结构,旨在简单、直观地处理关系型、标记型数据。Pandas 的目标是成为 Python 数据分析实践与实战的必备高级工具,其长远目标是成为最强大、最灵活、可以支持任何语言的开源数据分析工具。经过多年不懈的努力,Pandas 离这个目标已经越来越近了。
Pandas 适用于处理以下类型的数据:
- 与 SQL 或 Excel 表类似的,含异构列的表格数据;
- 有序和无序(非固定频率)的时间序列数据;
- 带行列标签的矩阵数据,包括同构或异构型数据;
- 任意其它形式的观测、统计数据集, 数据转入 Pandas 数据结构时不必事先标记。
Pandas 的主要数据结构是 Series(一维数据)与 DataFrame(二维数据),这两种数据结构足以处理金融、统计、社会科学、工程等领域里的大多数典型用例。对于 R 用户,DataFrame 提供了比 R 语言 data.frame 更丰富的功能。Pandas 基于 NumPy 开发,可以与其它第三方科学计算支持库完美集成。
2、Pandas数据结构
pandas中有两大核心数据结构:
下面对他们做一个整体性的认识:
(1)Scries介绍:
(2)DataFrame介绍:
二、 Pandas基础操作:
导入 Pandas 与 NumPy:
import numpy as np
import pandas as pd
1、生成对象
用值列表生成 Series 时,Pandas 默认自动生成整数索引:
>>> s = pd.Series([1, 3, 5, np.nan, 6, 8])
>>> s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
用含日期时间索引与标签的 NumPy 数组生成 DataFrame:
>>> dates = pd.date_range('20130101', periods=6)
>>> dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD')) # 参数为:内容值,行,列
>>> df
A B C D
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362
2013-01-02 -0.709287 1.577562 0.658355 0.430606
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
2013-01-05 1.138694 1.344041 -1.208061 1.401647
2013-01-06 -1.160976 -0.242705 -1.791619 0.243495
用 Series 字典对象生成 DataFrame:
>>> df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
>>> df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
DataFrame 的列有不同数据类型:
>>> df2.dtypes
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
2、查看数据
下列代码说明如何查看 DataFrame 头部和尾部数据:
>>> df2.head(2) # 查看前两列,不写参数则默认全部
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
>>> df2.tail(3) # 参数3表示查询倒数3行
A B C D E F
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
显示索引与列名:
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')
>>>
>>> df2.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
DataFrame.to_numpy()
输出底层数据的 NumPy 对象。注意,DataFrame 的列由多种数据类型组成时,该操作耗费系统资源较大,这也是 Pandas 和 NumPy 的本质区别:NumPy 数组只有一种数据类型,DataFrame 每列的数据类型各不相同。调用 DataFrame.to_numpy() 时,Pandas 查找支持 DataFrame 里所有数据类型的 NumPy 数据类型。还有一种数据类型是 object,可以把 DataFrame 列里的值强制转换为 Python 对象。
下面的 df 这个 DataFrame 里的值都是浮点数,DataFrame.to_numpy()
的操作会很快,而且不复制数据。
>>> df # 这个是原本df的值
A B C D
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362
2013-01-02 -0.709287 1.577562 0.658355 0.430606
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
2013-01-05 1.138694 1.344041 -1.208061 1.401647
2013-01-06 -1.160976 -0.242705 -1.791619 0.243495
>>> df.to_numpy() # 转换成numpy数据对象,看见没,不包含索引值index哦
array([[ 0.8723507 , -1.26699091, -1.7654972 , -1.54736172],
[-0.70928725, 1.57756193, 0.65835469, 0.43060616],
[-1.34628219, -0.70250861, -1.00931852, -0.63541983],
[ 0.21530743, -0.42556926, -1.32340721, -0.97316084],
[ 1.13869393, 1.34404111, -1.20806121, 1.40164723],
[-1.16097626, -0.24270474, -1.7916191 , 0.24349455]])
df2 这个 DataFrame 包含了多种类型,DataFrame.to_numpy()
操作就会耗费较多资源。
>>> df2 # 这个是原本df2的值
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
>>> df2.to_numpy() # 转换成numpy数据对象,依然不包含索引值index哦
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
describe()
可以快速查看数据的统计摘要:
>>> df # 这个是原本df的值
A B C D
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362
2013-01-02 -0.709287 1.577562 0.658355 0.430606
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
2013-01-05 1.138694 1.344041 -1.208061 1.401647
2013-01-06 -1.160976 -0.242705 -1.791619 0.243495
>>> df.describe() # df的数据的统计摘要
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.165032 0.047305 -1.073258 -0.180032
std 1.058678 1.150697 0.903167 1.073311
min -1.346282 -1.266991 -1.791619 -1.547362
25% -1.048054 -0.633274 -1.654975 -0.888726
50% -0.246990 -0.334137 -1.265734 -0.195963
75% 0.708090 0.947355 -1.059004 0.383828
max 1.138694 1.577562 0.658355 1.401647
转置数据:
>>> df # 这个是原本df的值
A B C D
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362
2013-01-02 -0.709287 1.577562 0.658355 0.430606
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
2013-01-05 1.138694 1.344041 -1.208061 1.401647
2013-01-06 -1.160976 -0.242705 -1.791619 0.243495
>>> df.T
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.872351 -0.709287 -1.346282 0.215307 1.138694 -1.160976
B -1.266991 1.577562 -0.702509 -0.425569 1.344041 -0.242705
C -1.765497 0.658355 -1.009319 -1.323407 -1.208061 -1.791619
D -1.547362 0.430606 -0.635420 -0.973161 1.401647 0.243495
按轴排序:
>>> df.sort_index(axis=1,ascending=False) # axis=1对应的是行的index进行排序,ascending=False代表降序
D C B A
2013-01-01 -1.547362 -1.765497 -1.266991 0.872351
2013-01-02 0.430606 0.658355 1.577562 -0.709287
2013-01-03 -0.635420 -1.009319 -0.702509 -1.346282
2013-01-04 -0.973161 -1.323407 -0.425569 0.215307
2013-01-05 1.401647 -1.208061 1.344041 1.138694
2013-01-06 0.243495 -1.791619 -0.242705 -1.160976
按值排序:
>>> df.sort_values(by='C') # 默认是升序
A B C D
2013-01-06 -1.160976 -0.242705 -1.791619 0.243495
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
2013-01-05 1.138694 1.344041 -1.208061 1.401647
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
2013-01-02 -0.709287 1.577562 0.658355 0.430606
3、选择数据(对数据切片)
选择、设置标准 Python / Numpy 的表达式已经非常直观,交互也很方便,但对于生产代码,还是推荐优化过的 Pandas 数据访问方法:.at、.iat、.loc 和 .iloc。
(1)获取数据
选择单列,产生 Series,与 df.A 等效:
>>> df['A']
2013-01-01 0.872351
2013-01-02 -0.709287
2013-01-03 -1.346282
2013-01-04 0.215307
2013-01-05 1.138694
2013-01-06 -1.160976
Freq: D, Name: A, dtype: float64
>>> df.A # 与上面等价
2013-01-01 0.872351
2013-01-02 -0.709287
2013-01-03 -1.346282
2013-01-04 0.215307
2013-01-05 1.138694
2013-01-06 -1.160976
Freq: D, Name: A, dtype: float64
用 [ ] 切片行:
>>> df[0:3] # 获取前三行,左闭右开
A B C D
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362
2013-01-02 -0.709287 1.577562 0.658355 0.430606
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
>>> df["20130102":"20130104"] # 与上面等价
A B C D
2013-01-02 -0.709287 1.577562 0.658355 0.430606
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
(2)按标签选择,主要用了.loc方法:
用标签提取一行数据:
>>> dates # 这个前面定义了
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
>>> df.loc[dates[0]] # 其实就是获取第0行的数据
A 0.872351
B -1.266991
C -1.765497
D -1.547362
Name: 2013-01-01 00:00:00, dtype: float64
用标签选择多列数据:
>>> df # 原始数据
A B C D
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362
2013-01-02 -0.709287 1.577562 0.658355 0.430606
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
2013-01-05 1.138694 1.344041 -1.208061 1.401647
2013-01-06 -1.160976 -0.242705 -1.791619 0.243495
>>> df.loc[:,['A','B']] # 第一个冒号是对行切片,逗号后面是对列进行操作
A B
2013-01-01 0.872351 -1.266991
2013-01-02 -0.709287 1.577562
2013-01-03 -1.346282 -0.702509
2013-01-04 0.215307 -0.425569
2013-01-05 1.138694 1.344041
2013-01-06 -1.160976 -0.242705
用标签切片,包含行与列结束点:
>>> df.loc['20130102':'20130104', ['A', 'B']] # 这个和上面其实一模一样,只不过同时操作行和列而已
A B
2013-01-02 -0.709287 1.577562
2013-01-03 -1.346282 -0.702509
2013-01-04 0.215307 -0.425569
返回对象降维:
>>> df.loc['20130102', ['A', 'B']]
A -0.709287
B 1.577562
Name: 2013-01-02 00:00:00, dtype: float64
提取标量值:
>>> df.loc[dates[0], 'A']
0.8723507009004865
快速访问标量,与上述方法等效:
>>> df.at[dates[0], 'A']
0.8723507009004865
(3)按位置选择,主要用了.iloc方法:
用整数位置选择:
>>> df # 原始数据
A B C D
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362
2013-01-02 -0.709287 1.577562 0.658355 0.430606
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
2013-01-05 1.138694 1.344041 -1.208061 1.401647
2013-01-06 -1.160976 -0.242705 -1.791619 0.243495
>>> df.iloc[3] # 取第3行
A 0.215307
B -0.425569
C -1.323407
D -0.973161
Name: 2013-01-04 00:00:00, dtype: float64
类似 NumPy / Python,用整数切片:
>>> df.iloc[3:5,0:2] # 逗号前面是对行的操作,后面是对列的操作
A B
2013-01-04 0.215307 -0.425569
2013-01-05 1.138694 1.344041
类似 NumPy / Python,用整数列表按位置切片:
>>> df.iloc[[1,2,4],[1,2]]
B C
2013-01-02 1.577562 0.658355
2013-01-03 -0.702509 -1.009319
2013-01-05 1.344041 -1.208061
显式整行切片:
>>> df.iloc[1:4,:]
A B C D
2013-01-02 -0.709287 1.577562 0.658355 0.430606
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
显式整列切片:
>>> df.iloc[:,1:4]
B C D
2013-01-01 -1.266991 -1.765497 -1.547362
2013-01-02 1.577562 0.658355 0.430606
2013-01-03 -0.702509 -1.009319 -0.635420
2013-01-04 -0.425569 -1.323407 -0.973161
2013-01-05 1.344041 -1.208061 1.401647
2013-01-06 -0.242705 -1.791619 0.243495
显式提取值:
>>> df.iloc[1,1]
1.57756192701639
快速访问标量,与上述方法等效:
>>> df.iat[1,1]
1.57756192701639
(4)布尔索引
用单列的值选择数据:
>>> df[df.A>0] # 只把A列大于0的行选择上
A B C D
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
2013-01-05 1.138694 1.344041 -1.208061 1.401647
选择 DataFrame 里满足条件的值,不满足的用NaN填充:
>>> df[df>0]
A B C D
2013-01-01 0.872351 NaN NaN NaN
2013-01-02 NaN 1.577562 0.658355 0.430606
2013-01-03 NaN NaN NaN NaN
2013-01-04 0.215307 NaN NaN NaN
2013-01-05 1.138694 1.344041 NaN 1.401647
2013-01-06 NaN NaN NaN 0.243495
用 isin() 筛选:
>>> df3 = df.copy()
>>> df3
A B C D
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362
2013-01-02 -0.709287 1.577562 0.658355 0.430606
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161
2013-01-05 1.138694 1.344041 -1.208061 1.401647
2013-01-06 -1.160976 -0.242705 -1.791619 0.243495
>>> df3['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
>>> df3
A B C D E
2013-01-01 0.872351 -1.266991 -1.765497 -1.547362 one
2013-01-02 -0.709287 1.577562 0.658355 0.430606 one
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420 two
2013-01-04 0.215307 -0.425569 -1.323407 -0.973161 three
2013-01-05 1.138694 1.344041 -1.208061 1.401647 four
2013-01-06 -1.160976 -0.242705 -1.791619 0.243495 three
>>> df3[df3['E'].isin(['two','four'])]
A B C D E
2013-01-03 -1.346282 -0.702509 -1.009319 -0.635420 two
2013-01-05 1.138694 1.344041 -1.208061 1.401647 four
(5)赋值
用索引自动对齐新增列的数据:
>>> s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
>>> s1
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
按标签赋值:
>>> df[dates[0],'A']=0
按位置赋值:
>>> df.iat[0,1]=0
按 NumPy 数组赋值:
>>> df.loc[:, 'D'] = np.array([5] * len(df))
用 where 条件赋值:
>>> df2 = df.copy()
>>> df2[df2 > 0] = -df2
>>>df2
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 -5 NaN
2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0
2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0
2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0
2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0
2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0
4、缺失值处理
Pandas 主要用 np.nan 表示缺失数据。 计算时,默认不包含空值。下面开始看:
重建索引(reindex)可以更改、添加、删除指定轴的索引,并返回数据副本,即不更改原数据。
>>> df1 = df.reindex(index=dates[0:4],columns=list(df.columns)+['E'])
>>> df1
A B C D E
2013-01-01 0.872351 0.000000 -1.765497 5 NaN
2013-01-02 -0.709287 1.577562 0.658355 5 NaN
2013-01-03 -1.346282 -0.702509 -1.009319 5 NaN
2013-01-04 0.215307 -0.425569 -1.323407 5 NaN
>>> df1.loc[dates[0]:dates[1], 'E'] = 1
>>> df1
A B C D E
2013-01-01 0.872351 0.000000 -1.765497 5 1.0
2013-01-02 -0.709287 1.577562 0.658355 5 1.0
2013-01-03 -1.346282 -0.702509 -1.009319 5 NaN
2013-01-04 0.215307 -0.425569 -1.323407 5 NaN
删除所有含缺失值的行:
>>> df1.dropna(how='any') # 不改变原来的值
A B C D E
2013-01-01 0.872351 0.000000 -1.765497 5 1.0
2013-01-02 -0.709287 1.577562 0.658355 5 1.0
填充缺失值:
>>> df1.fillna(value=5) # NaN的部分用5填充
A B C D E
2013-01-01 0.872351 0.000000 -1.765497 5 1.0
2013-01-02 -0.709287 1.577562 0.658355 5 1.0
2013-01-03 -1.346282 -0.702509 -1.009319 5 5.0
2013-01-04 0.215307 -0.425569 -1.323407 5 5.0
提取 nan 值的布尔掩码,NaN的地方为True,否则为False:
>>> pd.isna(df1)
A B C D E
2013-01-01 False False False False False
2013-01-02 False False False False False
2013-01-03 False False False False True
2013-01-04 False False False False True
5、运算
(1)统计
一般情况下,运算时排除缺失值。
描述性统计:
>>> df # 原始数据
A B C D
2013-01-01 1.019612 0.343168 -0.351095 0.498860
2013-01-02 -0.341544 -1.688322 0.192494 0.733620
2013-01-03 -0.092551 1.478971 1.695497 -0.269589
2013-01-04 -0.505425 0.110293 -0.904603 -1.431836
2013-01-05 1.186768 -0.659781 0.471698 1.103648
2013-01-06 -0.581067 0.153586 1.081942 0.904470
>>> df.mean()
A 0.114299
B -0.043681
C 0.364322
D 0.256529
dtype: float64
在另一个轴(即,行)上执行同样的操作:
>>> df.mean(1)
2013-01-01 0.377636
2013-01-02 -0.275938
2013-01-03 0.703082
2013-01-04 -0.682893
2013-01-05 0.525583
2013-01-06 0.389733
Freq: D, dtype: float64
不同维度对象运算时,要先对齐。 此外,Pandas 自动沿指定维度广播。
>>> s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
>>> s # 原始数据
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
>>> df # 原始数据
A B C D
2013-01-01 1.019612 0.343168 -0.351095 0.498860
2013-01-02 -0.341544 -1.688322 0.192494 0.733620
2013-01-03 -0.092551 1.478971 1.695497 -0.269589
2013-01-04 -0.505425 0.110293 -0.904603 -1.431836
2013-01-05 1.186768 -0.659781 0.471698 1.103648
2013-01-06 -0.581067 0.153586 1.081942 0.904470
>>> df.sub(s, axis='index') # 这个就是df-s运算,s会广播
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 -1.092551 0.478971 0.695497 -1.269589
2013-01-04 -3.505425 -2.889707 -3.904603 -4.431836
2013-01-05 -3.813232 -5.659781 -4.528302 -3.896352
2013-01-06 NaN NaN NaN NaN
(2)Apply 函数
Apply 函数处理数据:
>>> df # 原始数据
A B C D
2013-01-01 1.019612 0.343168 -0.351095 0.498860
2013-01-02 -0.341544 -1.688322 0.192494 0.733620
2013-01-03 -0.092551 1.478971 1.695497 -0.269589
2013-01-04 -0.505425 0.110293 -0.904603 -1.431836
2013-01-05 1.186768 -0.659781 0.471698 1.103648
2013-01-06 -0.581067 0.153586 1.081942 0.904470
>>> df.apply(np.cumsum) # np.cumsum就是沿着特定的轴做加法,如下面A列,第一行不变,第二行为1+2,第三行为1+2+3,...
A B C D
2013-01-01 1.019612 0.343168 -0.351095 0.498860
2013-01-02 0.678068 -1.345153 -0.158601 1.232480
2013-01-03 0.585516 0.133817 1.536896 0.962891
2013-01-04 0.080091 0.244111 0.632293 -0.468945
2013-01-05 1.266859 -0.415671 1.103991 0.634703
2013-01-06 0.685792 -0.262085 2.185933 1.539173
>>> df.apply(lambda x: x.max() - x.min())
A 1.767834
B 3.167292
C 2.600100
D 2.535484
dtype: float64
(3)直方图
>>> s = pd.Series(np.random.randint(0, 7, size=10))
>>> s
0 0
1 6
2 0
3 4
4 3
5 0
6 5
7 5
8 0
9 3
dtype: int64
>>> s.value_counts() #统计出现的次数
0 4
5 2
3 2
6 1
4 1
dtype: int64
(4)字符串方法
Series 的 str 属性包含一组字符串处理功能,如下列代码所示。注意,str 的模式匹配默认使用正则表达式。对于字符串的方法,链接.
>>> s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
>>> s
0 A
1 B
2 C
3 Aaba
4 Baca
5 NaN
6 CABA
7 dog
8 cat
dtype: object
>>> s.str.lower() # 转换成小写
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
6、合并
(1)结合(Concat)
Pandas 提供了多种将 Series、DataFrame 对象组合在一起的功能,用索引与关联代数功能的多种设置逻辑可执行连接(join)与合并(merge)操作。
concat() 用于连接 Pandas 对象:
>>> pieces = [df[:3], df[3:7], df[7:]] # 分成三组
>>> pieces
[ 0 1 2 3
0 0.006284 0.480507 0.250293 0.876491
1 -0.819405 1.643500 1.451623 -0.229111
2 0.247354 -0.751726 -0.856380 -1.147593,
0 1 2 3
3 -1.031575 0.510607 -0.455070 -1.002518
4 -1.700787 0.998601 0.380679 1.465612
5 -0.550172 0.542730 0.574797 2.108123
6 -1.163303 -0.090545 1.402760 -0.213998,
0 1 2 3
7 -1.730882 -0.081091 0.759767 0.375214
8 -0.098455 1.021803 -1.330572 0.639896
9 0.610294 -0.627773 -1.244307 1.651799]
>>> type(pieces)
<class 'list'>
>>> pd.concat(pieces) # 拼接起来
0 1 2 3
0 0.006284 0.480507 0.250293 0.876491
1 -0.819405 1.643500 1.451623 -0.229111
2 0.247354 -0.751726 -0.856380 -1.147593
3 -1.031575 0.510607 -0.455070 -1.002518
4 -1.700787 0.998601 0.380679 1.465612
5 -0.550172 0.542730 0.574797 2.108123
6 -1.163303 -0.090545 1.402760 -0.213998
7 -1.730882 -0.081091 0.759767 0.375214
8 -0.098455 1.021803 -1.330572 0.639896
9 0.610294 -0.627773 -1.244307 1.651799
(2)连接(join)
SQL 风格的合并。
>>> left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
>>> left
key lval
0 foo 1
1 foo 2
>>> right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
>>> right
key rval
0 foo 4
1 foo 5
>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
# 另一个例子:
>>> left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
>>> left
key lval
0 foo 1
1 bar 2
>>> right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
>>> right
key rval
0 foo 4
1 bar 5
>>> pd.merge(left, right, on='key')
key lval rval
0 foo 1 4
1 bar 2 5
(3)追加(Append)
为 DataFrame 追加行。
>>> df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0.440431 0.493252 0.822852 0.412442
1 -2.150111 0.013618 0.075654 0.647939
2 2.394404 0.906516 0.343973 -1.538976
3 0.497830 -0.721377 2.327849 -0.179012
4 1.160945 -0.094008 0.223427 -0.167923
5 -0.933684 -0.788374 0.170846 -0.252211
6 -0.034939 -0.054579 0.472946 -0.381365
7 -1.238431 0.730380 -1.595895 0.154667
>>> s = df.iloc[3]
>>> s
A 0.497830
B -0.721377
C 2.327849
D -0.179012
Name: 3, dtype: float64
>>> df.append(s, ignore_index=True) # 忽略index的值追加
A B C D
0 0.440431 0.493252 0.822852 0.412442
1 -2.150111 0.013618 0.075654 0.647939
2 2.394404 0.906516 0.343973 -1.538976
3 0.497830 -0.721377 2.327849 -0.179012
4 1.160945 -0.094008 0.223427 -0.167923
5 -0.933684 -0.788374 0.170846 -0.252211
6 -0.034939 -0.054579 0.472946 -0.381365
7 -1.238431 0.730380 -1.595895 0.154667
8 0.497830 -0.721377 2.327849 -0.179012
>>> df.append(s, ignore_index=False) # 不忽略index的值追加,注意区别
A B C D
0 0.440431 0.493252 0.822852 0.412442
1 -2.150111 0.013618 0.075654 0.647939
2 2.394404 0.906516 0.343973 -1.538976
3 0.497830 -0.721377 2.327849 -0.179012
4 1.160945 -0.094008 0.223427 -0.167923
5 -0.933684 -0.788374 0.170846 -0.252211
6 -0.034939 -0.054579 0.472946 -0.381365
7 -1.238431 0.730380 -1.595895 0.154667
3 0.497830 -0.721377 2.327849 -0.179012
7、分组(Grouping)
“group by” 指的是涵盖下列一项或多项步骤的处理流程:
- 分割:按条件把数据分割成多组;
- 应用:为每组单独应用函数;
- 组合:将处理结果组合成一个数据结构。
>>> df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
>>> df
A B C D
0 foo one -0.509942 2.238172
1 bar one 0.812267 -0.560212
2 foo two -1.266951 1.586459
3 bar three 0.562791 -1.010264
4 foo two 1.282602 -1.753547
5 bar two 1.785235 -1.971573
6 foo one 0.557069 -0.488453
7 foo three -1.098466 0.053972
先分组,再用 sum()函数计算每组的汇总数据:
>>> df.groupby('A').sum() # 根据列A来分组,然后将每组按列加起来,注意会自动区分数字,只有数字才相加
C D
A
bar 3.160294 -3.542048
foo -1.035689 1.636603
多列分组后,生成多层索引,也可以应用 sum 函数:
>>> df.groupby(['A', 'B']).sum()
C D
A B
bar one 0.812267 -0.560212
three 0.562791 -1.010264
two 1.785235 -1.971573
foo one 0.047127 1.749719
three -1.098466 0.053972
two 0.015651 -0.167088
8、重塑(Reshaping)
(1)堆叠(Stack)
>>> tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', # zip这个函数将第一部分和第二部分打包成一对元组
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
>>> tuples
[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]
>>> index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
>>> index
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
>>> df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
>>> df
A B
first second
bar one 0.088624 -1.043238
two -0.802856 -0.230631
baz one 1.436134 -0.061092
two 1.393059 -0.359081
foo one 0.569854 1.479097
two -1.270516 0.625655
qux one 1.804187 0.089641
two -1.088694 2.587161
>>> df2 = df[:4]
>>> df2
A B
first second
bar one 0.088624 -1.043238
two -0.802856 -0.230631
baz one 1.436134 -0.061092
two 1.393059 -0.359081
接下来就是堆叠,stack()方法把 DataFrame 列压缩至一层,下面的列名A,B实际上成了index了,下面这个具有三层索引:
>>> stacked = df2.stack()
>>> stacked
first second
bar one A 0.088624
B -1.043238
two A -0.802856
B -0.230631
baz one A 1.436134
B -0.061092
two A 1.393059
B -0.359081
dtype: float64
压缩后的 DataFrame 或 Series 具有多层索引, stack() 的逆操作是 unstack(),默认为拆叠最后一层索引index,注意看下面例子:
>>> stacked.unstack() # 相当于stacked.unstack(2)
A B
first second
bar one 0.088624 -1.043238
two -0.802856 -0.230631
baz one 1.436134 -0.061092
two 1.393059 -0.359081
>>> stacked.unstack(1)
second one two
first
bar A 0.088624 -0.802856
B -1.043238 -0.230631
baz A 1.436134 1.393059
B -0.061092 -0.359081
>>> stacked.unstack(0)
first bar baz
second
one A 0.088624 1.436134
B -1.043238 -0.061092
two A -0.802856 1.393059
B -0.230631 -0.359081
9、数据透视表(Pivot Tables)
>>> df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3, # 切记,字典的key最后都变成了列名哦
'B': ['A', 'B', 'C'] * 4,
'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D': np.random.randn(12),
'E': np.random.randn(12)})
>>> df
A B C D E
0 one A foo -1.057104 -0.794894
1 one B foo 0.449186 0.300695
2 two C foo -1.592963 2.141752
3 three A bar 0.448007 0.750902
4 one B bar -0.044531 -0.645921
5 one C bar -0.270702 1.744061
6 two A foo 1.202951 -0.885481
7 three B foo 0.523942 0.450120
8 one C foo -0.654923 -0.935690
9 one A bar -0.648731 0.156769
10 two B bar 0.621874 -0.784878
11 three C bar -0.746672 0.210716
然后使用上述数据生成数据透视表.
下面例子实际上就是重新整理了上面的数据,比如将一列变成index,取哪一列做为列名,取哪一列做为值,当然有缺失了可能,用NaN填充了,看具体例子:
>>> pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
C bar foo
A B
one A -0.648731 -1.057104
B -0.044531 0.449186
C -0.270702 -0.654923
three A 0.448007 NaN
B NaN 0.523942
C -0.746672 NaN
two A NaN 1.202951
B 0.621874 NaN
C NaN -1.592963
10、可视化
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> ts = pd.Series(np.random.randn(1000),
index=pd.date_range('1/1/2000', periods=1000))
>>> ts
2000-01-01 -0.080495
2000-01-02 0.700097
2000-01-03 -1.668427
2000-01-04 -0.977912
2000-01-05 1.057130
...
2002-09-22 0.842348
2002-09-23 -1.887213
2002-09-24 -0.652051
2002-09-25 0.402222
2002-09-26 1.939206
Freq: D, Length: 1000, dtype: float64
>>> ts = ts.cumsum() # cumsum() 样本数值的累计和
>>> ts
2000-01-01 -0.080495
2000-01-02 0.619602
2000-01-03 -1.048825
2000-01-04 -2.026738
2000-01-05 -0.969608
...
2002-09-22 29.031369
2002-09-23 27.144156
2002-09-24 26.492105
2002-09-25 26.894327
2002-09-26 28.833533
Freq: D, Length: 1000, dtype: float64
ts.plot()
plt.show()
上面运行的结果:
DataFrame 的 plot() 方法可以快速绘制所有带标签的列:
>>> df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=['A', 'B', 'C', 'D'])
>>> df
A B C D
2000-01-01 -0.080938 0.113106 -1.089391 -0.217352
2000-01-02 -0.363437 0.348288 -1.067418 -0.159933
2000-01-03 0.349426 0.203287 0.913916 1.131560
2000-01-04 -0.677355 0.217901 -0.442623 -0.128144
2000-01-05 -1.846178 1.030022 1.564095 -0.453326
... ... ... ... ...
2002-09-22 -1.985099 -1.033704 -1.177357 0.618171
2002-09-23 -0.841400 -1.140720 0.937097 2.137209
2002-09-24 -0.049451 -0.259746 -0.852339 1.843638
2002-09-25 -1.927562 -0.204336 1.237734 0.446487
2002-09-26 1.229381 -0.776085 0.811986 0.694741
[1000 rows x 4 columns]
>>> df = df.cumsum()
>>> df
A B C D
2000-01-01 -0.080938 0.113106 -1.089391 -0.217352
2000-01-02 -0.444374 0.461394 -2.156809 -0.377285
2000-01-03 -0.094948 0.664681 -1.242893 0.754275
2000-01-04 -0.772303 0.882582 -1.685516 0.626131
2000-01-05 -2.618481 1.912604 -0.121421 0.172805
... ... ... ... ...
2002-09-22 -55.622693 55.205758 -4.404934 -16.320452
2002-09-23 -56.464093 54.065039 -3.467837 -14.183243
2002-09-24 -56.513544 53.805292 -4.320176 -12.339605
2002-09-25 -58.441106 53.600956 -3.082442 -11.893118
2002-09-26 -57.211725 52.824872 -2.270456 -11.198376
[1000 rows x 4 columns]
>>> plt.figure() # 定义一个面板
<Figure size 640x480 with 0 Axes>
>>> df.plot() # 画图
<matplotlib.axes._subplots.AxesSubplot object at 0x7f423c0dabe0>
>>> plt.legend(loc='best')
<matplotlib.legend.Legend object at 0x7f423c06e070>
>>> plt.show() # 显示
运行结果:
11、数据输入 / 输出
(1)导入数据
下面函数中的filename参数指的是你文件的路径,当然包括最后要导入的文件名以及后缀.
1、pd.read_csv(filename)
:从CSV文件导入数据
2、pd.read_table(filename)
:从限定分隔符的文本文件导入数据
这个函数默认是使用tab来分割数据的,假如说你的数据是空格符来分割的数据,那么只需要加入参数sep=" "
即可.
另一个比较重要的参数是header='infer'
, 默认会自动推断数据文件头,如果设置为None则无文件头,为1则第一行是文件头;
3、pd.read_excel(filename)
:从Excel文件导入数据
4、pd.read_sql(query, connection_object)
:从SQL表/库导入数据
5、pd.read_json(json_string)
:从JSON格式的字符串导入数据
6、pd.read_html(url)
:解析URL、字符串或者HTML文件,抽取其中的tables表格
7、pd.read_clipboard()
:从你的粘贴板获取内容,并传给read_table()
8、pd.DataFrame(dict)
:从字典对象导入数据,Key是列名,Value是数据
这只是最简单的,对于这些函数中其他的参数看官网的说明,其实大部分默认就行了,需要不同的操作,研究 官网地址的说明.
(2)导出数据
df.to_csv(filename)
:导出数据到CSV文件
df.to_excel(filename)
:导出数据到Excel文件
df.to_sql(table_name, connection_object)
:导出数据到SQL表
df.to_json(filename)
:以Json格式导出数据到文本文件
上述几个函数,若要保存为csv(或者其他文件)时不需要索引,只需要:
df.to_csv( '1.csv',index=None)