【Python基础】pandas 使用指南(超详细!)

pandas指南

学习目的:

  • 掌握pandas中series与dataframe
  • pandas数据清洗
  • 使用pandas进行数据基本统计
  • 时间序列分析
  • 数据分析常用方法

1 Pandas基础

  • pandas是python中数据分析核心库,能够快速,灵活的对大量数据进行分析,是Python进行数据分析的必要利器;
  • pandas支持多种数据导入,支持数据合并,拆分,基本统计,时间序列分析,透视表等多种操作;

1.1 Series数据结构

Series:一维的带索引数据结构(单列)

Series类:
pd.Series(data=None,index=None,dtype=None,name=None,copy=False,fastpath=False)

  1. 创建Series对象

    第一列为索引,第二列为Series数据

import pandas as pd
import numpy as np
sdata = pd.Series(np.arange(1,4), index=list('abc'))
sdata
a    1
b    2
c    3
dtype: int32
  1. Series对象访问

    .iloc[]确保你按位置而不是标签访问Series对象的元素。

# 默认数字索引
print(sdata.iloc[0])
# 使用标签[a,b,c]
print(sdata['b'])
# 使用loc方式,只能使用标签
print(sdata.loc['c'])
1
2
3
  1. 获取index与value
# 获取索引
sdata.index.values
array(['a', 'b', 'c'], dtype=object)
# 获取索引
sdata.values
array([1, 2, 3])
  1. 将index与value转成列表
# 将索引转成列表
sdata.index.values.tolist()
['a', 'b', 'c']
# 将数据转成列表
sdata.values.tolist()
[1, 2, 3]
# Series对象访问
for item in sdata.items():
    print(item)
('a', 1)
('b', 2)
('c', 3)

1.2 dataframe

DataFrame:多种类型的列构成的二维标签数据结构(多列);

DataFrame类:

pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

  • data:一维数据,二维数据
  • index:行标签
  • columns:列标签
1.2.1 Dataframe创建
  1. 一行一列
# 一维数据
pd.DataFrame(data=np.arange(1,4))
0
01
12
23
  1. 多列
# 多维数据 data为4X4
data = np.arange(16).reshape(4,4)
pd.DataFrame(data=data)
0123
00123
14567
2891011
312131415
  1. 设置index与columns
# 设置index与columns
data = np.arange(16).reshape(4,4)
pdata = pd.DataFrame(data=data, index=list('abcd'), columns=['c1','c2','c3','c4'])
pdata
c1c2c3c4
a0123
b4567
c891011
d12131415
  1. 设置index与columns
# 设置index与columns
data = {'c1':[1,2,3], 'c2':[4,5,6]}
pdata = pd.DataFrame(data=data)
pdata
c1c2
014
125
236
  1. 设置列标签
# 设置列标签
pdata.columns = ['t1','t2']
pdata
t1t2
014
125
236
1.2.2 DataFrame对象访问
data = {'c1':[1,2,3], 'c2':[4,5,6], 'c3':[7,8,9]}
pdata = pd.DataFrame(data=data)
pdata
c1c2c3
0147
1258
2369
  1. 获取指定列
# 获取一列数据,返回Series对象
pdata['c1']
0    1
1    2
2    3
Name: c1, dtype: int64
# 取多列数据
pdata[['c1','c2']]
c1c2
014
125
236
  1. loc操作

    loc操作:使用类似列表方式去对数据进行访问,支持bool索引;

import pandas as pd
data = {'c1':[1,2,3], 'c2':[4,5,6], 'c3':[7,8,9]}
pdata = pd.DataFrame(data=data)
pdata
c1c2c3
0147
1258
2369
  1. 获取指定行
# 获取第一行
pdata.loc[0]
c1    1
c2    4
c3    7
Name: 0, dtype: int64
# 获取第一行指定c1,c2列
pdata.loc[0,['c1','c2']]
c1    1
c2    4
Name: 0, dtype: int64
  1. 遍历DataFrame对象
# 获取列索引
for item in pdata:
    print(item)
c1
c2
c3
# 按列遍历
for item in pdata.items():
    print(item)
('c1', 0    1
1    2
2    3
Name: c1, dtype: int64)
('c2', 0    4
1    5
2    6
Name: c2, dtype: int64)
('c3', 0    7
1    8
2    9
Name: c3, dtype: int64)
# 按行遍历
for item in pdata.iterrows():
    print(item)
(0, c1    1
c2    4
c3    7
Name: 0, dtype: int64)
(1, c1    2
c2    5
c3    8
Name: 1, dtype: int64)
(2, c1    3
c2    6
c3    9
Name: 2, dtype: int64)
1.2.3 DataFrame修改
  1. 修改元素
# 修改元素
import pandas as pd
data = {'c1':[1,2,3], 'c2':[4,5,6], 'c3':[7,8,9]}
pdata = pd.DataFrame(data=data)
# 修改c1列值
pdata['c1'] = 0
pdata
c1c2c3
0047
1058
2069
  1. DataFrame插入列
# DataFrame插入列
import pandas as pd
data = {'c1':[1,2,3], 'c2':[4,5,6], 'c3':[7,8,9]}
pdata = pd.DataFrame(data=data)
# 修改c1列值
pdata['c4'] = [-1,-1,-1]
pdata
c1c2c3c4
0147-1
1258-1
2369-1
  1. DataFrame插入行
# DataFrame插入行
import pandas as pd
data = {'c1':[1,2,3], 'c2':[4,5,6], 'c3':[7,8,9]}
pdata = pd.DataFrame(data=data)
# 修改c1列值
pdata.loc[3] = [-1,-1,-1]
pdata
c1c2c3
0147
1258
2369
3-1-1-1

2 pandas数据导入与保存

目的:

  • 数据导入:excel, csv文件
  • 数据导出
  • 基本统计
  • 缺省数据处理

2.1 数据导入

数据是分析基础,实际工作中,数据来自于企业内部数据,网络数据,开源数据集;

方法说明
pd.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, ...)读取CSV文件
pd.read_excel(io, sheet_name=0, names=None, index_col=None, usecols=None, ...)读取Excel文件
pd.read_json(path_or_buf=None, orient=None, typ='frame', dtype=None, ...)读取JSON文件
  1. 读取excel文件
import pandas as pd
import numpy as np
# 读取excel文件
fpath = r'data\test.xlsx'
pdata = pd.read_excel(fpath)
pdata
序号姓名数学语文
019987
128892
237773
346668
  1. 读取csv文件
# 读取csv文件
fpath = r'data\GDP.csv'
pdata = pd.read_csv(fpath, encoding='gbk')
pdata
Country NameCountry CodeIndicator NameIndicator Code196019611962196319641965...2008200920102011201220132014201520162017
0ArubaABWGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...2.791961e+092.498933e+092.467704e+092.584464e+09NaNNaNNaNNaNNaNNaN
1AfghanistanAFGGDP (current US$)NY.GDP.MKTP.CD5.377778e+085.488889e+085.466667e+087.511112e+088.000000e+081.006667e+09...1.019053e+101.248694e+101.593680e+101.793024e+102.053654e+102.004633e+102.005019e+101.921556e+101.946902e+10NaN
2AngolaAGOGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...8.417803e+107.549238e+108.247091e+101.041160e+111.153980e+111.249120e+111.267770e+111.029620e+119.533511e+10NaN
3AlbaniaALBGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...1.288135e+101.204421e+101.192695e+101.289087e+101.231978e+101.277628e+101.322824e+101.133526e+101.186387e+10NaN
4AndorraANDGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...4.007353e+093.660531e+093.355695e+093.442063e+093.164615e+093.281585e+093.350736e+092.811489e+092.858518e+09NaN
..................................................................
259KosovoXKXGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...5.687488e+095.653793e+095.829934e+096.649291e+096.473725e+097.072092e+097.386891e+096.440501e+096.649889e+09NaN
260Yemen, Rep.YEMGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...2.691085e+102.513027e+103.090675e+103.272642e+103.539315e+104.041523e+104.322858e+103.773392e+102.731761e+10NaN
261South AfricaZAFGDP (current US$)NY.GDP.MKTP.CD7.575248e+097.972841e+098.497830e+099.423212e+091.037379e+101.133417e+10...2.871000e+112.972170e+113.752980e+114.168780e+113.963330e+113.668100e+113.511190e+113.176110e+112.954560e+11NaN
262ZambiaZMBGDP (current US$)NY.GDP.MKTP.CD7.130000e+086.962857e+086.931429e+087.187143e+088.394286e+081.082857e+09...1.791086e+101.532834e+102.026556e+102.346010e+102.550337e+102.804546e+102.715063e+102.115439e+102.106399e+10NaN
263ZimbabweZWEGDP (current US$)NY.GDP.MKTP.CD1.052990e+091.096647e+091.117602e+091.159512e+091.217138e+091.311436e+09...4.415703e+098.621574e+091.014186e+101.209845e+101.424249e+101.545177e+101.589105e+101.630467e+101.661996e+10NaN

264 rows × 62 columns

# 导入指定列
fpath = r'data\GDP.csv'
pdata = pd.read_csv(fpath,usecols = ['Country Name','1990'], encoding='gbk')
pdata
Country Name1990
0ArubaNaN
1AfghanistanNaN
2Angola1.002674e+10
3Albania2.101625e+09
4Andorra1.029048e+09
.........
259KosovoNaN
260Yemen, Rep.5.647252e+09
261South Africa1.155530e+11
262Zambia3.285217e+09
263Zimbabwe8.783817e+09

264 rows × 2 columns

# 导入指定表头
fpath = r'data\GDP.csv'
pdata = pd.read_csv(fpath,header=1, encoding='gbk')
pdata
ArubaABWGDP (current US$)NY.GDP.MKTP.CDUnnamed: 4Unnamed: 5Unnamed: 6Unnamed: 7Unnamed: 8Unnamed: 9...2791960894249893296124677039112584463687Unnamed: 56Unnamed: 57Unnamed: 58Unnamed: 59Unnamed: 60Unnamed: 61
0AfghanistanAFGGDP (current US$)NY.GDP.MKTP.CD5.377778e+085.488889e+085.466667e+087.511112e+088.000000e+081.006667e+09...1.019053e+101.248694e+101.593680e+101.793024e+102.053654e+102.004633e+102.005019e+101.921556e+101.946902e+10NaN
1AngolaAGOGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...8.417803e+107.549238e+108.247091e+101.041160e+111.153980e+111.249120e+111.267770e+111.029620e+119.533511e+10NaN
2AlbaniaALBGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...1.288135e+101.204421e+101.192695e+101.289087e+101.231978e+101.277628e+101.322824e+101.133526e+101.186387e+10NaN
3AndorraANDGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...4.007353e+093.660531e+093.355695e+093.442063e+093.164615e+093.281585e+093.350736e+092.811489e+092.858518e+09NaN
4Arab WorldARBGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...2.078120e+121.795820e+122.109650e+122.501550e+122.741240e+122.839630e+122.906620e+122.563300e+122.504700e+12NaN
..................................................................
258KosovoXKXGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...5.687488e+095.653793e+095.829934e+096.649291e+096.473725e+097.072092e+097.386891e+096.440501e+096.649889e+09NaN
259Yemen, Rep.YEMGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...2.691085e+102.513027e+103.090675e+103.272642e+103.539315e+104.041523e+104.322858e+103.773392e+102.731761e+10NaN
260South AfricaZAFGDP (current US$)NY.GDP.MKTP.CD7.575248e+097.972841e+098.497830e+099.423212e+091.037379e+101.133417e+10...2.871000e+112.972170e+113.752980e+114.168780e+113.963330e+113.668100e+113.511190e+113.176110e+112.954560e+11NaN
261ZambiaZMBGDP (current US$)NY.GDP.MKTP.CD7.130000e+086.962857e+086.931429e+087.187143e+088.394286e+081.082857e+09...1.791086e+101.532834e+102.026556e+102.346010e+102.550337e+102.804546e+102.715063e+102.115439e+102.106399e+10NaN
262ZimbabweZWEGDP (current US$)NY.GDP.MKTP.CD1.052990e+091.096647e+091.117602e+091.159512e+091.217138e+091.311436e+09...4.415703e+098.621574e+091.014186e+101.209845e+101.424249e+101.545177e+101.589105e+101.630467e+101.661996e+10NaN

263 rows × 62 columns

# 无表头指定None
pdata = pd.read_csv(fpath,header=None, encoding='gbk')
pdata
0123456789...52535455565758596061
0Country NameCountry CodeIndicator NameIndicator Code1.960000e+031.961000e+031.962000e+031.963000e+031.964000e+031.965000e+03...2.008000e+032.009000e+032.010000e+032.011000e+032.012000e+032.013000e+032.014000e+032.015000e+032.016000e+032017.0
1ArubaABWGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...2.791961e+092.498933e+092.467704e+092.584464e+09NaNNaNNaNNaNNaNNaN
2AfghanistanAFGGDP (current US$)NY.GDP.MKTP.CD5.377778e+085.488889e+085.466667e+087.511112e+088.000000e+081.006667e+09...1.019053e+101.248694e+101.593680e+101.793024e+102.053654e+102.004633e+102.005019e+101.921556e+101.946902e+10NaN
3AngolaAGOGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...8.417803e+107.549238e+108.247091e+101.041160e+111.153980e+111.249120e+111.267770e+111.029620e+119.533511e+10NaN
4AlbaniaALBGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...1.288135e+101.204421e+101.192695e+101.289087e+101.231978e+101.277628e+101.322824e+101.133526e+101.186387e+10NaN
..................................................................
260KosovoXKXGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...5.687488e+095.653793e+095.829934e+096.649291e+096.473725e+097.072092e+097.386891e+096.440501e+096.649889e+09NaN
261Yemen, Rep.YEMGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...2.691085e+102.513027e+103.090675e+103.272642e+103.539315e+104.041523e+104.322858e+103.773392e+102.731761e+10NaN
262South AfricaZAFGDP (current US$)NY.GDP.MKTP.CD7.575248e+097.972841e+098.497830e+099.423212e+091.037379e+101.133417e+10...2.871000e+112.972170e+113.752980e+114.168780e+113.963330e+113.668100e+113.511190e+113.176110e+112.954560e+11NaN
263ZambiaZMBGDP (current US$)NY.GDP.MKTP.CD7.130000e+086.962857e+086.931429e+087.187143e+088.394286e+081.082857e+09...1.791086e+101.532834e+102.026556e+102.346010e+102.550337e+102.804546e+102.715063e+102.115439e+102.106399e+10NaN
264ZimbabweZWEGDP (current US$)NY.GDP.MKTP.CD1.052990e+091.096647e+091.117602e+091.159512e+091.217138e+091.311436e+09...4.415703e+098.621574e+091.014186e+101.209845e+101.424249e+101.545177e+101.589105e+101.630467e+101.661996e+10NaN

265 rows × 62 columns

2.2 数据保存

方法说明
pdata.to_csv(path_or_buf=None, sep=',', ...)保存为CSV文件
pdata.to_excel(excel_writer, sheet_name='Sheet1', na_rep='', ...)保存为Excel文件
pdata.to_json(path_or_buf=None, orient=None, ...)保存为JSON格式文件
# 保存文件
import pandas as pd
import numpy as np
# 读取excel文件
fpath = r'data\GDP.csv'
csv_path1 = r'data\new_GDP_1.csv'
csv_path2 = r'data\new_GDP_2.csv'
csv_path3 = r'data\new_GDP_3.csv'
pdata = pd.read_csv(fpath, encoding='gbk')
# 保存格式带索引
pdata.to_csv(csv_path1)
# 保存格式不带索引
pdata.to_csv(csv_path2, index=False)
# 保存格式不带索引,保存指定列
pdata.to_csv(csv_path3, index=False, columns=['1990','1991'])
pdata = pd.read_csv(csv_path1, encoding='gbk')
pdata
Unnamed: 0Country NameCountry CodeIndicator NameIndicator Code19601961196219631964...2008200920102011201220132014201520162017
00ArubaABWGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaN...2.791961e+092.498933e+092.467704e+092.584464e+09NaNNaNNaNNaNNaNNaN
11AfghanistanAFGGDP (current US$)NY.GDP.MKTP.CD5.377778e+085.488889e+085.466667e+087.511112e+088.000000e+08...1.019053e+101.248694e+101.593680e+101.793024e+102.053654e+102.004633e+102.005019e+101.921556e+101.946902e+10NaN
22AngolaAGOGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaN...8.417803e+107.549238e+108.247091e+101.041160e+111.153980e+111.249120e+111.267770e+111.029620e+119.533511e+10NaN
33AlbaniaALBGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaN...1.288135e+101.204421e+101.192695e+101.289087e+101.231978e+101.277628e+101.322824e+101.133526e+101.186387e+10NaN
44AndorraANDGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaN...4.007353e+093.660531e+093.355695e+093.442063e+093.164615e+093.281585e+093.350736e+092.811489e+092.858518e+09NaN
..................................................................
259259KosovoXKXGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaN...5.687488e+095.653793e+095.829934e+096.649291e+096.473725e+097.072092e+097.386891e+096.440501e+096.649889e+09NaN
260260Yemen, Rep.YEMGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaN...2.691085e+102.513027e+103.090675e+103.272642e+103.539315e+104.041523e+104.322858e+103.773392e+102.731761e+10NaN
261261South AfricaZAFGDP (current US$)NY.GDP.MKTP.CD7.575248e+097.972841e+098.497830e+099.423212e+091.037379e+10...2.871000e+112.972170e+113.752980e+114.168780e+113.963330e+113.668100e+113.511190e+113.176110e+112.954560e+11NaN
262262ZambiaZMBGDP (current US$)NY.GDP.MKTP.CD7.130000e+086.962857e+086.931429e+087.187143e+088.394286e+08...1.791086e+101.532834e+102.026556e+102.346010e+102.550337e+102.804546e+102.715063e+102.115439e+102.106399e+10NaN
263263ZimbabweZWEGDP (current US$)NY.GDP.MKTP.CD1.052990e+091.096647e+091.117602e+091.159512e+091.217138e+09...4.415703e+098.621574e+091.014186e+101.209845e+101.424249e+101.545177e+101.589105e+101.630467e+101.661996e+10NaN

264 rows × 63 columns

pdata = pd.read_csv(csv_path2, encoding='gbk')
pdata
Country NameCountry CodeIndicator NameIndicator Code196019611962196319641965...2008200920102011201220132014201520162017
0ArubaABWGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...2.791961e+092.498933e+092.467704e+092.584464e+09NaNNaNNaNNaNNaNNaN
1AfghanistanAFGGDP (current US$)NY.GDP.MKTP.CD5.377778e+085.488889e+085.466667e+087.511112e+088.000000e+081.006667e+09...1.019053e+101.248694e+101.593680e+101.793024e+102.053654e+102.004633e+102.005019e+101.921556e+101.946902e+10NaN
2AngolaAGOGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...8.417803e+107.549238e+108.247091e+101.041160e+111.153980e+111.249120e+111.267770e+111.029620e+119.533511e+10NaN
3AlbaniaALBGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...1.288135e+101.204421e+101.192695e+101.289087e+101.231978e+101.277628e+101.322824e+101.133526e+101.186387e+10NaN
4AndorraANDGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...4.007353e+093.660531e+093.355695e+093.442063e+093.164615e+093.281585e+093.350736e+092.811489e+092.858518e+09NaN
..................................................................
259KosovoXKXGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...5.687488e+095.653793e+095.829934e+096.649291e+096.473725e+097.072092e+097.386891e+096.440501e+096.649889e+09NaN
260Yemen, Rep.YEMGDP (current US$)NY.GDP.MKTP.CDNaNNaNNaNNaNNaNNaN...2.691085e+102.513027e+103.090675e+103.272642e+103.539315e+104.041523e+104.322858e+103.773392e+102.731761e+10NaN
261South AfricaZAFGDP (current US$)NY.GDP.MKTP.CD7.575248e+097.972841e+098.497830e+099.423212e+091.037379e+101.133417e+10...2.871000e+112.972170e+113.752980e+114.168780e+113.963330e+113.668100e+113.511190e+113.176110e+112.954560e+11NaN
262ZambiaZMBGDP (current US$)NY.GDP.MKTP.CD7.130000e+086.962857e+086.931429e+087.187143e+088.394286e+081.082857e+09...1.791086e+101.532834e+102.026556e+102.346010e+102.550337e+102.804546e+102.715063e+102.115439e+102.106399e+10NaN
263ZimbabweZWEGDP (current US$)NY.GDP.MKTP.CD1.052990e+091.096647e+091.117602e+091.159512e+091.217138e+091.311436e+09...4.415703e+098.621574e+091.014186e+101.209845e+101.424249e+101.545177e+101.589105e+101.630467e+101.661996e+10NaN

264 rows × 62 columns

pdata = pd.read_csv(csv_path3, encoding='gbk')
pdata
19901991
0NaNNaN
1NaNNaN
21.002674e+101.211861e+10
32.101625e+091.139167e+09
41.029048e+091.106929e+09
.........
259NaNNaN
2605.647252e+095.930370e+09
2611.155530e+111.239430e+11
2623.285217e+093.378882e+09
2638.783817e+098.641482e+09

264 rows × 2 columns

3 缺失数据处理

3.1 缺失值与空值

缺省值:数据集中数值为空的值, pandas使用Nan / NaT 表示

空值:空字符串 ""

s1 = [10, 10.5, None, 11]
s2 = [7, 6.9,7.5,None]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
pdata
s1s2
010.07.0
110.56.9
2NaN7.5
311.0NaN

3.2 缺失值判断

判断方法:

  • pd.isnull():缺省值对应的值为True,返回值为Boolean的Series或者DataFrame对象
  • pd.notnull():缺省值对应的值为False,返回值为Boolean的Series或者DataFrame对象
  • pdata.isnull() / pdata.notnull() :同上
sdata = pd.Series([1,2,3, np.NaN])
pd.isnull(sdata)
0    False
1    False
2    False
3     True
dtype: bool
s1 = [10, 10.5, None, 11]
s2 = [7, 6.9,7.5,None]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
pd.isnull(pdata)
s1s2
0FalseFalse
1FalseFalse
2TrueFalse
3FalseTrue

3.3 判断是否有缺失值

  1. 方式:np.allpd.notnull结合
s1 = [10, 10.5, None, 11]
s2 = [7, 6.9,7.5,None]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
# pd.notnull,若包含缺省值,缺省值对应值为False
# np.all:若对象中包含假,返回False, 否则返回真
np.all(pd.notnull(pdata))
# 返回False, 说明包含缺省值,否则不包含缺省值
False
s1 = [10, 10.5, 11]
s2 = [7, 6.9,7.5]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
np.all(pd.notnull(pdata))
True
  1. 方式:np.anypd.isnull结合
s1 = [10, 10.5, 11]
s2 = [7, 6.9,7.5]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
# isnull:缺省值对应值为True
# any:对象中包含真,返回True
np.any(pd.isnull(pdata))
# 返回False,说明不含缺省值,返回True说明包括缺省值
False

3.4 缺省值处理方式

缺省值处理:

  • 过滤缺省值(按行列)
  • 删除缺省值(按行列)
  • 填充值,填充值方式:
  • 插入均值,中位数,最大值,最小值等
  • 插入特殊值
  • 插入前(后)值入前(后)值
3.4.1 缺省值过滤
  1. 举例:某两只股票1周收盘值,None表示当前停盘

    需求:获取两只股票都没有停牌的数据

# 数据:某两只股票1周收盘值,None表示当前停盘
s1 = [10, 10.5, None, 11]
s2 = [7, 6.9,7.5,None]
pdata = pd.DataFrame({'s1':s1, 's2':s2})
pdata
s1s2
010.07.0
110.56.9
2NaN7.5
311.0NaN
# 需求:获取两只股票都没有停牌的数据
# 获取boolean索引
bindex = np.all(pdata.notnull(), axis=1)
bindex
0     True
1     True
2    False
3    False
dtype: bool
# 获取没有停牌数据
pdata[bindex]
s1s2
010.07.0
110.56.9
3.4.2 删除缺省值

pdata.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

主要参数:

参数说明
axis0'index':按行操作,1'columns':按列操作
how根据 axis 指定操作方式,any:只要有一个 Na 就删除,all:全部为 Na 才删除
thresh指定非 Na 值的数量,非 Na 数量大于等于 thresh 时不删除
subset指定操作的列子集
inplaceTrue:在原始数据中进行修改

准备数据:

s1 = [10, 10.5, None, 11]
s2 = [7, 6.9,7.5,None]
s3 = [7, 6.9,7.5,7]
s4 = [None, 6.9,None,7.2]
pdata = pd.DataFrame({'s1':s1, 's2':s2, 's3':s3,'s4':s4})
pdata
s1s2s3s4
010.07.07.0NaN
110.56.96.96.9
2NaN7.57.5NaN
311.0NaN7.07.2

需求:

  1. 删除包含缺省值的行
  2. 删除包含2个缺省值行
  3. 删除指定列包含缺省值
  4. 删除包含缺省值的列
# 删除包含缺省值行
pdata.dropna()
s1s2s3s4
110.56.96.96.9
# 缺省值数量大于1,thresh设置为3
# thresh:指定非Na数量(非Na数量>=thresh,不删除)
pdata.dropna(thresh=3)
s1s2s3s4
010.07.07.0NaN
110.56.96.96.9
311.0NaN7.07.2
# 指定列:['s1','s4']
pdata.dropna(subset=['s1','s4'])
s1s2s3s4
110.56.96.96.9
311.0NaN7.07.2
# 删除包含缺省值列
pdata.dropna(axis=1)
s3
07.0
16.9
27.5
37.0

注意:

  • 以上数据删除都不对原始数据进行修改
  • 指定inplace为True,在原始数据中进行修改
3.4.3 缺失值填充

填充方法:

pdata.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

参数说明
value填充值
method填充方式:{'backfill', 'bfill', 'pad', 'ffill', None}
axis指定行列:0'index' 表示按行,1'columns' 表示按列
limit插入数量限制
pdata
s1s2s3s4
010.07.07.0NaN
110.56.96.96.9
2NaN7.57.5NaN
311.0NaN7.07.2

需求:

  • 缺省值填充固定值0
  • 使用前/后面数据填充
  • 使用均值填充
  • 插入均值插入均值
# 固定值0
pdata.fillna(0)
s1s2s3s4
010.07.07.00.0
110.56.96.96.9
20.07.57.50.0
311.00.07.07.2
# 使用前一行数据填充
pdata.fillna(method='ffill')
# 使用向前填充 (ffill) 替代 fillna
pdata.ffill()
s1s2s3s4
010.07.07.0NaN
110.56.96.96.9
210.57.57.56.9
311.07.57.07.2
# 使用后一行数据填充
pdata.fillna(method='bfill')
# 使用向后填充 (bfill) 替代 fillna
pdata.bfill()
s1s2s3s4
010.07.07.06.9
110.56.96.96.9
211.07.57.57.2
311.0NaN7.07.2
# 使用后一列数据填充
pdata.fillna(axis=1,method='bfill')
# 使用向后填充 (bfill) 替代 fillna
pdata.bfill(axis=1)
s1s2s3s4
010.07.07.0NaN
110.56.96.96.9
27.57.57.5NaN
311.07.07.07.2

对于股票缺省值,我们倾向于,使用前一天数据填充缺失值

3.4.4 插入均值,中位数,最大值,最小值

pdata.mean/max/min/median(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

主要参数:

参数说明
axis方向,0 表示按列,1 表示按行
skipna是否忽略 NaN,True 表示不计算 NaN,默认为 True
pdata
s1s2s3s4
010.07.07.0NaN
110.56.96.96.9
2NaN7.57.5NaN
311.0NaN7.07.2
  1. 插入均值
#插入均值
pdata.fillna(pdata.mean())
s1s2s3s4
010.07.0000007.07.05
110.56.9000006.96.90
210.57.5000007.57.05
311.07.1333337.07.20
  1. 插入中位数
#插入中位数
pdata.fillna(pdata.median())
s1s2s3s4
010.07.07.07.05
110.56.96.96.90
210.57.57.57.05
311.07.07.07.20

4 数据清洗

4.1 准备数据

某次考试成绩

import pandas as pd
import numpy as np

names = list('ABCD')
math = [90,100,50,80]
chinese = [89,96,58,77]
pdata = pd.DataFrame({'name':names, 'math':math, 'chinese':chinese})
pdata
namemathchinese
0A9089
1B10096
2C5058
3D8077

4.2 获取指定列

通过列名,直接获取值

# 通过列名,直接获取值
pdata['name']
0    A
1    B
2    C
3    D
Name: name, dtype: object

4.3 获取指定多列

基本方法:pdata[[col1, col2]]

获取姓名与数学成绩

# 获取指定多列
# 基本方法:pdata[[col1, col2]]
pdata[['name', 'math']]
namemath
0A90
1B100
2C50
3D80

4.4 根据指定条件获取数据

  1. 需求1:数学成绩大于80的所有成绩;

    实现思路:

    • 根据条件生成boolean索引
    • 通过boolean索引获取数据
# 需求1:数学成绩大于80的所有成绩;
bindex = pdata['math'] > 80
pdata[bindex]
namemathchinese
0A9089
1B10096
  1. 需求2:获取同学A的成绩

    实现思路:

    • 根据条件生成boolean索引
    • 通过boolean索引获取数据
# 需求2:获取同学A的成绩;
pdata[pdata['name']=='A']
namemathchinese
0A9089

4.5 根据指定多个条件获取数据

  1. 需求1:获取数学语文都及格成绩
    • 条件1:数学成绩大于59,
    • 条件2:语文成绩大于59
    • 条件3:两个条件与操作:&
    • 基本语法:pdata[condition1&condition2]
# 需求1:获取数学语文都及格成绩
# 注意:两个条件要加括号
pdata[(pdata['math']>59) & (pdata['chinese']>59)]
namemathchinese
0A9089
1B10096
3D8077
  1. 需求2:获取数学语文有一门大于等于80分
    • 条件1:数学成绩大于等于80,
    • 条件2:语文成绩大于等于80
    • 条件3:两个条件与操作:|
    • 基本语法:pdata[condition1|condition2]
# 需求2:获取数学语文有一门大于等于80分
# 注意:两个条件要加括号
pdata[(pdata['math']>=80) | (pdata['chinese']>=80)]
namemathchinese
0A9089
1B10096
3D8077
# 根据集合获取数据
bindex = pdata['math'].isin([100, 90])
pdata[bindex]
namemathchinese
0A9089
1B10096
  1. 根据集合获取数据

    获取数学成绩为100或者90的学生成绩

    • 多个值判断: pdata.isin(values),返回boolean索引
bindex = pdata['math'].isin([100, 90])
pdata[bindex]
namemathchinese
0A9089
1B10096

4.6 根据数据排序

  • 排序方式1:根据索引排序

    pdata.sort_index(axis=0,level=None,ascending=True,...)

  • 排序方式2:根据指定列内容排序

    pdata.sort_values(by,axis=0,ascending=True,...)

# 根据数据排序
import pandas as pd
import numpy as np
names = list('ABCD')
math = [90,100,80,80]
chinese = [89,96,58,77]
pdata = pd.DataFrame({'name':names, 'math':math, 'chinese':chinese})
pdata
namemathchinese
0A9089
1B10096
2C8058
3D8077
# 根据索引排序,降序,ascending=False
pdata.sort_index(ascending=False)
namemathchinese
3D8077
2C8058
1B10096
0A9089
# 根据数学成绩排序,降序,ascending=False
pdata.sort_values(['math'], ascending=False)
namemathchinese
1B10096
0A9089
2C8058
3D8077
# 根据数学与语文成绩排序,降序,ascending=False
# sort_values中加入两列数据
pdata.sort_values(['math', 'chinese'], ascending=False)
namemathchinese
1B10096
0A9089
3D8077
2C8058

5 pandas汇总与描述性统计

pandas计算与统计相关方法:

  • 最大值:pdata.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
  • 最小值:pdata.min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
  • 均值:pdata.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
  • 中位数:pdata.median(axis=None,skipna=None,level=None,numeric_only=None, **kwargs)
  • 求和:pdata.sum(axis=None,skipna=None,level=None,numeric_only=None,min_count=0,**kwargs)
  • 方差:pdata.var(axis=None,skipna=None,level=None,ddof=1,numeric_only=None,**kwargs)
  • 标准差:pdata.std(axis=None,skipna=None,level=None,ddof=1,numeric_only=None,**kwargs)
  • 累加和:pdata.cumsum(axis=None, skipna=True, *args, **kwargs)
  • 分位数:pdata.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')
  • 每个数值到均值的平均差:pdata.mad(axis=None, skipna=None, level=None)
  • 元素与先前元素的相差百分比:pdata.pct_change(periods=1,fill_method='pad',limit=None,freq=None,**kwargs)
  • 偏度:pdata.skew(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
  • 峰度:pdata.kurt(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
  • 数据描述:pdata.describe(percentiles=None, include=None, exclude=None)
  1. 产生数据
import pandas as pd
import numpy as np

names = list('ABCD')
math = [90,100,80,80]
chinese = [89,96,58,77]
pdata = pd.DataFrame({'name':names, 'math':math, 'chinese':chinese})
pdata
namemathchinese
0A9089
1B10096
2C8058
3D8077
  1. 基本描述与计算
#最大值,最小值,四分位数,均值,数量,标准差
pdata.describe()
mathchinese
count4.0000004.0000
mean87.50000080.0000
std9.57427116.6333
min80.00000058.0000
25%80.00000072.2500
50%85.00000083.0000
75%92.50000090.7500
max100.00000096.0000
#计算每个学生总分,平均分
print(pdata.iloc[:, 1:].sum(axis=1))
print(pdata.iloc[:, 1:].mean(axis=1))
0    179
1    196
2    138
3    157
dtype: int64
0    89.5
1    98.0
2    69.0
3    78.5
dtype: float64
  1. 分位数

    计算:下四分位数,中位数,上四分位数

#下四分位数
print(pdata.iloc[:, 1:].quantile(q=0.25))
#中位数
print(pdata.iloc[:, 1:].quantile(q=0.5))
#上位数
print(pdata.iloc[:, 1:].quantile(q=0.75))
math       80.00
chinese    72.25
Name: 0.25, dtype: float64
math       85.0
chinese    83.0
Name: 0.5, dtype: float64
math       92.50
chinese    90.75
Name: 0.75, dtype: float64

6 索引/多级索引

主要内容:

  • 重置索引
  • 索引变换
  • 多级索引索引
import pandas as pd
import numpy as np
names = list('ABCDABCD')
pdata = pd.DataFrame(np.random.randint(30,100, size=(8,2)),columns=['math','chinese'])
pdata['team'] = [1]*4+[2]*4
pdata['name'] = names
pdata
mathchineseteamname
058671A
163971B
255861C
354341D
475572A
549642B
683882C
767752D

6.1 设置索引

  • name设置为索引

  • 重置索引方法:pdata.set_index(keys,drop=True,append=False,inplace=False)

参数说明
keys指定索引名称,可以是多列
dropTrue:删除列数据,False:保留列数据
appendTrue:在原有列基础上追加,False:替代原有列
inplaceTrue:在原数据中进行修改,False:返回新数据
# 设置索引
# set_index:inplace返回副本,新数据
ndata = pdata.set_index('name')
ndata
mathchineseteam
name
A58671
B63971
C55861
D54341
A75572
B49642
C83882
D67752

6.2 多级索引

需求:通过索引获取指定学期数据

pd.MultiIndex.from_arrays(arrays, sortorder=None, names=None)

import pandas as pd
import numpy as np
names = list('ABCDABCD')
pdata = pd.DataFrame(np.random.randint(30,100, size=(8,2)),columns=['math','chinese'])
pdata
mathchinese
05371
14398
26638
38769
49295
58053
66298
78464
lv1 = [1]*4+[2]*4
lv2 = list('ABCDABCD')
# 创建MultiIndex对象
mindex = pd.MultiIndex.from_arrays([lv1, lv2])
data = pdata.set_index(mindex)
data
mathchinese
1A5371
B4398
C6638
D8769
2A9295
B8053
C6298
D8464

6.3 通过多级索引取值

需求:

  • 获取第一学期数据
  • 获取第一学期A同学数据
  • 获取A同学所有数据有数据
#第一学期数据
data.loc[1]
mathchinese
A5371
B4398
C6638
D8769
#第一学期A同学数据
data.loc[(1,'A')]
math       53
chinese    71
Name: (1, A), dtype: int32
# 如何获取A同学所有数据?
# 索引交换
data = data.swaplevel(0,1)
data
mathchinese
A15371
B14398
C16638
D18769
A29295
B28053
C26298
D28464
data.loc['A']
mathchinese
15371
29295

6.4 行列变换

  • pdata.stack(level=-1, dropna=True):将列“旋转”为行
  • pdata.unstack(level=-1, fill_value=None):将行“旋转”为列
import pandas as pd
import numpy as np
names = list('ABCDABCD')
pdata = pd.DataFrame(np.random.randint(30,100, size=(8,2)),columns=['math','chinese'])
lv1 = [1]*4+[2]*4
lv2 = list('ABCDABCD')
mindex = pd.MultiIndex.from_arrays([lv1, lv2])
pdata = pdata.set_index(mindex)
pdata
mathchinese
1A8496
B7570
C8655
D9798
2A4037
B7943
C7264
D8237

需求:

  • 将学期转到列
  • 获取所有学生数学成绩
  • 获取第一学期数据数据
  1. 将学期转到列
# 将学期转到列
tmp = pdata.unstack(level=0)
tmp
mathchinese
1212
A84409637
B75797043
C86725564
D97829837
  1. 获取所有学生数学成绩
# 获取所有学生数学成绩
tmp['math']
12
A8440
B7579
C8672
D9782
  1. 获取第一学期数据
# 列索引层级交换,获取第一学期数据
tmp.swaplevel(axis=1)[1]
mathchinese
A8496
B7570
C8655
D9798
tmp
mathchinese
1212
A84409637
B75797043
C86725564
D97829837
t = tmp.stack(level=0)
t
12
Achinese9637
math8440
Bchinese7043
math7579
Cchinese5564
math8672
Dchinese9837
math9782
# 获取A同学第一学期成绩
t.loc['A'][1]
chinese    96
math       84
Name: 1, dtype: int32
t.sort_index(level = 1)
12
Achinese9637
Bchinese7043
Cchinese5564
Dchinese9837
Amath8440
Bmath7579
Cmath8672
Dmath9782

7 时间与时间序列

时间是数据分析重要维度,pandas中时间主要知识点:

  • 时间戳
  • 周期
  • 时间间隔
  • 时间索引
  • 时间滑动窗口间滑动窗口

7.1 时间戳

时间处理中常见的对象;

时间戳方法:

  • pd.Timestamp(ts_input=<object object at 0x00000258D4E907F0>,freq=None, tz=None,...),详情见说明案例
  • pd.to_datetime(arg,errors='raise', dayfirst=False,...)
#年
print(pd.Timestamp(2020))
#年月日
print(pd.Timestamp(2020, 6, 2))
#字符串
print(pd.Timestamp('2020-05-07'))
#字符串时间
print(pd.Timestamp('2020-05-07 04:02:01'))
#字符串时间
print(pd.Timestamp('2017-01-01T12'))
#时间戳
print(pd.Timestamp(1513393355.5, unit='s'))
#2017-03-01与format对应
print(pd.to_datetime('2017-02-01',format="%Y-%m-%d"))
#20170301与年月日对应
print(pd.to_datetime('20170301',format="%Y%m%d"))
1970-01-01 00:00:00.000002020
2020-06-02 00:00:00
2020-05-07 00:00:00
2020-05-07 04:02:01
2017-01-01 12:00:00
2017-12-16 03:02:35.500000
2017-02-01 00:00:00
2017-03-01 00:00:00
print(pd.to_datetime('2017-02-01',format="%Y-%m-%d"))
#生成时间索引
print(pd.to_datetime(['2017-02-01'],format="%Y-%m-%d"))
2017-02-01 00:00:00
DatetimeIndex(['2017-02-01'], dtype='datetime64[ns]', freq=None)
pd.Timestamp.now()
Timestamp('2024-02-01 10:45:29.038280')

7.2 时间索引

pd.date_range(start=None,end=None,periods=None,freq=None,tz=None, ... **kwargs):生成DatetimeIndex

官方文档:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html)

参数说明
start开始时间
end结束时间
periods产生周期数量
freq间隔,默认为 D(天),可选其他频率,例如:H(小时)、M(分钟)等
closed闭区间:None'left''right'

freq的主要参数:

参数说明
M每月最后一天
MS每月第一天
D
H小时
T分钟(min)
S
Q季度
# 周期单位为Day
print(pd.date_range('2017-01-01', periods=2))
# 周期单位为hour
print(pd.date_range('2017-01-01 02', periods=2, freq='H'))
# 每个月月初
print(pd.date_range('2017-01', periods=3, freq='MS'))
# 每个月月底
print(pd.date_range('2017-01', periods=3, freq='M'))
DatetimeIndex(['2017-01-01', '2017-01-02'], dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2017-01-01 02:00:00', '2017-01-01 03:00:00'], dtype='datetime64[ns]', freq='H')
DatetimeIndex(['2017-01-01', '2017-02-01', '2017-03-01'], dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2017-01-31', '2017-02-28', '2017-03-31'], dtype='datetime64[ns]', freq='M')

7.3 周期

通过时间段与固定时间间隔一系列时间;

周期作用:可以获取指定年,指定月,指定日等的数据

  • 周期:pd.Period(value=None,freq=None,ordinal=None,year=None,...)
  • 周期序列:pd.period_range(start=None, end=None, periods=None, freq=None, name=None):生成PeriodIndex
  • pdata.to_period(freq=None, axis=0, copy=True)
import pandas as pd
import numpy as np
print(pd. Timestamp('2017-01-02'))
print(pd.Period('2017-01-02'))
print(pd.Period('2017-01'))
print(pd.Period('2017'))
2017-01-02 00:00:00
2017-01-02
2017-01
2017
# 周期单位为Day
print(pd.period_range('2017-01-01', periods=2))
# 周期单位为月
print(pd.period_range('2017-01', periods=2, freq='M'))
# 周期单位为小时
print(pd.period_range('2017-01-02', periods=2, freq='H'))
PeriodIndex(['2017-01-01', '2017-01-02'], dtype='period[D]')
PeriodIndex(['2017-01', '2017-02'], dtype='period[M]')
PeriodIndex(['2017-01-02 00:00', '2017-01-02 01:00'], dtype='period[H]')
tmp = pd.to_datetime(['2017-01-02', '2017-03-04'])
tmp.to_period('M')
PeriodIndex(['2017-01', '2017-03'], dtype='period[M]')

7.4 时间索引

很多数据及数据集中都会有时间维度,可以将其设置为时间索引;

内容:

  • 获取指定时间数据
  • 获取时间段数据
  • 获取某个时期数据时期数据
# 准备数据
import pandas as pd
import numpy as np
# 注意这里是字符串
ts = ['2019-03-25','2019-03-26','2019-03-27','2019-03-28','2019-03-29','2019-03-30','2019-03-31',
 '2019-04-01','2019-04-02','2019-04-03','2019-04-04','2019-05-01','2019-04-05',]
values = np.arange(len(ts))
pdata = pd.DataFrame({'ts':ts, 'values':values})
pdata
tsvalues
02019-03-250
12019-03-261
22019-03-272
32019-03-283
42019-03-294
52019-03-305
62019-03-316
72019-04-017
82019-04-028
92019-04-039
102019-04-0410
112019-05-0111
122019-04-0512
pdata.ts.values
array(['2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
       '2019-03-29', '2019-03-30', '2019-03-31', '2019-04-01',
       '2019-04-02', '2019-04-03', '2019-04-04', '2019-05-01',
       '2019-04-05'], dtype=object)

需求:

  • 获取三月份数据
  • 获取4月1号到4号数据
  • 获取第2季度数据
# 生成时间索引
# 方式1:
tindex = pd.to_datetime(pdata['ts'])
tindex
0    2019-03-25
1    2019-03-26
2    2019-03-27
3    2019-03-28
4    2019-03-29
5    2019-03-30
6    2019-03-31
7    2019-04-01
8    2019-04-02
9    2019-04-03
10   2019-04-04
11   2019-05-01
12   2019-04-05
Name: ts, dtype: datetime64[ns]
# 创建时间索引
tmp = pdata.set_index(tindex)
tmp.index
DatetimeIndex(['2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
               '2019-03-29', '2019-03-30', '2019-03-31', '2019-04-01',
               '2019-04-02', '2019-04-03', '2019-04-04', '2019-05-01',
               '2019-04-05'],
              dtype='datetime64[ns]', name='ts', freq=None)
# 三月数据
tmp.loc['2019-03']
tsvalues
ts
2019-03-252019-03-250
2019-03-262019-03-261
2019-03-272019-03-272
2019-03-282019-03-283
2019-03-292019-03-294
2019-03-302019-03-305
2019-03-312019-03-316
# 4月1号到4号数据:切片操作
tmp = tmp.sort_index()
tmp.loc['2019-04-01':'2019-04-05']
tsvalues
ts
2019-04-012019-04-017
2019-04-022019-04-028
2019-04-032019-04-039
2019-04-042019-04-0410
2019-04-052019-04-0512
# 获取第2季度数据
# 按季度生成索引
qindex = tmp.index.to_period('Q')
qdata = tmp.set_index(qindex)
qdata.loc['2019Q2']
tsvalues
ts
2019Q22019-04-017
2019Q22019-04-028
2019Q22019-04-039
2019Q22019-04-0410
2019Q22019-04-0512
2019Q22019-05-0111
tmp.index.to_period('Q')
PeriodIndex(['2019Q1', '2019Q1', '2019Q1', '2019Q1', '2019Q1', '2019Q1',
             '2019Q1', '2019Q2', '2019Q2', '2019Q2', '2019Q2', '2019Q2',
             '2019Q2'],
            dtype='period[Q-DEC]', name='ts')

7.5 时间差值

  • timedalte:两个datetime值之间的差(如日,秒和微妙)的类型

  • pd.Timedelta(value=<object object at 0x0000023DD1424CC0>, unit=None, **kwargs)

# Timestamp相减
pd.Timestamp('2019-01-02')-pd.Timestamp('2019-01-01')
Timedelta('1 days 00:00:00')
pd.Timedelta(5, 'T')
Timedelta('0 days 00:05:00')

7.6 重采样

重采样作用:

  • 降低采样率
  • 提升采样率
  • 方法:pdata.resample(rule,how=None,axis=0,fill_method=None,closed=None,label=None,convention='start'...)
参数说明
rule规则,例如:'T'(分钟)、'M'(月份)等
fill_method提升采样率填充方式,'ffill'(向前填充)、'bfill'(向后填充)等
closed降低采样率时的闭合方式,'right''left',默认为 'right'
label降低采样率时的聚合值标签,{'right', 'left'}
loffset时间偏差,用于调整聚合后的时间索引,可以是 timedelta 类型
kind聚合方式,'period''timestamp',默认聚合到时间索引

应用场景:股票分析,金融等;

import pandas as pd
import numpy as np
index = pd.date_range('1/1/2000', periods=9, freq='2T')
series = pd.Series(range(9), index=index)
series
2000-01-01 00:00:00    0
2000-01-01 00:02:00    1
2000-01-01 00:04:00    2
2000-01-01 00:06:00    3
2000-01-01 00:08:00    4
2000-01-01 00:10:00    5
2000-01-01 00:12:00    6
2000-01-01 00:14:00    7
2000-01-01 00:16:00    8
Freq: 2T, dtype: int64
# 降低采样率,将时间间隔改成4S,每个时间对应值为均值
s = series.resample('4T')
s.groups
{Timestamp('2000-01-01 00:00:00'): 2,
 Timestamp('2000-01-01 00:04:00'): 4,
 Timestamp('2000-01-01 00:08:00'): 6,
 Timestamp('2000-01-01 00:12:00'): 8,
 Timestamp('2000-01-01 00:16:00'): 9}

Resampler对象相关方法:

方法说明
s.groups返回 Resampler 对象,字典
s.max()降频分组后的最大值
s.min()降频分组后的最小值
s.first()降频分组后的第一个值
s.last()降频分组后的最后一个值
s.mean()降频分组后的均值
s.median()降频分组后的中位数
s.first()
2000-01-01 00:00:00    0
2000-01-01 00:04:00    2
2000-01-01 00:08:00    4
2000-01-01 00:12:00    6
2000-01-01 00:16:00    8
Freq: 4T, dtype: int64
s.max()
2000-01-01 00:00:00    1
2000-01-01 00:04:00    3
2000-01-01 00:08:00    5
2000-01-01 00:12:00    7
2000-01-01 00:16:00    8
Freq: 4T, dtype: int64
s.mean()
2000-01-01 00:00:00    0.5
2000-01-01 00:04:00    2.5
2000-01-01 00:08:00    4.5
2000-01-01 00:12:00    6.5
2000-01-01 00:16:00    8.0
Freq: 4T, dtype: float64
#提高采样率,将时间间隔改成S,每个时间对应值为均值
rd = series.resample('T')
rd.bfill()
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    1
2000-01-01 00:03:00    2
2000-01-01 00:04:00    2
2000-01-01 00:05:00    3
2000-01-01 00:06:00    3
2000-01-01 00:07:00    4
2000-01-01 00:08:00    4
2000-01-01 00:09:00    5
2000-01-01 00:10:00    5
2000-01-01 00:11:00    6
2000-01-01 00:12:00    6
2000-01-01 00:13:00    7
2000-01-01 00:14:00    7
2000-01-01 00:15:00    8
2000-01-01 00:16:00    8
Freq: T, dtype: int64

7.7 时间迁移

  • 时间序列常用操作:对数据按照时间进行迁移
  • 迁移数据:df.shift(periods=1, freq=None, axis=0, fill_value=None)
  • 迁移索引:df.tshift(periods=1, freq=None, axis=0) ——该方法已被弃用
index = pd.date_range('1/1/2000', periods=9, freq='D')
series = pd.Series(range(9), index=index)
series
2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
2000-01-08    7
2000-01-09    8
Freq: D, dtype: int64
series.shift(1, freq='T')
2000-01-01 00:01:00    0
2000-01-02 00:01:00    1
2000-01-03 00:01:00    2
2000-01-04 00:01:00    3
2000-01-05 00:01:00    4
2000-01-06 00:01:00    5
2000-01-07 00:01:00    6
2000-01-08 00:01:00    7
2000-01-09 00:01:00    8
Freq: D, dtype: int64
series.shift(1)
2000-01-01    NaN
2000-01-02    0.0
2000-01-03    1.0
2000-01-04    2.0
2000-01-05    3.0
2000-01-06    4.0
2000-01-07    5.0
2000-01-08    6.0
2000-01-09    7.0
Freq: D, dtype: float64
series.shift(1, freq='D')
2000-01-02    0
2000-01-03    1
2000-01-04    2
2000-01-05    3
2000-01-06    4
2000-01-07    5
2000-01-08    6
2000-01-09    7
2000-01-10    8
Freq: D, dtype: int64

8 数据清洗

数据清洗方式:

  • 获取某列的唯一值:Series.unique()
  • 每个值出现次数:Series.value_counts()
  • 删除指定行列:pdata.drop(labels=None,axis=0,index=None,columns=None, level=None,inplace=False,errors='raise')
  • 去重:pdata.drop_duplicates(subset=None, keep='first', inplace=False)
  1. 产生数据
# 准备数据
import pandas as pd
import numpy as np
# 注意这里是字符串
ts = ['2019-03-25','2019-03-26','2019-03-26','2019-03-26','2019-03-29','2019-03-30','2019-03-31',
 '2019-04-01','2019-04-02','2019-04-03','2019-04-04','2019-05-01','2019-04-05',]
values = np.arange(len(ts))
pdata = pd.DataFrame({'ts':ts, 'values':values})
pdata
tsvalues
02019-03-250
12019-03-261
22019-03-262
32019-03-263
42019-03-294
52019-03-305
62019-03-316
72019-04-017
82019-04-028
92019-04-039
102019-04-0410
112019-05-0111
122019-04-0512
  1. 唯一值与数值出现次数
# 唯一值
pdata.ts.unique()
array(['2019-03-25', '2019-03-26', '2019-03-29', '2019-03-30',
       '2019-03-31', '2019-04-01', '2019-04-02', '2019-04-03',
       '2019-04-04', '2019-05-01', '2019-04-05'], dtype=object)
# 数值出现次数
pdata.ts.value_counts()
ts
2019-03-26    3
2019-03-25    1
2019-03-29    1
2019-03-30    1
2019-03-31    1
2019-04-01    1
2019-04-02    1
2019-04-03    1
2019-04-04    1
2019-05-01    1
2019-04-05    1
Name: count, dtype: int64
  1. 删除指定行列
# 删除指定行列
# 删除单行
print(pdata.drop(0))
            ts  values
1   2019-03-26       1
2   2019-03-26       2
3   2019-03-26       3
4   2019-03-29       4
5   2019-03-30       5
6   2019-03-31       6
7   2019-04-01       7
8   2019-04-02       8
9   2019-04-03       9
10  2019-04-04      10
11  2019-05-01      11
12  2019-04-05      12
# 删除多行
print(pdata.drop(index=[1,2,3]))
            ts  values
0   2019-03-25       0
4   2019-03-29       4
5   2019-03-30       5
6   2019-03-31       6
7   2019-04-01       7
8   2019-04-02       8
9   2019-04-03       9
10  2019-04-04      10
11  2019-05-01      11
12  2019-04-05      12
# 删除列
print(pdata.drop(columns='ts'))
    values
0        0
1        1
2        2
3        3
4        4
5        5
6        6
7        7
8        8
9        9
10      10
11      11
12      12
# 删除index为0值,删除列为ts的值
pdata.drop(index=0, columns='ts')
values
11
22
33
44
55
66
77
88
99
1010
1111
1212
  1. 去重
# 去重
# 保留第一次
pdata.drop_duplicates('ts')
tsvalues
02019-03-250
12019-03-261
42019-03-294
52019-03-305
62019-03-316
72019-04-017
82019-04-028
92019-04-039
102019-04-0410
112019-05-0111
122019-04-0512
#保留最后一次
pdata.drop_duplicates(subset='ts', keep='last')
tsvalues
02019-03-250
32019-03-263
42019-03-294
52019-03-305
62019-03-316
72019-04-017
82019-04-028
92019-04-039
102019-04-0410
112019-05-0111
122019-04-0512
#保留第一次数据,生成boolean索引,为True的为需要删除数据
bindex = pdata.duplicated(subset='ts')
pdata[bindex==False]
tsvalues
02019-03-250
12019-03-261
42019-03-294
52019-03-305
62019-03-316
72019-04-017
82019-04-028
92019-04-039
102019-04-0410
112019-05-0111
122019-04-0512

9 数据合并

目的:根据需求,合并多个数据集

  • merge:
  • join:
  • concat:at:

9.1 merge

将不同数据集根据指定字段进行合并得到新的数据集。

pd.merge(left,right,how='inner',on=None,left_on=None,right_on=None,left_index=False, right_index=False,sort=False,suffixes=('_x', '_y'),copy=True,indicator=False,validate=None)

主要参数:

参数说明
left左侧数据集
right右侧数据集
how合并方式
on索引或者列名,用于连接数据集
left_on数据连接,左侧数据集的列名
right_on数据连接,右侧数据集的列名
left_index使用左侧数据集的索引进行连接
right_index使用右侧数据集的索引进行连接
sort是否根据连接键排序
suffixes合并后相同列名的后缀

合并方式:

参数说明
left左连接
right右连接
outer外连接
inner内连接
  1. 准备数据
import pandas as pd
import numpy as np
n1 = list('ABCE')
n2 = list('ABCD')
d1 = [90,80,100,69]
d2 = [95,78,96,72]
c1 = ['001','001','002','002']
c2 = ['001','001','002','002']
df1 = pd.DataFrame({'name':n1,'math':d1, 'class':c1})
df2 = pd.DataFrame({'name':n2,'chinese':d2, 'class':c2,'pname':n1})
df1
namemathclass
0A90001
1B80001
2C100002
3E69002
df2
namechineseclasspname
0A95001A
1B78001B
2C96002C
3D72002E
  1. 数据合并
#默认按索引合并
pd.merge(df1, df2)
namemathclasschinesepname
0A9000195A
1B8000178B
2C10000296C
#按class合并,根据class,两两组合,但是对于当前成绩来说,我们希望以名称进行合并
pd.merge(df1, df2, on = 'class')
name_xmathclassname_ychinesepname
0A90001A95A
1A90001B78B
2B80001A95A
3B80001B78B
4C100002C96C
5C100002D72E
6E69002C96C
7E69002D72E
#修改合并方式,取并集,会产生缺失数据
pd.merge(df1, df2, on = 'name', how='outer')
namemathclass_xchineseclass_ypname
0A90.000195.0001A
1B80.000178.0001B
2C100.000296.0002C
3E69.0002NaNNaNNaN
4DNaNNaN72.0002E
#指定left, 以left为主
pd.merge(df1, df2, on = 'name', how='left')
namemathclass_xchineseclass_ypname
0A9000195.0001A
1B8000178.0001B
2C10000296.0002C
3E69002NaNNaNNaN
#指定right, 以right为主
pd.merge(df1, df2, on = 'name', how='right')
namemathclass_xchineseclass_ypname
0A90.000195001A
1B80.000178001B
2C100.000296002C
3DNaNNaN72002E
#实际工作中,数据集列名称可能不能,需要制定不同列名进行合并
pd.merge(df1, df2, left_on = 'name',right_on='pname')
name_xmathclass_xname_ychineseclass_ypname
0A90001A95001A
1B80001B78001B
2C100002C96002C
3E69002D72002E

9.2 join方法

join方法:DateFrame对象方法,与megre方法类似,how的方式默认为left

df1.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

主要参数:

参数说明
on指定连接列
how拼接方式
lsuffix左侧 DataFrame 的列名后缀
rsuffix右侧 DataFrame 的列名后缀
df1
namemathclass
0A90001
1B80001
2C100002
3E69002
df2
namechineseclasspname
0A95001A
1B78001B
2C96002C
3D72002E
#合并时候如果有相同列名,需要指定lsuffix,rsuffix,用于区分合并后列;
df1.join(df2, lsuffix='_x', rsuffix='_y')
name_xmathclass_xname_ychineseclass_ypname
0A90001A95001A
1B80001B78001B
2C100002C96002C
3E69002D72002E
df1.join(df2.set_index('class'), lsuffix='_x', rsuffix='_y', on ='class')
name_xmathclassname_ychinesepname
0A90001A95A
0A90001B78B
1B80001A95A
1B80001B78B
2C100002C96C
2C100002D72E
3E69002C96C
3E69002D72E

9.3 concat

concat:根据设置轴与条件,将两个数据进行拼接

pd.concat(objs,axis=0,join='outer',join_axes=None,ignore_index=False,keys=None,levels=None,names=None...)

参数说明:

参数说明
objs一个包含 Series 列表、DataFrame 列表或字典的列表
axis指定拼接轴,0 表示按索引(index),1 表示按列(columns)
join连接方式,inner 表示交集,outer 表示并集
ignore_index是否不使用并置轴上的索引值,True 表示不使用
join_axes用于指定连接轴的 Index 对象列表
keys序列,用于构建层次化索引(MultiIndex)
levels多级索引的特定值
names列表,用于指定层级索引的名称
#按axis=0, 按列进行拼接
pd.concat([df1, df2], sort= True)
chineseclassmathnamepname
0NaN00190.0ANaN
1NaN00180.0BNaN
2NaN002100.0CNaN
3NaN00269.0ENaN
095.0001NaNAA
178.0001NaNBB
296.0002NaNCC
372.0002NaNDE
pd.concat([df1, df2], axis=1)
namemathclassnamechineseclasspname
0A90001A95001A
1B80001B78001B
2C100002C96002C
3E69002D72002E
#按axis=0, 按列进行拼接,并设置key,结果:多级索引
pd.concat([df1, df2], sort= True, keys=['p1', 'p2'])
chineseclassmathnamepname
p10NaN00190.0ANaN
1NaN00180.0BNaN
2NaN002100.0CNaN
3NaN00269.0ENaN
p2095.0001NaNAA
178.0001NaNBB
296.0002NaNCC
372.0002NaNDE
#axis=1,按索引进行拼接
pd.concat([df1, df2], axis=1)
namemathclassnamechineseclasspname
0A90001A95001A
1B80001B78001B
2C100002C96002C
3E69002D72002E
#axis=1,按索引进行拼接,列中增加多级索引
pd.concat([df1, df2], axis=1, keys=['s1','s2'])
s1s2
namemathclassnamechineseclasspname
0A90001A95001A
1B80001B78001B
2C100002C96002C
3E69002D72002E
#拼接数据,根据name进行拼接
t1 = df1.set_index('name')
t2 = df2.set_index('name')
pd.concat([t1, t2], axis=1, sort=True)
mathclasschineseclasspname
name
A90.000195.0001A
B80.000178.0001B
C100.000296.0002C
DNaNNaN72.0002E
E69.0002NaNNaNNaN
#拼接数据,根据name进行拼接,join设置为inner
t1 = df1.set_index('name')
t2 = df2.set_index('name')
pd.concat([t1, t2], axis=1, sort=True, join='inner')
mathclasschineseclasspname
name
A9000195001A
B8000178001B
C10000296002C

10 pandas数据处理常用函数

目的:掌握apply, agg,str等函数,对数据灵活处理

10.1 apply函数

  • 对pandas中DataFrame或者Series中每个数据进行处理
  • apply方法:df.apply(func,axis=0,broadcast=None, raw=False,reduce=None, result_type=None,args=(), **kwds)
参数说明
func处理函数,用于处理一系列值
axis轴设置,指定处理函数的轴,0 表示按列,1 表示按行
import pandas as pd
import numpy as np
n = list('ABCD')
math = [90,80,47,69]
chinese = [95,78,96,59]
nclass = ['001','001','002','002']
df = pd.DataFrame({'name':n,'math':math, 'chinese':chinese,'class':nclass})
df
namemathchineseclass
0A9095001
1B8078001
2C4796002
3D6959002
name =df['name']
name
0    A
1    B
2    C
3    D
Name: name, dtype: object
# 修改name名字
name.apply(lambda x:x+x)
0    AA
1    BB
2    CC
3    DD
Name: name, dtype: object
# 将成绩转成True或者False
df[['math','chinese']].apply(lambda x : x>59)
mathchinese
0TrueTrue
1TrueTrue
2FalseTrue
3TrueFalse

10.2 func处理对象

def func(value):
    print(type(value))
    return np.mean(value)
df[['math','chinese']].apply(func)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>





math       71.5
chinese    82.0
dtype: float64
def func1(x):
    return x-x.mean()
df[['math','chinese']].apply(func1)
mathchinese
018.513.0
18.5-4.0
2-24.514.0
3-2.5-23.0
def func_min(x):
    return x-x.min()

sdata = df[['math','chinese']]
sdata.apply(func_min)
mathchinese
04336
13319
2037
3220

10.3 map

  • 适用于Series对象或Df的某一列
  • map方法:Series.map(arg, na_action=None)
  • 对Series中的每个数值进行处理处理
df['math'].map(lambda x: 'pass' if x > 59 else 'failed')
0      pass
1      pass
2    failed
3      pass
Name: math, dtype: object
# 需求:将根据成绩单生成:'pass','failed'
func = lambda x: 'pass' if x > 59 else 'failed'
df[['math','chinese']].apply(lambda x : x.map(func))
mathchinese
0passpass
1passpass
2failedpass
3passfailed

10.4 replace:替换

  • 对当前数据集中指定数据进行替换
  • 方法:df.replace(to_replace=None,value=None,inplace=False,limit=None,regex=False,method='pad')

主要参数:

参数说明
to_replace替换值,可以是字符串、正则表达式、列表等
value替换目标值
limit替换次数的限制
inplace是否替换原数据,True 表示替换原数据
regex是否使用正则表达式进行替换,需要设置为 True
# 将A替换成a
df.replace('A', 'a')
namemathchineseclass
0a9095001
1B8078001
2C4796002
3D6959002
# 将[A,B]替换成*
df.replace(['A','B'], '*')
namemathchineseclass
0*9095001
1*8078001
2C4796002
3D6959002
# 一组数据替换
df.replace(list('ABCD'), list('abcd'))
namemathchineseclass
0a9095001
1b8078001
2c4796002
3d6959002
# 正则:将所有字母替换成*
df.replace(r'[A-Z]','*', regex=True)
namemathchineseclass
0*9095001
1*8078001
2*4796002
3*6959002

10.5 agg:聚合操作

  • 按照设置axis对数据进行聚合操作(mean, max,…)
  • df.agg(func, axis=0, *args, **kwargs)
  • axis: 0:func应用到column, 1:func应用到rowow
df[['chinese', 'math']].agg(['mean', 'std'])
chinesemath
mean82.00000071.500000
std17.41646718.448125
df[['chinese', 'math']].describe()
chinesemath
count4.0000004.000000
mean82.00000071.500000
std17.41646718.448125
min59.00000047.000000
25%73.25000063.500000
50%86.50000074.500000
75%95.25000082.500000
max96.00000090.000000

10.6 transform:处理数据

  • 根据设置方法,对数据进行处理,得到新的数据
  • df.transform(func, axis=0, *args, **kwargs)
  • 每个科目数值减去均值去均值
df[['chinese', 'math']]
chinesemath
09590
17880
29647
35969
df[['chinese', 'math']].transform(lambda x:x-x.mean())
chinesemath
013.018.5
1-4.08.5
214.0-24.5
3-23.0-2.5

10.7 filter:过滤

  • 根据标签过滤符合条件数据
  • df.filter(items=None, like=None, regex=None, axis=None)
参数说明
itemslabels 值,类似于列表
like标签模糊匹配
regex正则表达式匹配
axis设置轴,指定操作的轴,例如 0(按行)、1(按列)
df.filter(['name'])
name
0A
1B
2C
3D
# 设置like:like in label
df.filter(like = 'n')
namechinese
0A95
1B78
2C96
3D59
# 正则表达式:获取e结尾的列
df.filter(regex=r'.*e')
namechinese
0A95
1B78
2C96
3D59

11 分组处理

目的:根据指定条件,对数据进行分组,然后在依据分组进行计算

例如:统计每天活跃用户总数,用户每天在线时长,用户平均消费水平等;

import pandas as pd
classname = ['001','001','002','002','003','003']
name = ['sun','li','zhou','wang','zao','wu']
height = [169, 172,180,170,165,175]
weights = [61,53,75,64,50,58]
df = pd.DataFrame({'cname':classname, 'user':name, 'height':height, 'weights':weights})
df
cnameuserheightweights
0001sun16961
1001li17253
2002zhou18075
3002wang17064
4003zao16550
5003wu17558

11.1 groupby分组

方法:df.groupby(by=None,axis=0,level=None,as_index=True,sort=True,group_keys=True,squeeze=False,observed=False,**kwargs)

主要参数:

参数说明
by分组依据
axis轴设置,0'index' 表示按行,1'columns' 表示按列,默认为 0
group_keys聚合输出,以组标签作为索引,默认为 True
sort根据分组标签排序,默认为 True
level多级索引,指定索引
# 创建DataFrameGroupBy 对象,
dfg = df.groupby('cname')
dfg.groups
{'001': [0, 1], '002': [2, 3], '003': [4, 5]}
# 分组统计
dfg.count()
userheightweights
cname
001222
002222
003222
# 根据多列进行分组:
dfg = df.groupby(['cname','height'])
dfg.groups
{('001', 169): [0], ('001', 172): [1], ('002', 170): [3], ('002', 180): [2], ('003', 165): [4], ('003', 175): [5]}
# 统计结果为多级索引
dfg.count()
userweights
cnameheight
00116911
17211
00217011
18011
00316511
17511

11.2 聚合

分组得到groupby对象,可以通过聚合函数对其操作,获取聚合结果;

直白理解:对分组数据,进行处理;

import pandas as pd
import numpy as np
classname = ['001','001','002','002','003','003']
name = ['sun','li','zhou','wang','zao','wu']
height = [169, 172,180,170,165,175]
weights = [61,53,75,64,50,58]
df = pd.DataFrame({'cname':classname, 'user':name, 'height':height, 'weights':weights})
df_cname = df.drop('user', axis=1) # 删除列名为user的列
dfg = df_cname.groupby('cname')
df_cname
cnameheightweights
000116961
100117253
200218075
300217064
400316550
500317558
dfg.agg(['max', 'mean', 'std', 'count'])
heightweights
maxmeanstdcountmaxmeanstdcount
cname
001172170.52.12132026157.05.6568542
002180175.07.07106827569.57.7781752
003175170.07.07106825854.05.6568542

11.3 transform

  • 作用:将分组数据处理,得到一组新的数据
  • 方法:dfg.transform(func, *args, **kwargs)
  • 场景:分组数据中,与分组均值差,差,
# 每个班级平均身高,体重
dfg.transform('mean')
heightweights
0170.557.0
1170.557.0
2175.069.5
3175.069.5
4170.054.0
5170.054.0
# 每个班级 学生-班级平均值(身高体重)
dfg.transform(lambda x: x-x.mean())
heightweights
0-1.54.0
11.5-4.0
25.05.5
3-5.0-5.5
4-5.0-4.0
55.04.0

11.5 filter

  • 过滤数据
  • dfg.filter(func, dropna=True, *args, **kwargs):根据过滤条件返回分组数据据
#返回分组身高均值大于171的数据
dfg['height'].filter(lambda x: np.mean(x)> 171)
2    180
3    170
Name: height, dtype: int64
dfg.count()
heightweights
cname
00122
00222
00322

11.6 cut分组

  • 根据设置区间进行分段汇总
  • 方法:pd.cut(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False,duplicates='raise')
# 准备数据,统计70~79,80~89,90~100三个范围对应数量
df = pd.Series(np.random.randint(70,100, size = 10))
# 统计70~79,80~89,90~100三个范围对应数量
r = pd.cut(df, [70,80,90,101], right=False)
r
0     [70, 80)
1     [80, 90)
2     [70, 80)
3    [90, 101)
4     [70, 80)
5     [80, 90)
6    [90, 101)
7     [70, 80)
8    [90, 101)
9     [80, 90)
dtype: category
Categories (3, interval[int64, left]): [[70, 80) < [80, 90) < [90, 101)]
# 统计70~79,80~89,90~100三个范围对应数量,设置标签
r = pd.cut(df, [70,80,90,101], right=False, labels=['70+','80+','100+'])
r.groupby(r, observed=False).count()
70+     4
80+     3
100+    3
dtype: int64
#返回retbins
r,bins= pd.cut(df, [70,80,90,101], right=False, labels=['70+','80+','100+'],retbins=True)
bins
array([ 70,  80,  90, 101])

11.7 透视表

透视表是一种可以对数据动态排布并且分类汇总的表格格式,使用方式与groupby类似,但是比其简单;

方法:pd.pivot_table(data,values=None,index=None,columns=None,aggfunc='mean',fill_value=None,margins=False,...)

主要参数:

参数说明
data数据
values用于计算数据项的列
index行分组键,可以是列名、Grouper、数组等
columns列分组键,可以是列名、Grouper、数组等
aggfunc聚合函数或函数列表,或者字典形式的函数
fill_value设置缺省值
dropna删除缺省值,True 表示删除,默认为 True
margins是否增加统计列,True 表示增加,默认为 False
margins_name新增统计列的列名
import pandas as pd
path = r'data\titanic.csv'
df = pd.read_csv(path)
df
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue
................................................
88602male27.00013.0000SSecondmanTrueNaNSouthamptonnoTrue
88711female19.00030.0000SFirstwomanFalseBSouthamptonyesTrue
88803femaleNaN1223.4500SThirdwomanFalseNaNSouthamptonnoFalse
88911male26.00030.0000CFirstmanTrueCCherbourgyesTrue
89003male32.0007.7500QThirdmanTrueNaNQueenstownnoTrue

891 rows × 15 columns

# 根据class等级统计男女获救的比例
# 1等仓,2等仓, ...
# 女性获救比例, x x ...
# 男性获救比例 x x ...
df.pivot_table(values='survived',index='sex', columns=['class'])
classFirstSecondThird
sex
female0.9680850.9210530.500000
male0.3688520.1574070.135447
# 根据class等级统计男女获救的比例,统计所有值,并重命名
# 设置margins,统计所有男女获救比例
# 1等仓,2等仓, ...
# 女性获救比例, x x ...
# 男性获救比例 x x ...
df.pivot_table(values='survived',index='sex', columns=['class'],margins=True, margins_name='all_mean')
classFirstSecondThirdall_mean
sex
female0.9680850.9210530.5000000.742038
male0.3688520.1574070.1354470.188908
all_mean0.6296300.4728260.2423630.383838
# 根据class等级与年龄段,统计男女获救的比例
cuts = pd.cut(df.age,[0,18,100])
df.pivot_table(values='survived',index=['sex',cuts], columns=['class'])
classFirstSecondThird
sexage
female(0, 18]0.9090911.0000000.511628
(18, 100]0.9729730.9000000.423729
male(0, 18]0.8000000.6000000.215686
(18, 100]0.3750000.0714290.133663
# 根据class等级与年龄段,统计男女获救的比例,数量,并统计所有值
cuts = pd.cut(df.age,[0,18,100])
df.pivot_table(values='survived',index=['sex'], columns=['class'], aggfunc=['mean', 'count'],margins=True)
meancount
classFirstSecondThirdAllFirstSecondThirdAll
sex
female0.9680850.9210530.5000000.7420389476144314
male0.3688520.1574070.1354470.188908122108347577
All0.6296300.4728260.2423630.383838216184491891

11.8 str相关方法

str方法使Series对象内置方法,用于对字符串处理,与字符串相关方法类似;

  • cat(others=None, sep=None, na_rep=None, join=None):拼接字符串
  • split(pat=None, n=-1, expand=False):切分字符串
  • get(i):获取指定位置的字符串
  • join(sep):字符串拼接
  • find(sub, start=0, end=None):查找,返回第一次出现子集位置
  • contains(pat, case=True, flags=0, na=nan, regex=True):判断是否包含指定的值
  • replace(pat, repl, n=-1, case=None, flags=0, regex=True):替换
  • sf.str.repeat(repeats):重复repeats次
  • startswith(pat, na=nan):判断是否已pat开头
  • sf.str.endswith(pat, na=nan):判断是否已pat结尾
  • sf.str.strip(to_strip=None):根据指定字符掐头去尾
sf = pd.Series(['a_1', 'b_2', 'c_3'])
sf.str.cat(list('ABC'), sep = '-')
0    a_1-A
1    b_2-B
2    c_3-C
dtype: object
sf.str.split('_', expand=True)
01
0a1
1b2
2c3
#提取有效字符,
sf.str.extract(r'([a-zA-Z]+)')
0
0a
1b
2c
#字符开头
sf.str.match(r'([a-zA-Z]+)')
0    True
1    True
2    True
dtype: bool
#查找所有字符与数字
sf.str.findall(r'([a-z0-9]+)')
0    [a, 1]
1    [b, 2]
2    [c, 3]
dtype: object
df = pd.DataFrame([ ["张三", "abcd1234"], ["李四", "a2b4c3r5"]], columns=["name", "pwd"]) 
df
namepwd
0张三abcd1234
1李四a2b4c3r5

12 pandas可视化

pandas可以直接绘制图表,实现基于matplotlib,使用方式与其类似,

方法:df.plot(*args, **kwargs)

主要参数:

参数说明
dataSeries 或 DataFrame 对象
x标签或索引,用于指定 x 轴数据
y标签或索引,用于指定 y 轴数据
kind绘制图像的样式,例如:linebarbarhhist
figsize图像大小
use_index是否使用 index 作为 x 轴刻度,默认为 True
grid是否使用栅格,默认为 False
legend是否显示图例,默认为 True

12.1 基本使用

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#数据
df = pd.DataFrame(np.random.randint(30,60, size=(6,4)), columns=list('ABCD'), index=range(6))
df
ABCD
041343553
140533450
259444738
359505940
455524945
545573854
#绘制折线图
#x:index, y:默认所有columns
df.plot()
<AxesSubplot: >


png

#A使用左侧Y轴
df.A.plot()
#B使用右侧Y周
df.B.plot(secondary_y=True, style='g')
<AxesSubplot: >


png

12.2 plot中可视化方法

  • 折线图:df.plot.line(x=None, y=None, **kwargs)
  • 柱状图:df.plot.bar(x=None, y=None, **kwargs)
  • 条形图:df.plot.barh(x=None, y=None, **kwargs)
  • 直方图:df.plot.hist(by=None, bins=10, **kwargs)
  • KDE图:df.plot.kde(bw_method=None, ind=None, **kwargs)
  • 饼状图:df.plot.pie(**kwargs)
  • 散点图:df.plot.scatter(x, y, s=None, c=None, **kwargs)
  • 箱状图:df.plot.box(by=None, **kwargs)
  • 区域块状图:df.plot.area(x=None, y=None, **kwargs)*kwargs)
# 柱状图
df.plot.bar()
<AxesSubplot: >


png

# 条状图
df.plot.barh()
<AxesSubplot: >


png

# 柱状图
df.plot.pie(y='A')
<AxesSubplot: ylabel='A'>


png

  • 42
    点赞
  • 108
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

住在天上的云

如果您喜欢我的文章,欢迎打赏哦

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值