pandas指南
学习目的:
掌握pandas中series与dataframe pandas数据清洗 使用pandas进行数据基本统计 时间序列分析 数据分析常用方法
1 Pandas基础
pandas是python中数据分析核心库,能够快速,灵活的对大量数据进行分析,是Python进行数据分析的必要利器; pandas支持多种数据导入,支持数据合并,拆分,基本统计,时间序列分析,透视表等多种操作;
1.1 Series数据结构
Series:一维的带索引数据结构(单列)
Series类: pd.Series(data=None,index=None,dtype=None,name=None,copy=False,fastpath=Fals
e)
创建Series对象
第一列为索引,第二列为Series数据
import pandas as pd
import numpy as np
sdata = pd. Series( np. arange( 1 , 4 ) , index= list ( 'abc' ) )
sdata
a 1
b 2
c 3
dtype: int32
Series对象访问
.iloc[]
确保你按位置而不是标签访问Series对象的元素。
print ( sdata. iloc[ 0 ] )
print ( sdata[ 'b' ] )
print ( sdata. loc[ 'c' ] )
1
2
3
获取index与value
sdata. index. values
array(['a', 'b', 'c'], dtype=object)
sdata. values
array([1, 2, 3])
将index与value转成列表
sdata. index. values. tolist( )
['a', 'b', 'c']
sdata. values. tolist( )
[1, 2, 3]
for item in sdata. items( ) :
print ( item)
('a', 1)
('b', 2)
('c', 3)
1.2 dataframe
DataFrame:多种类型的列构成的二维标签数据结构(多列);
DataFrame类:
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
data:一维数据,二维数据 index:行标签 columns:列标签
1.2.1 Dataframe创建
一行一列
pd. DataFrame( data= np. arange( 1 , 4 ) )
多列
data = np. arange( 16 ) . reshape( 4 , 4 )
pd. DataFrame( data= data)
0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 3 12 13 14 15
设置index与columns
data = np. arange( 16 ) . reshape( 4 , 4 )
pdata = pd. DataFrame( data= data, index= list ( 'abcd' ) , columns= [ 'c1' , 'c2' , 'c3' , 'c4' ] )
pdata
c1 c2 c3 c4 a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 d 12 13 14 15
设置index与columns
data = { 'c1' : [ 1 , 2 , 3 ] , 'c2' : [ 4 , 5 , 6 ] }
pdata = pd. DataFrame( data= data)
pdata
设置列标签
pdata. columns = [ 't1' , 't2' ]
pdata
1.2.2 DataFrame对象访问
data = { 'c1' : [ 1 , 2 , 3 ] , 'c2' : [ 4 , 5 , 6 ] , 'c3' : [ 7 , 8 , 9 ] }
pdata = pd. DataFrame( data= data)
pdata
获取指定列
pdata[ 'c1' ]
0 1
1 2
2 3
Name: c1, dtype: int64
pdata[ [ 'c1' , 'c2' ] ]
loc操作
loc操作:使用类似列表方式去对数据进行访问,支持bool索引;
import pandas as pd
data = { 'c1' : [ 1 , 2 , 3 ] , 'c2' : [ 4 , 5 , 6 ] , 'c3' : [ 7 , 8 , 9 ] }
pdata = pd. DataFrame( data= data)
pdata
获取指定行
pdata. loc[ 0 ]
c1 1
c2 4
c3 7
Name: 0, dtype: int64
pdata. loc[ 0 , [ 'c1' , 'c2' ] ]
c1 1
c2 4
Name: 0, dtype: int64
遍历DataFrame对象
for item in pdata:
print ( item)
c1
c2
c3
for item in pdata. items( ) :
print ( item)
('c1', 0 1
1 2
2 3
Name: c1, dtype: int64)
('c2', 0 4
1 5
2 6
Name: c2, dtype: int64)
('c3', 0 7
1 8
2 9
Name: c3, dtype: int64)
for item in pdata. iterrows( ) :
print ( item)
(0, c1 1
c2 4
c3 7
Name: 0, dtype: int64)
(1, c1 2
c2 5
c3 8
Name: 1, dtype: int64)
(2, c1 3
c2 6
c3 9
Name: 2, dtype: int64)
1.2.3 DataFrame修改
修改元素
import pandas as pd
data = { 'c1' : [ 1 , 2 , 3 ] , 'c2' : [ 4 , 5 , 6 ] , 'c3' : [ 7 , 8 , 9 ] }
pdata = pd. DataFrame( data= data)
pdata[ 'c1' ] = 0
pdata
DataFrame插入列
import pandas as pd
data = { 'c1' : [ 1 , 2 , 3 ] , 'c2' : [ 4 , 5 , 6 ] , 'c3' : [ 7 , 8 , 9 ] }
pdata = pd. DataFrame( data= data)
pdata[ 'c4' ] = [ - 1 , - 1 , - 1 ]
pdata
c1 c2 c3 c4 0 1 4 7 -1 1 2 5 8 -1 2 3 6 9 -1
DataFrame插入行
import pandas as pd
data = { 'c1' : [ 1 , 2 , 3 ] , 'c2' : [ 4 , 5 , 6 ] , 'c3' : [ 7 , 8 , 9 ] }
pdata = pd. DataFrame( data= data)
pdata. loc[ 3 ] = [ - 1 , - 1 , - 1 ]
pdata
c1 c2 c3 0 1 4 7 1 2 5 8 2 3 6 9 3 -1 -1 -1
2 pandas数据导入与保存
目的:
数据导入:excel, csv文件 数据导出 基本统计 缺省数据处理
2.1 数据导入
数据是分析基础,实际工作中,数据来自于企业内部数据,网络数据,开源数据集;
方法 说明 pd.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, ...)
读取CSV文件 pd.read_excel(io, sheet_name=0, names=None, index_col=None, usecols=None, ...)
读取Excel文件 pd.read_json(path_or_buf=None, orient=None, typ='frame', dtype=None, ...)
读取JSON文件
读取excel文件
import pandas as pd
import numpy as np
fpath = r'data\test.xlsx'
pdata = pd. read_excel( fpath)
pdata
序号 姓名 数学 语文 0 1 赵 99 87 1 2 钱 88 92 2 3 孙 77 73 3 4 李 66 68
读取csv文件
fpath = r'data\GDP.csv'
pdata = pd. read_csv( fpath, encoding= 'gbk' )
pdata
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09 NaN NaN NaN NaN NaN NaN 1 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.004633e+10 2.005019e+10 1.921556e+10 1.946902e+10 NaN 2 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041160e+11 1.153980e+11 1.249120e+11 1.267770e+11 1.029620e+11 9.533511e+10 NaN 3 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.133526e+10 1.186387e+10 NaN 4 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 4.007353e+09 3.660531e+09 3.355695e+09 3.442063e+09 3.164615e+09 3.281585e+09 3.350736e+09 2.811489e+09 2.858518e+09 NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 259 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 5.687488e+09 5.653793e+09 5.829934e+09 6.649291e+09 6.473725e+09 7.072092e+09 7.386891e+09 6.440501e+09 6.649889e+09 NaN 260 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.691085e+10 2.513027e+10 3.090675e+10 3.272642e+10 3.539315e+10 4.041523e+10 4.322858e+10 3.773392e+10 2.731761e+10 NaN 261 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575248e+09 7.972841e+09 8.497830e+09 9.423212e+09 1.037379e+10 1.133417e+10 ... 2.871000e+11 2.972170e+11 3.752980e+11 4.168780e+11 3.963330e+11 3.668100e+11 3.511190e+11 3.176110e+11 2.954560e+11 NaN 262 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 1.082857e+09 ... 1.791086e+10 1.532834e+10 2.026556e+10 2.346010e+10 2.550337e+10 2.804546e+10 2.715063e+10 2.115439e+10 2.106399e+10 NaN 263 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 1.311436e+09 ... 4.415703e+09 8.621574e+09 1.014186e+10 1.209845e+10 1.424249e+10 1.545177e+10 1.589105e+10 1.630467e+10 1.661996e+10 NaN
264 rows × 62 columns
fpath = r'data\GDP.csv'
pdata = pd. read_csv( fpath, usecols = [ 'Country Name' , '1990' ] , encoding= 'gbk' )
pdata
Country Name 1990 0 Aruba NaN 1 Afghanistan NaN 2 Angola 1.002674e+10 3 Albania 2.101625e+09 4 Andorra 1.029048e+09 ... ... ... 259 Kosovo NaN 260 Yemen, Rep. 5.647252e+09 261 South Africa 1.155530e+11 262 Zambia 3.285217e+09 263 Zimbabwe 8.783817e+09
264 rows × 2 columns
fpath = r'data\GDP.csv'
pdata = pd. read_csv( fpath, header= 1 , encoding= 'gbk' )
pdata
Aruba ABW GDP (current US$) NY.GDP.MKTP.CD Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 ... 2791960894 2498932961 2467703911 2584463687 Unnamed: 56 Unnamed: 57 Unnamed: 58 Unnamed: 59 Unnamed: 60 Unnamed: 61 0 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.004633e+10 2.005019e+10 1.921556e+10 1.946902e+10 NaN 1 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041160e+11 1.153980e+11 1.249120e+11 1.267770e+11 1.029620e+11 9.533511e+10 NaN 2 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.133526e+10 1.186387e+10 NaN 3 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 4.007353e+09 3.660531e+09 3.355695e+09 3.442063e+09 3.164615e+09 3.281585e+09 3.350736e+09 2.811489e+09 2.858518e+09 NaN 4 Arab World ARB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.078120e+12 1.795820e+12 2.109650e+12 2.501550e+12 2.741240e+12 2.839630e+12 2.906620e+12 2.563300e+12 2.504700e+12 NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 258 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 5.687488e+09 5.653793e+09 5.829934e+09 6.649291e+09 6.473725e+09 7.072092e+09 7.386891e+09 6.440501e+09 6.649889e+09 NaN 259 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.691085e+10 2.513027e+10 3.090675e+10 3.272642e+10 3.539315e+10 4.041523e+10 4.322858e+10 3.773392e+10 2.731761e+10 NaN 260 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575248e+09 7.972841e+09 8.497830e+09 9.423212e+09 1.037379e+10 1.133417e+10 ... 2.871000e+11 2.972170e+11 3.752980e+11 4.168780e+11 3.963330e+11 3.668100e+11 3.511190e+11 3.176110e+11 2.954560e+11 NaN 261 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 1.082857e+09 ... 1.791086e+10 1.532834e+10 2.026556e+10 2.346010e+10 2.550337e+10 2.804546e+10 2.715063e+10 2.115439e+10 2.106399e+10 NaN 262 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 1.311436e+09 ... 4.415703e+09 8.621574e+09 1.014186e+10 1.209845e+10 1.424249e+10 1.545177e+10 1.589105e+10 1.630467e+10 1.661996e+10 NaN
263 rows × 62 columns
pdata = pd. read_csv( fpath, header= None , encoding= 'gbk' )
pdata
0 1 2 3 4 5 6 7 8 9 ... 52 53 54 55 56 57 58 59 60 61 0 Country Name Country Code Indicator Name Indicator Code 1.960000e+03 1.961000e+03 1.962000e+03 1.963000e+03 1.964000e+03 1.965000e+03 ... 2.008000e+03 2.009000e+03 2.010000e+03 2.011000e+03 2.012000e+03 2.013000e+03 2.014000e+03 2.015000e+03 2.016000e+03 2017.0 1 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09 NaN NaN NaN NaN NaN NaN 2 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.004633e+10 2.005019e+10 1.921556e+10 1.946902e+10 NaN 3 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041160e+11 1.153980e+11 1.249120e+11 1.267770e+11 1.029620e+11 9.533511e+10 NaN 4 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.133526e+10 1.186387e+10 NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 260 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 5.687488e+09 5.653793e+09 5.829934e+09 6.649291e+09 6.473725e+09 7.072092e+09 7.386891e+09 6.440501e+09 6.649889e+09 NaN 261 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.691085e+10 2.513027e+10 3.090675e+10 3.272642e+10 3.539315e+10 4.041523e+10 4.322858e+10 3.773392e+10 2.731761e+10 NaN 262 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575248e+09 7.972841e+09 8.497830e+09 9.423212e+09 1.037379e+10 1.133417e+10 ... 2.871000e+11 2.972170e+11 3.752980e+11 4.168780e+11 3.963330e+11 3.668100e+11 3.511190e+11 3.176110e+11 2.954560e+11 NaN 263 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 1.082857e+09 ... 1.791086e+10 1.532834e+10 2.026556e+10 2.346010e+10 2.550337e+10 2.804546e+10 2.715063e+10 2.115439e+10 2.106399e+10 NaN 264 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 1.311436e+09 ... 4.415703e+09 8.621574e+09 1.014186e+10 1.209845e+10 1.424249e+10 1.545177e+10 1.589105e+10 1.630467e+10 1.661996e+10 NaN
265 rows × 62 columns
2.2 数据保存
方法 说明 pdata.to_csv(path_or_buf=None, sep=',', ...)
保存为CSV文件 pdata.to_excel(excel_writer, sheet_name='Sheet1', na_rep='', ...)
保存为Excel文件 pdata.to_json(path_or_buf=None, orient=None, ...)
保存为JSON格式文件
import pandas as pd
import numpy as np
fpath = r'data\GDP.csv'
csv_path1 = r'data\new_GDP_1.csv'
csv_path2 = r'data\new_GDP_2.csv'
csv_path3 = r'data\new_GDP_3.csv'
pdata = pd. read_csv( fpath, encoding= 'gbk' )
pdata. to_csv( csv_path1)
pdata. to_csv( csv_path2, index= False )
pdata. to_csv( csv_path3, index= False , columns= [ '1990' , '1991' ] )
pdata = pd. read_csv( csv_path1, encoding= 'gbk' )
pdata
Unnamed: 0 Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 0 0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09 NaN NaN NaN NaN NaN NaN 1 1 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.004633e+10 2.005019e+10 1.921556e+10 1.946902e+10 NaN 2 2 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041160e+11 1.153980e+11 1.249120e+11 1.267770e+11 1.029620e+11 9.533511e+10 NaN 3 3 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.133526e+10 1.186387e+10 NaN 4 4 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 4.007353e+09 3.660531e+09 3.355695e+09 3.442063e+09 3.164615e+09 3.281585e+09 3.350736e+09 2.811489e+09 2.858518e+09 NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 259 259 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 5.687488e+09 5.653793e+09 5.829934e+09 6.649291e+09 6.473725e+09 7.072092e+09 7.386891e+09 6.440501e+09 6.649889e+09 NaN 260 260 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN ... 2.691085e+10 2.513027e+10 3.090675e+10 3.272642e+10 3.539315e+10 4.041523e+10 4.322858e+10 3.773392e+10 2.731761e+10 NaN 261 261 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575248e+09 7.972841e+09 8.497830e+09 9.423212e+09 1.037379e+10 ... 2.871000e+11 2.972170e+11 3.752980e+11 4.168780e+11 3.963330e+11 3.668100e+11 3.511190e+11 3.176110e+11 2.954560e+11 NaN 262 262 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 ... 1.791086e+10 1.532834e+10 2.026556e+10 2.346010e+10 2.550337e+10 2.804546e+10 2.715063e+10 2.115439e+10 2.106399e+10 NaN 263 263 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 ... 4.415703e+09 8.621574e+09 1.014186e+10 1.209845e+10 1.424249e+10 1.545177e+10 1.589105e+10 1.630467e+10 1.661996e+10 NaN
264 rows × 63 columns
pdata = pd. read_csv( csv_path2, encoding= 'gbk' )
pdata
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 0 Aruba ABW GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.791961e+09 2.498933e+09 2.467704e+09 2.584464e+09 NaN NaN NaN NaN NaN NaN 1 Afghanistan AFG GDP (current US$) NY.GDP.MKTP.CD 5.377778e+08 5.488889e+08 5.466667e+08 7.511112e+08 8.000000e+08 1.006667e+09 ... 1.019053e+10 1.248694e+10 1.593680e+10 1.793024e+10 2.053654e+10 2.004633e+10 2.005019e+10 1.921556e+10 1.946902e+10 NaN 2 Angola AGO GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 8.417803e+10 7.549238e+10 8.247091e+10 1.041160e+11 1.153980e+11 1.249120e+11 1.267770e+11 1.029620e+11 9.533511e+10 NaN 3 Albania ALB GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 1.288135e+10 1.204421e+10 1.192695e+10 1.289087e+10 1.231978e+10 1.277628e+10 1.322824e+10 1.133526e+10 1.186387e+10 NaN 4 Andorra AND GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 4.007353e+09 3.660531e+09 3.355695e+09 3.442063e+09 3.164615e+09 3.281585e+09 3.350736e+09 2.811489e+09 2.858518e+09 NaN ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 259 Kosovo XKX GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 5.687488e+09 5.653793e+09 5.829934e+09 6.649291e+09 6.473725e+09 7.072092e+09 7.386891e+09 6.440501e+09 6.649889e+09 NaN 260 Yemen, Rep. YEM GDP (current US$) NY.GDP.MKTP.CD NaN NaN NaN NaN NaN NaN ... 2.691085e+10 2.513027e+10 3.090675e+10 3.272642e+10 3.539315e+10 4.041523e+10 4.322858e+10 3.773392e+10 2.731761e+10 NaN 261 South Africa ZAF GDP (current US$) NY.GDP.MKTP.CD 7.575248e+09 7.972841e+09 8.497830e+09 9.423212e+09 1.037379e+10 1.133417e+10 ... 2.871000e+11 2.972170e+11 3.752980e+11 4.168780e+11 3.963330e+11 3.668100e+11 3.511190e+11 3.176110e+11 2.954560e+11 NaN 262 Zambia ZMB GDP (current US$) NY.GDP.MKTP.CD 7.130000e+08 6.962857e+08 6.931429e+08 7.187143e+08 8.394286e+08 1.082857e+09 ... 1.791086e+10 1.532834e+10 2.026556e+10 2.346010e+10 2.550337e+10 2.804546e+10 2.715063e+10 2.115439e+10 2.106399e+10 NaN 263 Zimbabwe ZWE GDP (current US$) NY.GDP.MKTP.CD 1.052990e+09 1.096647e+09 1.117602e+09 1.159512e+09 1.217138e+09 1.311436e+09 ... 4.415703e+09 8.621574e+09 1.014186e+10 1.209845e+10 1.424249e+10 1.545177e+10 1.589105e+10 1.630467e+10 1.661996e+10 NaN
264 rows × 62 columns
pdata = pd. read_csv( csv_path3, encoding= 'gbk' )
pdata
1990 1991 0 NaN NaN 1 NaN NaN 2 1.002674e+10 1.211861e+10 3 2.101625e+09 1.139167e+09 4 1.029048e+09 1.106929e+09 ... ... ... 259 NaN NaN 260 5.647252e+09 5.930370e+09 261 1.155530e+11 1.239430e+11 262 3.285217e+09 3.378882e+09 263 8.783817e+09 8.641482e+09
264 rows × 2 columns
3 缺失数据处理
3.1 缺失值与空值
缺省值:数据集中数值为空的值, pandas使用Nan
/ NaT
表示
空值:空字符串 ""
s1 = [ 10 , 10.5 , None , 11 ]
s2 = [ 7 , 6.9 , 7.5 , None ]
pdata = pd. DataFrame( { 's1' : s1, 's2' : s2} )
pdata
s1 s2 0 10.0 7.0 1 10.5 6.9 2 NaN 7.5 3 11.0 NaN
3.2 缺失值判断
判断方法:
pd.isnull()
:缺省值对应的值为True,返回值为Boolean的Series或者DataFrame对象pd.notnull()
:缺省值对应的值为False,返回值为Boolean的Series或者DataFrame对象pdata.isnull()
/ pdata.notnull()
:同上
sdata = pd. Series( [ 1 , 2 , 3 , np. NaN] )
pd. isnull( sdata)
0 False
1 False
2 False
3 True
dtype: bool
s1 = [ 10 , 10.5 , None , 11 ]
s2 = [ 7 , 6.9 , 7.5 , None ]
pdata = pd. DataFrame( { 's1' : s1, 's2' : s2} )
pd. isnull( pdata)
s1 s2 0 False False 1 False False 2 True False 3 False True
3.3 判断是否有缺失值
方式:np.all
与 pd.notnull
结合
s1 = [ 10 , 10.5 , None , 11 ]
s2 = [ 7 , 6.9 , 7.5 , None ]
pdata = pd. DataFrame( { 's1' : s1, 's2' : s2} )
np. all ( pd. notnull( pdata) )
False
s1 = [ 10 , 10.5 , 11 ]
s2 = [ 7 , 6.9 , 7.5 ]
pdata = pd. DataFrame( { 's1' : s1, 's2' : s2} )
np. all ( pd. notnull( pdata) )
True
方式:np.any
与 pd.isnull
结合
s1 = [ 10 , 10.5 , 11 ]
s2 = [ 7 , 6.9 , 7.5 ]
pdata = pd. DataFrame( { 's1' : s1, 's2' : s2} )
np. any ( pd. isnull( pdata) )
False
3.4 缺省值处理方式
缺省值处理:
过滤缺省值(按行列) 删除缺省值(按行列) 填充值,填充值方式: 插入均值,中位数,最大值,最小值等 插入特殊值 插入前(后)值入前(后)值
3.4.1 缺省值过滤
举例:某两只股票1周收盘值,None表示当前停盘
需求:获取两只股票都没有停牌的数据
s1 = [ 10 , 10.5 , None , 11 ]
s2 = [ 7 , 6.9 , 7.5 , None ]
pdata = pd. DataFrame( { 's1' : s1, 's2' : s2} )
pdata
s1 s2 0 10.0 7.0 1 10.5 6.9 2 NaN 7.5 3 11.0 NaN
bindex = np. all ( pdata. notnull( ) , axis= 1 )
bindex
0 True
1 True
2 False
3 False
dtype: bool
pdata[ bindex]
3.4.2 删除缺省值
pdata.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
主要参数:
参数 说明 axis
0
或 'index'
:按行操作,1
或 'columns'
:按列操作how
根据 axis
指定操作方式,any
:只要有一个 Na 就删除,all
:全部为 Na 才删除 thresh
指定非 Na 值的数量,非 Na 数量大于等于 thresh
时不删除 subset
指定操作的列子集 inplace
True
:在原始数据中进行修改
准备数据:
s1 = [ 10 , 10.5 , None , 11 ]
s2 = [ 7 , 6.9 , 7.5 , None ]
s3 = [ 7 , 6.9 , 7.5 , 7 ]
s4 = [ None , 6.9 , None , 7.2 ]
pdata = pd. DataFrame( { 's1' : s1, 's2' : s2, 's3' : s3, 's4' : s4} )
pdata
s1 s2 s3 s4 0 10.0 7.0 7.0 NaN 1 10.5 6.9 6.9 6.9 2 NaN 7.5 7.5 NaN 3 11.0 NaN 7.0 7.2
需求:
删除包含缺省值的行 删除包含2个缺省值行 删除指定列包含缺省值 删除包含缺省值的列
pdata. dropna( )
pdata. dropna( thresh= 3 )
s1 s2 s3 s4 0 10.0 7.0 7.0 NaN 1 10.5 6.9 6.9 6.9 3 11.0 NaN 7.0 7.2
pdata. dropna( subset= [ 's1' , 's4' ] )
s1 s2 s3 s4 1 10.5 6.9 6.9 6.9 3 11.0 NaN 7.0 7.2
pdata. dropna( axis= 1 )
注意:
以上数据删除都不对原始数据进行修改 指定inplace为True,在原始数据中进行修改
3.4.3 缺失值填充
填充方法:
pdata.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
参数 说明 value
填充值 method
填充方式:{'backfill', 'bfill', 'pad', 'ffill', None}
axis
指定行列:0
或 'index'
表示按行,1
或 'columns'
表示按列 limit
插入数量限制
pdata
s1 s2 s3 s4 0 10.0 7.0 7.0 NaN 1 10.5 6.9 6.9 6.9 2 NaN 7.5 7.5 NaN 3 11.0 NaN 7.0 7.2
需求:
缺省值填充固定值0 使用前/后面数据填充 使用均值填充 插入均值插入均值
pdata. fillna( 0 )
s1 s2 s3 s4 0 10.0 7.0 7.0 0.0 1 10.5 6.9 6.9 6.9 2 0.0 7.5 7.5 0.0 3 11.0 0.0 7.0 7.2
pdata. fillna( method= 'ffill' )
pdata. ffill( )
s1 s2 s3 s4 0 10.0 7.0 7.0 NaN 1 10.5 6.9 6.9 6.9 2 10.5 7.5 7.5 6.9 3 11.0 7.5 7.0 7.2
pdata. fillna( method= 'bfill' )
pdata. bfill( )
s1 s2 s3 s4 0 10.0 7.0 7.0 6.9 1 10.5 6.9 6.9 6.9 2 11.0 7.5 7.5 7.2 3 11.0 NaN 7.0 7.2
pdata. fillna( axis= 1 , method= 'bfill' )
pdata. bfill( axis= 1 )
s1 s2 s3 s4 0 10.0 7.0 7.0 NaN 1 10.5 6.9 6.9 6.9 2 7.5 7.5 7.5 NaN 3 11.0 7.0 7.0 7.2
对于股票缺省值,我们倾向于,使用前一天数据填充缺失值
3.4.4 插入均值,中位数,最大值,最小值
pdata.mean/max/min/median(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
主要参数:
参数 说明 axis
方向,0
表示按列,1
表示按行 skipna
是否忽略 NaN,True
表示不计算 NaN,默认为 True
pdata
s1 s2 s3 s4 0 10.0 7.0 7.0 NaN 1 10.5 6.9 6.9 6.9 2 NaN 7.5 7.5 NaN 3 11.0 NaN 7.0 7.2
插入均值
pdata. fillna( pdata. mean( ) )
s1 s2 s3 s4 0 10.0 7.000000 7.0 7.05 1 10.5 6.900000 6.9 6.90 2 10.5 7.500000 7.5 7.05 3 11.0 7.133333 7.0 7.20
插入中位数
pdata. fillna( pdata. median( ) )
s1 s2 s3 s4 0 10.0 7.0 7.0 7.05 1 10.5 6.9 6.9 6.90 2 10.5 7.5 7.5 7.05 3 11.0 7.0 7.0 7.20
4 数据清洗
4.1 准备数据
某次考试成绩
import pandas as pd
import numpy as np
names = list ( 'ABCD' )
math = [ 90 , 100 , 50 , 80 ]
chinese = [ 89 , 96 , 58 , 77 ]
pdata = pd. DataFrame( { 'name' : names, 'math' : math, 'chinese' : chinese} )
pdata
name math chinese 0 A 90 89 1 B 100 96 2 C 50 58 3 D 80 77
4.2 获取指定列
通过列名,直接获取值
pdata[ 'name' ]
0 A
1 B
2 C
3 D
Name: name, dtype: object
4.3 获取指定多列
基本方法:pdata[[col1, col2]]
获取姓名与数学成绩
pdata[ [ 'name' , 'math' ] ]
name math 0 A 90 1 B 100 2 C 50 3 D 80
4.4 根据指定条件获取数据
需求1:数学成绩大于80的所有成绩;
实现思路:
根据条件生成boolean索引 通过boolean索引获取数据
bindex = pdata[ 'math' ] > 80
pdata[ bindex]
name math chinese 0 A 90 89 1 B 100 96
需求2:获取同学A的成绩
实现思路:
根据条件生成boolean索引 通过boolean索引获取数据
pdata[ pdata[ 'name' ] == 'A' ]
4.5 根据指定多个条件获取数据
需求1:获取数学语文都及格成绩
条件1:数学成绩大于59, 条件2:语文成绩大于59 条件3:两个条件与操作:& 基本语法:pdata[condition1&condition2]
pdata[ ( pdata[ 'math' ] > 59 ) & ( pdata[ 'chinese' ] > 59 ) ]
name math chinese 0 A 90 89 1 B 100 96 3 D 80 77
需求2:获取数学语文有一门大于等于80分
条件1:数学成绩大于等于80, 条件2:语文成绩大于等于80 条件3:两个条件与操作:| 基本语法:pdata[condition1|condition2]
pdata[ ( pdata[ 'math' ] >= 80 ) | ( pdata[ 'chinese' ] >= 80 ) ]
name math chinese 0 A 90 89 1 B 100 96 3 D 80 77
bindex = pdata[ 'math' ] . isin( [ 100 , 90 ] )
pdata[ bindex]
name math chinese 0 A 90 89 1 B 100 96
根据集合获取数据
获取数学成绩为100或者90的学生成绩
多个值判断: pdata.isin(values),返回boolean索引
bindex = pdata[ 'math' ] . isin( [ 100 , 90 ] )
pdata[ bindex]
name math chinese 0 A 90 89 1 B 100 96
4.6 根据数据排序
排序方式1:根据索引排序
pdata.sort_index(axis=0,level=None,ascending=True,...)
排序方式2:根据指定列内容排序
pdata.sort_values(by,axis=0,ascending=True,...)
import pandas as pd
import numpy as np
names = list ( 'ABCD' )
math = [ 90 , 100 , 80 , 80 ]
chinese = [ 89 , 96 , 58 , 77 ]
pdata = pd. DataFrame( { 'name' : names, 'math' : math, 'chinese' : chinese} )
pdata
name math chinese 0 A 90 89 1 B 100 96 2 C 80 58 3 D 80 77
pdata. sort_index( ascending= False )
name math chinese 3 D 80 77 2 C 80 58 1 B 100 96 0 A 90 89
pdata. sort_values( [ 'math' ] , ascending= False )
name math chinese 1 B 100 96 0 A 90 89 2 C 80 58 3 D 80 77
pdata. sort_values( [ 'math' , 'chinese' ] , ascending= False )
name math chinese 1 B 100 96 0 A 90 89 3 D 80 77 2 C 80 58
5 pandas汇总与描述性统计
pandas计算与统计相关方法:
最大值:pdata.max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
最小值:pdata.min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
均值:pdata.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
中位数:pdata.median(axis=None,skipna=None,level=None,numeric_only=None, **kwargs)
求和:pdata.sum(axis=None,skipna=None,level=None,numeric_only=None,min_count=0,**kwargs)
方差:pdata.var(axis=None,skipna=None,level=None,ddof=1,numeric_only=None,**kwargs)
标准差:pdata.std(axis=None,skipna=None,level=None,ddof=1,numeric_only=None,**kwargs)
累加和:pdata.cumsum(axis=None, skipna=True, *args, **kwargs)
分位数:pdata.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')
每个数值到均值的平均差:pdata.mad(axis=None, skipna=None, level=None)
元素与先前元素的相差百分比:pdata.pct_change(periods=1,fill_method='pad',limit=None,freq=None,**kwargs)
偏度:pdata.skew(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
峰度:pdata.kurt(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
数据描述:pdata.describe(percentiles=None, include=None, exclude=None)
产生数据
import pandas as pd
import numpy as np
names = list ( 'ABCD' )
math = [ 90 , 100 , 80 , 80 ]
chinese = [ 89 , 96 , 58 , 77 ]
pdata = pd. DataFrame( { 'name' : names, 'math' : math, 'chinese' : chinese} )
pdata
name math chinese 0 A 90 89 1 B 100 96 2 C 80 58 3 D 80 77
基本描述与计算
pdata. describe( )
math chinese count 4.000000 4.0000 mean 87.500000 80.0000 std 9.574271 16.6333 min 80.000000 58.0000 25% 80.000000 72.2500 50% 85.000000 83.0000 75% 92.500000 90.7500 max 100.000000 96.0000
print ( pdata. iloc[ : , 1 : ] . sum ( axis= 1 ) )
print ( pdata. iloc[ : , 1 : ] . mean( axis= 1 ) )
0 179
1 196
2 138
3 157
dtype: int64
0 89.5
1 98.0
2 69.0
3 78.5
dtype: float64
分位数
计算:下四分位数,中位数,上四分位数
print ( pdata. iloc[ : , 1 : ] . quantile( q= 0.25 ) )
print ( pdata. iloc[ : , 1 : ] . quantile( q= 0.5 ) )
print ( pdata. iloc[ : , 1 : ] . quantile( q= 0.75 ) )
math 80.00
chinese 72.25
Name: 0.25, dtype: float64
math 85.0
chinese 83.0
Name: 0.5, dtype: float64
math 92.50
chinese 90.75
Name: 0.75, dtype: float64
6 索引/多级索引
主要内容:
import pandas as pd
import numpy as np
names = list ( 'ABCDABCD' )
pdata = pd. DataFrame( np. random. randint( 30 , 100 , size= ( 8 , 2 ) ) , columns= [ 'math' , 'chinese' ] )
pdata[ 'team' ] = [ 1 ] * 4 + [ 2 ] * 4
pdata[ 'name' ] = names
pdata
math chinese team name 0 58 67 1 A 1 63 97 1 B 2 55 86 1 C 3 54 34 1 D 4 75 57 2 A 5 49 64 2 B 6 83 88 2 C 7 67 75 2 D
6.1 设置索引
参数 说明 keys
指定索引名称,可以是多列 drop
True
:删除列数据,False
:保留列数据append
True
:在原有列基础上追加,False
:替代原有列inplace
True
:在原数据中进行修改,False
:返回新数据
ndata = pdata. set_index( 'name' )
ndata
math chinese team name A 58 67 1 B 63 97 1 C 55 86 1 D 54 34 1 A 75 57 2 B 49 64 2 C 83 88 2 D 67 75 2
6.2 多级索引
需求:通过索引获取指定学期数据
pd.MultiIndex.from_arrays(arrays, sortorder=None, names=None)
import pandas as pd
import numpy as np
names = list ( 'ABCDABCD' )
pdata = pd. DataFrame( np. random. randint( 30 , 100 , size= ( 8 , 2 ) ) , columns= [ 'math' , 'chinese' ] )
pdata
math chinese 0 53 71 1 43 98 2 66 38 3 87 69 4 92 95 5 80 53 6 62 98 7 84 64
lv1 = [ 1 ] * 4 + [ 2 ] * 4
lv2 = list ( 'ABCDABCD' )
mindex = pd. MultiIndex. from_arrays( [ lv1, lv2] )
data = pdata. set_index( mindex)
data
math chinese 1 A 53 71 B 43 98 C 66 38 D 87 69 2 A 92 95 B 80 53 C 62 98 D 84 64
6.3 通过多级索引取值
需求:
获取第一学期数据 获取第一学期A同学数据 获取A同学所有数据有数据
data. loc[ 1 ]
math chinese A 53 71 B 43 98 C 66 38 D 87 69
data. loc[ ( 1 , 'A' ) ]
math 53
chinese 71
Name: (1, A), dtype: int32
data = data. swaplevel( 0 , 1 )
data
math chinese A 1 53 71 B 1 43 98 C 1 66 38 D 1 87 69 A 2 92 95 B 2 80 53 C 2 62 98 D 2 84 64
data. loc[ 'A' ]
6.4 行列变换
pdata.stack(level=-1, dropna=True)
:将列“旋转”为行pdata.unstack(level=-1, fill_value=None)
:将行“旋转”为列
import pandas as pd
import numpy as np
names = list ( 'ABCDABCD' )
pdata = pd. DataFrame( np. random. randint( 30 , 100 , size= ( 8 , 2 ) ) , columns= [ 'math' , 'chinese' ] )
lv1 = [ 1 ] * 4 + [ 2 ] * 4
lv2 = list ( 'ABCDABCD' )
mindex = pd. MultiIndex. from_arrays( [ lv1, lv2] )
pdata = pdata. set_index( mindex)
pdata
math chinese 1 A 84 96 B 75 70 C 86 55 D 97 98 2 A 40 37 B 79 43 C 72 64 D 82 37
需求:
将学期转到列 获取所有学生数学成绩 获取第一学期数据数据
将学期转到列
tmp = pdata. unstack( level= 0 )
tmp
math chinese 1 2 1 2 A 84 40 96 37 B 75 79 70 43 C 86 72 55 64 D 97 82 98 37
获取所有学生数学成绩
tmp[ 'math' ]
获取第一学期数据
tmp. swaplevel( axis= 1 ) [ 1 ]
math chinese A 84 96 B 75 70 C 86 55 D 97 98
tmp
math chinese 1 2 1 2 A 84 40 96 37 B 75 79 70 43 C 86 72 55 64 D 97 82 98 37
t = tmp. stack( level= 0 )
t
1 2 A chinese 96 37 math 84 40 B chinese 70 43 math 75 79 C chinese 55 64 math 86 72 D chinese 98 37 math 97 82
t. loc[ 'A' ] [ 1 ]
chinese 96
math 84
Name: 1, dtype: int32
t. sort_index( level = 1 )
1 2 A chinese 96 37 B chinese 70 43 C chinese 55 64 D chinese 98 37 A math 84 40 B math 75 79 C math 86 72 D math 97 82
7 时间与时间序列
时间是数据分析重要维度,pandas中时间主要知识点:
7.1 时间戳
时间处理中常见的对象;
时间戳方法:
pd.Timestamp(ts_input=<object object at 0x00000258D4E907F0>,freq=None, tz=None,...)
,详情见说明案例pd.to_datetime(arg,errors='raise', dayfirst=False,...)
print ( pd. Timestamp( 2020 ) )
print ( pd. Timestamp( 2020 , 6 , 2 ) )
print ( pd. Timestamp( '2020-05-07' ) )
print ( pd. Timestamp( '2020-05-07 04:02:01' ) )
print ( pd. Timestamp( '2017-01-01T12' ) )
print ( pd. Timestamp( 1513393355.5 , unit= 's' ) )
print ( pd. to_datetime( '2017-02-01' , format = "%Y-%m-%d" ) )
print ( pd. to_datetime( '20170301' , format = "%Y%m%d" ) )
1970-01-01 00:00:00.000002020
2020-06-02 00:00:00
2020-05-07 00:00:00
2020-05-07 04:02:01
2017-01-01 12:00:00
2017-12-16 03:02:35.500000
2017-02-01 00:00:00
2017-03-01 00:00:00
print ( pd. to_datetime( '2017-02-01' , format = "%Y-%m-%d" ) )
print ( pd. to_datetime( [ '2017-02-01' ] , format = "%Y-%m-%d" ) )
2017-02-01 00:00:00
DatetimeIndex(['2017-02-01'], dtype='datetime64[ns]', freq=None)
pd. Timestamp. now( )
Timestamp('2024-02-01 10:45:29.038280')
7.2 时间索引
pd.date_range(start=None,end=None,periods=None,freq=None,tz=None, ... **kwargs)
:生成DatetimeIndex
官方文档:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html)
参数 说明 start
开始时间 end
结束时间 periods
产生周期数量 freq
间隔,默认为 D
(天),可选其他频率,例如:H
(小时)、M
(分钟)等 closed
闭区间:None
、'left'
、'right'
freq的主要参数:
参数 说明 M
每月最后一天 MS
每月第一天 D
天 H
小时 T
分钟(min) S
秒 Q
季度
print ( pd. date_range( '2017-01-01' , periods= 2 ) )
print ( pd. date_range( '2017-01-01 02' , periods= 2 , freq= 'H' ) )
print ( pd. date_range( '2017-01' , periods= 3 , freq= 'MS' ) )
print ( pd. date_range( '2017-01' , periods= 3 , freq= 'M' ) )
DatetimeIndex(['2017-01-01', '2017-01-02'], dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2017-01-01 02:00:00', '2017-01-01 03:00:00'], dtype='datetime64[ns]', freq='H')
DatetimeIndex(['2017-01-01', '2017-02-01', '2017-03-01'], dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2017-01-31', '2017-02-28', '2017-03-31'], dtype='datetime64[ns]', freq='M')
7.3 周期
通过时间段与固定时间间隔一系列时间;
周期作用:可以获取指定年,指定月,指定日等的数据
周期:pd.Period(value=None,freq=None,ordinal=None,year=None,...)
周期序列:pd.period_range(start=None, end=None, periods=None, freq=None, name=None)
:生成PeriodIndex pdata.to_period(freq=None, axis=0, copy=True)
import pandas as pd
import numpy as np
print ( pd. Timestamp( '2017-01-02' ) )
print ( pd. Period( '2017-01-02' ) )
print ( pd. Period( '2017-01' ) )
print ( pd. Period( '2017' ) )
2017-01-02 00:00:00
2017-01-02
2017-01
2017
print ( pd. period_range( '2017-01-01' , periods= 2 ) )
print ( pd. period_range( '2017-01' , periods= 2 , freq= 'M' ) )
print ( pd. period_range( '2017-01-02' , periods= 2 , freq= 'H' ) )
PeriodIndex(['2017-01-01', '2017-01-02'], dtype='period[D]')
PeriodIndex(['2017-01', '2017-02'], dtype='period[M]')
PeriodIndex(['2017-01-02 00:00', '2017-01-02 01:00'], dtype='period[H]')
tmp = pd. to_datetime( [ '2017-01-02' , '2017-03-04' ] )
tmp. to_period( 'M' )
PeriodIndex(['2017-01', '2017-03'], dtype='period[M]')
7.4 时间索引
很多数据及数据集中都会有时间维度,可以将其设置为时间索引;
内容:
获取指定时间数据 获取时间段数据 获取某个时期数据时期数据
import pandas as pd
import numpy as np
ts = [ '2019-03-25' , '2019-03-26' , '2019-03-27' , '2019-03-28' , '2019-03-29' , '2019-03-30' , '2019-03-31' ,
'2019-04-01' , '2019-04-02' , '2019-04-03' , '2019-04-04' , '2019-05-01' , '2019-04-05' , ]
values = np. arange( len ( ts) )
pdata = pd. DataFrame( { 'ts' : ts, 'values' : values} )
pdata
ts values 0 2019-03-25 0 1 2019-03-26 1 2 2019-03-27 2 3 2019-03-28 3 4 2019-03-29 4 5 2019-03-30 5 6 2019-03-31 6 7 2019-04-01 7 8 2019-04-02 8 9 2019-04-03 9 10 2019-04-04 10 11 2019-05-01 11 12 2019-04-05 12
pdata. ts. values
array(['2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
'2019-03-29', '2019-03-30', '2019-03-31', '2019-04-01',
'2019-04-02', '2019-04-03', '2019-04-04', '2019-05-01',
'2019-04-05'], dtype=object)
需求:
获取三月份数据 获取4月1号到4号数据 获取第2季度数据
tindex = pd. to_datetime( pdata[ 'ts' ] )
tindex
0 2019-03-25
1 2019-03-26
2 2019-03-27
3 2019-03-28
4 2019-03-29
5 2019-03-30
6 2019-03-31
7 2019-04-01
8 2019-04-02
9 2019-04-03
10 2019-04-04
11 2019-05-01
12 2019-04-05
Name: ts, dtype: datetime64[ns]
tmp = pdata. set_index( tindex)
tmp. index
DatetimeIndex(['2019-03-25', '2019-03-26', '2019-03-27', '2019-03-28',
'2019-03-29', '2019-03-30', '2019-03-31', '2019-04-01',
'2019-04-02', '2019-04-03', '2019-04-04', '2019-05-01',
'2019-04-05'],
dtype='datetime64[ns]', name='ts', freq=None)
tmp. loc[ '2019-03' ]
ts values ts 2019-03-25 2019-03-25 0 2019-03-26 2019-03-26 1 2019-03-27 2019-03-27 2 2019-03-28 2019-03-28 3 2019-03-29 2019-03-29 4 2019-03-30 2019-03-30 5 2019-03-31 2019-03-31 6
tmp = tmp. sort_index( )
tmp. loc[ '2019-04-01' : '2019-04-05' ]
ts values ts 2019-04-01 2019-04-01 7 2019-04-02 2019-04-02 8 2019-04-03 2019-04-03 9 2019-04-04 2019-04-04 10 2019-04-05 2019-04-05 12
qindex = tmp. index. to_period( 'Q' )
qdata = tmp. set_index( qindex)
qdata. loc[ '2019Q2' ]
ts values ts 2019Q2 2019-04-01 7 2019Q2 2019-04-02 8 2019Q2 2019-04-03 9 2019Q2 2019-04-04 10 2019Q2 2019-04-05 12 2019Q2 2019-05-01 11
tmp. index. to_period( 'Q' )
PeriodIndex(['2019Q1', '2019Q1', '2019Q1', '2019Q1', '2019Q1', '2019Q1',
'2019Q1', '2019Q2', '2019Q2', '2019Q2', '2019Q2', '2019Q2',
'2019Q2'],
dtype='period[Q-DEC]', name='ts')
7.5 时间差值
timedalte:两个datetime值之间的差(如日,秒和微妙)的类型
pd.Timedelta(value=<object object at 0x0000023DD1424CC0>, unit=None, **kwargs
)
pd. Timestamp( '2019-01-02' ) - pd. Timestamp( '2019-01-01' )
Timedelta('1 days 00:00:00')
pd. Timedelta( 5 , 'T' )
Timedelta('0 days 00:05:00')
7.6 重采样
重采样作用:
降低采样率 提升采样率 方法:pdata.resample(rule,how=None,axis=0,fill_method=None,closed=None,label=None,convention='start'...)
参数 说明 rule
规则,例如:'T'
(分钟)、'M'
(月份)等 fill_method
提升采样率填充方式,'ffill'
(向前填充)、'bfill'
(向后填充)等 closed
降低采样率时的闭合方式,'right'
或 'left'
,默认为 'right'
label
降低采样率时的聚合值标签,{'right', 'left'}
loffset
时间偏差,用于调整聚合后的时间索引,可以是 timedelta
类型 kind
聚合方式,'period'
或 'timestamp'
,默认聚合到时间索引
应用场景:股票分析,金融等;
import pandas as pd
import numpy as np
index = pd. date_range( '1/1/2000' , periods= 9 , freq= '2T' )
series = pd. Series( range ( 9 ) , index= index)
series
2000-01-01 00:00:00 0
2000-01-01 00:02:00 1
2000-01-01 00:04:00 2
2000-01-01 00:06:00 3
2000-01-01 00:08:00 4
2000-01-01 00:10:00 5
2000-01-01 00:12:00 6
2000-01-01 00:14:00 7
2000-01-01 00:16:00 8
Freq: 2T, dtype: int64
s = series. resample( '4T' )
s. groups
{Timestamp('2000-01-01 00:00:00'): 2,
Timestamp('2000-01-01 00:04:00'): 4,
Timestamp('2000-01-01 00:08:00'): 6,
Timestamp('2000-01-01 00:12:00'): 8,
Timestamp('2000-01-01 00:16:00'): 9}
Resampler对象相关方法:
方法 说明 s.groups
返回 Resampler 对象,字典 s.max()
降频分组后的最大值 s.min()
降频分组后的最小值 s.first()
降频分组后的第一个值 s.last()
降频分组后的最后一个值 s.mean()
降频分组后的均值 s.median()
降频分组后的中位数
s. first( )
2000-01-01 00:00:00 0
2000-01-01 00:04:00 2
2000-01-01 00:08:00 4
2000-01-01 00:12:00 6
2000-01-01 00:16:00 8
Freq: 4T, dtype: int64
s. max ( )
2000-01-01 00:00:00 1
2000-01-01 00:04:00 3
2000-01-01 00:08:00 5
2000-01-01 00:12:00 7
2000-01-01 00:16:00 8
Freq: 4T, dtype: int64
s. mean( )
2000-01-01 00:00:00 0.5
2000-01-01 00:04:00 2.5
2000-01-01 00:08:00 4.5
2000-01-01 00:12:00 6.5
2000-01-01 00:16:00 8.0
Freq: 4T, dtype: float64
rd = series. resample( 'T' )
rd. bfill( )
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 1
2000-01-01 00:03:00 2
2000-01-01 00:04:00 2
2000-01-01 00:05:00 3
2000-01-01 00:06:00 3
2000-01-01 00:07:00 4
2000-01-01 00:08:00 4
2000-01-01 00:09:00 5
2000-01-01 00:10:00 5
2000-01-01 00:11:00 6
2000-01-01 00:12:00 6
2000-01-01 00:13:00 7
2000-01-01 00:14:00 7
2000-01-01 00:15:00 8
2000-01-01 00:16:00 8
Freq: T, dtype: int64
7.7 时间迁移
时间序列常用操作:对数据按照时间进行迁移 迁移数据:df.shift(periods=1, freq=None, axis=0, fill_value=None)
迁移索引:df.tshift(periods=1, freq=None, axis=0)
——该方法已被弃用
index = pd. date_range( '1/1/2000' , periods= 9 , freq= 'D' )
series = pd. Series( range ( 9 ) , index= index)
series
2000-01-01 0
2000-01-02 1
2000-01-03 2
2000-01-04 3
2000-01-05 4
2000-01-06 5
2000-01-07 6
2000-01-08 7
2000-01-09 8
Freq: D, dtype: int64
series. shift( 1 , freq= 'T' )
2000-01-01 00:01:00 0
2000-01-02 00:01:00 1
2000-01-03 00:01:00 2
2000-01-04 00:01:00 3
2000-01-05 00:01:00 4
2000-01-06 00:01:00 5
2000-01-07 00:01:00 6
2000-01-08 00:01:00 7
2000-01-09 00:01:00 8
Freq: D, dtype: int64
series. shift( 1 )
2000-01-01 NaN
2000-01-02 0.0
2000-01-03 1.0
2000-01-04 2.0
2000-01-05 3.0
2000-01-06 4.0
2000-01-07 5.0
2000-01-08 6.0
2000-01-09 7.0
Freq: D, dtype: float64
series. shift( 1 , freq= 'D' )
2000-01-02 0
2000-01-03 1
2000-01-04 2
2000-01-05 3
2000-01-06 4
2000-01-07 5
2000-01-08 6
2000-01-09 7
2000-01-10 8
Freq: D, dtype: int64
8 数据清洗
数据清洗方式:
获取某列的唯一值:Series.unique()
每个值出现次数:Series.value_counts()
删除指定行列:pdata.drop(labels=None,axis=0,index=None,columns=None, level=None,inplace=False,errors='raise')
去重:pdata.drop_duplicates(subset=None, keep='first', inplace=False)
产生数据
import pandas as pd
import numpy as np
ts = [ '2019-03-25' , '2019-03-26' , '2019-03-26' , '2019-03-26' , '2019-03-29' , '2019-03-30' , '2019-03-31' ,
'2019-04-01' , '2019-04-02' , '2019-04-03' , '2019-04-04' , '2019-05-01' , '2019-04-05' , ]
values = np. arange( len ( ts) )
pdata = pd. DataFrame( { 'ts' : ts, 'values' : values} )
pdata
ts values 0 2019-03-25 0 1 2019-03-26 1 2 2019-03-26 2 3 2019-03-26 3 4 2019-03-29 4 5 2019-03-30 5 6 2019-03-31 6 7 2019-04-01 7 8 2019-04-02 8 9 2019-04-03 9 10 2019-04-04 10 11 2019-05-01 11 12 2019-04-05 12
唯一值与数值出现次数
pdata. ts. unique( )
array(['2019-03-25', '2019-03-26', '2019-03-29', '2019-03-30',
'2019-03-31', '2019-04-01', '2019-04-02', '2019-04-03',
'2019-04-04', '2019-05-01', '2019-04-05'], dtype=object)
pdata. ts. value_counts( )
ts
2019-03-26 3
2019-03-25 1
2019-03-29 1
2019-03-30 1
2019-03-31 1
2019-04-01 1
2019-04-02 1
2019-04-03 1
2019-04-04 1
2019-05-01 1
2019-04-05 1
Name: count, dtype: int64
删除指定行列
print ( pdata. drop( 0 ) )
ts values
1 2019-03-26 1
2 2019-03-26 2
3 2019-03-26 3
4 2019-03-29 4
5 2019-03-30 5
6 2019-03-31 6
7 2019-04-01 7
8 2019-04-02 8
9 2019-04-03 9
10 2019-04-04 10
11 2019-05-01 11
12 2019-04-05 12
print ( pdata. drop( index= [ 1 , 2 , 3 ] ) )
ts values
0 2019-03-25 0
4 2019-03-29 4
5 2019-03-30 5
6 2019-03-31 6
7 2019-04-01 7
8 2019-04-02 8
9 2019-04-03 9
10 2019-04-04 10
11 2019-05-01 11
12 2019-04-05 12
print ( pdata. drop( columns= 'ts' ) )
values
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11
12 12
pdata. drop( index= 0 , columns= 'ts' )
values 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12
去重
pdata. drop_duplicates( 'ts' )
ts values 0 2019-03-25 0 1 2019-03-26 1 4 2019-03-29 4 5 2019-03-30 5 6 2019-03-31 6 7 2019-04-01 7 8 2019-04-02 8 9 2019-04-03 9 10 2019-04-04 10 11 2019-05-01 11 12 2019-04-05 12
pdata. drop_duplicates( subset= 'ts' , keep= 'last' )
ts values 0 2019-03-25 0 3 2019-03-26 3 4 2019-03-29 4 5 2019-03-30 5 6 2019-03-31 6 7 2019-04-01 7 8 2019-04-02 8 9 2019-04-03 9 10 2019-04-04 10 11 2019-05-01 11 12 2019-04-05 12
bindex = pdata. duplicated( subset= 'ts' )
pdata[ bindex== False ]
ts values 0 2019-03-25 0 1 2019-03-26 1 4 2019-03-29 4 5 2019-03-30 5 6 2019-03-31 6 7 2019-04-01 7 8 2019-04-02 8 9 2019-04-03 9 10 2019-04-04 10 11 2019-05-01 11 12 2019-04-05 12
9 数据合并
目的:根据需求,合并多个数据集
9.1 merge
将不同数据集根据指定字段进行合并得到新的数据集。
pd.merge(left,right,how='inner',on=None,left_on=None,right_on=None,left_index=False, right_index=False,sort=False,suffixes=('_x', '_y'),copy=True,indicator=False,validate=None)
主要参数:
参数 说明 left
左侧数据集 right
右侧数据集 how
合并方式 on
索引或者列名,用于连接数据集 left_on
数据连接,左侧数据集的列名 right_on
数据连接,右侧数据集的列名 left_index
使用左侧数据集的索引进行连接 right_index
使用右侧数据集的索引进行连接 sort
是否根据连接键排序 suffixes
合并后相同列名的后缀
合并方式:
参数 说明 left
左连接 right
右连接 outer
外连接 inner
内连接
准备数据
import pandas as pd
import numpy as np
n1 = list ( 'ABCE' )
n2 = list ( 'ABCD' )
d1 = [ 90 , 80 , 100 , 69 ]
d2 = [ 95 , 78 , 96 , 72 ]
c1 = [ '001' , '001' , '002' , '002' ]
c2 = [ '001' , '001' , '002' , '002' ]
df1 = pd. DataFrame( { 'name' : n1, 'math' : d1, 'class' : c1} )
df2 = pd. DataFrame( { 'name' : n2, 'chinese' : d2, 'class' : c2, 'pname' : n1} )
df1
name math class 0 A 90 001 1 B 80 001 2 C 100 002 3 E 69 002
df2
name chinese class pname 0 A 95 001 A 1 B 78 001 B 2 C 96 002 C 3 D 72 002 E
数据合并
pd. merge( df1, df2)
name math class chinese pname 0 A 90 001 95 A 1 B 80 001 78 B 2 C 100 002 96 C
pd. merge( df1, df2, on = 'class' )
name_x math class name_y chinese pname 0 A 90 001 A 95 A 1 A 90 001 B 78 B 2 B 80 001 A 95 A 3 B 80 001 B 78 B 4 C 100 002 C 96 C 5 C 100 002 D 72 E 6 E 69 002 C 96 C 7 E 69 002 D 72 E
pd. merge( df1, df2, on = 'name' , how= 'outer' )
name math class_x chinese class_y pname 0 A 90.0 001 95.0 001 A 1 B 80.0 001 78.0 001 B 2 C 100.0 002 96.0 002 C 3 E 69.0 002 NaN NaN NaN 4 D NaN NaN 72.0 002 E
pd. merge( df1, df2, on = 'name' , how= 'left' )
name math class_x chinese class_y pname 0 A 90 001 95.0 001 A 1 B 80 001 78.0 001 B 2 C 100 002 96.0 002 C 3 E 69 002 NaN NaN NaN
pd. merge( df1, df2, on = 'name' , how= 'right' )
name math class_x chinese class_y pname 0 A 90.0 001 95 001 A 1 B 80.0 001 78 001 B 2 C 100.0 002 96 002 C 3 D NaN NaN 72 002 E
pd. merge( df1, df2, left_on = 'name' , right_on= 'pname' )
name_x math class_x name_y chinese class_y pname 0 A 90 001 A 95 001 A 1 B 80 001 B 78 001 B 2 C 100 002 C 96 002 C 3 E 69 002 D 72 002 E
9.2 join方法
join方法:DateFrame对象方法,与megre方法类似,how的方式默认为left
df1.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
主要参数:
参数 说明 on
指定连接列 how
拼接方式 lsuffix
左侧 DataFrame 的列名后缀 rsuffix
右侧 DataFrame 的列名后缀
df1
name math class 0 A 90 001 1 B 80 001 2 C 100 002 3 E 69 002
df2
name chinese class pname 0 A 95 001 A 1 B 78 001 B 2 C 96 002 C 3 D 72 002 E
df1. join( df2, lsuffix= '_x' , rsuffix= '_y' )
name_x math class_x name_y chinese class_y pname 0 A 90 001 A 95 001 A 1 B 80 001 B 78 001 B 2 C 100 002 C 96 002 C 3 E 69 002 D 72 002 E
df1. join( df2. set_index( 'class' ) , lsuffix= '_x' , rsuffix= '_y' , on = 'class' )
name_x math class name_y chinese pname 0 A 90 001 A 95 A 0 A 90 001 B 78 B 1 B 80 001 A 95 A 1 B 80 001 B 78 B 2 C 100 002 C 96 C 2 C 100 002 D 72 E 3 E 69 002 C 96 C 3 E 69 002 D 72 E
9.3 concat
concat:根据设置轴与条件,将两个数据进行拼接
pd.concat(objs,axis=0,join='outer',join_axes=None,ignore_index=False,keys=None,levels=None,names=None...)
参数说明:
参数 说明 objs
一个包含 Series 列表、DataFrame 列表或字典的列表 axis
指定拼接轴,0
表示按索引(index),1
表示按列(columns) join
连接方式,inner
表示交集,outer
表示并集 ignore_index
是否不使用并置轴上的索引值,True
表示不使用 join_axes
用于指定连接轴的 Index 对象列表 keys
序列,用于构建层次化索引(MultiIndex) levels
多级索引的特定值 names
列表,用于指定层级索引的名称
pd. concat( [ df1, df2] , sort= True )
chinese class math name pname 0 NaN 001 90.0 A NaN 1 NaN 001 80.0 B NaN 2 NaN 002 100.0 C NaN 3 NaN 002 69.0 E NaN 0 95.0 001 NaN A A 1 78.0 001 NaN B B 2 96.0 002 NaN C C 3 72.0 002 NaN D E
pd. concat( [ df1, df2] , axis= 1 )
name math class name chinese class pname 0 A 90 001 A 95 001 A 1 B 80 001 B 78 001 B 2 C 100 002 C 96 002 C 3 E 69 002 D 72 002 E
pd. concat( [ df1, df2] , sort= True , keys= [ 'p1' , 'p2' ] )
chinese class math name pname p1 0 NaN 001 90.0 A NaN 1 NaN 001 80.0 B NaN 2 NaN 002 100.0 C NaN 3 NaN 002 69.0 E NaN p2 0 95.0 001 NaN A A 1 78.0 001 NaN B B 2 96.0 002 NaN C C 3 72.0 002 NaN D E
pd. concat( [ df1, df2] , axis= 1 )
name math class name chinese class pname 0 A 90 001 A 95 001 A 1 B 80 001 B 78 001 B 2 C 100 002 C 96 002 C 3 E 69 002 D 72 002 E
pd. concat( [ df1, df2] , axis= 1 , keys= [ 's1' , 's2' ] )
s1 s2 name math class name chinese class pname 0 A 90 001 A 95 001 A 1 B 80 001 B 78 001 B 2 C 100 002 C 96 002 C 3 E 69 002 D 72 002 E
t1 = df1. set_index( 'name' )
t2 = df2. set_index( 'name' )
pd. concat( [ t1, t2] , axis= 1 , sort= True )
math class chinese class pname name A 90.0 001 95.0 001 A B 80.0 001 78.0 001 B C 100.0 002 96.0 002 C D NaN NaN 72.0 002 E E 69.0 002 NaN NaN NaN
t1 = df1. set_index( 'name' )
t2 = df2. set_index( 'name' )
pd. concat( [ t1, t2] , axis= 1 , sort= True , join= 'inner' )
math class chinese class pname name A 90 001 95 001 A B 80 001 78 001 B C 100 002 96 002 C
10 pandas数据处理常用函数
目的:掌握apply, agg,str等函数,对数据灵活处理
10.1 apply函数
对pandas中DataFrame或者Series中每个数据进行处理 apply方法:df.apply(func,axis=0,broadcast=None, raw=False,reduce=None, result_type=None,args=(), **kwds)
参数 说明 func
处理函数,用于处理一系列值 axis
轴设置,指定处理函数的轴,0
表示按列,1
表示按行
import pandas as pd
import numpy as np
n = list ( 'ABCD' )
math = [ 90 , 80 , 47 , 69 ]
chinese = [ 95 , 78 , 96 , 59 ]
nclass = [ '001' , '001' , '002' , '002' ]
df = pd. DataFrame( { 'name' : n, 'math' : math, 'chinese' : chinese, 'class' : nclass} )
df
name math chinese class 0 A 90 95 001 1 B 80 78 001 2 C 47 96 002 3 D 69 59 002
name = df[ 'name' ]
name
0 A
1 B
2 C
3 D
Name: name, dtype: object
name. apply ( lambda x: x+ x)
0 AA
1 BB
2 CC
3 DD
Name: name, dtype: object
df[ [ 'math' , 'chinese' ] ] . apply ( lambda x : x> 59 )
math chinese 0 True True 1 True True 2 False True 3 True False
10.2 func处理对象
def func ( value) :
print ( type ( value) )
return np. mean( value)
df[ [ 'math' , 'chinese' ] ] . apply ( func)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
math 71.5
chinese 82.0
dtype: float64
def func1 ( x) :
return x- x. mean( )
df[ [ 'math' , 'chinese' ] ] . apply ( func1)
math chinese 0 18.5 13.0 1 8.5 -4.0 2 -24.5 14.0 3 -2.5 -23.0
def func_min ( x) :
return x- x. min ( )
sdata = df[ [ 'math' , 'chinese' ] ]
sdata. apply ( func_min)
math chinese 0 43 36 1 33 19 2 0 37 3 22 0
10.3 map
适用于Series对象或Df的某一列 map方法:Series.map(arg, na_action=None)
对Series中的每个数值进行处理处理
df[ 'math' ] . map ( lambda x: 'pass' if x > 59 else 'failed' )
0 pass
1 pass
2 failed
3 pass
Name: math, dtype: object
func = lambda x: 'pass' if x > 59 else 'failed'
df[ [ 'math' , 'chinese' ] ] . apply ( lambda x : x. map ( func) )
math chinese 0 pass pass 1 pass pass 2 failed pass 3 pass failed
10.4 replace:替换
对当前数据集中指定数据进行替换 方法:df.replace(to_replace=None,value=None,inplace=False,limit=None,regex=False,method='pad')
主要参数:
参数 说明 to_replace
替换值,可以是字符串、正则表达式、列表等 value
替换目标值 limit
替换次数的限制 inplace
是否替换原数据,True
表示替换原数据 regex
是否使用正则表达式进行替换,需要设置为 True
df. replace( 'A' , 'a' )
name math chinese class 0 a 90 95 001 1 B 80 78 001 2 C 47 96 002 3 D 69 59 002
df. replace( [ 'A' , 'B' ] , '*' )
name math chinese class 0 * 90 95 001 1 * 80 78 001 2 C 47 96 002 3 D 69 59 002
df. replace( list ( 'ABCD' ) , list ( 'abcd' ) )
name math chinese class 0 a 90 95 001 1 b 80 78 001 2 c 47 96 002 3 d 69 59 002
df. replace( r'[A-Z]' , '*' , regex= True )
name math chinese class 0 * 90 95 001 1 * 80 78 001 2 * 47 96 002 3 * 69 59 002
10.5 agg:聚合操作
按照设置axis对数据进行聚合操作(mean, max,…) df.agg(func, axis=0, *args, **kwargs)
axis: 0:func应用到column, 1:func应用到rowow
df[ [ 'chinese' , 'math' ] ] . agg( [ 'mean' , 'std' ] )
chinese math mean 82.000000 71.500000 std 17.416467 18.448125
df[ [ 'chinese' , 'math' ] ] . describe( )
chinese math count 4.000000 4.000000 mean 82.000000 71.500000 std 17.416467 18.448125 min 59.000000 47.000000 25% 73.250000 63.500000 50% 86.500000 74.500000 75% 95.250000 82.500000 max 96.000000 90.000000
10.6 transform:处理数据
根据设置方法,对数据进行处理,得到新的数据 df.transform(func, axis=0, *args, **kwargs)
每个科目数值减去均值去均值
df[ [ 'chinese' , 'math' ] ]
chinese math 0 95 90 1 78 80 2 96 47 3 59 69
df[ [ 'chinese' , 'math' ] ] . transform( lambda x: x- x. mean( ) )
chinese math 0 13.0 18.5 1 -4.0 8.5 2 14.0 -24.5 3 -23.0 -2.5
10.7 filter:过滤
根据标签过滤符合条件数据 df.filter(items=None, like=None, regex=None, axis=None)
参数 说明 items
labels 值,类似于列表 like
标签模糊匹配 regex
正则表达式匹配 axis
设置轴,指定操作的轴,例如 0
(按行)、1
(按列)
df. filter ( [ 'name' ] )
df. filter ( like = 'n' )
name chinese 0 A 95 1 B 78 2 C 96 3 D 59
df. filter ( regex= r'.*e' )
name chinese 0 A 95 1 B 78 2 C 96 3 D 59
11 分组处理
目的:根据指定条件,对数据进行分组,然后在依据分组进行计算
例如:统计每天活跃用户总数,用户每天在线时长,用户平均消费水平等;
import pandas as pd
classname = [ '001' , '001' , '002' , '002' , '003' , '003' ]
name = [ 'sun' , 'li' , 'zhou' , 'wang' , 'zao' , 'wu' ]
height = [ 169 , 172 , 180 , 170 , 165 , 175 ]
weights = [ 61 , 53 , 75 , 64 , 50 , 58 ]
df = pd. DataFrame( { 'cname' : classname, 'user' : name, 'height' : height, 'weights' : weights} )
df
cname user height weights 0 001 sun 169 61 1 001 li 172 53 2 002 zhou 180 75 3 002 wang 170 64 4 003 zao 165 50 5 003 wu 175 58
11.1 groupby分组
方法:df.groupby(by=None,axis=0,level=None,as_index=True,sort=True,group_keys=True,squeeze=False,observed=False,**kwargs)
主要参数:
参数 说明 by
分组依据 axis
轴设置,0
或 'index'
表示按行,1
或 'columns'
表示按列,默认为 0
group_keys
聚合输出,以组标签作为索引,默认为 True
sort
根据分组标签排序,默认为 True
level
多级索引,指定索引
dfg = df. groupby( 'cname' )
dfg. groups
{'001': [0, 1], '002': [2, 3], '003': [4, 5]}
dfg. count( )
user height weights cname 001 2 2 2 002 2 2 2 003 2 2 2
dfg = df. groupby( [ 'cname' , 'height' ] )
dfg. groups
{('001', 169): [0], ('001', 172): [1], ('002', 170): [3], ('002', 180): [2], ('003', 165): [4], ('003', 175): [5]}
dfg. count( )
user weights cname height 001 169 1 1 172 1 1 002 170 1 1 180 1 1 003 165 1 1 175 1 1
11.2 聚合
分组得到groupby对象,可以通过聚合函数对其操作,获取聚合结果;
直白理解:对分组数据,进行处理;
import pandas as pd
import numpy as np
classname = [ '001' , '001' , '002' , '002' , '003' , '003' ]
name = [ 'sun' , 'li' , 'zhou' , 'wang' , 'zao' , 'wu' ]
height = [ 169 , 172 , 180 , 170 , 165 , 175 ]
weights = [ 61 , 53 , 75 , 64 , 50 , 58 ]
df = pd. DataFrame( { 'cname' : classname, 'user' : name, 'height' : height, 'weights' : weights} )
df_cname = df. drop( 'user' , axis= 1 )
dfg = df_cname. groupby( 'cname' )
df_cname
cname height weights 0 001 169 61 1 001 172 53 2 002 180 75 3 002 170 64 4 003 165 50 5 003 175 58
dfg. agg( [ 'max' , 'mean' , 'std' , 'count' ] )
height weights max mean std count max mean std count cname 001 172 170.5 2.121320 2 61 57.0 5.656854 2 002 180 175.0 7.071068 2 75 69.5 7.778175 2 003 175 170.0 7.071068 2 58 54.0 5.656854 2
11.3 transform
作用:将分组数据处理,得到一组新的数据 方法:dfg.transform(func, *args, **kwargs)
场景:分组数据中,与分组均值差,差,
dfg. transform( 'mean' )
height weights 0 170.5 57.0 1 170.5 57.0 2 175.0 69.5 3 175.0 69.5 4 170.0 54.0 5 170.0 54.0
dfg. transform( lambda x: x- x. mean( ) )
height weights 0 -1.5 4.0 1 1.5 -4.0 2 5.0 5.5 3 -5.0 -5.5 4 -5.0 -4.0 5 5.0 4.0
11.5 filter
过滤数据 dfg.filter(func, dropna=True, *args, **kwargs)
:根据过滤条件返回分组数据据
dfg[ 'height' ] . filter ( lambda x: np. mean( x) > 171 )
2 180
3 170
Name: height, dtype: int64
dfg. count( )
height weights cname 001 2 2 002 2 2 003 2 2
11.6 cut分组
根据设置区间进行分段汇总 方法:pd.cut(x,bins,right=True,labels=None,retbins=False,precision=3,include_lowest=False,duplicates='raise')
df = pd. Series( np. random. randint( 70 , 100 , size = 10 ) )
r = pd. cut( df, [ 70 , 80 , 90 , 101 ] , right= False )
r
0 [70, 80)
1 [80, 90)
2 [70, 80)
3 [90, 101)
4 [70, 80)
5 [80, 90)
6 [90, 101)
7 [70, 80)
8 [90, 101)
9 [80, 90)
dtype: category
Categories (3, interval[int64, left]): [[70, 80) < [80, 90) < [90, 101)]
r = pd. cut( df, [ 70 , 80 , 90 , 101 ] , right= False , labels= [ '70+' , '80+' , '100+' ] )
r. groupby( r, observed= False ) . count( )
70+ 4
80+ 3
100+ 3
dtype: int64
r, bins= pd. cut( df, [ 70 , 80 , 90 , 101 ] , right= False , labels= [ '70+' , '80+' , '100+' ] , retbins= True )
bins
array([ 70, 80, 90, 101])
11.7 透视表
透视表是一种可以对数据动态排布并且分类汇总的表格格式,使用方式与groupby类似,但是比其简单;
方法:pd.pivot_table(data,values=None,index=None,columns=None,aggfunc='mean',fill_value=None,margins=False,...)
主要参数:
参数 说明 data
数据 values
用于计算数据项的列 index
行分组键,可以是列名、Grouper、数组等 columns
列分组键,可以是列名、Grouper、数组等 aggfunc
聚合函数或函数列表,或者字典形式的函数 fill_value
设置缺省值 dropna
删除缺省值,True
表示删除,默认为 True
margins
是否增加统计列,True
表示增加,默认为 False
margins_name
新增统计列的列名
import pandas as pd
path = r'data\titanic.csv'
df = pd. read_csv( path)
df
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone 0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False 1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False 2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True 3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False 4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True 887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True 888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False 889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True 890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True
891 rows × 15 columns
df. pivot_table( values= 'survived' , index= 'sex' , columns= [ 'class' ] )
class First Second Third sex female 0.968085 0.921053 0.500000 male 0.368852 0.157407 0.135447
df. pivot_table( values= 'survived' , index= 'sex' , columns= [ 'class' ] , margins= True , margins_name= 'all_mean' )
class First Second Third all_mean sex female 0.968085 0.921053 0.500000 0.742038 male 0.368852 0.157407 0.135447 0.188908 all_mean 0.629630 0.472826 0.242363 0.383838
cuts = pd. cut( df. age, [ 0 , 18 , 100 ] )
df. pivot_table( values= 'survived' , index= [ 'sex' , cuts] , columns= [ 'class' ] )
class First Second Third sex age female (0, 18] 0.909091 1.000000 0.511628 (18, 100] 0.972973 0.900000 0.423729 male (0, 18] 0.800000 0.600000 0.215686 (18, 100] 0.375000 0.071429 0.133663
cuts = pd. cut( df. age, [ 0 , 18 , 100 ] )
df. pivot_table( values= 'survived' , index= [ 'sex' ] , columns= [ 'class' ] , aggfunc= [ 'mean' , 'count' ] , margins= True )
mean count class First Second Third All First Second Third All sex female 0.968085 0.921053 0.500000 0.742038 94 76 144 314 male 0.368852 0.157407 0.135447 0.188908 122 108 347 577 All 0.629630 0.472826 0.242363 0.383838 216 184 491 891
11.8 str相关方法
str方法使Series对象内置方法,用于对字符串处理,与字符串相关方法类似;
cat(others=None, sep=None, na_rep=None, join=None)
:拼接字符串split(pat=None, n=-1, expand=False)
:切分字符串get(i)
:获取指定位置的字符串join(sep)
:字符串拼接find(sub, start=0, end=None)
:查找,返回第一次出现子集位置contains(pat, case=True, flags=0, na=nan, regex=True)
:判断是否包含指定的值replace(pat, repl, n=-1, case=None, flags=0, regex=True)
:替换sf.str.repeat(repeats)
:重复repeats次startswith(pat, na=nan)
:判断是否已pat开头sf.str.endswith(pat, na=nan)
:判断是否已pat结尾sf.str.strip(to_strip=None)
:根据指定字符掐头去尾
sf = pd. Series( [ 'a_1' , 'b_2' , 'c_3' ] )
sf. str . cat( list ( 'ABC' ) , sep = '-' )
0 a_1-A
1 b_2-B
2 c_3-C
dtype: object
sf. str . split( '_' , expand= True )
sf. str . extract( r'([a-zA-Z]+)' )
sf. str . match ( r'([a-zA-Z]+)' )
0 True
1 True
2 True
dtype: bool
sf. str . findall( r'([a-z0-9]+)' )
0 [a, 1]
1 [b, 2]
2 [c, 3]
dtype: object
df = pd. DataFrame( [ [ "张三" , "abcd1234" ] , [ "李四" , "a2b4c3r5" ] ] , columns= [ "name" , "pwd" ] )
df
name pwd 0 张三 abcd1234 1 李四 a2b4c3r5
12 pandas可视化
pandas可以直接绘制图表,实现基于matplotlib,使用方式与其类似,
方法:df.plot(*args, **kwargs)
主要参数:
参数 说明 data
Series 或 DataFrame 对象 x
标签或索引,用于指定 x 轴数据 y
标签或索引,用于指定 y 轴数据 kind
绘制图像的样式,例如:line
、bar
、barh
、hist
等 figsize
图像大小 use_index
是否使用 index 作为 x 轴刻度,默认为 True
grid
是否使用栅格,默认为 False
legend
是否显示图例,默认为 True
12.1 基本使用
import numpy as np
import pandas as pd
import matplotlib. pyplot as plt
% matplotlib inline
df = pd. DataFrame( np. random. randint( 30 , 60 , size= ( 6 , 4 ) ) , columns= list ( 'ABCD' ) , index= range ( 6 ) )
df
A B C D 0 41 34 35 53 1 40 53 34 50 2 59 44 47 38 3 59 50 59 40 4 55 52 49 45 5 45 57 38 54
df. plot( )
<AxesSubplot: >
df. A. plot( )
df. B. plot( secondary_y= True , style= 'g' )
<AxesSubplot: >
12.2 plot中可视化方法
折线图:df.plot.line(x=None, y=None, **kwargs) 柱状图:df.plot.bar(x=None, y=None, **kwargs) 条形图:df.plot.barh(x=None, y=None, **kwargs) 直方图:df.plot.hist(by=None, bins=10, **kwargs) KDE图:df.plot.kde(bw_method=None, ind=None, **kwargs) 饼状图:df.plot.pie(**kwargs) 散点图:df.plot.scatter(x, y, s=None, c=None, **kwargs) 箱状图:df.plot.box(by=None, **kwargs) 区域块状图:df.plot.area(x=None, y=None, **kwargs)*kwargs)
df. plot. bar( )
<AxesSubplot: >
df. plot. barh( )
<AxesSubplot: >
df. plot. pie( y= 'A' )
<AxesSubplot: ylabel='A'>