Data Whale第20期组队学习 Pandas学习—基础知识
一、Dataframe简介
1.1 Pandas常见的数据类型
数据结构 | 维度 | 说明 |
---|---|---|
Series | 1 | 一维数组,等同于list,类似于Numpy中的array |
Data Frames | 2 | 二维的表格型数据结构 |
Panel | 3 | 三维数组,即Data Frames的容器 |
Series 一般由序列的值 data 、索引 index 、存储类型 dtype 、序列的名字 name等四个部分组成 。其中, index 可以指定名字,默认为空。
import pandas as pd
Ser=pd.Series(data=[200,'w',{'dic1':6}],
index=pd.Index(['ID1',50,'six'],name='my_index'),
dtype='object',
name='Hello')
print("Ser=",Ser)
# Ser= my_index
# ID1 200
# 50 w
# six {'dic1': 6}
# Name: Hello, dtype: object
1.2 创建Data Frames
常见创建Data Frames的方法有以下几种:
1)创建空Data Frames
2)利用dict创建
3)从其他数据源中读取(例如excel,网址等)
代码如下:
import pandas as pd
# 1、 创建空Data Frames
df=pd.DataFrame()
print("df=",df)
# df= Empty DataFrame
# Columns: []
# Index: []
# 2、利用dict创建
dict={'name':["zhangsan","lisi","wangwu","yangliu"],
'age':[25,31,28,19],
'city':["guangzhou","shenzheng","hangzhou","chengdu"]}
df1=pd.DataFrame(dict)
print(df1)
# name age city
# 0 zhangsan 25 guangzhou
# 1 lisi 31 shenzheng
# 2 wangwu 28 hangzhou
# 3 yangliu 19 chengdu
# 2、利用dict创建(规范化)
index=pd.Index(["zhangsan","lisi","wangwu","yangliu"],name='person')
cols=['age','city']
data = [[25,"guangzhou"],
[31,"shenzheng"],
[28,"hangzhou"],
[19,"chengdu"]]
df2=pd.DataFrame(index=index,data=data,columns=cols)
print(df2)
# age city
# person
# zhangsan 25 guangzhou
# lisi 31 shenzheng
# wangwu 28 hangzhou
# yangliu 19 chengdu
# 3、从其他数据源中读取(例如excel,网址等)
reprot_2020_df = pd.read_csv('./2020.csv',
usecols=['Country', 'Happiness Rank', 'Happiness Score', 'Region'])
# 数据预览
print('==============================================')
print(reprot_2020_df.head())
print(reprot_2020_df.values[:2, :])
print('==============================================')
print(reprot_2020_df[['Region', 'Happiness Rank']].values[:2, :])
二、 Data Frame 基本操作
2.1 添加一列
import pandas as pd
# 2、利用dict创建(规范化)
index=pd.Index(["zhangsan","lisi","wangwu","yangliu"],name='person')
cols=['age','city']
data = [[25,"guangzhou"],
[31,"shenzheng"],
[28,"hangzhou"],
[19,"chengdu"]]
df2=pd.DataFrame(index=index,data=data,columns=cols)
print(df2)
# age city
# person
# zhangsan 25 guangzhou
# lisi 31 shenzheng
# wangwu 28 hangzhou
# yangliu 19 chengdu
df2['country']='China'
print(df2)
# age city country
# person
# zhangsan 25 guangzhou China
# lisi 31 shenzheng China
# wangwu 28 hangzhou China
# yangliu 19 chengdu China
df2['Adress']=df2['country']
print(df2)
# age city country Adress
# person
# zhangsan 25 guangzhou China China
# lisi 31 shenzheng China China
# wangwu 28 hangzhou China China
# yangliu 19 chengdu China China
2.2 修改表格中列的值
# 代码同上
df2['Adress']='hebeisheng'
print(df2)
# age city country Adress
# person
# zhangsan 25 guangzhou China hebeisheng
# lisi 31 shenzheng China hebeisheng
# wangwu 28 hangzhou China hebeisheng
# yangliu 19 chengdu China hebeisheng
df2['city']='shijiazhuangshi'
df2['Adress']=df2['Adress']+','+df2['city']
print(df2)
# age city country Adress
# person
# zhangsan 25 shijiazhuangshi China hebeisheng,shijiazhuangshi
# lisi 31 shijiazhuangshi China hebeisheng,shijiazhuangshi
# wangwu 28 shijiazhuangshi China hebeisheng,shijiazhuangshi
# yangliu 19 shijiazhuangshi China hebeisheng,shijiazhuangshi
2.3 删除一列
删除一列可以使用del或drop函数,使用drop函数时,传递参数时要加上axis=1,axis表示轴向,默认axis=0(纵向),axis=1(横向)。
print(df2.drop('age',axis=1)) #drop操作不会改变df2原有数据
# city country Adress
# person
# zhangsan shijiazhuangshi China hebeisheng,shijiazhuangshi
# lisi shijiazhuangshi China hebeisheng,shijiazhuangshi
# wangwu shijiazhuangshi China hebeisheng,shijiazhuangshi
# yangliu shijiazhuangshi China hebeisheng,shijiazhuangshi
del df2['Adress']
print(df2)
# age city country
# person
# zhangsan 25 shijiazhuangshi China
# lisi 31 shijiazhuangshi China
# wangwu 28 shijiazhuangshi China
# yangliu 19 shijiazhuangshi China
2.4 选取列
print(df2['city'])
# person
# zhangsan shijiazhuangshi
# lisi shijiazhuangshi
# wangwu shijiazhuangshi
# yangliu shijiazhuangshi
# Name: city, dtype: object
print(df2.age)
# person
# zhangsan 25
# lisi 31
# wangwu 28
# yangliu 19
# Name: age, dtype: int64
print(df2[['age','city']]) #选取多列
# age city
# person
# zhangsan 25 shijiazhuangshi
# lisi 31 shijiazhuangshi
# wangwu 28 shijiazhuangshi
# yangliu 19 shijiazhuangshi
2.5 对列进行命名
重新对列进行命名有以下三种基本方法:
1)传递list
2)传递dict
3) 传递axis
#1)传递list
df2.columns=['Age','Name','Adress']
print(df2)
# Age Name Adress
# person
# zhangsan 25 shijiazhuangshi China
# lisi 31 shijiazhuangshi China
# wangwu 28 shijiazhuangshi China
# yangliu 19 shijiazhuangshi China
#2)传递dict
print(df2.rename(index=str,columns={'age':'Age','name':'Name','country':'Adress'}))
# Age city Adress
# person
# zhangsan 25 shijiazhuangshi China
# lisi 31 shijiazhuangshi China
# wangwu 28 shijiazhuangshi China
# yangliu 19 shijiazhuangshi China
2.6 loc函数
使用loc函数实现行的操作,loc函数通过行索引 “Index” 中的具体值来取行数据。
data=pd.DataFrame(np.arange(25).reshape(5,5),index=list('hello'),columns=list('WORLD'))
print("data=",data)
# data= W O R L D
# h 0 1 2 3 4
# e 5 6 7 8 9
# l 10 11 12 13 14
# l 15 16 17 18 19
# o 20 21 22 23 24
#
#取索引为'e'的行
print("data.loc['e']=",data.loc['e'])
# data.loc['e']= W 5
# O 6
# R 7
# L 8
# D 9
# Name: e, dtype: int32
#
#取第一行数据,索引为'h'的行就是第一行,所以结果相同
print("data.iloc[0]=",data.iloc[0])
# Name: e, dtype: int32
# data.iloc[0]= W 0
# O 1
# R 2
# L 3
# D 4
# Name: h, dtype: int32
三、窗口对象
pandas 分别有滑动窗口 rolling 、扩张窗口 expanding 和指数加权窗口 ewm 等3类窗口。
3.1 滑动窗口 rolling
Ser1=pd.Series([9,8,7,6,5])
rolled=Ser1.rolling(window=3)
print("rolled=",rolled)
# rolled= Rolling [window=3,center=False,axis=0]
# 当得到滑窗对象后,使用相应的聚合函数进行计算,需要注意的是窗口包含当前行所在的元素,
# 例如在第四个位置进行均值运算时,应当计算(2+3+4)/3,而不是(1+2+3)/3
print("rolled.mean()=",rolled.mean())
# rolled.mean()= 0 NaN
# 1 NaN
# 2 8.0
# 3 7.0
# 4 6.0
# dtype: float64
print("rolled.sum()=",rolled.sum())
# rolled.sum()= 0 NaN
# 1 NaN
# 2 24.0
# 3 21.0
# 4 18.0
# dtype: float64
# 计算滑动相关系数或滑动协方差
Ser2=pd.Series([5,9,11,13,52,8,41])
print("滑动相关系数=",rolled.cov(Ser2))
# 滑动相关系数= 0 NaN
# 1 NaN
# 2 -3.0
# 3 -2.0
# 4 -20.5
# 5 NaN
# 6 NaN
# dtype: float64
print("滑动协方差=",rolled.corr(Ser2))
# 滑动协方差= 0 NaN
# 1 NaN
# 2 -0.981981
# 3 -1.000000
# 4 -0.886845
# 5 NaN
# 6 NaN
# dtype: float64
3.2 扩张窗口 expanding
# 扩张窗口也称累计窗口,即理解为一个动态长度的窗口,其窗口的大小是从序列开始处到具体操作的对应位置,
# 其使用的聚合函数会作用于这些逐步扩张的窗口上。具体地说,设序列为a1, a2, a3, a4,
# 则其每个位置对应的窗口即[a1]、[a1, a2]、[a1, a2, a3]、[a1, a2, a3, a4]。
print("扩张窗口:",Ser2.expanding().mean())
# 扩张窗口: 0 5.000000
# 1 7.000000
# 2 8.333333
# 3 9.500000
# 4 18.000000
# 5 16.333333
# 6 19.857143
# dtype: float64
3.3 指数加权窗口 ewm
均值和方差是统计领域常用的分析方法,可以获得所有样本的分布情况。但在带有时序的应用场景中,我们时常需要在统计全量数据的同时,为最近的数据赋予更高的权重,同时降低距离当前时间点较远数据的权重。指数加权滑动平均(exponential weighted moving average, EWMA)是金融等领域处理此类问题常用的数据分析方法。该方法为每个样本赋予一个随时间指数衰减的权重,使得计算获得的平均值能够反映数据最近的统计信息。
Pandas提供了窗口函数ewm()及在此基础上的均值和方差操作。ewm函数的定义为
def ewm(self, com=None, span=None, halflife=None, alpha=None, min_periods=0,adjust=True, ignore_na=False, axis=0)
其中,com、span、halflife和alpha定义衰减系数 [公式] 的值。alpha参数直接指定 [公式] ,而com、span、halflife分别指定质心、跨度和半衰期。
Ser4=pd.Series(np.random.randn(2000),
index=pd.date_range('15/12/2020',periods=2000)).cumsum()
print("Ser4=",Ser4)
# Ser4= 2020-12-15 -2.552990
# 2020-12-16 -1.899371
# 2020-12-17 -1.034935
# 2020-12-18 -1.777100
# 2020-12-19 0.492655
# ...
# 2026-06-02 -47.202255
# 2026-06-03 -46.493395
# 2026-06-04 -46.070576
# 2026-06-05 -49.187433
# 2026-06-06 -48.542981
# Freq: D, Length: 2000, dtype: float64
ewma_s=Ser4.ewm(span=30).mean()
Ser4.plot()
ewma_s.plot()
plt.bar(range(len(Ser4)), Ser4)
plt.show()
经过滑动平均的数据可以反映数据趋势。
参考资料
1、https://segmentfault.com/a/1190000018373808
2、https://blog.csdn.net/fanzonghao/article/details/85626325