序列 pd.Series
(附件为代码运行信息)
1、创建序列
序列是带标签索引的一维数组
In [88]:
import pandas as pd
创建序列
In [41]:
a = pd.Series([9.1,9.5,10.0,11]) #未指定索引,索引默认是 0,1,2,3... a
Out[41]:
0 9.1 1 9.5 2 10.0 3 11.0 dtype: float64
可以在创建序列时指定索引
In [42]:
a = pd.Series([9.1,9.5,10.0,11],index=['a','b','c','d']) a
Out[42]:
a 9.1 b 9.5 c 10.0 d 11.0 dtype: float64
可以为创建的序列指定名称
In [43]:
a = pd.Series([9.1,9.5,10.0,11],index=['a','b','c','d'],name='t') a
Out[43]:
a 9.1 b 9.5 c 10.0 d 11.0 Name: t, dtype: float64
2、时间索引
pd.Series的索引可以是数字、字符串,也可以是时间 使用pd.to_datetime()函数 讲字符串序列转换为时间戳索引。默认是时间为世界时(UTC)
UTC -> BJT: +8h BJT -> UTC: -8h
In [44]:
a = pd.Series([9.1,9.5,10.0,11],index=pd.to_datetime(['2010-01-01','2010-01-02','2010-01-03','2010-01-04']),name='t') a
Out[44]:
2010-01-01 9.1 2010-01-02 9.5 2010-01-03 10.0 2010-01-04 11.0 Name: t, dtype: float64
假设上述设置的时间的BJT,先需要转化为UTC。如下: 使用datetime.timedelta() 函数
In [45]:
import datetime as dt time_index = pd.to_datetime(['2010-01-01','2010-01-02','2010-01-03','2010-01-04']) time_index = time_index - dt.timedelta(hours = 8) a = pd.Series([9.1,9.5,10.0,11],index=time_index,name='t') a
Out[45]:
2009-12-31 16:00:00 9.1 2010-01-01 16:00:00 9.5 2010-01-02 16:00:00 10.0 2010-01-03 16:00:00 11.0 Name: t, dtype: float64
3、pd.Series对象的算术运算
In [46]:
a = pd.Series([9.1,9.5,10.0,11],index=['a','b','c','d']) b = a -10 b
Out[46]:
a -0.9 b -0.5 c 0.0 d 1.0 dtype: float64
当一个pd.Series对象与另一个pd.Series对象进行计算时,二者按照索引进行计算
In [47]:
a = pd.Series([9.1,9.5,10.0,11],index=['a','b','c','d']) b = pd.Series([9.1,9.5,10.0,11],index=['b','a','c','d']) b-a
Out[47]:
a 0.4 b -0.4 c 0.0 d 0.0 dtype: float64
如果两个pd.Series对象的值不一样,对于不有的索引运算结果将为Nan
In [48]:
a = pd.Series([9.1,9.5,10.0,11],index=['a','b','c','d']) b = pd.Series([9.1,9.5,10.0,11],index=['b','e','c','d']) b-a
Out[48]:
a NaN b -0.4 c 0.0 d 0.0 e NaN dtype: float64
4、pd.Series对象常用属性
.dtype 数据类型
In [49]:
a = pd.Series([9.1,9.5,10.0,11],index=['a','b','c','d']) a.dtype
Out[49]:
dtype('float64')
.ndim 数据维度
In [50]:
a.ndim
Out[50]:
1
.shape 数据形状
In [51]:
a.shape
Out[51]:
(4,)
.at 通过标签索引访问单个元素
In [52]:
a.at['a']
Out[52]:
9.1
In [ ]:
.iat 通过位置索引访问单个元素
In [53]:
a.iat[0]
Out[53]:
9.1
通过标签或者布尔序列访问多个元素
In [54]:
a.loc[['a','b']]
Out[54]:
a 9.1 b 9.5 dtype: float64
In [23]:
a.loc['a']
Out[23]:
9.1
In [55]:
a.loc[[True,False,True,False]] #布尔个数需与序列对象的元素个数保持一致
Out[55]:
a 9.1 c 10.0 dtype: float64
通过位置访问多个元素
In [56]:
a.iloc[1]
Out[56]:
9.5
In [27]:
a.iloc[[0,1]]
Out[27]:
a 9.1 b 9.5 dtype: float64
In [ ]:
.values 获取数据的原始np.ndarray对象
In [57]:
a.values
Out[57]:
array([ 9.1, 9.5, 10. , 11. ])
5、pd.Series对象常用方法
dropna() 删除NaN
In [58]:
import numpy as np c = pd.Series([1,2,3,np.nan,5],index=['a','b','c','d','e'],name='t') print(c) d = c.dropna() print('-'*10) print(d)
a 1.0 b 2.0 c 3.0 d NaN e 5.0 Name: t, dtype: float64 ---------- a 1.0 b 2.0 c 3.0 e 5.0 Name: t, dtype: float64
groupby() 分组汇总 返回值为groupby对象 这个对象自带min() max() mean()方法
In [59]:
a = pd.Series([1,2,3,4],index=['sta1','sta2','sta1','sta2']) b = a.groupby(level=0).mean() print(b)
sta1 2.0 sta2 3.0 dtype: float64
In [60]:
#可以根据阈值进行计算 a = pd.Series([1,2,3,4],index=['sta1','sta2','sta1','sta2']) b = a.groupby(a>=2).mean() print(b)
False 1.0 True 3.0 dtype: float64
interpolate()/fillna() 填充NaN interpolate 可以通过插值方案填充数组中的NaN。它支持以下几种方案: linear:线性插值 time:基于时间索引的线性插值、在时间索引分辨率为‘天’或更细分的情况下有效 index:基于数值索引的线性插值,在索引为数值时有效
In [63]:
import numpy as np import pandas as pd a = pd.Series([1,2,np.nan,4],index=['a','b','c','d'],name='t') b = a.interpolate(method = 'linear') b
Out[63]:
a 1.0 b 2.0 c 3.0 d 4.0 Name: t, dtype: float64
In [64]:
#使用特定的数值进行填充 b = a.fillna(value=999) b
Out[64]:
a 1.0 b 2.0 c 999.0 d 4.0 Name: t, dtype: float64
In [65]:
#使用旁值进行填充 #b = a.fillna(method = 'bfill') #使用后一个值进行填充 a.bfill()
Out[65]:
a 1.0 b 2.0 c 4.0 d 4.0 Name: t, dtype: float64
In [67]:
#b = a.fillna(method = 'ffill') #使用前一个值进行填充 a.ffill()
Out[67]:
a 1.0 b 2.0 c 2.0 d 4.0 Name: t, dtype: float64
resample() 时间序列重采样 仅当索引为时间索引、时间周期索引或时间差索引时有效 降采样:划分的时间周期大于数据本身的时间间隔时,称为降采样 升采样:划分的时间周期小于数据本身的时间间隔时,称为升采样
In [68]:
#降采样与聚合 import pandas as pd a = pd.Series([1,2,3,4],index=pd.to_datetime(['2020-01-01','2020-01-02','2020-01-04','2020-01-05']),name='t') b = a.resample('2D').max() b
Out[68]:
2020-01-01 2 2020-01-03 3 2020-01-05 4 Freq: 2D, Name: t, dtype: int64
In [69]:
#升采样与填充 a = pd.Series([1,2,3,4],index=pd.to_datetime(['2020-01-01','2020-01-02','2020-01-04','2020-01-05']),name='t') b = a.resample('1D').asfreq() #.asfreq()指定值填充 默认Nan .bfill() ffill() nearest() interpolate() b
Out[69]:
2020-01-01 1.0 2020-01-02 2.0 2020-01-03 NaN 2020-01-04 3.0 2020-01-05 4.0 Freq: D, Name: t, dtype: float64
reindex() / reindex_like() 按照指定顺序排序
In [70]:
#reindex() 用一个序列来对一个已经存在的pd.Seares对象进行排序 a = pd.Series([1,2,3,4],index=['a','b','c','d'],name='t') c = a.reindex(['a','c','b','d']) c
Out[70]:
a 1 c 3 b 2 d 4 Name: t, dtype: int64
In [71]:
a = pd.Series([1,2,3,4],index=['a','b','c','d'],name='t') a = pd.Series([1,2,3,4],index=['a','c','b','d'],name='t1') c = a.reindex_like(a) c
Out[71]:
a 1 c 2 b 3 d 4 Name: t1, dtype: int64
rename() 重命名序列
In [72]:
a = pd.Series([1,2,3,4],index=['a','b','c','d'],name='t') b = a.rename('m') b
Out[72]:
a 1 b 2 c 3 d 4 Name: m, dtype: int64
rolling() 滑动计算 rolling() 将会返回rolling对象。如同groupby对象,rolling对象带有sum() mean() max() min()等方法
In [73]:
a = pd.Series([1,2,3,4,5,6,7,8,9,10,11]) b = a.rolling(3).mean() #滑动平均 b
Out[73]:
0 NaN 1 NaN 2 2.0 3 3.0 4 4.0 5 5.0 6 6.0 7 7.0 8 8.0 9 9.0 10 10.0 dtype: float64
In [74]:
a = pd.Series([1,2,3,4,5,6,7,8,9,10,11]) b = a.rolling(3, center = True).mean() #中央平均 b
Out[74]:
0 NaN 1 2.0 2 3.0 3 4.0 4 5.0 5 6.0 6 7.0 7 8.0 8 9.0 9 10.0 10 NaN dtype: float64
sort_index() 按照索引排序
In [75]:
a = pd.Series([1,2,3,4],index=['a','c','b','d'],name='t') b = a.sort_values() #升序 c = a.sort_index(ascending=False) #降序 print(b) print('-'*10) print(c)
a 1 c 2 b 3 d 4 Name: t, dtype: int64 ---------- d 4 c 2 b 3 a 1 Name: t, dtype: int64
sort_values() 按照数据排序
In [77]:
a = pd.Series([1,2,3,4],index=['a','c','b','d'],name='t') b = a.sort_values() c = a.sort_values(ascending=False) print(b) print('-'*10) print(c)
a 1 c 2 b 3 d 4 Name: t, dtype: int64 ---------- d 4 b 3 c 2 a 1 Name: t, dtype: int64
max() min()
In [78]:
a = pd.Series([1,2,3,4],index=['a','c','b','d'],name='t') print(a.max()) print(a.min())
4 1
argmin() argmin() 最大最小值的位置
In [79]:
a = pd.Series([1,2,3,4],index=['a','c','b','d'],name='t') print(a.argmax()) print(a.argmin())
3 0
In [ ]:
idxmax() idxmin() 最大最小值的索引
In [80]:
a = pd.Series([1,2,3,4],index=['a','c','b','d'],name='t') print(a.idxmax()) print(a.idxmin())
d a
In [ ]:
std() /var() 标准差/无偏方差
In [81]:
a = pd.Series([1,2,3,4,5,6,7,8,9,10,11]) print(a.std()) print(a.var())
3.3166247903554 11.0
cov() 协方差
In [82]:
a = pd.Series([1,2,3,4,5,6,7,8,9,10,11]) b = pd.Series([11,10,9,8,7,6,5,4,3,2,1]) print(a.cov(b))
-11.0
sum() mean()
In [83]:
a = pd.Series([1,2,3,4,5,6,7,8,9,10,11]) print(a.max()) print(a.mean())
11 6.0
abs() 绝对值
In [84]:
a = pd.Series([-1,2,3,-4,5,6,7,8,-9,10,11]) print(a.abs())
0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 dtype: int64
In [ ]:
to_csv() 保存为csv文件
In [ ]:
a = pd.Series([1,2,3,4],index=['a','c','b','d'],name='t') a.to_csv('xxx.csv') a.to_csv('xxx.csv',index = False, header=False) #忽略原始名字和索引
In [ ]:
to_list() 转为列表
In [85]:
a = pd.Series([1,2,3,4],index=['a','c','b','d'],name='t') b=a.to_list() b
Out[85]:
[1, 2, 3, 4]
astype() 转为数据类型
In [86]:
import numpy as np a = pd.Series([1,2,3,4],index=['a','c','b','d'],name='t') print(a) b = a.astype(np.int32) print('-'*10) print(b)
a 1 c 2 b 3 d 4 Name: t, dtype: int64 ---------- a 1 c 2 b 3 d 4 Name: t, dtype: int32
In [ ]: