pd.DataFrame() 数据框
(附件为代码运行信息)
1、创建数据框
pd.DataFrame(data,index,columns,dtype=None)
In [1]:
#无index和columns import pandas as pd a = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]]) a
Out[1]:
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
In [2]:
#index 行索引,columns 列索引 a = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], index = ['a','b','c'], columns =['t','p','q']) a
Out[2]:
t | p | q | |
---|---|---|---|
a | 1 | 2 | 3 |
b | 4 | 5 | 6 |
c | 7 | 8 | 9 |
In [3]:
#抽出其中一列为pd.Series对象 b = a['t'] b
Out[3]:
a 1 b 4 c 7 Name: t, dtype: int64
In [5]:
#DataFrame的每一列可为不同的数据类型 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = ['a','b','c'], columns =['t','p','q']) a.dtypes
Out[5]:
t float64 p int64 q int64 dtype: object
In [7]:
#时间索引 index a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) a
Out[7]:
t | p | q | |
---|---|---|---|
2020-01-01 | 1.1 | 2 | 3 |
2020-01-02 | 4.1 | 5 | 6 |
2020-01-03 | 7.1 | 8 | 9 |
2、读取CSV文件
In [ ]:
#带有索引和列名的csv a = pd.read_csv('xxx.csv',index_col = 0) #第几列为行索引(index)
In [ ]:
#只带有索引或者列名的csv a = pd.read_csv('xxx.csv',index_col = 0,header=None) #没有列名 b = pd.read_csv('xxx.csv',index_col = 0,header=['t','p','q','rh']) #没有列名,可以指定列名
In [ ]:
#没有索引读取 a = pd.read_csv('xxx.csv',index_col = None) #没有索引,可以将index指定为None b = pd.read_csv('xxx.csv',) #没有索引,或者直接省略index_col
In [ ]:
#对含有时间列的读取。如果CSV文件中有时间列,则默认情况下读取出来的数据可能是字符串类型。 #使用parse_index参数指定需要读取的列为时间列 a = pd.read_csv('xxx.csv',parse_dates=[0])#第1列为时间列读取 a = pd.read_csv('xxx.csv',parse_dates=[0],index_col=[0]) #将时间列指定为索引 a = pd.read_csv('xxx.csv',parse_dates=[0,1])#时间列分离读取
3、pd.DataFrame的算术运算
In [8]:
#与标量进行运算 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) print(a) print('-'*10) b = a-100 print(b)
t p q 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6 2020-01-03 7.1 8 9 ---------- t p q 2020-01-01 -98.9 -98 -97 2020-01-02 -95.9 -95 -94 2020-01-03 -92.9 -92 -91
In [10]:
#pd.DataFrame对象与pd.Series对象进行运算 #二者以行为单位进行运算 pd.DataFrame对象的每一行都加上pd.Series对象的值 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) b = pd.Series([1,2,3],index=['t','p','q']) #注意从DataFrame中取出一列为Series对象 columns为取出转为Serise对象的index #在DataFrame和Series对象做运算时是DateFrame对象的columns,与Series对象对应 print(a+b)
t p q 2020-01-01 2.1 4 6 2020-01-02 5.1 7 9 2020-01-03 8.1 10 12
In [11]:
#pd.DataFrame对象与pd.DataFrame对象作运算 #需要同时对齐索引和列名进行逐元素运算 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) b = pd.DataFrame([[2.1,2,3], [3.1,5,6], [4.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) print(b-a)
t p q 2020-01-01 1.0 0 0 2020-01-02 -1.0 0 0 2020-01-03 -3.0 0 0
In [1]:
##提取满足条件的行 import pandas as pd a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) b = a[a['t']>2] print(b)
t p q 2020-01-02 4.1 5 6 2020-01-03 7.1 8 9
In [3]:
b = a[(a['t']>2) & (a['q']<8) ] print(b)
t p q 2020-01-02 4.1 5 6
In [4]:
b = a[(a['t']>2) | (a['q']<8) ] print(b)
t p q 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6 2020-01-03 7.1 8 9
In [7]:
#按照时间索引条件索取 b = a[a.index.day == 1] print(b) print('-'*10) c = a[a.index.month == 1] print(c) print('-'*10) d = a[a.index.year == 2020] print(d)
t p q 2020-01-01 1.1 2 3 ---------- t p q 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6 2020-01-03 7.1 8 9 ---------- t p q 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6 2020-01-03 7.1 8 9
4、pd.DataFrame 的常用属性
In [8]:
#.dtypes Series对象:dtype a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) a.dtypes
Out[8]:
t float64 p int64 q int64 dtype: object
In [9]:
#.ndim a.ndim
Out[9]:
2
In [10]:
#.shape a.shape
Out[10]:
(3, 3)
In [12]:
#.at 通过行、列标签访问单个元素 a.at['2020-01-01','t']
Out[12]:
1.1
In [13]:
#.iat 通过行、列标签位置访问单个元素 a.iat[0,0]
Out[13]:
1.1
In [21]:
#.loc 通过行列标签或者布尔序列访问多个元素 #行 b = a.loc['2020-01-01'] #结果为Series print(b) print('-'*10) c = a.loc[['2020-01-01']] #结果为DataFrame print(c) d = a.loc[['2020-01-01','2020-01-02']] print('-'*10) print(d)
t 1.1 p 2.0 q 3.0 Name: 2020-01-01 00:00:00, dtype: float64 ---------- t p q 2020-01-01 1.1 2 3 ---------- t p q 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6
In [25]:
#列 b = a.loc[:,'t'] #Series print(b) print('-'*10) c = a.loc[:, ['t']] #DataFrame print(c) print('-'*10) d = a.loc[:,['t','p']] print(d)
2020-01-01 1.1 2020-01-02 4.1 2020-01-03 7.1 Name: t, dtype: float64 ---------- t 2020-01-01 1.1 2020-01-02 4.1 2020-01-03 7.1 ---------- t p 2020-01-01 1.1 2 2020-01-02 4.1 5 2020-01-03 7.1 8
In [26]:
#按布尔序列访问行列 b = a.loc[[True,False,True]] print(b) print('-'*10) c = a.loc[:,[True,False,True]] print(c)
t p q 2020-01-01 1.1 2 3 2020-01-03 7.1 8 9 ---------- t q 2020-01-01 1.1 3 2020-01-02 4.1 6 2020-01-03 7.1 9
In [31]:
#.iloc通过行列位置访问多个元素 #行 b = a.iloc[0] #Series print(b) print('-'*10) c = a.iloc[[0]] #DataFrame print(c) print('-'*10) d = a.iloc[[0,1]] print(d)
t 1.1 p 2.0 q 3.0 Name: 2020-01-01 00:00:00, dtype: float64 ---------- t p q 2020-01-01 1.1 2 3 ---------- t p q 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6
In [33]:
#列 b = a.iloc[:,0] #Series print(b) print('-'*10) c = a.iloc[:,[0]] #DataFrame print(c) print('-'*10) d = a.iloc[:,[0,1]] print(d)
2020-01-01 1.1 2020-01-02 4.1 2020-01-03 7.1 Name: t, dtype: float64 ---------- t 2020-01-01 1.1 2020-01-02 4.1 2020-01-03 7.1 ---------- t p 2020-01-01 1.1 2 2020-01-02 4.1 5 2020-01-03 7.1 8
In [34]:
#.values 获取原始的np.ndarray对象 a.values
Out[34]:
array([[1.1, 2. , 3. ], [4.1, 5. , 6. ], [7.1, 8. , 9. ]])
5、pd.DataFrame对象常用方法
In [39]:
#dropna 删除NaN import numpy as np #删除包含NaN的行或列 a = pd.DataFrame([[1.1,np.nan,3], [np.nan,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) print(a) print('-'*10) print(a.dropna()) #删除含Nan的行 print('-'*10) print(a.dropna(axis=1)) #删除含Nan的列
t p q 2020-01-01 1.1 NaN 3 2020-01-02 NaN 5.0 6 2020-01-03 7.1 8.0 9 ---------- t p q 2020-01-03 7.1 8.0 9 q 2020-01-01 3 2020-01-02 6 2020-01-03 9
In [42]:
#仅删除全为Nan的行或者列 a = pd.DataFrame([[np.nan,np.nan,np.nan], [np.nan,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) print(a) print('-'*10) print(a.dropna(how = 'all')) #删除含Nan的行 print('-'*10) print(a.dropna(axis=1, how = 'all')) #删除含Nan的列
t p q 2020-01-01 NaN NaN NaN 2020-01-02 NaN 5.0 6.0 2020-01-03 7.1 8.0 9.0 ---------- t p q 2020-01-02 NaN 5.0 6.0 2020-01-03 7.1 8.0 9.0 ---------- t p q 2020-01-01 NaN NaN NaN 2020-01-02 NaN 5.0 6.0 2020-01-03 7.1 8.0 9.0
In [44]:
#groupby 分组汇总 对象支持sum() mean() max() min() var() size() #通过原始数据分组 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['s1','s2','s3']) print(a) print('-'*10) print(a.groupby(level = 0 ).sum()) print('-'*10) print(a.groupby(level = 0 ).size()) #计算每个分组的长度
s1 s2 s3 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6 2020-01-03 7.1 8 9 ---------- s1 s2 s3 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6 2020-01-03 7.1 8 9 ---------- 2020-01-01 1 2020-01-02 1 2020-01-03 1 dtype: int64
In [50]:
#by参数使用列名称进行分组 a = pd.DataFrame([[1.1,2,3,'k1'], [4.1,5,6,'k1'], [7.1,8,9,'k2']], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['s1','s2','s3','kind']) print(a) print('-'*10) print(a.groupby(by = 'kind').sum())
s1 s2 s3 kind 2020-01-01 1.1 2 3 k1 2020-01-02 4.1 5 6 k1 2020-01-03 7.1 8 9 k2 ---------- s1 s2 s3 kind k1 5.2 7 9 k2 7.1 8 9
In [56]:
#通过函数分组 a = pd.DataFrame([[1.1,2,3,'k1'], [4.1,5,6,'k1'], [7.1,8,9,'k2']], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['s1','s2','s3','kind']) print(a) print('-'*10) print(a.groupby(by = lambda x: x=='k2').sum())
s1 s2 s3 kind 2020-01-01 1.1 2 3 k1 2020-01-02 4.1 5 6 k1 2020-01-03 7.1 8 9 k2 ---------- s1 s2 s3 kind False 12.3 15 18 k1k1k2
In [57]:
#对时间戳类型进行分组 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['s1','s2','s3']) print(a) print('-'*10) print(a.groupby(by = lambda x: x.day).sum())
s1 s2 s3 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6 2020-01-03 7.1 8 9 ---------- s1 s2 s3 1 1.1 2 3 2 4.1 5 6 3 7.1 8 9
In [59]:
#applymap() map() 按规则映射 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['s1','s2','s3']) print(a) print('-'*10) print(a.map(lambda x:x/10))
s1 s2 s3 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6 2020-01-03 7.1 8 9 ---------- s1 s2 s3 2020-01-01 0.11 0.2 0.3 2020-01-02 0.41 0.5 0.6 2020-01-03 0.71 0.8 0.9
In [61]:
#resample() 重采样 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-04']), columns =['s1','s2','s3']) print(a) print('-'*10) print(a.resample('2D').mean()) #升采样 print('-'*10) print(a.resample('1D').mean()) #降采样
s1 s2 s3 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6 2020-01-04 7.1 8 9 ---------- s1 s2 s3 2020-01-01 2.6 3.5 4.5 2020-01-03 7.1 8.0 9.0 ---------- s1 s2 s3 2020-01-01 1.1 2.0 3.0 2020-01-02 4.1 5.0 6.0 2020-01-03 NaN NaN NaN 2020-01-04 7.1 8.0 9.0
In [68]:
#interpolate() / fillna() 填充NaN #对于interpolate()通过插值的方式填充NaN,默认以列的方式进行插值。对于fillna()的方法,可以用指定值或前、后值对NaN进行填充 a = pd.DataFrame([[2,3,4], [5,np.nan,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) print(a) print('-'*10) b = a.interpolate(method='linear') #列线性插值 print(b) print('-'*10) print(a.interpolate(method='linear',axis = 1)) #行线性插值
t p q 2020-01-01 2.0 3.0 4 2020-01-02 5.0 NaN 6 2020-01-03 7.1 8.0 9 ---------- t p q 2020-01-01 2.0 3.0 4 2020-01-02 5.0 5.5 6 2020-01-03 7.1 8.0 9 ---------- t p q 2020-01-01 2.0 3.0 4.0 2020-01-02 5.0 5.5 6.0 2020-01-03 7.1 8.0 9.0
In [72]:
a = pd.DataFrame([[2,3,4], [5,np.nan,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['t','p','q']) print(a) print('-'*10) print(a.fillna(value=999)) #特定值填充 print('-'*10) print(a.bfill()) #列向后值填充 print('-'*10) print(a.ffill()) #列向后值填充
t p q 2020-01-01 2.0 3.0 4 2020-01-02 5.0 NaN 6 2020-01-03 7.1 8.0 9 ---------- t p q 2020-01-01 2.0 3.0 4 2020-01-02 5.0 999.0 6 2020-01-03 7.1 8.0 9 ---------- t p q 2020-01-01 2.0 3.0 4 2020-01-02 5.0 8.0 6 2020-01-03 7.1 8.0 9 ---------- t p q 2020-01-01 2.0 3.0 4 2020-01-02 5.0 3.0 6 2020-01-03 7.1 8.0 9
In [75]:
#reindex() reindex_like 按照指定序列/索引和列名排序 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['s1','s2','s3']) print(a) print('-'*10) print(a.reindex(['2020-01-02','2020-01-01','2020-01-03'])) print('-'*10) print(a.reindex(index=['2020-01-02','2020-01-01','2020-01-03'],columns=['s2','s1','s3'])) #按照指定序列排序
s1 s2 s3 2020-01-01 1.1 2 3 2020-01-02 4.1 5 6 2020-01-03 7.1 8 9 ---------- s1 s2 s3 2020-01-02 4.1 5 6 2020-01-01 1.1 2 3 2020-01-03 7.1 8 9 ---------- s2 s1 s3 2020-01-02 5 4.1 6 2020-01-01 2 1.1 3 2020-01-03 8 7.1 9
In [77]:
#reindex_like 用已有的DataFrame对象的索引和列名排序 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['s1','s2','s3']) b = pd.DataFrame(data = None,index=['2020-01-02','2020-01-01','2020-01-03'],columns=['s2','s1','s3']) print(a.reindex_like(b))
s2 s1 s3 2020-01-02 5 4.1 6 2020-01-01 2 1.1 3 2020-01-03 8 7.1 9
In [78]:
#reset_index() 重置索引 原索引被还原成数据列 print(a.reset_index())
index s1 s2 s3 0 2020-01-01 1.1 2 3 1 2020-01-02 4.1 5 6 2 2020-01-03 7.1 8 9
In [80]:
#rolling 滑动窗口计算 返回对象支持 max() min() sum() mean() a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-01','2020-01-02','2020-01-03']), columns =['s1','s2','s3']) print(a.rolling(2).mean()) #普通序列滑动 #对时间序列滑动计算 print('-'*10) print(a.rolling('2D').mean()) #时间周期作为字符串滑动时,数组边缘元素的滑动窗口的最小值为1
s1 s2 s3 2020-01-01 NaN NaN NaN 2020-01-02 2.6 3.5 4.5 2020-01-03 5.6 6.5 7.5 ---------- s1 s2 s3 2020-01-01 1.1 2.0 3.0 2020-01-02 2.6 3.5 4.5 2020-01-03 5.6 6.5 7.5
In [82]:
#sort_index 按照索引排序 a = pd.DataFrame([[1.1,2,3], [4.1,5,6], [7.1,8,9]], index = pd.to_datetime(['2020-01-02','2020-01-01','2020-01-03']), columns =['s2','s1','s3']) print(a) print('-'*10) print(a.sort_index(ascending=True)) #升序 print('-'*10) print(a.sort_index(ascending=False)) #降序
s2 s1 s3 2020-01-02 1.1 2 3 2020-01-01 4.1 5 6 2020-01-03 7.1 8 9 ---------- s2 s1 s3 2020-01-01 4.1 5 6 2020-01-02 1.1 2 3 2020-01-03 7.1 8 9 ---------- s2 s1 s3 2020-01-03 7.1 8 9 2020-01-02 1.1 2 3 2020-01-01 4.1 5 6
In [85]:
#sort_values() 按照数据排序 a = pd.DataFrame([[8.1,9,3], [4.1,5,1], [7.1,8,9]], index = pd.to_datetime(['2020-01-02','2020-01-01','2020-01-03']), columns =['s2','s1','s3']) print(a) print('-'*10) print(a.sort_values('s2')) #按照p列升序排列 print('-'*10) print(a.sort_values('s2',ascending=False)) #降序
s2 s1 s3 2020-01-02 8.1 9 3 2020-01-01 4.1 5 1 2020-01-03 7.1 8 9 ---------- s2 s1 s3 2020-01-01 4.1 5 1 2020-01-03 7.1 8 9 2020-01-02 8.1 9 3 ---------- s2 s1 s3 2020-01-02 8.1 9 3 2020-01-03 7.1 8 9 2020-01-01 4.1 5 1
In [88]:
#max() min() print(a.max()) #列最大值 print('-'*10) print(a.max(axis=1)) #行最大值
s2 8.1 s1 9.0 s3 9.0 dtype: float64 ---------- 2020-01-02 9.0 2020-01-01 5.0 2020-01-03 9.0 dtype: float64
In [90]:
#idxmax() idxmin() 最大最小值对应的标签 print(a.idxmax()) #列最大值对应的标签 print('-'*10) print(a.idxmax(axis=1)) #行最大值对应的标签
s2 2020-01-02 s1 2020-01-02 s3 2020-01-03 dtype: datetime64[ns] ---------- 2020-01-02 s1 2020-01-01 s1 2020-01-03 s3 dtype: object
In [92]:
#std() var() print(a.std()) print('-'*10) print(a.var())
s2 2.081666 s1 2.081666 s3 4.163332 dtype: float64 ---------- s2 4.333333 s1 4.333333 s3 17.333333 dtype: float64
In [93]:
#corr() corrwith() 相关系数 #pearson 皮尔逊相关系数(默认) kendall 肯德尔相关系数 spearman 斯皮尔曼相关系数 print(a.corr(method='kendall'))
s2 s1 s3 s2 1.000000 1.000000 0.333333 s1 1.000000 1.000000 0.333333 s3 0.333333 0.333333 1.000000
In [95]:
a = pd.DataFrame([[8.1,9,3], [4.1,5,1], [7.1,8,9]], index = pd.to_datetime(['2020-01-02','2020-01-01','2020-01-03']), columns =['s2','s1','s3']) b = pd.DataFrame([[1.1,9,3], [2.1,5,5], [3.1,5,9]], index = pd.to_datetime(['2020-01-02','2020-01-01','2020-01-03']), columns =['s2','s1','s3']) print(a.corrwith(b))
s2 -0.240192 s1 0.693375 s3 0.838628 dtype: float64
In [3]:
#cov() 计算协方差 import pandas as pd a = pd.DataFrame([[8.1,9,3], [4.1,5,1], [7.1,8,9]], index = pd.to_datetime(['2020-01-02','2020-01-01','2020-01-03']), columns =['s2','s1','s3']) print(a.cov())
s2 s1 s3 s2 4.333333 4.333333 4.333333 s1 4.333333 4.333333 4.333333 s3 4.333333 4.333333 17.333333
In [ ]:
#sum() mean() abs()
In [ ]:
#to_csv a.to_scv('xxx.csv')
In [11]:
#astype() 转化数据类似 import numpy as np print(a.dtypes) print('-'*10) b = a.astype(np.float32) #统一转化 print(b.dtypes) print('-'*10) b = a.astype({'s1':np.float32},{'s2':np.float64}) print(b.dtypes)
s2 float64 s1 int64 s3 int64 dtype: object ---------- s2 float32 s1 float32 s3 float32 dtype: object ---------- s2 float64 s1 float32 s3 int64 dtype: object
6、pandas的常用函数
In [13]:
#to_numeric() 将序列转换为数值类型 a = pd.DataFrame([[8.1,9,'3'], [4.1,5,1], [7.1,8,9]], index = pd.to_datetime(['2020-01-02','2020-01-01','2020-01-03']), columns =['s2','s1','s3']) print(a.dtypes) print('-'*10) a['s3'] = pd.to_numeric(a['s3']) print(a.dtypes)
s2 float64 s1 int64 s3 object dtype: object ---------- s2 float64 s1 int64 s3 int64 dtype: object
In [15]:
#to_datatime() 将时间序列转换为时间戳形式 a = pd.to_datetime(['20200228','20200301']) print(a)
DatetimeIndex(['2020-02-28', '2020-03-01'], dtype='datetime64[ns]', freq=None)
In [17]:
#to_timedelta() 将时间转换为时间差类型 a = pd.to_timedelta(['1 day','10 min','20 s']) print(a)
TimedeltaIndex(['1 days 00:00:00', '0 days 00:10:00', '0 days 00:00:20'], dtype='timedelta64[ns]', freq=None)
In [22]:
#data_range() 生成时间序列 等差序列 a = pd.date_range(start = '2020-03-01', periods = 3, freq = '3h') #periods用于指定生成元素的个数。freq时间步长,省略默认周期为1 print(a) b = pd.date_range(start = '2020-03-01', end = '2020-03-03',freq = 'D') print(b) c = pd.date_range(start = '2020-03-01', end = '2020-03-03') print(c)
DatetimeIndex(['2020-03-01 00:00:00', '2020-03-01 03:00:00', '2020-03-01 06:00:00'], dtype='datetime64[ns]', freq='3h') DatetimeIndex(['2020-03-01', '2020-03-02', '2020-03-03'], dtype='datetime64[ns]', freq='D') DatetimeIndex(['2020-03-01', '2020-03-02', '2020-03-03'], dtype='datetime64[ns]', freq='D')
In [ ]:
#merge() 按值连接两个pd.DataFrame 等同于excel中的vlookup
In [24]:
#concat() 合并多个pd.DataFrame a = pd.DataFrame([[8.1,9,'3'], [4.1,5,1], [7.1,8,9]], index = pd.to_datetime(['2020-01-02','2020-01-01','2020-01-03']), columns =['s2','s1','s3']) print(pd.concat([a,a])) print('-'*10) print(pd.concat([a,a],axis=1))
s2 s1 s3 2020-01-02 8.1 9 3 2020-01-01 4.1 5 1 2020-01-03 7.1 8 9 2020-01-02 8.1 9 3 2020-01-01 4.1 5 1 2020-01-03 7.1 8 9 ---------- s2 s1 s3 s2 s1 s3 2020-01-02 8.1 9 3 8.1 9 3 2020-01-01 4.1 5 1 4.1 5 1 2020-01-03 7.1 8 9 7.1 8 9
In [ ]:
In [ ]: