【Python】特征衍生

本文介绍了数据预处理中的特征衍生方法,包括单变量的重编码、高阶多项式,双变量的四则运算和多项式衍生,交叉组合特征衍生,以及分组统计特征衍生,如分组均值、分位数等。同时,文章还涉及了时间序列字段的处理,如时间格式转换和时间差值衍生。
摘要由CSDN通过智能技术生成

1. 单变量特征衍生

1.1 数据重编码

  • 连续变量
    标准化:0-1标准化、Z-Score标准化
    离散化:等距分箱、等频分箱、聚类分箱
  • 离散变量
    非数值->数值:自然数编码、字典编码
    列->新列:独热编码、哑变量变换

1.2 高阶多项式

  • 原理
    X >>> X2、X3、X4 … …
  • 代码实现
    可以手动实现,也可利用sklearn中的PolynomialFeature评估器实现。
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

x1 = np.array([1, 2, 3])
# (3,) >>> (3,1)
x1.reshape(-1, 1)
'''
array([[1],
       [2],
       [3]])
'''
# 1次方到5次方
PolynomialFeatures(degree=5).fit_transform(x1.reshape(-1, 1))
'''
array([[  1.,   1.,   1.,   1.,   1.,   1.],
       [  1.,   2.,   4.,   8.,  16.,  32.],
       [  1.,   3.,   9.,  27.,  81., 243.]])
'''

2. 双变量特征衍生

2.1 四则运算

X1, X2 >>> X1+X2, X1-X2, X1*X2, X1/X2

2.2 多项式衍生

2.2.1 导包 & 数据

from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

df = pd.DataFrame({'X1':[1,2,3], 'X2':[2,3,4]})
df
'''
	X1	X2
0	1	2
1	2	3
2	3	4
'''

2.2.2 二阶衍生

'''
多项式衍生
X1,X2 >>> X1,X2 | X1^2,X1*X2,X2^2
X1,X2,X3 >>> X1,X2,X3 | X1^2,X1*X2,X1*X3,X2^2,X2*X3,X3^2

include_bias 默认True,包含特征的0次方
interaction_only 默认False,True则只创建交叉项
'''
PolynomialFeatures(degree=2, include_bias=False).fit_transform(df)
'''
array([[ 1.,  2.,  1.,  2.,  4.],
       [ 2.,  3.,  4.,  6.,  9.],
       [ 3.,  4.,  9., 12., 16.]])
'''

2.2.3 三阶衍生

在这里插入图片描述

'''
X1,X2 >>> X1,X2 | X1^2,X1*X2,X2^2 | X1^3,X1^2*X2,X1*X2^2,X2^3
'''
PolynomialFeatures(degree=3, include_bias=False).fit_transform(df)
'''
array([[ 1.,  2.,  1.,  2.,  4.,  1.,  2.,  4.,  8.],
       [ 2.,  3.,  4.,  6.,  9.,  8., 12., 18., 27.],
       [ 3.,  4.,  9., 12., 16., 27., 36., 48., 64.]])
'''

创建特征的同时创建列名称

df = pd.DataFrame({'X1':[1,2,3]
                  ,'X2':[2,3,4]
                  ,'X3':[1,0,0]})
df
'''
	X1	X2	X3
0	1	2	1
1	2	3	0
2	3	4	0
'''
# 选取X1、X2进行三阶衍生
colNames = ['X1', 'X2']
degree = 3
colNames_new = []
# 生成新列名
for deg in range(2, degree+1):
    for i in range(deg+1):
        col_temp = colNames[0] + '^' + str(deg-i) + '*' + colNames[1] + '^'  + str(i)
        colNames_new.append(col_temp)
colNames_new
'''
['X1^2*X2^0',
 'X1^1*X2^1',
 'X1^0*X2^2',
 'X1^3*X2^0',
 'X1^2*X2^1',
 'X1^1*X2^2',
 'X1^0*X2^3']
'''

3. 交叉组合特征衍生

在这里插入图片描述

3.1 导包 & 数据

df = pd.DataFrame({'SeniorCitizen':[0,0,0,0,0]
                  ,'Partner':['Yes','No','No','No','No']
                  ,'Dependents':['No','No','No','No','No']})
df
'''

SeniorCitizen	Partner	Dependents
0	0	Yes	No
1	0	No	No
2	0	No	No
3	0	No	No
4	0	No	No
'''

3.2 生成衍生列和名称

colNames = ['SeniorCitizen', 'Partner', 'Dependents']
colNames_new_l = []
features_new_l = []

for col_index, col_name in enumerate(colNames):
    print(col_index, col_name)
'''
0 SeniorCitizen
1 Partner
2 Dependents
'''
# 衍生列名称
for col_index, col_name in enumerate(colNames):
    for col_sub_index in range(col_index+1, len(colNames)):
        newNames = col_name + ' & ' + colNames[col_sub_index]
        print(newNames)
'''
SeniorCitizen & Partner
SeniorCitizen & Dependents
Partner & Dependents
'''
# 衍生列名称及特征本身
for col_index, col_name in enumerate(colNames):
    for col_sub_index in range(col_index+1, len(colNames)):
        newNames = col_name + '&' + colNames[col_sub_index]
        colNames_new_l.append(newNames)
        newDF = pd.Series(df[col_name].astype('str')
                         + '&'
                         + df[colNames[col_sub_index]].astype('str')
                         ,name=newNames)
        features_new_l.append(newDF)
        
features_new = pd.concat(features_new_l, axis=1)
features_new
'''

SeniorCitizen&Partner	SeniorCitizen&Dependents	Partner&Dependents
0	0&Yes	0&No	Yes&No
1	0&No	0&No	No&No
2	0&No	0&No	No&No
3	0&No	0&No	No&No
4	0&No	0&No	No&No
'''

4. 分组统计特征衍生

4.1 分组统计原理

'''
B特征根据A特征的不同取值进行分组统计,
统计量可以是均值、方差等(针对连续变量)或众数、分位数等(针对离散变量).
如: Monthly Charges根据tenure取值进行分组统计
ID tenure Monthly_Charges | mean min max
1  1      3               | 5    3   7
2  2      10              | 11   10  12
3  3      15              | 15   15  15
4  2      12              | 11   10  12
5  1      7               | 5    3   7
'''
'''
1. B特征可离散也可连续,A特征必须离散.
   A最好是取值较多的离散变量(固定取值的连续变量),否则重复行多.
2. 计算B特征分组统计量时,不局限于连续特征只用连续变量统计量、离散特征只用离散变量统计量.
   离散也可用均值、方差、偏度、峰度等,连续也可用众数、分位数等.
3. 分组统计可用于多表连接.
   ID 消费金额 消费品类      ID 消费金额_均值 消费金额_总额 消费品类_众数
   1  11       A             1  10.6667       32            C
   2  5        B       >>>   2  7.33333       22            A
   3  1        A             3  1             2             A
4. 在分组统计的基础上进一步四则运算特征衍生
   ID tenure Monthly_Charges | mean | Monthly_Charges-mean  Monthly_Charges/mean
   1  1      3               | 5    |  -2                    0.6
   2  2      10              | 11   |  -1                    0.909
   3  3      15              | 15   |  0                     1
'''

4.2 过程

4.2.1 数据准备

data = pd.DataFrame({'tenure': [1, 3, 2, 4, 2, 3, 4]
             ,'SeniorCitizen': [0,0,2,0,0,1,3]
             ,'MonthlyCharges': [29.85,56.95,53.85,42.30,70.70,73.12,76.37]
             ,'gender': ['F','M','M','F','M','M','F']})
# 提取目标字段数据
colNames = ['tenure', 'SeniorCitizen', 'MonthlyCharges']
features_temp = data[colNames]
features_temp
tenureSeniorCitizenMonthlyCharges
01029.85
13056.95
22253.85
34042.30
42070.70
53173.12
64376.37

4.2.2 单统计变量衍生

# 在不同tenure取值下计算其他变量分组均值
features_temp.groupby('tenure').mean()
SeniorCitizenMonthlyCharges
tenure
10.029.850
21.062.275
30.565.035
41.559.335

4.2.3 多统计变量衍生(新数据+新列名)

colNames = ['tenure', 'SeniorCitizen', 'MonthlyCharges']
# 分组汇总字段
colNames_sub = ['SeniorCitizen', 'MonthlyCharges']

# 字段汇总统计量设置
aggs = {}
for col in colNames_sub:
    aggs[col] = ['mean', 'min', 'max']
aggs
'''
    {'SeniorCitizen': ['mean', 'min', 'max'],
     'MonthlyCharges': ['mean', 'min', 'max']}
'''
# 创建新的列名
cols = ['tenure']
for key in aggs.keys():
    cols.extend([key+'_'+'tenure'+'_'+stat for stat in aggs[key]])
cols
'''
    ['tenure',
     'SeniorCitizen_tenure_mean',
     'SeniorCitizen_tenure_min',
     'SeniorCitizen_tenure_max',
     'MonthlyCharges_tenure_mean',
     'MonthlyCharges_tenure_min',
     'MonthlyCharges_tenure_max']
'''
'''
df.agg('mean') 计算列均值
df.agg('mean', axis=1) 计算行均值
df.agg({'行名/列名1': ['mean', 'min', 'max']
       ,'行名/列名2':['函数名1', '函数名2', '函数名3']}) 

df.reset_index() 更新索引,'tenure'成为特征列
'''
features_new = features_temp.groupby('tenure').agg(aggs).reset_index()
features_new
tenureSeniorCitizenMonthlyCharges
meanminmaxmeanminmax
010.00029.85029.8529.85
121.00262.27553.8570.70
230.50165.03556.9573.12
341.50359.33542.3076.37
# 重新设置列名
features_new.columns = cols
features_new
tenureSeniorCitizen_tenure_meanSeniorCitizen_tenure_minSeniorCitizen_tenure_maxMonthlyCharges_tenure_meanMonthlyCharges_tenure_minMonthlyCharges_tenure_max
010.00029.85029.8529.85
121.00262.27553.8570.70
230.50165.03556.9573.12
341.50359.33542.3076.37
# 左连接,将features_new拼接到d左边data中
data_new = pd.merge(data, features_new, how='left', on='tenure')
data_new
tenureSeniorCitizenMonthlyChargesgenderSeniorCitizen_tenure_meanSeniorCitizen_tenure_minSeniorCitizen_tenure_maxMonthlyCharges_tenure_meanMonthlyCharges_tenure_minMonthlyCharges_tenure_max
01029.85F0.00029.85029.8529.85
13056.95M0.50165.03556.9573.12
22253.85M1.00262.27553.8570.70
34042.30F1.50359.33542.3076.37
42070.70M1.00262.27553.8570.70
53173.12M0.50165.03556.9573.12
64376.37F1.50359.33542.3076.37

4.2.4 常用统计量整理

# 统计量
'''
- mean/var: 均值、方差
- max/min: 最大值、最小值
- skew: 数据分布偏度,小于0时左偏,大于0时右偏
- median: 中位数
- count: 个数
- nunique: 类别数
- quantile: 分位数(不能.agg(),需要自定义函数完成计算)
'''
a = np.array([[1,2,3,2,4,1]
             ,[0,0,0,1,1,1]])
df = pd.DataFrame(a.T, columns=['x1','x2'])
df
x1x2
010
120
230
321
441
511
aggs = {'x1': ['mean', 'var', 'max', 'min', 'skew', 'median', 'count', 'nunique']}
df.groupby('x2').agg(aggs).reset_index()
x2x1
meanvarmaxminskewmediancountnunique
002.0000001.000000310.000002.033
112.3333332.333333410.935222.033

4.2.5 分位数统计量衍生

# quantile 分位数
def q1(x):
    '''
    下四分位数
    '''
    return x.quantile(0.25)
def q2(x):
    '''
    上四分位数
    '''
    return x.quantile(0.75)
'''
aggs = {'x2': ['q1', 'q2']}
第一次用于创建列名称
aggs = {'x2': [q1, q2]}
第二次用于作为函数名带入.agg()
'''
# x2在x1上的汇总
colNames = ['x2']
keyCol = 'x1'

# 第一次定义aggs,用于辅助定义列名称
aggs = {}
for col in colNames:
    aggs[col] = ['q1', 'q2']
aggs
'''
    {'x2': ['q1', 'q2']}
'''
# 新增列的列名称
cols = [keyCol]
for key in aggs.keys():
    cols.extend([key+'_'+keyCol+'_'+stat for stat in aggs[key]])
cols
'''
    ['x1', 'x2_x1_q1', 'x2_x1_q2']
'''
# 第二次定义aggs,用于配合groupby过程进行分组计算
aggs = {}
for col in colNames:
    aggs[col] = [q1, q2]
aggs
'''
    {'x2': [<function __main__.q1(x)>, <function __main__.q2(x)>]}
'''
d2 = df.groupby(keyCol).agg(aggs).reset_index()
d2
x1x2
q1q2
010.250.75
120.250.75
230.000.00
341.001.00
d2.columns = cols
d2
x1x2_x1_q1x2_x1_q2
010.250.75
120.250.75
230.000.00
341.001.00

4.3 分组统计函数封装

# 分组统计函数封装
def Binary_Group_Statistics(keyCol
                           ,features
                           ,col_num=None
                           ,col_cat=None
                           ,num_stat=['mean', 'var', 'max', 'min', 'skew', 'median']
                           ,cat_stat=['mean', 'var', 'max', 'min', 'median', 'count', 'nunique']
                           ,quant=True):
    """
    双变量分组统计特征衍生函数
    
    :param keyCol: 分组参考的关键变量
    :param features: 原始数据集
    :param col_num: 参与衍生的连续型变量
    :param col_cat: 参与衍生的离散型变量
    :param num_stat: 连续变量分组统计量
    :param cat_num: 离散变量分组统计量
    :param quant: 是否计算分位数
    
    :return: 交叉衍生后的新特征和新特征名称
    """
    # 当输入特征有连续型特征时
    if  col_num != None:
        aggs_num = {}
        colNames = col_num
        # 创建agg方法所需字典
        for col in col_num:
            aggs_num[col] = num_stat
        # 创建衍生特征名称列表
        cols_num = [keyCol]
        for key in aggs_num.keys():
            cols_num.extend([key+'_'+keyCol+'_'+stat for stat in aggs_num[key]])
        # 创建衍生特征df
        features_num_new = features[col_num+[keyCol]].groupby(keyCol).agg(aggs_num).reset_index()
        features_num_new.columns = cols_num
        # 当输入的特征有连续型也有离散型时
        if col_cat != None:
            aggs_cat = {}
            colName = col_num + col_cat
            # 创建agg方法所需字典
            for col in col_cat:
                aggs_cat[col] = cat_stat
            # 创建衍生特征名称列表
            cols_cat = [keyCol]
            for key in aggs_cat.keys():
                cols_cat.extend([key+'_'+keyCol+'_'+stat for stat in aggs_cat[key]])
            # 创建衍生特征df
            features_cat_new = features[col_cat+[keyCol]].groupby(keyCol).agg(aggs_cat).reset_index()
            features_cat_new.columns = cols_cat
            
            # 合并连续变量衍生结果与离散变量衍生结果
            df_temp = pd.merge(features_num_new, features_cat_new, how='left', on=keyCol)
            features_new = pd.merge(features[keyCol], df_temp, how='left', on=keyCol)
            features_new.loc[:, ~features_new.columns.duplicated()] # 剔除重复列
            colNames_new = cols_num + cols_cat
            colNames_new.remove(keyCol)
            colNames_new.remove(keyCol)
        # 当只有连续变量时
        else:
            # merge连续变量衍生结果与原始数据,然后删除重复列
            features_new = pd.merge(features[keyCol], features_num_new, how='left', on=keyCol)
            features_new.loc[:, ~features_new.columns.duplicated()]
            colNames_new = cols_num
            colNames_new.remove(keyCol)
    # 当没有输入连续变量时
    else:
        # 但存在分类变量时,及只有分类变量时
        if col_cat != None:
            aggs_cat = {}
            colNames = col_cat
            for col in col_cat:
                aggs_cat[col] = cat_stat
            cols_cat = [keyCol]
            for key in aggs_cat.keys():
                cols_cat.extend([key+'_'+keyCol+'_'+stat for stat in aggs_cat[key]])
            features_cat_new = features[col_cat+[keyCol]].groupby(keyCol).agg(aggs_cat).reset_index()
            features_cat_new.columns = cols_cat
            features_new = pd.merge(features[keyCol], features_cat_new, how='left', on=keyCol)
            features_new.loc[:, ~features_new.columns.duplicated()]
            colNames_new = cols_cat
            colNames_new.remove(keyCol)
            
    if quant:
        # 定义四分位计算函数
        def q1(x):
            return x.quantile(0.25) # 下四分位数
        def q2(x):
            return x.quantile(0.75) # 上四分位数
            
        aggs = {}
        for col in colNames:
            aggs[col] = ['q1', 'q2']
        cols = [keyCol]
        for key in aggs.keys():
            cols.extend([key+'_'+keyCol+'_'+stat for stat in aggs[key]])
        aggs = {}
        for col in colNames:
            aggs[col] = [q1, q2]
        features_temp = features[colNames+[keyCol]].groupby(keyCol).agg(aggs).reset_index()
        features_temp.columns = cols
            
        features_new = pd.merge(features_new, features_temp, how='left', on=keyCol)
        features_new.loc[:, ~features_new.columns.duplicated()]
        colNames_new = colNames_new + cols
        colNames_new.remove(keyCol)
    features_new.drop([keyCol], axis=1, inplace=True)
    return features_new, colNames_new
# 使用示例
features = pd.DataFrame({'tenure': [1, 3, 2, 4, 2, 3, 4]
             ,'SeniorCitizen': [0,0,2,0,0,1,3]
             ,'MonthlyCharges': [29.85,56.95,53.85,42.30,70.70,73.12,76.37]
             ,'gender': ['F','M','M','F','M','M','F']})
col_num = ['MonthlyCharges']
col_cat = ['SeniorCitizen']
keyCol = 'tenure'
df, col = Binary_Group_Statistics(keyCol, features, col_num, col_cat)

df # 不包含tenure,完全是新生成的列
MonthlyCharges_tenure_meanMonthlyCharges_tenure_varMonthlyCharges_tenure_maxMonthlyCharges_tenure_minMonthlyCharges_tenure_skewMonthlyCharges_tenure_medianSeniorCitizen_tenure_meanSeniorCitizen_tenure_varSeniorCitizen_tenure_maxSeniorCitizen_tenure_minSeniorCitizen_tenure_medianSeniorCitizen_tenure_countSeniorCitizen_tenure_nuniqueMonthlyCharges_tenure_q1MonthlyCharges_tenure_q2
029.850NaN29.8529.85NaN29.8500.0NaN000.01129.850029.8500
165.035130.7344573.1256.95NaN65.0350.50.5100.52260.992569.0775
262.275141.9612570.7053.85NaN62.2751.02.0201.02258.062566.4875
359.335580.3824576.3742.30NaN59.3351.54.5301.52250.817567.8525
462.275141.9612570.7053.85NaN62.2751.02.0201.02258.062566.4875
565.035130.7344573.1256.95NaN65.0350.50.5100.52260.992569.0775
659.335580.3824576.3742.30NaN59.3351.54.5301.52250.817567.8525
col
'''
    ['MonthlyCharges_tenure_mean',
     'MonthlyCharges_tenure_var',
     'MonthlyCharges_tenure_max',
     'MonthlyCharges_tenure_min',
     'MonthlyCharges_tenure_skew',
     'MonthlyCharges_tenure_median',
     'SeniorCitizen_tenure_mean',
     'SeniorCitizen_tenure_var',
     'SeniorCitizen_tenure_max',
     'SeniorCitizen_tenure_min',
     'SeniorCitizen_tenure_median',
     'SeniorCitizen_tenure_count',
     'SeniorCitizen_tenure_nunique',
     'MonthlyCharges_tenure_q1',
     'MonthlyCharges_tenure_q2']
'''

5. 时序字段特征衍生

  • 本质是增加分组
  • 发现规律,如某一季度用户流失
  • 根据自然周期和业务周期衍生,如季节和旅游旺淡季
  • 根据关键时间点的时间差值衍生

5.1 时间格式转换 pd.to_datetime()

t = pd.DataFrame()
t['time'] = ['2022-01-03;02:31:52'
            ,'2022-07-01;14:22:01'
            ,'2022-08-22;08:02:31'
            ,'2022-04-30;11:41:31'
            ,'2022-05-02;22:01:27']
'''
或传入字典
t = pd.DataFrame({'time': ['2022-01-03;02:31:52'
                          ,'2022-07-01;14:22:01'
                          ,'2022-08-22;08:02:31'
                          ,'2022-04-30;11:41:31'
                          ,'2022-05-02;22:01:27']})
'''
t['time'] = pd.to_datetime(t['time'])
t['time'], t['time'][0] # 每个元素是时间戳格式
'''
    (0   2022-01-03 02:31:52
     1   2022-07-01 14:22:01
     2   2022-08-22 08:02:31
     3   2022-04-30 11:41:31
     4   2022-05-02 22:01:27
     Name: time, dtype: datetime64[ns],
     Timestamp('2022-01-03 02:31:52'))

'''
# 转换精度D,h,s,ms,ns
t['time'].values, t['time'].values.astype('datetime64[D]')
'''
    (array(['2022-01-03T02:31:52.000000000', '2022-07-01T14:22:01.000000000',
            '2022-08-22T08:02:31.000000000', '2022-04-30T11:41:31.000000000',
            '2022-05-02T22:01:27.000000000'], dtype='datetime64[ns]'),
     array(['2022-01-03', '2022-07-01', '2022-08-22', '2022-04-30',
            '2022-05-02'], dtype='datetime64[D]'))
'''
t['time-D'] = t['time'].values.astype('datetime64[D]')
t
timetime-D
02022-01-03 02:31:522022-01-03
12022-07-01 14:22:012022-07-01
22022-08-22 08:02:312022-08-22
32022-04-30 11:41:312022-04-30
42022-05-02 22:01:272022-05-02

5.2 时序字段信息提取 dt.year/month/day…

'''
- 年月日时分秒
dt.year dt.month dt.day dt.hour dt.minute dt.second
- 提取季度、一年第几周、星期几
dt.quarter dt.weekofyear dt.dayofweek/dt.weekday
'''
t['time'].dt.day
'''
    0     3
    1     1
    2    22
    3    30
    4     2
    Name: time, dtype: int64
'''
t['year'] = t['time'].dt.year
t['month'] = t['time'].dt.month
t['day'] = t['time'].dt.day
t['hour'] = t['time'].dt.hour
t['minute'] = t['time'].dt.minute
t['second'] = t['time'].dt.second

# 周一从0开始,+1代表周几
t['dayofweek'] = t['time'].dt.dayofweek + 1
t['quarter'] = t['time'].dt.quarter
t['weekofyear'] = t['time'].dt.weekofyear
t
timetime-Dyearmonthdayhourminuteseconddayofweekquarterweekofyear
02022-01-03 02:31:522022-01-0320221323152111
12022-07-01 14:22:012022-07-01202271142215326
22022-08-22 08:02:312022-08-22202282282311334
32022-04-30 11:41:312022-04-3020224301141316217
42022-05-02 22:01:272022-05-02202252221271218
# 是否是周末
t['weekend'] = (t['dayofweek'] > 5).astype(int)
# 小时所属凌晨、上午、下午、晚上,6小时为周期划分(整除8)
t['hour_section'] = (t['hour'] // 6).astype(int)
t
timetime-Dyearmonthdayhourminuteseconddayofweekquarterweekofyearweekendhour_section
02022-01-03 02:31:522022-01-032022132315211100
12022-07-01 14:22:012022-07-0120227114221532602
22022-08-22 08:02:312022-08-2220228228231133401
32022-04-30 11:41:312022-04-302022430114131621711
42022-05-02 22:01:272022-05-0220225222127121803

5.3 时间差值衍生

5.3.1 列与列的运算

t['time'] - t['time']
'''
    0   0 days
    1   0 days
    2   0 days
    3   0 days
    4   0 days
    Name: time, dtype: timedelta64[ns]
'''

5.3.2 列与时间戳的运算

p1 = '2022-01-03;02:31:52'
pd.Timestamp(p1), pd.Timestamp(p1).year
'''
    (Timestamp('2022-01-03 02:31:52'), 2022)
'''
t['time_diff'] = t['time'] - pd.Timestamp(p1)
t['time_diff']
'''
    0     0 days 00:00:00
    1   179 days 11:50:09
    2   231 days 05:30:39
    3   117 days 09:09:39
    4   119 days 19:29:35
    Name: time_diff, dtype: timedelta64[ns]
'''
'''
- timedelta类型提取信息
月 天 秒
dt.days/30 dt.days dt.seconds

得到的相差的seconds是忽略天数计算的结果,
相差的days是忽略时分秒计算的结果
'''
t['time_diff'].dt.seconds
'''
    0        0
    1    42609
    2    19839
    3    32979
    4    70175
    Name: time_diff, dtype: int64
'''
# 真实的相差秒数,包含天时分的差值
t['time_diff_s'] = t['time_diff'].values.astype('timedelta64[s]').astype('int')
t['time_diff_s']
'''
    0           0
    1    15508209
    2    19978239
    3    10141779
    4    10351775
    Name: time_diff_s, dtype: int64
'''
t
timetime-Dyearmonthdayhourminuteseconddayofweekquarterweekofyearweekendhour_sectiontime_difftime_diff_s
02022-01-03 02:31:522022-01-0320221323152111000 days 00:00:000
12022-07-01 14:22:012022-07-0120227114221532602179 days 11:50:0915508209
22022-08-22 08:02:312022-08-2220228228231133401231 days 05:30:3919978239
32022-04-30 11:41:312022-04-302022430114131621711117 days 09:09:3910141779
42022-05-02 22:01:272022-05-0220225222127121803119 days 19:29:3510351775

5.3.2 选择时间戳

t['time'].max(), t['time'].min()
'''
    (Timestamp('2022-08-22 08:02:31'), Timestamp('2022-01-03 02:31:52'))
'''
import datetime

# 获取当前时间,精确到毫秒
print(datetime.datetime.now())
# 定义格式 年月日
print('-')
print(datetime.datetime.now().strftime('%Y-%m-%d'))
print(pd.Timestamp(datetime.datetime.now().strftime('%Y-%m-%d')))
# 定义格式 年月日时分秒
print('-')
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
print(pd.Timestamp(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
'''
    2023-02-21 10:35:29.455932
    -
    2023-02-21
    2023-02-21 00:00:00
    -
    2023-02-21 10:35:29
    2023-02-21 10:35:29
'''
  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值