【Python】特征衍生

rejudge

已于 2023-03-29 17:42:39 修改

阅读量1.1k

点赞数 2

分类专栏： Python 文章标签： python 机器学习 sklearn

于 2023-02-23 21:17:13 首次发布

本文链接：https://blog.csdn.net/qq_45249685/article/details/128737162

版权

Python 专栏收录该内容

43 篇文章 3 订阅

订阅专栏

本文介绍了数据预处理中的特征衍生方法，包括单变量的重编码、高阶多项式，双变量的四则运算和多项式衍生，交叉组合特征衍生，以及分组统计特征衍生，如分组均值、分位数等。同时，文章还涉及了时间序列字段的处理，如时间格式转换和时间差值衍生。

摘要由CSDN通过智能技术生成

1. 单变量特征衍生

1.1 数据重编码

连续变量
标准化：0-1标准化、Z-Score标准化
离散化：等距分箱、等频分箱、聚类分箱
离散变量
非数值->数值：自然数编码、字典编码
列->新列：独热编码、哑变量变换

1.2 高阶多项式

原理
X >>> X²、X³、X⁴ … …
代码实现
可以手动实现，也可利用sklearn中的PolynomialFeature评估器实现。

from sklearn.preprocessing import PolynomialFeatures
import numpy as np

x1 = np.array([1, 2, 3])
# (3,) >>> (3,1)
x1.reshape(-1, 1)
'''
array([[1],
       [2],
       [3]])
'''

# 1次方到5次方
PolynomialFeatures(degree=5).fit_transform(x1.reshape(-1, 1))
'''
array([[  1.,   1.,   1.,   1.,   1.,   1.],
       [  1.,   2.,   4.,   8.,  16.,  32.],
       [  1.,   3.,   9.,  27.,  81., 243.]])
'''

2. 双变量特征衍生

2.1 四则运算

X1, X2 >>> X1+X2, X1-X2, X1*X2, X1/X2

2.2 多项式衍生

2.2.1 导包 & 数据

from sklearn.preprocessing import PolynomialFeatures
import pandas as pd

df = pd.DataFrame({'X1':[1,2,3], 'X2':[2,3,4]})
df
'''
	X1	X2
0	1	2
1	2	3
2	3	4
'''

2.2.2 二阶衍生

'''
多项式衍生
X1,X2 >>> X1,X2 | X1^2,X1*X2,X2^2
X1,X2,X3 >>> X1,X2,X3 | X1^2,X1*X2,X1*X3,X2^2,X2*X3,X3^2

include_bias 默认True，包含特征的0次方
interaction_only 默认False，True则只创建交叉项
'''
PolynomialFeatures(degree=2, include_bias=False).fit_transform(df)
'''
array([[ 1.,  2.,  1.,  2.,  4.],
       [ 2.,  3.,  4.,  6.,  9.],
       [ 3.,  4.,  9., 12., 16.]])
'''

2.2.3 三阶衍生

在这里插入图片描述

'''
X1,X2 >>> X1,X2 | X1^2,X1*X2,X2^2 | X1^3,X1^2*X2,X1*X2^2,X2^3
'''
PolynomialFeatures(degree=3, include_bias=False).fit_transform(df)
'''
array([[ 1.,  2.,  1.,  2.,  4.,  1.,  2.,  4.,  8.],
       [ 2.,  3.,  4.,  6.,  9.,  8., 12., 18., 27.],
       [ 3.,  4.,  9., 12., 16., 27., 36., 48., 64.]])
'''

创建特征的同时创建列名称

df = pd.DataFrame({'X1':[1,2,3]
                  ,'X2':[2,3,4]
                  ,'X3':[1,0,0]})
df
'''
	X1	X2	X3
0	1	2	1
1	2	3	0
2	3	4	0
'''

# 选取X1、X2进行三阶衍生
colNames = ['X1', 'X2']
degree = 3
colNames_new = []
# 生成新列名
for deg in range(2, degree+1):
    for i in range(deg+1):
        col_temp = colNames[0] + '^' + str(deg-i) + '*' + colNames[1] + '^'  + str(i)
        colNames_new.append(col_temp)
colNames_new
'''
['X1^2*X2^0',
 'X1^1*X2^1',
 'X1^0*X2^2',
 'X1^3*X2^0',
 'X1^2*X2^1',
 'X1^1*X2^2',
 'X1^0*X2^3']
'''

3. 交叉组合特征衍生

在这里插入图片描述

3.1 导包 & 数据

df = pd.DataFrame({'SeniorCitizen':[0,0,0,0,0]
                  ,'Partner':['Yes','No','No','No','No']
                  ,'Dependents':['No','No','No','No','No']})
df
'''

SeniorCitizen	Partner	Dependents
0	0	Yes	No
1	0	No	No
2	0	No	No
3	0	No	No
4	0	No	No
'''

3.2 生成衍生列和名称

colNames = ['SeniorCitizen', 'Partner', 'Dependents']
colNames_new_l = []
features_new_l = []

for col_index, col_name in enumerate(colNames):
    print(col_index, col_name)
'''
0 SeniorCitizen
1 Partner
2 Dependents
'''

# 衍生列名称
for col_index, col_name in enumerate(colNames):
    for col_sub_index in range(col_index+1, len(colNames)):
        newNames = col_name + ' & ' + colNames[col_sub_index]
        print(newNames)
'''
SeniorCitizen & Partner
SeniorCitizen & Dependents
Partner & Dependents
'''

# 衍生列名称及特征本身
for col_index, col_name in enumerate(colNames):
    for col_sub_index in range(col_index+1, len(colNames)):
        newNames = col_name + '&' + colNames[col_sub_index]
        colNames_new_l.append(newNames)
        newDF = pd.Series(df[col_name].astype('str')
                         + '&'
                         + df[colNames[col_sub_index]].astype('str')
                         ,name=newNames)
        features_new_l.append(newDF)
        
features_new = pd.concat(features_new_l, axis=1)
features_new
'''

SeniorCitizen&Partner	SeniorCitizen&Dependents	Partner&Dependents
0	0&Yes	0&No	Yes&No
1	0&No	0&No	No&No
2	0&No	0&No	No&No
3	0&No	0&No	No&No
4	0&No	0&No	No&No
'''

4. 分组统计特征衍生

4.1 分组统计原理

'''
B特征根据A特征的不同取值进行分组统计，
统计量可以是均值、方差等(针对连续变量)或众数、分位数等(针对离散变量).
如: Monthly Charges根据tenure取值进行分组统计
ID tenure Monthly_Charges | mean min max
1  1      3               | 5    3   7
2  2      10              | 11   10  12
3  3      15              | 15   15  15
4  2      12              | 11   10  12
5  1      7               | 5    3   7
'''

'''
1. B特征可离散也可连续，A特征必须离散.
   A最好是取值较多的离散变量(固定取值的连续变量),否则重复行多.
2. 计算B特征分组统计量时，不局限于连续特征只用连续变量统计量、离散特征只用离散变量统计量.
   离散也可用均值、方差、偏度、峰度等，连续也可用众数、分位数等.
3. 分组统计可用于多表连接.
   ID 消费金额 消费品类      ID 消费金额_均值 消费金额_总额 消费品类_众数
   1  11       A             1  10.6667       32            C
   2  5        B       >>>   2  7.33333       22            A
   3  1        A             3  1             2             A
4. 在分组统计的基础上进一步四则运算特征衍生
   ID tenure Monthly_Charges | mean | Monthly_Charges-mean  Monthly_Charges/mean
   1  1      3               | 5    |  -2                    0.6
   2  2      10              | 11   |  -1                    0.909
   3  3      15              | 15   |  0                     1
'''

4.2 过程

4.2.1 数据准备

data = pd.DataFrame({'tenure': [1, 3, 2, 4, 2, 3, 4]
             ,'SeniorCitizen': [0,0,2,0,0,1,3]
             ,'MonthlyCharges': [29.85,56.95,53.85,42.30,70.70,73.12,76.37]
             ,'gender': ['F','M','M','F','M','M','F']})

# 提取目标字段数据
colNames = ['tenure', 'SeniorCitizen', 'MonthlyCharges']
features_temp = data[colNames]
features_temp

	tenure	SeniorCitizen	MonthlyCharges
0	1	0	29.85
1	3	0	56.95
2	2	2	53.85
3	4	0	42.30
4	2	0	70.70
5	3	1	73.12
6	4	3	76.37

4.2.2 单统计变量衍生

# 在不同tenure取值下计算其他变量分组均值
features_temp.groupby('tenure').mean()

	SeniorCitizen	MonthlyCharges
tenure
1	0.0	29.850
2	1.0	62.275
3	0.5	65.035
4	1.5	59.335

4.2.3 多统计变量衍生(新数据+新列名)

colNames = ['tenure', 'SeniorCitizen', 'MonthlyCharges']
# 分组汇总字段
colNames_sub = ['SeniorCitizen', 'MonthlyCharges']

# 字段汇总统计量设置
aggs = {}
for col in colNames_sub:
    aggs[col] = ['mean', 'min', 'max']
aggs
'''
    {'SeniorCitizen': ['mean', 'min', 'max'],
     'MonthlyCharges': ['mean', 'min', 'max']}
'''

# 创建新的列名
cols = ['tenure']
for key in aggs.keys():
    cols.extend([key+'_'+'tenure'+'_'+stat for stat in aggs[key]])
cols
'''
    ['tenure',
     'SeniorCitizen_tenure_mean',
     'SeniorCitizen_tenure_min',
     'SeniorCitizen_tenure_max',
     'MonthlyCharges_tenure_mean',
     'MonthlyCharges_tenure_min',
     'MonthlyCharges_tenure_max']
'''

'''
df.agg('mean') 计算列均值
df.agg('mean', axis=1) 计算行均值
df.agg({'行名/列名1': ['mean', 'min', 'max']
       ,'行名/列名2'：['函数名1', '函数名2', '函数名3']}) 

df.reset_index() 更新索引，'tenure'成为特征列
'''
features_new = features_temp.groupby('tenure').agg(aggs).reset_index()
features_new

	tenure	SeniorCitizen			MonthlyCharges
		mean	min	max	mean	min	max
0	1	0.0	0	0	29.850	29.85	29.85
1	2	1.0	0	2	62.275	53.85	70.70
2	3	0.5	0	1	65.035	56.95	73.12
3	4	1.5	0	3	59.335	42.30	76.37

# 重新设置列名
features_new.columns = cols
features_new

	tenure	SeniorCitizen_tenure_mean	SeniorCitizen_tenure_max	MonthlyCharges_tenure_mean	MonthlyCharges_tenure_min	MonthlyCharges_tenure_max
0	1	0.0	0	29.850	29.85	29.85
1	2	1.0	2	62.275	53.85	70.70
2	3	0.5	1	65.035	56.95	73.12
3	4	1.5	3	59.335	42.30	76.37

# 左连接，将features_new拼接到d左边data中
data_new = pd.merge(data, features_new, how='left', on='tenure')
data_new

	tenure	SeniorCitizen	MonthlyCharges	gender	SeniorCitizen_tenure_mean	SeniorCitizen_tenure_max	MonthlyCharges_tenure_mean	MonthlyCharges_tenure_min	MonthlyCharges_tenure_max
0	1	0	29.85	F	0.0	0	29.850	29.85	29.85
1	3	0	56.95	M	0.5	1	65.035	56.95	73.12
2	2	2	53.85	M	1.0	2	62.275	53.85	70.70
3	4	0	42.30	F	1.5	3	59.335	42.30	76.37
4	2	0	70.70	M	1.0	2	62.275	53.85	70.70
5	3	1	73.12	M	0.5	1	65.035	56.95	73.12
6	4	3	76.37	F	1.5	3	59.335	42.30	76.37

4.2.4 常用统计量整理

# 统计量
'''
- mean/var: 均值、方差
- max/min: 最大值、最小值
- skew: 数据分布偏度，小于0时左偏，大于0时右偏
- median: 中位数
- count: 个数
- nunique: 类别数
- quantile: 分位数(不能.agg(),需要自定义函数完成计算)
'''

a = np.array([[1,2,3,2,4,1]
             ,[0,0,0,1,1,1]])
df = pd.DataFrame(a.T, columns=['x1','x2'])
df

	x1	x2
0	1	0
1	2	0
2	3	0
3	2	1
4	4	1
5	1	1

aggs = {'x1': ['mean', 'var', 'max', 'min', 'skew', 'median', 'count', 'nunique']}
df.groupby('x2').agg(aggs).reset_index()

	x2	x1
		mean	var	max	min	skew	median	count	nunique
0	0	2.000000	1.000000	3	1	0.00000	2.0	3	3
1	1	2.333333	2.333333	4	1	0.93522	2.0	3	3

4.2.5 分位数统计量衍生

# quantile 分位数
def q1(x):
    '''
    下四分位数
    '''
    return x.quantile(0.25)
def q2(x):
    '''
    上四分位数
    '''
    return x.quantile(0.75)

'''
aggs = {'x2': ['q1', 'q2']}
第一次用于创建列名称
aggs = {'x2': [q1, q2]}
第二次用于作为函数名带入.agg()
'''
# x2在x1上的汇总
colNames = ['x2']
keyCol = 'x1'

# 第一次定义aggs，用于辅助定义列名称
aggs = {}
for col in colNames:
    aggs[col] = ['q1', 'q2']
aggs
'''
    {'x2': ['q1', 'q2']}
'''

# 新增列的列名称
cols = [keyCol]
for key in aggs.keys():
    cols.extend([key+'_'+keyCol+'_'+stat for stat in aggs[key]])
cols
'''
    ['x1', 'x2_x1_q1', 'x2_x1_q2']
'''

# 第二次定义aggs，用于配合groupby过程进行分组计算
aggs = {}
for col in colNames:
    aggs[col] = [q1, q2]
aggs
'''
    {'x2': [<function __main__.q1(x)>, <function __main__.q2(x)>]}
'''

d2 = df.groupby(keyCol).agg(aggs).reset_index()
d2

	x1	x2
		q1	q2
0	1	0.25	0.75
1	2	0.25	0.75
2	3	0.00	0.00
3	4	1.00	1.00

d2.columns = cols
d2

	x1	x2_x1_q1	x2_x1_q2
0	1	0.25	0.75
1	2	0.25	0.75
2	3	0.00	0.00
3	4	1.00	1.00

4.3 分组统计函数封装

# 分组统计函数封装
def Binary_Group_Statistics(keyCol
                           ,features
                           ,col_num=None
                           ,col_cat=None
                           ,num_stat=['mean', 'var', 'max', 'min', 'skew', 'median']
                           ,cat_stat=['mean', 'var', 'max', 'min', 'median', 'count', 'nunique']
                           ,quant=True):
    """
    双变量分组统计特征衍生函数
    
    :param keyCol: 分组参考的关键变量
    :param features: 原始数据集
    :param col_num: 参与衍生的连续型变量
    :param col_cat: 参与衍生的离散型变量
    :param num_stat: 连续变量分组统计量
    :param cat_num: 离散变量分组统计量
    :param quant: 是否计算分位数
    
    :return: 交叉衍生后的新特征和新特征名称
    """
    # 当输入特征有连续型特征时
    if  col_num != None:
        aggs_num = {}
        colNames = col_num
        # 创建agg方法所需字典
        for col in col_num:
            aggs_num[col] = num_stat
        # 创建衍生特征名称列表
        cols_num = [keyCol]
        for key in aggs_num.keys():
            cols_num.extend([key+'_'+keyCol+'_'+stat for stat in aggs_num[key]])
        # 创建衍生特征df
        features_num_new = features[col_num+[keyCol]].groupby(keyCol).agg(aggs_num).reset_index()
        features_num_new.columns = cols_num
        # 当输入的特征有连续型也有离散型时
        if col_cat != None:
            aggs_cat = {}
            colName = col_num + col_cat
            # 创建agg方法所需字典
            for col in col_cat:
                aggs_cat[col] = cat_stat
            # 创建衍生特征名称列表
            cols_cat = [keyCol]
            for key in aggs_cat.keys():
                cols_cat.extend([key+'_'+keyCol+'_'+stat for stat in aggs_cat[key]])
            # 创建衍生特征df
            features_cat_new = features[col_cat+[keyCol]].groupby(keyCol).agg(aggs_cat).reset_index()
            features_cat_new.columns = cols_cat
            
            # 合并连续变量衍生结果与离散变量衍生结果
            df_temp = pd.merge(features_num_new, features_cat_new, how='left', on=keyCol)
            features_new = pd.merge(features[keyCol], df_temp, how='left', on=keyCol)
            features_new.loc[:, ~features_new.columns.duplicated()] # 剔除重复列
            colNames_new = cols_num + cols_cat
            colNames_new.remove(keyCol)
            colNames_new.remove(keyCol)
        # 当只有连续变量时
        else:
            # merge连续变量衍生结果与原始数据，然后删除重复列
            features_new = pd.merge(features[keyCol], features_num_new, how='left', on=keyCol)
            features_new.loc[:, ~features_new.columns.duplicated()]
            colNames_new = cols_num
            colNames_new.remove(keyCol)
    # 当没有输入连续变量时
    else:
        # 但存在分类变量时，及只有分类变量时
        if col_cat != None:
            aggs_cat = {}
            colNames = col_cat
            for col in col_cat:
                aggs_cat[col] = cat_stat
            cols_cat = [keyCol]
            for key in aggs_cat.keys():
                cols_cat.extend([key+'_'+keyCol+'_'+stat for stat in aggs_cat[key]])
            features_cat_new = features[col_cat+[keyCol]].groupby(keyCol).agg(aggs_cat).reset_index()
            features_cat_new.columns = cols_cat
            features_new = pd.merge(features[keyCol], features_cat_new, how='left', on=keyCol)
            features_new.loc[:, ~features_new.columns.duplicated()]
            colNames_new = cols_cat
            colNames_new.remove(keyCol)
            
    if quant:
        # 定义四分位计算函数
        def q1(x):
            return x.quantile(0.25) # 下四分位数
        def q2(x):
            return x.quantile(0.75) # 上四分位数
            
        aggs = {}
        for col in colNames:
            aggs[col] = ['q1', 'q2']
        cols = [keyCol]
        for key in aggs.keys():
            cols.extend([key+'_'+keyCol+'_'+stat for stat in aggs[key]])
        aggs = {}
        for col in colNames:
            aggs[col] = [q1, q2]
        features_temp = features[colNames+[keyCol]].groupby(keyCol).agg(aggs).reset_index()
        features_temp.columns = cols
            
        features_new = pd.merge(features_new, features_temp, how='left', on=keyCol)
        features_new.loc[:, ~features_new.columns.duplicated()]
        colNames_new = colNames_new + cols
        colNames_new.remove(keyCol)
    features_new.drop([keyCol], axis=1, inplace=True)
    return features_new, colNames_new

# 使用示例
features = pd.DataFrame({'tenure': [1, 3, 2, 4, 2, 3, 4]
             ,'SeniorCitizen': [0,0,2,0,0,1,3]
             ,'MonthlyCharges': [29.85,56.95,53.85,42.30,70.70,73.12,76.37]
             ,'gender': ['F','M','M','F','M','M','F']})
col_num = ['MonthlyCharges']
col_cat = ['SeniorCitizen']
keyCol = 'tenure'
df, col = Binary_Group_Statistics(keyCol, features, col_num, col_cat)

df # 不包含tenure，完全是新生成的列

	MonthlyCharges_tenure_mean	MonthlyCharges_tenure_var	MonthlyCharges_tenure_max	MonthlyCharges_tenure_min	MonthlyCharges_tenure_skew	MonthlyCharges_tenure_median	SeniorCitizen_tenure_mean	SeniorCitizen_tenure_var	SeniorCitizen_tenure_max	SeniorCitizen_tenure_median	SeniorCitizen_tenure_count	SeniorCitizen_tenure_nunique	MonthlyCharges_tenure_q1	MonthlyCharges_tenure_q2
0	29.850	NaN	29.85	29.85	NaN	29.850	0.0	NaN	0	0.0	1	1	29.8500	29.8500
1	65.035	130.73445	73.12	56.95	NaN	65.035	0.5	0.5	1	0.5	2	2	60.9925	69.0775
2	62.275	141.96125	70.70	53.85	NaN	62.275	1.0	2.0	2	1.0	2	2	58.0625	66.4875
3	59.335	580.38245	76.37	42.30	NaN	59.335	1.5	4.5	3	1.5	2	2	50.8175	67.8525
4	62.275	141.96125	70.70	53.85	NaN	62.275	1.0	2.0	2	1.0	2	2	58.0625	66.4875
5	65.035	130.73445	73.12	56.95	NaN	65.035	0.5	0.5	1	0.5	2	2	60.9925	69.0775
6	59.335	580.38245	76.37	42.30	NaN	59.335	1.5	4.5	3	1.5	2	2	50.8175	67.8525

col
'''
    ['MonthlyCharges_tenure_mean',
     'MonthlyCharges_tenure_var',
     'MonthlyCharges_tenure_max',
     'MonthlyCharges_tenure_min',
     'MonthlyCharges_tenure_skew',
     'MonthlyCharges_tenure_median',
     'SeniorCitizen_tenure_mean',
     'SeniorCitizen_tenure_var',
     'SeniorCitizen_tenure_max',
     'SeniorCitizen_tenure_min',
     'SeniorCitizen_tenure_median',
     'SeniorCitizen_tenure_count',
     'SeniorCitizen_tenure_nunique',
     'MonthlyCharges_tenure_q1',
     'MonthlyCharges_tenure_q2']
'''

5. 时序字段特征衍生

本质是增加分组
发现规律，如某一季度用户流失
根据自然周期和业务周期衍生，如季节和旅游旺淡季
根据关键时间点的时间差值衍生

5.1 时间格式转换 pd.to_datetime()

t = pd.DataFrame()
t['time'] = ['2022-01-03;02:31:52'
            ,'2022-07-01;14:22:01'
            ,'2022-08-22;08:02:31'
            ,'2022-04-30;11:41:31'
            ,'2022-05-02;22:01:27']
'''
或传入字典
t = pd.DataFrame({'time': ['2022-01-03;02:31:52'
                          ,'2022-07-01;14:22:01'
                          ,'2022-08-22;08:02:31'
                          ,'2022-04-30;11:41:31'
                          ,'2022-05-02;22:01:27']})
'''

t['time'] = pd.to_datetime(t['time'])
t['time'], t['time'][0] # 每个元素是时间戳格式
'''
    (0   2022-01-03 02:31:52
     1   2022-07-01 14:22:01
     2   2022-08-22 08:02:31
     3   2022-04-30 11:41:31
     4   2022-05-02 22:01:27
     Name: time, dtype: datetime64[ns],
     Timestamp('2022-01-03 02:31:52'))

'''

# 转换精度D,h,s,ms,ns
t['time'].values, t['time'].values.astype('datetime64[D]')
'''
    (array(['2022-01-03T02:31:52.000000000', '2022-07-01T14:22:01.000000000',
            '2022-08-22T08:02:31.000000000', '2022-04-30T11:41:31.000000000',
            '2022-05-02T22:01:27.000000000'], dtype='datetime64[ns]'),
     array(['2022-01-03', '2022-07-01', '2022-08-22', '2022-04-30',
            '2022-05-02'], dtype='datetime64[D]'))
'''

t['time-D'] = t['time'].values.astype('datetime64[D]')
t

	time	time-D
0	2022-01-03 02:31:52	2022-01-03
1	2022-07-01 14:22:01	2022-07-01
2	2022-08-22 08:02:31	2022-08-22
3	2022-04-30 11:41:31	2022-04-30
4	2022-05-02 22:01:27	2022-05-02

5.2 时序字段信息提取 dt.year/month/day…

'''
- 年月日时分秒
dt.year dt.month dt.day dt.hour dt.minute dt.second
- 提取季度、一年第几周、星期几
dt.quarter dt.weekofyear dt.dayofweek/dt.weekday
'''
t['time'].dt.day
'''
    0     3
    1     1
    2    22
    3    30
    4     2
    Name: time, dtype: int64
'''

t['year'] = t['time'].dt.year
t['month'] = t['time'].dt.month
t['day'] = t['time'].dt.day
t['hour'] = t['time'].dt.hour
t['minute'] = t['time'].dt.minute
t['second'] = t['time'].dt.second

# 周一从0开始，+1代表周几
t['dayofweek'] = t['time'].dt.dayofweek + 1
t['quarter'] = t['time'].dt.quarter
t['weekofyear'] = t['time'].dt.weekofyear
t

	time	time-D	year	month	day	hour	minute	second	dayofweek	quarter	weekofyear
0	2022-01-03 02:31:52	2022-01-03	2022	1	3	2	31	52	1	1	1
1	2022-07-01 14:22:01	2022-07-01	2022	7	1	14	22	1	5	3	26
2	2022-08-22 08:02:31	2022-08-22	2022	8	22	8	2	31	1	3	34
3	2022-04-30 11:41:31	2022-04-30	2022	4	30	11	41	31	6	2	17
4	2022-05-02 22:01:27	2022-05-02	2022	5	2	22	1	27	1	2	18

# 是否是周末
t['weekend'] = (t['dayofweek'] > 5).astype(int)
# 小时所属凌晨、上午、下午、晚上，6小时为周期划分(整除8)
t['hour_section'] = (t['hour'] // 6).astype(int)
t

	time	time-D	year	month	day	hour	minute	second	dayofweek	quarter	weekofyear	weekend	hour_section
0	2022-01-03 02:31:52	2022-01-03	2022	1	3	2	31	52	1	1	1	0	0
1	2022-07-01 14:22:01	2022-07-01	2022	7	1	14	22	1	5	3	26	0	2
2	2022-08-22 08:02:31	2022-08-22	2022	8	22	8	2	31	1	3	34	0	1
3	2022-04-30 11:41:31	2022-04-30	2022	4	30	11	41	31	6	2	17	1	1
4	2022-05-02 22:01:27	2022-05-02	2022	5	2	22	1	27	1	2	18	0	3

5.3 时间差值衍生

5.3.1 列与列的运算

t['time'] - t['time']
'''
    0   0 days
    1   0 days
    2   0 days
    3   0 days
    4   0 days
    Name: time, dtype: timedelta64[ns]
'''

5.3.2 列与时间戳的运算

p1 = '2022-01-03;02:31:52'
pd.Timestamp(p1), pd.Timestamp(p1).year
'''
    (Timestamp('2022-01-03 02:31:52'), 2022)
'''

t['time_diff'] = t['time'] - pd.Timestamp(p1)
t['time_diff']
'''
    0     0 days 00:00:00
    1   179 days 11:50:09
    2   231 days 05:30:39
    3   117 days 09:09:39
    4   119 days 19:29:35
    Name: time_diff, dtype: timedelta64[ns]
'''

'''
- timedelta类型提取信息
月 天 秒
dt.days/30 dt.days dt.seconds

得到的相差的seconds是忽略天数计算的结果，
相差的days是忽略时分秒计算的结果
'''
t['time_diff'].dt.seconds
'''
    0        0
    1    42609
    2    19839
    3    32979
    4    70175
    Name: time_diff, dtype: int64
'''

# 真实的相差秒数，包含天时分的差值
t['time_diff_s'] = t['time_diff'].values.astype('timedelta64[s]').astype('int')
t['time_diff_s']
'''
    0           0
    1    15508209
    2    19978239
    3    10141779
    4    10351775
    Name: time_diff_s, dtype: int64
'''

	time	time-D	year	month	day	hour	minute	second	dayofweek	quarter	weekofyear	weekend	hour_section	time_diff	time_diff_s
0	2022-01-03 02:31:52	2022-01-03	2022	1	3	2	31	52	1	1	1	0	0	0 days 00:00:00	0
1	2022-07-01 14:22:01	2022-07-01	2022	7	1	14	22	1	5	3	26	0	2	179 days 11:50:09	15508209
2	2022-08-22 08:02:31	2022-08-22	2022	8	22	8	2	31	1	3	34	0	1	231 days 05:30:39	19978239
3	2022-04-30 11:41:31	2022-04-30	2022	4	30	11	41	31	6	2	17	1	1	117 days 09:09:39	10141779
4	2022-05-02 22:01:27	2022-05-02	2022	5	2	22	1	27	1	2	18	0	3	119 days 19:29:35	10351775

5.3.2 选择时间戳

t['time'].max(), t['time'].min()
'''
    (Timestamp('2022-08-22 08:02:31'), Timestamp('2022-01-03 02:31:52'))
'''

import datetime

# 获取当前时间，精确到毫秒
print(datetime.datetime.now())
# 定义格式 年月日
print('-')
print(datetime.datetime.now().strftime('%Y-%m-%d'))
print(pd.Timestamp(datetime.datetime.now().strftime('%Y-%m-%d')))
# 定义格式 年月日时分秒
print('-')
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
print(pd.Timestamp(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
'''
    2023-02-21 10:35:29.455932
    -
    2023-02-21
    2023-02-21 00:00:00
    -
    2023-02-21 10:35:29
    2023-02-21 10:35:29
'''