特征衍生工程

 在这里插入图片描述

高阶多项式特征衍生 

import pandas as pd
import numpy as np
# import warnings
# warnings.filterwarnings('ignore')

diabetes = pd.read_csv(r"D:\本科\kaggle数据挖掘\titanic\diabetes.csv")
titanic = pd.read_csv(r"D:\本科\kaggle数据挖掘\titanic\train.csv")
titanic.fillna(0, inplace = True)
titanic.head()

diabetes.head()

 对diabetes特征处理

from sklearn.preprocessing import PolynomialFeatures
tmp = PolynomialFeatures(degree=5).fit_transform(diabetes.iloc[:,0:1])
tmp = pd.DataFrame(tmp)
tmp.rename(columns = {0:'preg$^0$', 1:'preg$^1$', 2:'preg$^2$', 3:'preg$^3$', 4:'preg$^4$', 5:'preg$^5$'}, inplace = True)
new_diabetes = pd.concat([diabetes, tmp], axis = 1, join = 'inner')
new_diabetes.head()

 

 对titanic特征处理

temp = PolynomialFeatures(degree = 5).fit_transform(titanic.iloc[:,9:10])
temp = pd.DataFrame(temp)
temp.rename(columns = {0:'fare$^0$', 1:'fare$^1$', 2:'fare$^2$', 3:'fare$^3$', 4:'fare$^4$', 5:'fare$^5$'}, inplace = True)
new_titanic = pd.concat([titanic, temp], axis = 1, join = 'inner')
new_titanic.iloc[0:5, 6:]

对diabetes二阶多项式衍生

data = diabetes.iloc[[0,1], [0,1]]
data

 

PolynomialFeatures(degree = 2, include_bias = False).fit_transform(data)
array([[6.0000e+00, 1.4800e+02, 3.6000e+01, 8.8800e+02, 2.1904e+04],
       [1.0000e+00, 8.5000e+01, 1.0000e+00, 8.5000e+01, 7.2250e+03]])

61483688821904
1851857225

 对diabetes三阶多项式衍生

PolynomialFeatures(degree = 3, include_bias = False).fit_transform(data)

array([[6.000000e+00, 1.480000e+02, 3.600000e+01, 8.880000e+02,
        2.190400e+04, 2.160000e+02, 5.328000e+03, 1.314240e+05,
        3.241792e+06],
       [1.000000e+00, 8.500000e+01, 1.000000e+00, 8.500000e+01,
        7.225000e+03, 1.000000e+00, 8.500000e+01, 7.225000e+03,
        6.141250e+05]])

 交叉组合特征衍生

  

 5.交叉组合特征衍生方法介绍_哔哩哔哩_bilibili

时序特征——对时间数据的特征衍生 

一、对时间数据的转换 

1.初定义数据(转化成表格                                           

import pandas as pd
t = pd.DataFrame()
t['time'] = ['2022-04-05;13:34:03',
             '1949-10-03;14:01:06',
             '1945-08-15;09:00:00']
t

 

2.数据转换(真正转换成时间

t['time'] = pd.to_datetime(t['time'])
t['time']

0   2022-04-05 13:34:03
1   1949-10-01 14:01:06
2   1945-08-15 09:00:00
Name: time, dtype: datetime64[ns]

 再来一个例子

t1 = pd.DataFrame()
t1['time'] = ['1997-07-01',
              '1999-12-20']
t1['time'] = pd.to_datetime(t1['time'])
t1['time']
0   1997-07-01
1   1999-12-20
Name: time, dtype: datetime64[ns]
t1['time'].values.astype('datetime64[D]')
t1['time']

0   1997-07-01
1   1999-12-20
Name: time, dtype: datetime64[ns]
t1['time'].values.astype('datetime64[h]')
t1['time']
0   1997-07-01
1   1999-12-20
Name: time, dtype: datetime64[ns]
  常用 时间数据类型                        
          pd.datetime64[ns] ( 纳秒 )
          pd.datetime64[D]  ( )
          pd.datetime64[h] (小时 )
          pd.datetime64[s]  ( )
          pd.datetime64[ ms ] ( 毫秒 )                                           
     DataFrame 类型只支持 [ns] 类型

t1['time'].dt.year

0    1997
1    1999
Name: time, dtype: int64
t1['time'].dt.quarter

0    3
1    4
Name: time, dtype: int64

目标编码 

import numpy as np
a = np.array([[1,2] * 5, [0, 1, 1, 1, 1, 0, 0, 0, 1, 0]]).T
train = pd.DataFrame(a, columns = ['tenure', 'Churn'])
train

 

from sklearn.model_selection import KFold
kf = KFold(n_splits = 5)
for train, text in kf.split(a):
    print('train: %s, text: %s' %(train, text))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值