lightGBM等GBDT进行特征工程技巧

一个模型的好坏,在于对数据的处理,即对数据特征的处理。下面是一些数据特征处理技巧,主要是GBDT方面的。

文章来源:https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575

NAN 处理

If you give np.nan to LGBM, then at each tree node split, it will split the non-NAN values and then send all the NANs to either the left child or right child depending on what’s best. Therefore NANs get special treatment at every node and can become overfit. By simply converting all NAN to a negative number lower than all non-NAN values (such as - 999),

df[col].fillna(-999, inplace=True)
then LGBM will no longer overprocess NAN. Instead it will give it the same attention as other numbers. Try both ways and see which gives the highest CV.

减少内存使用:Label Encode/ Factorize/ Memory reduction

df[col],_ = df[col].factorize()

if df[col].max()<128: df[col] = df[col].astype(‘int8’)
elif df[col].max()<32768: df[col] = df[col].astype(‘int16’)
else: df[col].astype(‘int32’)

for col in df.columns:
if df[col].dtype==‘float64’: df[col] = df[col].astype(‘float32’)
if df[col].dtype==‘int64’: df[col] = df[col].astype(‘int32’)

Categorical Features处理

With categorical variables, you have the choice of telling LGBM that they are categorical or you can tell LGBM to treat it as numerical (if you label encode it first). Either way, LGBM can extract the category classes. Try both ways and see which gives the highest CV. After label encoding do the following for category or leave it as int for numeric

df[col] = df[col].astype(‘category’)

分割

A single (string or numeric) column can be made into two columns by splitting. For example a string column id_30 such as “Mac OS X 10_9_5” can be split into Operating System “Mac OS X” and Version “10_9_5”. Or for example number column TransactionAmt “1230.45” can be split into Dollars “1230” and Cents “45”. LGBM cannot see these pieces on its own, you need to split them.

列组合及列间加减乘除Combining / Transforming / Interaction

Two (string or numeric) columns can be combined into one column. For example card1 and card2 can become a new column with

df[‘uid’] = df[‘card1’].astype(str)+’_’+df[‘card2’].astype(str)
This helps LGBM because by themselves card1 and card2 may not correlate with target and therefore LGBM won’t split them at a tree node. But the interaction uid = card1_card2 may correlate with target and now LGBM will split it. Numeric columns can combined with adding, subtracting, multiplying, etc. A numeric example is

df[‘x1_x2’] = df[‘x1’] * df[‘x2’]

均值、统计等Frequency Encoding

Frequency encoding is a powerful technique that allows LGBM to see whether column values are rare or common. For example, if you want LGBM to “see” which credit cards are used infrequently, try

temp = df[‘card1’].value_counts().to_dict()
df[‘card1_counts’] = df[‘card1’].map(temp)
Aggregations / Group Statistics
Providing LGBM with group statistics allows LGBM to determine if a value is common or rare for a particular group. You calculate group statistics by providing pandas with 3 variables. You give it the group, variable of interest, and type of statistic. For example,

temp = df.groupby(‘card1’)[‘TransactionAmt’].agg([‘mean’])
.rename({‘mean’:‘TransactionAmt_card1_mean’},axis=1)
df = pd.merge(df,temp,on=‘card1’,how=‘left’)
The feature here adds to each row what the average TransactionAmt is for that row’s card1 group. Therefore LGBM can now tell if a row has an abnormal TransactionAmt for their card1 group.

归一化Normalize / Standardize

You can normalize columns against themselves. For example

df[col] = ( df[col]-df[col].mean() ) / df[col].std()
Or you can normalize one column against another column. For example if you create a Group Statistic (described above) indicating the mean value for D3 each week. Then you can remove time dependence by

df[‘D3_remove_time’] = df[‘D3’] - df[‘D3_week_mean’]
The new variable D3_remove_time no longer increases as we advance in time because we have normalized it against the affects of time.

异常值处理等Outlier Removal / Relax / Smooth / PCA

Normally you want to remove anomalies from your data because they confuse your models. However in this competition, we want to find anomalies so use smoothing techniques carefully. The idea behind these methods is to determine and remove uncommon values. For example, by using frequency encoding of a variable, you can remove all values that appear less than 0.1% by replacing them with a new value like -9999 (note that you should use a different value than what you used for NAN).

  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值