一个模型的好坏,在于对数据的处理,即对数据特征的处理。下面是一些数据特征处理技巧,主要是GBDT方面的。
文章来源:https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575
NAN 处理
If you give np.nan to LGBM, then at each tree node split, it will split the non-NAN values and then send all the NANs to either the left child or right child depending on what’s best. Therefore NANs get special treatment at every node and can become overfit. By simply converting all NAN to a negative number lower than all non-NAN values (such as - 999),
df[col].fillna(-999, inplace=True)
then LGBM will no longer overprocess NAN. Instead it will give it the same attention as other numbers. Try both ways and see which gives the highest CV.
减少内存使用:Label Encode/ Factorize/ Memory reduction
df[col],_ = df[col].factorize()
if df[col].max()<128: df[col] = df[col].astype(‘int8’)
elif df[col].max()<32768: df[col] = df[col].astype(‘int16’)
else: df[col].astype(‘int32’)
for col in df.columns:
if df[col].dtype==‘float64’: df[col] = df[col].astype(‘float32’)
if df[col].dtype==‘int64’: df[col] = df[col].astype(‘int32’)
Categorical Features处理
With categorical variables, you have the choice of telling LGBM that they are categorical or you can tell LGBM to treat it as numerical (if you label encode it first). Either way, LGBM can extract the category classes. Try both ways and see which gives the highest CV. After label encoding do the following for category or leave it as int for numeric
df[col] = df[col].astype(‘category’)
分割
A single (string or numeric) column can be made into two columns by splitting. For example a string column id_30 such as “Mac OS X 10_9_5” can be split into Operating System “Mac OS X” and Version “10_9_5”. Or for example number column TransactionAmt “1230.45” can be split into Dollars “1230” and Cents “45”. LGBM cannot see these pieces on its own, you need to split them.
列组合及列间加减乘除Combining / Transforming / Interaction
Two (string or numeric) columns can be combined into one column. For example card1 and card2 can become a new column with
df[‘uid’] = df[‘card1’].astype(str)+’_’+df[‘card2’].astype(str)
This helps LGBM because by themselves card1 and card2 may not correlate with target and therefore LGBM won’t split them at a tree node. But the interaction uid = card1_card2 may correlate with target and now LGBM will split it. Numeric columns can combined with adding, subtracting, multiplying, etc. A numeric example is
df[‘x1_x2’] = df[‘x1’] * df[‘x2’]
均值、统计等Frequency Encoding
Frequency encoding is a powerful technique that allows LGBM to see whether column values are rare or common. For example, if you want LGBM to “see” which credit cards are used infrequently, try
temp = df[‘card1’].value_counts().to_dict()
df[‘card1_counts’] = df[‘card1’].map(temp)
Aggregations / Group Statistics
Providing LGBM with group statistics allows LGBM to determine if a value is common or rare for a particular group. You calculate group statistics by providing pandas with 3 variables. You give it the group, variable of interest, and type of statistic. For example,
temp = df.groupby(‘card1’)[‘TransactionAmt’].agg([‘mean’])
.rename({‘mean’:‘TransactionAmt_card1_mean’},axis=1)
df = pd.merge(df,temp,on=‘card1’,how=‘left’)
The feature here adds to each row what the average TransactionAmt is for that row’s card1 group. Therefore LGBM can now tell if a row has an abnormal TransactionAmt for their card1 group.
归一化Normalize / Standardize
You can normalize columns against themselves. For example
df[col] = ( df[col]-df[col].mean() ) / df[col].std()
Or you can normalize one column against another column. For example if you create a Group Statistic (described above) indicating the mean value for D3 each week. Then you can remove time dependence by
df[‘D3_remove_time’] = df[‘D3’] - df[‘D3_week_mean’]
The new variable D3_remove_time no longer increases as we advance in time because we have normalized it against the affects of time.
异常值处理等Outlier Removal / Relax / Smooth / PCA
Normally you want to remove anomalies from your data because they confuse your models. However in this competition, we want to find anomalies so use smoothing techniques carefully. The idea behind these methods is to determine and remove uncommon values. For example, by using frequency encoding of a variable, you can remove all values that appear less than 0.1% by replacing them with a new value like -9999 (note that you should use a different value than what you used for NAN).