代码拆解
- 首先导入包
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
- 然后创建一个
feature_df
feature_df = pd.DataFrame({"fea1":[0,1,2,3], "fea2":[4,5,6,7], "fea3":[8,9,10,11]})
feature_df
3. 选择指定列名,进行ploy
特征交叉,这里选的是 fea1
和fea2
crossed_feas = ["fea1", "fea2"]
ct = ColumnTransformer([('poly', PolynomialFeatures(degree=2, include_bias=False), crossed_feas)])
crossed_features = ct.fit_transform(feature_df)
- 获取交叉的特征名,这里会包含原始的特征名
poly_feature_names = ct.transformers_[0][1].get_feature_names_out(['fea1', 'fea2'])
print(f"poly_feature_names: {poly_feature_names}")
poly_feature_names: ['fea1' 'fea2' 'fea1^2' 'fea1 fea2' 'fea2^2']
- 交叉特征拼到原
feature_df
后面
feature_df_crossed = pd.DataFrame(crossed_features, columns=poly_feature_names)
# 这里只选取了新生成的特征,原特征在feature_df里有,所以没必要拼接
feature_df_crossed_only = feature_df_crossed.iloc[:, len(crossed_feas):]
feature_df = pd.concat([feature_df, feature_df_crossed_only], axis=1)
feature_df
完整代码
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
feature_df = pd.DataFrame({"fea1":[0,1,2,3], "fea2":[4,5,6,7], "fea3":[8,9,10,11]})
# poly特征交叉
crossed_feas = ["fea1", "fea2"]
ct = ColumnTransformer([('poly', PolynomialFeatures(degree=2, include_bias=False), crossed_feas)])
# 生成新特征df
crossed_features = ct.fit_transform(feature_df)
poly_feature_names = ct.transformers_[0][1].get_feature_names_out(['fea1', 'fea2'])
print(f"poly_feature_names: {poly_feature_names}")
feature_df_crossed = pd.DataFrame(crossed_features, columns=poly_feature_names)
feature_df_crossed_only = feature_df_crossed.iloc[:, len(crossed_feas):]
# 拼接到原df后面
feature_df = pd.concat([feature_df, feature_df_crossed_only], axis=1)
feature_df
PolynomialFeatures 的用法
`PolynomialFeatures` 是一个用于生成多项式特征的转换器。它的参数如下:
- `degree`:生成的多项式的最高次数。默认值为2。
- `include_bias`:是否包括偏置列。如果为True,则会在生成的特征矩阵中添加一列全为1的列。默认值为True。
- `interaction_only`:是否只生成交互项,不生成幂项。如果为True,则只生成输入特征之间的交互项,不生成幂项。默认值为False。
- `order`:控制生成的多项式的顺序。如果为'C',则按照列的字母顺序生成多项式。如果为'F',则按照列的顺序生成多项式。默认值为'C'。