1. 数据中字符串变量转换为数字整型
目标将上面的数据转化成[0,1,2]
from pandas.api.types import CategoricalDtype
Vehicle_Age_dtype = CategoricalDtype(categories=['< 1 Year', '1-2 Year', '> 2 Years'], ordered=True)
train['Vehicle_Age'] = train['Vehicle_Age'].astype(Vehicle_Age_dtype).cat.codes
2. 连续变量转换为分类变量
先对数据进行分箱操作,再转换为数字整型
import pandas as pd
age_bins = pd.cut(train['Age'], bins=7)
age_bins
train['Age_bins'] = age_bins.cat.codes
3. Onehot编码
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
coltrans = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), ['Age_bins', 'Vehicle_Age', 'Gender', 'Vehicle_Damage']),
('minmax', MinMaxScaler(), ['Policy_Sales_Channel'])
],
remainder='passthrough' # Keeps columns not specified in transformers
)
train_x_trans = coltrans.fit_transform(train_x)
train_x_trans
将 'Age_bins', 'Vehicle_Age', 'Gender', 'Vehicle_Damage' 这几列的数据转化成onehot编码后拼接,然后对 'Policy_Sales_Channel' 这列的连续变量做MinMax归一化,将归一化的数据拼接在onehot编码的最后一列。由于remainder='passthrough',所以没有指定进行变换的列,又会拼接到最后一列。最后生成的train_x_trans是一个numpy数组。