我们之前的blog写了一个函数,拿到了数据类型的分布。
但是,在基于sklearn中的pipeline机器学习机制下,我们希望可以把数据集中的数据类型划分为以下几个部分:
1.含大量的空值的列(如一列中超过70%为空)
1.数值数据(numerical),并且不含大量的空值
2.低类别数的类别数据(categorical),如只含两类的类别数据,并且不含大量的空值
3.中等类别数的类别数据,如含3-10类的类别数据,并且不含大量的空值
4.高类别数的类别数据,可能类别接近于类似均匀,如一列中70%以上的数据是独特的,并且不含大量的空值
5.所有的类别数据,并且不含大量的空值
基于这个想法,我们不妨写一个函数解决问题。
def cols_spliting(df:pd.DataFrame, y, cardinality = 10, high_missing_per= 0.7 ,drop_y = True):
# def cols_spliting(df:pd.DataFrame, y_col, cardinality = 10, apply_onehot = True, high_missing_per= 0.7, drop_high_missing = True):
assert len(df.index) != 0
if drop_y:
df.drop(y, axis=1,inplace=True)
binary_categorical_cols = []
thin_categorical_cols = []
uniform_categorical_cols = []
categorical_cols = []
numerical_cols = []
other_cols = []
high_missing_cols = []
small_missing_cols = []
missing_cols = []
count = df.shape[0]
for col in df.columns:
unique = df[col].nunique()
dtype = df[col].dtype
missing_count = df[col].isnull().sum()
per = missing_count/count
# type determine
if unique <= 2 and dtype == 'object':
binary_categorical_cols.append(col)
categorical_cols.append(col)
elif unique > 2 and unique <= cardinality and dtype == 'object':
thin_categorical_cols.append(col)
categorical_cols.append(col)
elif unique > cardinality and dtype == 'object':
uniform_categorical_cols.append(col)
categorical_cols.append(col)
elif dtype in ['int64', 'float64']:
numerical_cols.append(col)
else:
other_cols.append(col)
# missing determine
if per > 0 and per <= high_missing_per:
small_missing_cols.append(col)
missing_cols.append(col)
elif per > high_missing_per:
high_missing_cols.append(col)
missing_cols.append(col)
binary_categorical_without_high_missing = []
thin_categorical_without_high_missing = []
uniform_categorical_without_high_missing = []
numerical_high_without_high_missing = []
categorical_without_high_missing = []
joint_list = ['binary_categorical_without_high_missing = ',
'thin_categorical_without_high_missing = ',
'uniform_categorical_without_high_missing = ',
'numerical_without_high_missing = ',
'categorical_without_high_missing = ']
data_type_list = [binary_categorical_cols,thin_categorical_cols,uniform_categorical_cols,numerical_cols,categorical_cols]
def get_complementary_set(l1,l2): # 先col,再missing
set1 = set(l1)
set2 = set(l2)
set3 = set1 - set2
return list(set3)
i = 0
for l in data_type_list:
result = get_complementary_set(l,high_missing_cols)
print(joint_list[i] + str(result) + '\n')
i += 1
print('high_missing = ' + str(high_missing_cols))
我们在之前的代码基础上稍作修改,在房价回归的数据集上测试,得到结果:
binary_categorical_without_high_missing = ['Utilities', 'Street', 'CentralAir']
thin_categorical_without_high_missing = ['GarageFinish', 'RoofMatl', 'LotShape', 'BsmtFinType2', 'MasVnrType', 'Condition1', 'LotConfig', 'KitchenQual', 'BsmtExposure', 'LandSlope', 'FireplaceQu', 'Electrical', 'BsmtCond', 'BsmtFinType1', 'GarageCond', 'LandContour', 'BldgType', 'Condition2', 'RoofStyle', 'Functional', 'GarageQual', 'GarageType', 'ExterQual', 'SaleCondition', 'HeatingQC', 'MSZoning', 'HouseStyle', 'BsmtQual', 'Heating', 'ExterCond', 'Foundation', 'PavedDrive', 'SaleType']
uniform_categorical_without_high_missing = ['Exterior1st', 'Exterior2nd', 'Neighborhood']
numerical_without_high_missing = ['FullBath', 'MoSold', 'LotArea', 'BsmtUnfSF', 'LowQualFinSF', 'LotFrontage', 'GarageArea', '2ndFlrSF', 'OverallCond', '1stFlrSF', 'HalfBath', 'MasVnrArea', 'BsmtFinSF1', 'GarageCars', 'TotRmsAbvGrd', 'WoodDeckSF', 'Fireplaces', 'OpenPorchSF', 'OverallQual', 'ScreenPorch', 'BsmtFullBath', 'GarageYrBlt', 'MSSubClass', 'YrSold', 'BedroomAbvGr', 'GrLivArea', 'KitchenAbvGr', 'PoolArea', '3SsnPorch', 'TotalBsmtSF', 'YearRemodAdd', 'BsmtFinSF2', 'EnclosedPorch', 'BsmtHalfBath', 'MiscVal', 'YearBuilt']
categorical_without_high_missing = ['GarageFinish', 'RoofMatl', 'LotShape', 'BsmtFinType2', 'MasVnrType', 'Condition1', 'LotConfig', 'KitchenQual', 'BsmtExposure', 'LandSlope', 'FireplaceQu', 'Utilities', 'Electrical', 'BsmtCond', 'Street', 'Exterior2nd', 'BsmtFinType1', 'GarageCond', 'LandContour', 'BldgType', 'CentralAir', 'Condition2', 'RoofStyle', 'Functional', 'GarageQual', 'GarageType', 'ExterQual', 'SaleCondition', 'HeatingQC', 'MSZoning', 'HouseStyle', 'Exterior1st', 'BsmtQual', 'Heating', 'ExterCond', 'Foundation', 'Neighborhood', 'PavedDrive', 'SaleType']
high_missing = ['Alley', 'PoolQC', 'Fence', 'MiscFeature']