pandas数据集数据类型划分II

本文介绍了一个函数,用于根据数据集中特征的空值比例、数值类型和类别数进行划分。函数将数据分为5类:大量空值的列、数值型、低类别数类别型、中等类别数类别型和高类别数类别型。在对房价数据集应用该函数后,展示了各类别的具体特征。
摘要由CSDN通过智能技术生成

我们之前的blog写了一个函数,拿到了数据类型的分布。

数据集数据类型划分I

但是,在基于sklearn中的pipeline机器学习机制下,我们希望可以把数据集中的数据类型划分为以下几个部分:

1.含大量的空值的列(如一列中超过70%为空)

1.数值数据(numerical),并且不含大量的空值

2.低类别数的类别数据(categorical),如只含两类的类别数据,并且不含大量的空值

3.中等类别数的类别数据,如含3-10类的类别数据,并且不含大量的空值

4.高类别数的类别数据,可能类别接近于类似均匀,如一列中70%以上的数据是独特的,并且不含大量的空值

5.所有的类别数据,并且不含大量的空值

基于这个想法,我们不妨写一个函数解决问题。

def cols_spliting(df:pd.DataFrame, y, cardinality = 10, high_missing_per= 0.7 ,drop_y = True):
# def cols_spliting(df:pd.DataFrame, y_col, cardinality = 10, apply_onehot = True, high_missing_per= 0.7, drop_high_missing = True):
    assert len(df.index) != 0
    
    if drop_y:
        df.drop(y, axis=1,inplace=True)
								
    binary_categorical_cols = []
    thin_categorical_cols = []
    uniform_categorical_cols = []
    categorical_cols = []
    numerical_cols = []
    other_cols = []
    
    high_missing_cols = []
    small_missing_cols = []
    missing_cols = []
    
    count = df.shape[0]
    
    for col in df.columns:
        unique = df[col].nunique()
        dtype = df[col].dtype
        
        missing_count = df[col].isnull().sum()
            
        per = missing_count/count
        
        # type determine
        if unique <= 2 and dtype == 'object':
            binary_categorical_cols.append(col)
            categorical_cols.append(col)
        
        elif unique > 2 and unique <= cardinality and dtype == 'object':
            thin_categorical_cols.append(col)
            categorical_cols.append(col)
    
        elif unique > cardinality and dtype == 'object':
            uniform_categorical_cols.append(col)
            categorical_cols.append(col)
        
        elif dtype in ['int64', 'float64']:
            numerical_cols.append(col)
        
        else:
            other_cols.append(col)    
               
        # missing determine
        if  per > 0 and per <= high_missing_per:
            small_missing_cols.append(col)
            missing_cols.append(col)
            
        elif per > high_missing_per:
            high_missing_cols.append(col)
            missing_cols.append(col)
    
    binary_categorical_without_high_missing = []
    thin_categorical_without_high_missing = []
    uniform_categorical_without_high_missing = [] 
    numerical_high_without_high_missing = []
    categorical_without_high_missing = []
    joint_list = ['binary_categorical_without_high_missing = ',
                    'thin_categorical_without_high_missing = ',
                    'uniform_categorical_without_high_missing = ',
                    'numerical_without_high_missing = ',
                    'categorical_without_high_missing = ']
    
    data_type_list = [binary_categorical_cols,thin_categorical_cols,uniform_categorical_cols,numerical_cols,categorical_cols]
    
    def get_complementary_set(l1,l2): # 先col,再missing
        set1 = set(l1)
        set2 = set(l2)
        set3 = set1 - set2      
        return list(set3)
    
    i = 0   
    for l in data_type_list:
        result = get_complementary_set(l,high_missing_cols)
        print(joint_list[i] + str(result) + '\n')  
        i += 1  
    
    print('high_missing = ' + str(high_missing_cols))

我们在之前的代码基础上稍作修改,在房价回归的数据集上测试,得到结果:

binary_categorical_without_high_missing = ['Utilities', 'Street', 'CentralAir']

thin_categorical_without_high_missing = ['GarageFinish', 'RoofMatl', 'LotShape', 'BsmtFinType2', 'MasVnrType', 'Condition1', 'LotConfig', 'KitchenQual', 'BsmtExposure', 'LandSlope', 'FireplaceQu', 'Electrical', 'BsmtCond', 'BsmtFinType1', 'GarageCond', 'LandContour', 'BldgType', 'Condition2', 'RoofStyle', 'Functional', 'GarageQual', 'GarageType', 'ExterQual', 'SaleCondition', 'HeatingQC', 'MSZoning', 'HouseStyle', 'BsmtQual', 'Heating', 'ExterCond', 'Foundation', 'PavedDrive', 'SaleType']

uniform_categorical_without_high_missing = ['Exterior1st', 'Exterior2nd', 'Neighborhood']

numerical_without_high_missing = ['FullBath', 'MoSold', 'LotArea', 'BsmtUnfSF', 'LowQualFinSF', 'LotFrontage', 'GarageArea', '2ndFlrSF', 'OverallCond', '1stFlrSF', 'HalfBath', 'MasVnrArea', 'BsmtFinSF1', 'GarageCars', 'TotRmsAbvGrd', 'WoodDeckSF', 'Fireplaces', 'OpenPorchSF', 'OverallQual', 'ScreenPorch', 'BsmtFullBath', 'GarageYrBlt', 'MSSubClass', 'YrSold', 'BedroomAbvGr', 'GrLivArea', 'KitchenAbvGr', 'PoolArea', '3SsnPorch', 'TotalBsmtSF', 'YearRemodAdd', 'BsmtFinSF2', 'EnclosedPorch', 'BsmtHalfBath', 'MiscVal', 'YearBuilt']

categorical_without_high_missing = ['GarageFinish', 'RoofMatl', 'LotShape', 'BsmtFinType2', 'MasVnrType', 'Condition1', 'LotConfig', 'KitchenQual', 'BsmtExposure', 'LandSlope', 'FireplaceQu', 'Utilities', 'Electrical', 'BsmtCond', 'Street', 'Exterior2nd', 'BsmtFinType1', 'GarageCond', 'LandContour', 'BldgType', 'CentralAir', 'Condition2', 'RoofStyle', 'Functional', 'GarageQual', 'GarageType', 'ExterQual', 'SaleCondition', 'HeatingQC', 'MSZoning', 'HouseStyle', 'Exterior1st', 'BsmtQual', 'Heating', 'ExterCond', 'Foundation', 'Neighborhood', 'PavedDrive', 'SaleType']

high_missing = ['Alley', 'PoolQC', 'Fence', 'MiscFeature']

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值