pandas数据集数据类型划分II

21岁害怕编程

已于 2022-03-06 20:28:31 修改

阅读量1.4k

点赞数 1

分类专栏： pandas python 文章标签：机器学习数据分析 python

于 2022-03-06 19:45:46 首次发布

本文链接：https://blog.csdn.net/RuGe_Lee/article/details/123315743

版权

pandas 同时被 2 个专栏收录

14 篇文章 2 订阅

订阅专栏

python

11 篇文章 0 订阅

订阅专栏

本文介绍了一个函数，用于根据数据集中特征的空值比例、数值类型和类别数进行划分。函数将数据分为5类：大量空值的列、数值型、低类别数类别型、中等类别数类别型和高类别数类别型。在对房价数据集应用该函数后，展示了各类别的具体特征。

摘要由CSDN通过智能技术生成

我们之前的blog写了一个函数，拿到了数据类型的分布。

数据集数据类型划分I

但是，在基于sklearn中的pipeline机器学习机制下，我们希望可以把数据集中的数据类型划分为以下几个部分：

1.含大量的空值的列（如一列中超过70%为空）

1.数值数据（numerical），并且不含大量的空值

2.低类别数的类别数据（categorical），如只含两类的类别数据，并且不含大量的空值

3.中等类别数的类别数据，如含3-10类的类别数据，并且不含大量的空值

4.高类别数的类别数据，可能类别接近于类似均匀，如一列中70%以上的数据是独特的，并且不含大量的空值

5.所有的类别数据，并且不含大量的空值

基于这个想法，我们不妨写一个函数解决问题。

def cols_spliting(df:pd.DataFrame, y, cardinality = 10, high_missing_per= 0.7 ,drop_y = True):
# def cols_spliting(df:pd.DataFrame, y_col, cardinality = 10, apply_onehot = True, high_missing_per= 0.7, drop_high_missing = True):
    assert len(df.index) != 0
    
    if drop_y:
        df.drop(y, axis=1,inplace=True)
								
    binary_categorical_cols = []
    thin_categorical_cols = []
    uniform_categorical_cols = []
    categorical_cols = []
    numerical_cols = []
    other_cols = []
    
    high_missing_cols = []
    small_missing_cols = []
    missing_cols = []
    
    count = df.shape[0]
    
    for col in df.columns:
        unique = df[col].nunique()
        dtype = df[col].dtype
        
        missing_count = df[col].isnull().sum()
            
        per = missing_count/count
        
        # type determine
        if unique <= 2 and dtype == 'object':
            binary_categorical_cols.append(col)
            categorical_cols.append(col)
        
        elif unique > 2 and unique <= cardinality and dtype == 'object':
            thin_categorical_cols.append(col)
            categorical_cols.append(col)
    
        elif unique > cardinality and dtype == 'object':
            uniform_categorical_cols.append(col)
            categorical_cols.append(col)
        
        elif dtype in ['int64', 'float64']:
            numerical_cols.append(col)
        
        else:
            other_cols.append(col)    
               
        # missing determine
        if  per > 0 and per <= high_missing_per:
            small_missing_cols.append(col)
            missing_cols.append(col)
            
        elif per > high_missing_per:
            high_missing_cols.append(col)
            missing_cols.append(col)
    
    binary_categorical_without_high_missing = []
    thin_categorical_without_high_missing = []
    uniform_categorical_without_high_missing = [] 
    numerical_high_without_high_missing = []
    categorical_without_high_missing = []
    joint_list = ['binary_categorical_without_high_missing = ',
                    'thin_categorical_without_high_missing = ',
                    'uniform_categorical_without_high_missing = ',
                    'numerical_without_high_missing = ',
                    'categorical_without_high_missing = ']
    
    data_type_list = [binary_categorical_cols,thin_categorical_cols,uniform_categorical_cols,numerical_cols,categorical_cols]
    
    def get_complementary_set(l1,l2): # 先col，再missing
        set1 = set(l1)
        set2 = set(l2)
        set3 = set1 - set2      
        return list(set3)
    
    i = 0   
    for l in data_type_list:
        result = get_complementary_set(l,high_missing_cols)
        print(joint_list[i] + str(result) + '\n')  
        i += 1  
    
    print('high_missing = ' + str(high_missing_cols))

我们在之前的代码基础上稍作修改，在房价回归的数据集上测试，得到结果：

binary_categorical_without_high_missing = ['Utilities', 'Street', 'CentralAir']

thin_categorical_without_high_missing = ['GarageFinish', 'RoofMatl', 'LotShape', 'BsmtFinType2', 'MasVnrType', 'Condition1', 'LotConfig', 'KitchenQual', 'BsmtExposure', 'LandSlope', 'FireplaceQu', 'Electrical', 'BsmtCond', 'BsmtFinType1', 'GarageCond', 'LandContour', 'BldgType', 'Condition2', 'RoofStyle', 'Functional', 'GarageQual', 'GarageType', 'ExterQual', 'SaleCondition', 'HeatingQC', 'MSZoning', 'HouseStyle', 'BsmtQual', 'Heating', 'ExterCond', 'Foundation', 'PavedDrive', 'SaleType']

uniform_categorical_without_high_missing = ['Exterior1st', 'Exterior2nd', 'Neighborhood']

numerical_without_high_missing = ['FullBath', 'MoSold', 'LotArea', 'BsmtUnfSF', 'LowQualFinSF', 'LotFrontage', 'GarageArea', '2ndFlrSF', 'OverallCond', '1stFlrSF', 'HalfBath', 'MasVnrArea', 'BsmtFinSF1', 'GarageCars', 'TotRmsAbvGrd', 'WoodDeckSF', 'Fireplaces', 'OpenPorchSF', 'OverallQual', 'ScreenPorch', 'BsmtFullBath', 'GarageYrBlt', 'MSSubClass', 'YrSold', 'BedroomAbvGr', 'GrLivArea', 'KitchenAbvGr', 'PoolArea', '3SsnPorch', 'TotalBsmtSF', 'YearRemodAdd', 'BsmtFinSF2', 'EnclosedPorch', 'BsmtHalfBath', 'MiscVal', 'YearBuilt']

categorical_without_high_missing = ['GarageFinish', 'RoofMatl', 'LotShape', 'BsmtFinType2', 'MasVnrType', 'Condition1', 'LotConfig', 'KitchenQual', 'BsmtExposure', 'LandSlope', 'FireplaceQu', 'Utilities', 'Electrical', 'BsmtCond', 'Street', 'Exterior2nd', 'BsmtFinType1', 'GarageCond', 'LandContour', 'BldgType', 'CentralAir', 'Condition2', 'RoofStyle', 'Functional', 'GarageQual', 'GarageType', 'ExterQual', 'SaleCondition', 'HeatingQC', 'MSZoning', 'HouseStyle', 'Exterior1st', 'BsmtQual', 'Heating', 'ExterCond', 'Foundation', 'Neighborhood', 'PavedDrive', 'SaleType']

high_missing = ['Alley', 'PoolQC', 'Fence', 'MiscFeature']