pandas获取数据集数据类型分布（更细粒度的分割）

21岁害怕编程

已于 2022-03-10 22:27:10 修改

阅读量2.2k

点赞数 1

分类专栏： pandas pandas-profiling python 文章标签： python 数据分析人工智能

于 2022-03-05 21:58:07 首次发布

本文链接：https://blog.csdn.net/RuGe_Lee/article/details/123301924

版权

pandas 同时被 3 个专栏收录

14 篇文章 2 订阅

订阅专栏

python

11 篇文章 0 订阅

订阅专栏

pandas-profiling

2 篇文章 0 订阅

订阅专栏

方法一：使用pandas内置接口

在pandas中，获取数据类型有几个方法，以泰坦尼克号数据集为例，

1.拿到numerical数据类型

df.select_dtypes('number').columns

Index(['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')

2. 拿到categorical数据类型

df.select_dtypes('object').columns

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

还有

df.select_dtypes('category').columns

Index([], dtype='object')

方法二：pandas_profiling输出分析

以泰坦尼克号数据集为例，如果只是康康然后疯狂复制粘贴的话那没啥，但是如果想全流程自动化，就要把ProfileReport的结果用到接下来的数据处理中，不妨把结果输出到json文件。

from pandas_profiling import ProfileReport
import pandas as pd

df = pd.read_csv('train.csv',index_col=['PassengerId'])
report = ProfileReport(df,dark_mode = True,explorative=True)

report.to_file('result.json')

在这个json文件中，有如下的结构，取自己需要的统计即可，但是有亿点麻烦(doge)

方法三：自己动手写函数（相当于又把数据集分析了一遍doge）

def cols_spliting(df:pd.DataFrame, cardinality = 10, high_missing_per= 0.7, drop_high_missing = True):
    assert len(df.index) != 0
																
    binary_categorical_cols = []
    thin_categorical_cols = []
    uniform_categorical_cols = []
    categorical_cols = []
    numerical_cols = []
    other_cols = []
    
    high_missing_cols = []
    small_missing_cols = []
    missing_cols = []
    
    count = df.shape[0]
    
    for col in df.columns:
        unique = df[col].nunique()
        dtype = df[col].dtype
        
        missing_count = df[col].isnull().sum()          
        per = missing_count/count
        
        # type determine
        if unique <= 2 and dtype == 'object':
            binary_categorical_cols.append(col)
            categorical_cols.append(col)
        
        elif unique > 2 and unique <= cardinality and dtype == 'object':
            thin_categorical_cols.append(col)
            categorical_cols.append(col)
    
        elif unique > cardinality and dtype == 'object':
            uniform_categorical_cols.append(col)
            categorical_cols.append(col)
        
        elif dtype in ['int64', 'float64']:
            numerical_cols.append(col)
        
        else:
            other_cols.append(col)    
        
        
        # missing determine
        if  per > 0 and per <= high_missing_per:
            small_missing_cols.append(col)
            missing_cols.append(col)
            
        elif per > high_missing_per:
            high_missing_cols.append(col)
            missing_cols.append(col)
    
    print('--------------col types---------------')        
    print('categorical cols with 1-2 distinct values: ' + str(binary_categorical_cols))
    print('categorical cols with 3-{} distinct values: '.format(cardinality) + str(thin_categorical_cols))
    print('categorical cols with more than {} distinct values: '.format(cardinality) + str(uniform_categorical_cols))
    print('categorical cols: ' + str(categorical_cols))
    print('numerical cols : ' + str(numerical_cols))
    print('-------------missing cols-------------')
    # print('categorical cols with 2-3 distinct valuses: ' + str(other_cols))
    print('missing cols with more than {} :'.format(high_missing_per) + str(high_missing_cols))
    print('missing cols with less than {} :'.format(high_missing_per) + str(small_missing_cols))
    print('missing cols: ' + str(missing_cols))

以泰坦尼克号为例：

cols_spliting(df)

结果为

--------------col types---------------
categorical cols with 2-3 distinct values: ['Sex']
categorical cols with 3-10 distinct values: ['Pclass', 'SibSp', 'Parch', 'Embarked']
categorical cols with more than 10 distinct values: ['Name', 'Ticket', 'Cabin']
categorical cols: ['Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked']
numerical cols : ['Age', 'Fare']
-------------missing cols-------------
missing cols with more than 0.7 : ['Cabin']
missing cols with less than 0.7 : ['Age', 'Embarked']
missing cols: ['Age', 'Cabin', 'Embarked']

如果我们对输出的数据还要要求，让他更适用于机器学习，不妨康康这个

pandas数据集类型划分II

21岁害怕编程

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
pandas获取数据集数据类型分布（更细粒度的分割）

方法一：pandas_profiling输出分析以泰坦尼克号数据集为例，如果只是康康然后疯狂复制粘贴的话那没啥，但是如果想全流程自动化，就要把ProfileReport的结果用到接下来的数据处理中，不妨把结果输出到json文件。from pandas_profiling import ProfileReportimport pandas as pddf = pd.read_csv('train.csv',index_col=['PassengerId'])report = ProfileR
复制链接

扫一扫

专栏目录