数据挖掘之数据探索

最新推荐文章于 2022-08-06 18:06:06 发布

宋应

最新推荐文章于 2022-08-06 18:06:06 发布

阅读量668

点赞数

分类专栏：机器学习 Python 文章标签：数据挖掘 python 数据探索特征

本文链接：https://blog.csdn.net/songying2012/article/details/51024122

版权

机器学习同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

Python

8 篇文章 0 订阅

订阅专栏

本文探索：
1. 探索类别特征，查看每个类别特征有多少种类
2. 探索数值特征，离散化方式
3. 去除大多数是同一值的特征
4. 处理时间型特征
所需python包

from pandas import Series, DataFrame
import pandas as pd

一、查看每个类别特征有多少种类

def FindNumOfCatFeacture(data, feacture_cols, Flag_dropcat = 50):
    '''
    函数说明：寻找每一个类别特征有多少种种类, 及去除种类多的特征
    输入：data——整个数据集，包括Index，target
        feacture_cols——特征名
        Flag_dropcat——每个类别特征种类数大于这个数后，丢掉该类别特征
    输出：name_len——list类型  [('feacture1', len), ('feacture2',len)]
        dropCat_cols——list类型  要丢掉的特征列名，种类太多 
    '''
    #计算每个类别特征中有多少种类
    def num_cat(x):
        eachCat = list(x.value_counts().index)
        return len(eachCat)

    CatData = data[feacture_cols]   
    lenCatData = CatData.apply(num_cat, axis=0)
    CatData_cols = list(CatData.columns)
    name_len = zip(CatData_cols, lenCatData)
    dropCat_cols = [x[0] for i ,x in enumerate(name_len) \
                    if name_len[i][1] > Flag_dropcat]

    return name_len, dropCat_cols

二、离散化数值特征

#分箱:
def binning(col, cut_points, labels=None):
    '''
    函数说明：将连续特征离散化，分为几类
    '''
    #Define min and max values:
    minval = col.min()
    maxval = col.max()
    #利用最大值和最小值创建分箱点的列表
    break_points = [minval] + cut_points + [maxval]
    #如果没有标签，则使用默认标签0 ... (n-1)
    if not labels:
        labels = range(len(cut_points)+1)
    #使用pandas的cut功能分箱
    colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
    return colBin

三、去除大多数是同一值的特征

def DropHighValueOfFeacture(data, feacture_cols, prob_HighValue = 0.95):
    '''
    函数说明：剔除绝大多数（HighValue）为某一值的特征，
    输入：data——整个数据集，包括Index，target
        feacture_cols——类别特征名
        prob_HighValue——剔除变量的标准比率
    输出：catData——DataFrame
    '''
    #计算每个特征中HighValue的数目
    def num_HigeValue(x):
        return max(x.value_counts())
    newData = data[feacture_cols].apply(num_HigeValue, axis=0)
    newData = DataFrame(newData, columns=['num_HigeValue'])

    nExample = data.shape[0]
    probValue = map(lambda x: round(float(x)/nExample, 4), newData['num_HigeValue'])
    newData['probHighValue'] = probValue

    #寻找大于prob_HighValue的特征
    dropFeacture = newData[newData['probHighValue'] >= prob_HighValue]
    dropFeacture_cols = list(dropFeacture.index)

    return newData,dropFeacture_cols

四、处理时间型特征

from datetime import datetime, timedelta
def turnTimeToDayWeekMonth(listingInfo):
    '''
    函数说明：将形如2014/03/05时间量转化为Day Week Month
    输入：timeFeacture——时间特征
    输出：
    '''
    def strToTime(x):
        cday = datetime.strptime(x, '%Y/%m/%d')
        return cday

    def TimeToDayWeekMonth(x):
        day = x.strftime('%j')  #每年第几天
        week = x.strftime('%w')  #星期几
        month = x.strftime('%m')  #几月
        return Series([day,week,month])

    #dateOflisting = listingInfo.apply(strToTime)
    dateOflisting  = pd.to_datetime(listingInfo)
    DayWeekMonthOfList = dateOflisting.apply(TimeToDayWeekMonth)
    DayWeekMonthOfList = DataFrame(DayWeekMonthOfList.values,\
                               columns=['day','week','month'])

    return DayWeekMonthOfList

宋应

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
数据挖掘之数据探索

本文探索： 1. 探索类别特征，查看每个类别特征有多少种类 2. 探索数值特征，离散化方式 3. 去除大多数是同一值的特征 4. 处理时间型特征一、查看每个类别特征有多少种类def FindNumOfCatFeacture(data, feacture_cols, Flag_dropcat = 50): ''' 函数说明：寻找每一个类别特征有多少种种类, 及去除种类多的特征
复制链接

扫一扫