ReaHat用户挖掘有价值用户

oifengo

于 2019-08-01 14:46:50 发布

阅读量295

点赞数 1

分类专栏：数据挖掘

本文链接：https://blog.csdn.net/weixin_39381833/article/details/98043474

版权

数据挖掘专栏收录该内容

19 篇文章 0 订阅

订阅专栏

数据清洗

读取数据

#sep=' '  sep : str, default ‘,’ 指定分隔符号 默认为 “,"
#header 指定行数来作为列名字 默认为0 还可以为多行列名
#keep_default_na 指定参数为na 那么默认的NaN将被覆盖 否则添加

people=pd.read_csv("./people.csv",sep=',',header=0,keep_default_na=True,parse_dates=['date'])

设置图表格式

# set_index 设置索引为people_id
#drop：drop为False则索引列会被还原为普通列，否则会丢失
append：默认为False，是否将列附加到现有索引
inplace：默认为False，适当修改DataFrame(不要创建新对象)

people.set_index(keys=['people_id'],drop=True,append=False,inplace=True)

act_train=pd.read_csv("./act_train.csv",sep=',',header=0,keep_default_na=True,parse_dates=['date'])
act_train.set_index(keys=['people_id'],drop=True,append=False,inplace=True)

act_train.head(10)

act_test=pd.read_csv("./act_test.csv",sep=',',header=0,keep_default_na=True,parse_dates=['date'])
act_test.set_index(keys=['people_id'],drop=True,append=False,inplace=True)

act_test.head(10)

在这里插入图片描述

合并数据

通过DataFrame.merge()方法，People_id为键，合并act数据和people数据

#以people_id为key合并act_train  people
# 左连接 对于相同的列标题 添加_act _people来区分

train_data=act_train.merge(people,how='left',left_index=True,right_index=True,suffixes=('_act', '_people'))
train_data.head(10)

test_data=act_test.merge(people,how='left',left_index=True,right_index=True,suffixes=('_act', '_people'))
test_data.head(10）

在这里插入图片描述

拆分数据

在官方说明当中 type1的active和非type1 的active的格式是不同的
根据activing的type来拆分

查看类型和数量

train_data.activity_category.value_counts()

type 2    904683
type 5    490710
type 3    429408
type 4    207465
type 1    157615
type 6      4253
type 7      3157
Name: activity_category, dtype: int64

拆分数据

types=['type %d'%i for i in range(1,8)]
train_datas={}
test_datas={}
for _type in types:
    # DataFrame.dropna():抛弃了全为NaN的行和列
    train_datas[_type]=train_data[train_data.activity_category==_type].dropna(axis=(0,1), how='all')
    test_datas[_type]=test_data[test_data.activity_category==_type].dropna(axis=(0,1), how='all')
    print(train_datas[_type].activity_category.unique())
    print(test_datas[_type].activity_category.unique())

在这里插入图片描述

观察拆分后的数据

train_datas['type 1'].head(2)

在这里插入图片描述
发现每一个数据集中 activity_category都是一个样的
因此可以删除这一列（上一步拆分的过程就是一个天然的聚类的过程 activity_id对应了聚类的id）

删除activity_category这一列

# 删除activity_id这一列
#drop函数的使用：删除行、删除列
#print frame.drop(['a'])
#print frame.drop(['Ohio'], axis = 1)
#drop函数默认删除行，列需要加axis = 1

types=['type %d'%i for i in range(1,8)]
for _type in types:
    train_datas[_type].drop(['activity_category'],axis=1,inplace=True)
    test_datas[_type].drop(['activity_category'],axis=1,inplace=True)

观察删除activity_category列之后的数据

在这里插入图片描述

去除唯一值

首先将activity_id这一列数据变成索引列

# 指定索引列为act-id
#append=True是保留原来的people_id行索引从而生成一个多级行索引
#inplace=Ture是原地修改数据

types=['type %d'%i for i in range(1,8)]
for _type in types:
    train_datas[_type].set_index(keys=['activity_id'], drop=True, append=True, inplace=True)
    test_datas[_type].set_index(keys=['activity_id'], drop=True, append=True, inplace=True)

观察数据

在这里插入图片描述
每一行数据对应唯一的一个索引对（元祖（people_id,activity_id))。
也就是得到的结果中不包含重复索引
可以通过以下代码验证

验证不包含重复索引

#验证索引值为唯一

types=['type %d'%i for i in range(1,8)]
for _type in types:
    print(train_datas[_type].index.is_unique,end=',')
    print(test_datas[_type].index.is_unique,end=',' )

在这里插入图片描述
说明后续训练中不用考虑people_id,activit_id
因为他们作为索引对应的每个样本都是唯一的
他们对于判定没有任何帮助
这一类属性通常是人工产生的
比如数据库中很多id 大多是作为索引存在

数据类型转换

在目前得到的结果集中很多是字符串类型，并且含有文字类型的字符串
例如 date_act 列的数据

	2022-07-27

因此需要对他进行处理

查看每一列的数据类型

# 查看每一列的数据类型


pd.DataFrame({'train_1':train_datas['type 1'].dtypes,'train_2':train_datas['type 2'].dtypes,
              'train_3':train_datas['type 3'].dtypes,'train_4':train_datas['type 4'].dtypes,
              'train_5':train_datas['type 5'].dtypes,'train_6':train_datas['type 6'].dtypes,
              'train_7':train_datas['type 7'].dtypes,
              'test_1':test_datas['type 1'].dtypes,'test_2':test_datas['type 2'].dtypes,
              'test_3':test_datas['type 3'].dtypes,'test_4':test_datas['type 4'].dtypes,
              'test_5':test_datas['type 5'].dtypes,'test_6':test_datas['type 6'].dtypes,
              'test_7':test_datas['type 7'].dtypes,})

在这里插入图片描述
其中：

type 1 数据中没有char_10_act;
type 2~7 数据中没有char_1_act~char_9_act

train数据中没有outcome


需要将这些数据转换成np.float64 方便以后使用

观察这些列规律发现：

group_1列为字符串：group_xxx
data_act/data_people 列为datetime64类型，把每个日期转换成从1970-01-01以来的天数(浮点数)
char_1_act~char_10_act、char_1_people ~ char_9_people列为字符串：type x
char_10_people、char_11~char_37列为boolen 将这些数据转换成0 1
outcome char_38列为整数其中outcome列为标记信息 (0~1) char_38列为连续值

数据清洗

采用Pandas对象的矢量化字符串方法.str.replace()和.str.strp() 他们都返回一个Pandas.Series对象
然后用pd.Series对象。然后使用Pandas.astype()方法来将数字形式的字符串转换成浮点数

# 数据清洗
# 采用Pandas对象的矢量化字符串方法.str.replace()和.str.strp() 他们都返回一个Pandas.Series对象
# 然后用pd.Series对象。然后使用Pandas.astype()方法来将数字形式的字符串转换成浮点数

str_col_list=['group_1']+['char_%d_act'%i for i in range(1,11)]+['char_%d_people'%i for i in range(1,10)]
bool_col_list=['char_10_people']+['char_%d'%i for i in range(11,38)]
types=['type %d'%i for i in range(1,8)]
for _type in types:
    for data_set in [train_datas,test_datas]:
        data_set[_type].date_act= (data_set[_type].date_act- np.datetime64('1970-01-01'))/ np.timedelta64(1, 'D')
        data_set[_type].date_people= (data_set[_type].date_people- np.datetime64('1970-01-01'))/ np.timedelta64(1,'D') 
        data_set[_type].group_1=data_set[_type].group_1.str.replace("group",'').str.strip().astype(np.float64)
        for col in bool_col_list:
               if col in data_set[_type]:data_set[_type][col]=data_set[_type][col].astype(np.float64)
        for col in str_col_list[1:]:
               if col in data_set[_type]:data_set[_type][col]=data_set[_type][col].str.replace("type",'').str.strip().astype(np.float64) 

        data_set[_type]= data_set[_type].astype(np.float64)

再次检查数据索引

#检查数据索引

types=['type %d'%i for i in range(1,8)]
for _type in types:
    print((train_datas[_type].dtypes==np.float64).all(),end=',')
    print((test_datas[_type].dtypes==np.float64).all(),end=',')

True,True,True,True,True,True,True,True,True,True,True,True,True,True,

检查源数据

在这里插入图片描述

Data_Cleaner类

根据前面的分析
写出一个Data_Cleaner类
提供.load_data()方法返回清洗好的数据

import numpy as np
import pandas as pd
import  pickle
import  time
import os
def current_time():
    '''
    以固定格式打印当前时间

    :return:返回当前时间的字符串
    '''
    return time.strftime('%Y-%m-%d %X', time.localtime())
class Data_Cleaner:
    '''
    数据清洗器

    它的初始化需要提供三个文件的文件名。它提供了唯一的对外接口：load_data()。它返回清洗好的数据。
    如果数据已存在，则直接返回。否则将执行一系列清洗操作并返回清洗好的数据。
    '''
    def __init__(self,people_file_name,act_train_file_name,act_test_file_name):
        '''

        :param people_file_name: people.csv文件的 file_path
        :param act_train_file_name: act_train.csv文件的 file_path
        :param act_test_file_name:act_test.csv文件的 file_path
        :return:
        '''
        self.p_fname=people_file_name
        self.train_fname=act_train_file_name
        self.test_fname=act_test_file_name
        self.types=['type %d'%i for i in range(1,8)]
        self.fname='output/cleaned_data'
    def load_data(self):
        '''
        加载清洗好的数据

         如果数据已经存在，则直接返回。如果不存在，则加载 csv文件，然后合并数据、拆分成 type1 ~type7，然后执行数据类型转换，
        最后重新排列每个列的顺序。然后保存数据并返回数据。

        :return:一个元组：依次为：self.train_datas,self.test_datas
        '''
        if(self._is_ready()):
            print("cleaned data is availiable!\n")
            self._load_data()
        else:
            self._load_csv()
            self._merge_data()
            self._split_data()
            self._typecast_data()
            self._save_data()
        return self.train_datas,self.test_datas

    def _load_csv(self):
        '''
        加载 csv 文件

        :return:
        '''
        print("----- Begin run load_csv at %s -------"%current_time())
        self.people=pd.read_csv(self.p_fname,sep=',',header=0,keep_default_na=True,parse_dates=['date'])
        self.act_train=pd.read_csv(self.train_fname,sep=',',header=0,keep_default_na=True,parse_dates=['date'])
        self.act_test=pd.read_csv(self.test_fname,sep=',',header=0,keep_default_na=True,parse_dates=['date'])

        self.people.set_index(keys=['people_id'],drop=True,append=False,inplace=True)
        self.act_train.set_index(keys=['people_id'],drop=True,append=False,inplace=True)
        self.act_test.set_index(keys=['people_id'],drop=True,append=False,inplace=True)

        print("----- End run load_csv at %s -------"%current_time())
    def _merge_data(self):
        '''
        合并 people 数据和 activity 数据

        :return:
        '''
        print("----- Begin run merge_data at %s -------"%current_time())
        self.train_data=self.act_train.merge(self.people,how='left',left_index=True,right_index=True,suffixes=('_act', '_people'))
        self.test_data=self.act_test.merge(self.people,how='left',left_index=True,right_index=True,suffixes=('_act', '_people'))
        print("----- End run merge_data at %s -------"%current_time())
    def _split_data(self):
        '''
        拆分数据为 type 1~ 7

        :return:
        '''
        print("----- Begin run split_data at %s -------"%current_time())
        self.train_datas={}
        self.test_datas={}
        for _type in self.types:
            ## 拆分
            self.train_datas[_type]=self.train_data[self.train_data.activity_category==_type].dropna(axis=(0,1), how='all')
            self.test_datas[_type]=self.test_data[self.test_data.activity_category==_type].dropna(axis=(0,1), how='all')
            # 删除列 activity_category
            self.train_datas[_type].drop(['activity_category'],axis=1,inplace=True)
            self.test_datas[_type].drop(['activity_category'],axis=1,inplace=True)
            # 将列 activity_id 作为索引
            self.train_datas[_type].set_index(keys=['activity_id'], drop=True, append=True, inplace=True)
            self.test_datas[_type].set_index(keys=['activity_id'], drop=True, append=True, inplace=True)
        print("----- End run split_data at %s -------"%current_time())

    def _typecast_data(self):
        '''
        执行数据类型转换，将所有数据转换成浮点数

        :return:
        '''
        print("----- Begin run typecast_data at %s -------"%current_time())
        str_col_list=['group_1']+['char_%d_act'%i for i in range(1,11)]+['char_%d_people'%i for i in range(1,10)]
        bool_col_list=['char_10_people']+['char_%d'%i for i in range(11,38)]

        for _type in self.types:
            for data_set in [self.train_datas,self.test_datas]:
                # 处理日期列
                data_set[_type].date_act= (data_set[_type].date_act- np.datetime64('1970-01-01'))/ np.timedelta64(1, 'D')
                data_set[_type].date_people= (data_set[_type].date_people- np.datetime64('1970-01-01'))/ np.timedelta64(1,'D')
                # 处理 group 列
                data_set[_type].group_1=data_set[_type].group_1.str.replace("group",'').str.strip().astype(np.float64)
                # 处理布尔值列
                for col in bool_col_list:
                    if col in data_set[_type]:data_set[_type][col]=data_set[_type][col].astype(np.float64)
                # 处理其他字符串列
                for col in str_col_list[1:]:
                    if col in data_set[_type]:data_set[_type][col]=data_set[_type][col].str.replace("type",'').str.strip().astype(np.float64)

            data_set[_type]= data_set[_type].astype(np.float64)
        print("----- End run typecast_data at %s -------"%current_time())
    def _is_ready(self):
        if(os.path.exists(self.fname)):
            return True
        else :
            return False
    def _save_data(self):
        print("----- Begin run save_data at %s -------"%current_time())
        with open(self.fname,"wb") as file:
            pickle.dump([self.train_datas,self.test_datas],file=file)
        print("----- End run save_data at %s -------"%current_time())
    def _load_data(self):
        print("----- Begin run _load_data at %s -------"%current_time())
        with open(self.fname,"rb") as file:
            self.train_datas,self.test_datas=pickle.load(file)
        print("----- End run _load_data at %s -------"%current_time())

if __name__=='__main__':
    clearner=Data_Cleaner("./Data/people.csv",'./Data/act_train.csv','./Data/act_test.csv')
    result=clearner.load_data()
    for key,item in result[0].items():
        for col in item.columns:
            unique_value=item[col].unique()

            if(len(unique_value)<=100):
                print(col,':len=',len(unique_value),'\t;data=',unique_value)
            else:print(col,':len=',len(unique_value))

        print("\n=======\n")

独热码编码

# 独热码编码
# 观察各列的取值集合


lambda_len=lambda x:len(x.unique())
lambda_data=lambda x:str(x.unique()) if(len(x.unique())<=3) else str(x.unique()[:3])+'...'
train_results={}
test_results={}
types=['type %d'%i for i in range(1,8)]
for _type in types:
    train_results[_type[-1]]=pd.DataFrame({'len':train_datas[_type].apply(lambda_len),
                        'data':train_datas[_type].apply(lambda_data)},
                        index=train_datas[_type].columns) 
    test_results[_type[-1]]=pd.DataFrame({'len':test_datas[_type].apply(lambda_len),
                        'data':train_datas[_type].apply(lambda_data)},
                        index=test_datas[_type].columns) 

train_12=train_results['1'].merge(train_results['2'],how='outer',left_index=True,right_index=True,suffixes=('_ta_1', '_ta_2')) 
train_34=train_results['3'].merge(train_results['4'],how='outer',left_index=True,right_index=True,suffixes=('_ta_3', '_ta_4')) 
train_56=train_results['5'].merge(train_results['6'],how='outer',left_index=True,right_index=True,suffixes=('_ta_5', '_ta_6')) 
train_test_77=train_results['7'].merge(test_results['7'],how='outer',left_index=True,right_index=True,suffixes=('_ta_7', '_tt_7')) 
test_12=test_results['1'].merge(test_results['2'],how='outer',left_index=True,right_index=True,suffixes=('_tt_1', '_tt_2')) 
test_34=test_results['3'].merge(test_results['4'],how='outer',left_index=True,right_index=True,suffixes=('_tt_3', '_tt_4')) 
test_56=test_results['5'].merge(test_results['6'],how='outer',left_index=True,right_index=True,suffixes=('_tt_5', '_tt_6')) 

train_12.merge(train_34,how='outer',left_index=True,right_index=True)\
    .merge(train_56,how='outer',left_index=True,right_index=True)  \
    .merge(train_test_77,how='outer',left_index=True,right_index=True)\
    .merge(test_12,how='outer',left_index=True,right_index=True) \
    .merge(test_34,how='outer',left_index=True,right_index=True) \
    .merge(test_56,how='outer',left_index=True,right_index=True)

ta:train
tt:test
后缀1:type1
在这里插入图片描述

修改列的顺序

# 修改列的顺序


from scipy.sparse import hstack,csr_matrix
from sklearn.preprocessing  import OneHotEncoder
def onehot_encode(train_datas,test_datas): 

    train_results={}
    test_results={}
    types=['type %d'%i for i in range(1,8)]
    for _type in types:
        if _type=='type 1':
            one_hot_cols=['char_%d_act'%i for i in range(1,10)]+\
            ['char_%d_people'%i for i in range(1,10)]
            train_end_cols=['group_1','date_act','date_people','char_38','outcome']
            test_end_cols=['group_1','date_act','date_people','char_38']
        else:
            one_hot_cols=['char_%d_people'%i for i in range(1,10)]
            train_end_cols=['group_1','char_10_act','date_act','date_people','char_38','outcome']
            test_end_cols=['group_1','char_10_act','date_act','date_people','char_38']
        
        train_front_array=train_datas[_type][one_hot_cols].values #头部数组
        train_end_array=train_datas[_type][train_end_cols].values#末尾数组
        train_middle_array=train_datas[_type].drop(train_end_cols+one_hot_cols,axis=1,inplace=False).values#中间数组
        
        test_front_array=test_datas[_type][one_hot_cols].values #头部数组
        test_end_array=test_datas[_type][test_end_cols].values#末尾数组
        test_middle_array=test_datas[_type].drop(test_end_cols+one_hot_cols,axis=1,inplace=False).values#中间数组

        encoder=OneHotEncoder(categorical_features='all',sparse=True) # 一个稀疏矩阵，类型为 csr_matrix
        train_result=hstack([encoder.fit_transform(train_front_array),csr_matrix(train_middle_array),csr_matrix(train_end_array)])
        test_result=hstack([encoder.transform(test_front_array),csr_matrix(test_middle_array),csr_matrix(test_end_array)])
        train_results[_type]=train_result
        test_results[_type]=test_result
    return train_results,test_results

检查特征数量

# 检查特征数量

types=['type %d'%i for i in range(1,8)]

print('before encode:\n')
for _type in types:
    print('train(type=%s):shape='%_type,train_datas[_type].shape)
    print('test(type=%s):shape='%_type,test_datas[_type].shape)
print('==============\n\n')    
train_results,test_results=onehot_encode(train_datas,test_datas)
print('after encode:\n')
for _type in types:
    print('train(type=%s):shape='%_type,train_results[_type].shape)
    print('test(type=%s):shape='%_type,test_results[_type].shape)
print('==============\n\n')

在这里插入图片描述

归一化处理

# 归一化处理

from sklearn.preprocessing  import MaxAbsScaler
def scale(train_datas,test_datas): 
    train_results={}
    test_results={}
    types=['type %d'%i for i in range(1,8)]
    
    for _type in types:
        if _type=='type 1':
            train_last_index=5#最后5列为 group_1/date_act/date_people/char_38/outcome
            test_last_index=4#最后4列为 group_1/date_act/date_people/char_38 
        else:
            train_last_index=6#最后6列为 group_1/char_10_act/date_act/date_people/char_38/outcome
            test_last_index=5#最后5列为 group_1/char_10_act/date_act/date_people/char_38 
        
        scaler=MaxAbsScaler()
        train_array=train_datas[_type].toarray()        
        train_front=train_array[:,:-train_last_index]
        train_mid=scaler.fit_transform(train_array[:,-train_last_index:-1])#outcome 不需要归一化
        train_end=train_array[:,-1].reshape((-1,1)) #outcome
        train_results[_type]=np.hstack((train_front,train_mid,train_end))
        
        test_array=test_datas[_type].toarray()
        test_front=test_array[:,:-test_last_index]
        test_end=scaler.transform(test_array[:,-test_last_index:])
        test_results[_type]=np.hstack((test_front,test_end))

    return train_results,test_results

检查归一化之后的结果

# 检查归一化之后的结果

ta_results,tt_results=scale(train_results,test_results)
types=['type %d'%i for i in range(1,8)]
for _type in types:
    print("Train(type=%s):"%_type,np.unique(ta_results[_type].max(axis=1)),np.unique(ta_results[_type].min(axis=1)))
    print("Test(type=%s):"%_type,np.unique(tt_results[_type].max(axis=1)),np.unique(tt_results[_type].min(axis=1)))

在这里插入图片描述

oifengo

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
ReaHat用户挖掘有价值用户

文章目录项目实施读取数据设置图表格式合并数据拆分数据查看类型和数量拆封数据项目实施读取数据#sep=' ' sep : str, default ‘,’ 指定分隔符号默认为 “,"#header 指定行数来作为列名字默认为0 还可以为多行列名#keep_default_na 指定参数为na 那么默认的NaN将被覆盖否则添加#parse_dates=["date"]解析索引...
复制链接

扫一扫