kaggle:Costa Rican Household Poverty Level Prediction(1)DEA

DEA: Data Exploration Analysis

Costa Rican Household Poverty Level Prediction

哥斯达黎加家庭贫困水平预测

  这个比赛的目的是,基于历史数据,使用机器学习方法来预测家庭的贫困水平.数据不大,非常适合单机操作.

Data Explanation

比赛提供了两组数据train.csvtest.csv:

  • 每一行代表一个家庭成员的数据,(一个家庭可以有多个成员组成,只对户主进行预测)
  • train.csv,包含143列,含ID,Target和141个特征
  • test.csv,包含142列,不含Target
  • 户主代表一个家庭,只对户主进行评分

关键字段:

  • Id : 每一个样本数据的一个标识
  • idhogar : 每一个家庭的唯一标识,拥有相同标识样本属于一个家庭
  • parentesco1 : 标识该成员是否为户主
  • Target : 家庭的贫困情况

    • 1 = extreme poverty
    • 2 = moderate poverty
    • 3 = vulnerable households
    • 4 = non vulnerable households

!原始数据是每个家庭成员的信息,最后提交的结果是每个家庭的贫困情况.

问题就简单介绍完了.下面开对数据进行探索分析

# DEA 用到的python库: - pandas - numpy - matplotlib - seaborn - sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# set default params
%matplotlib inline
plt.rcParams['font.size'] = 18
pd.options.display.max_columns = 10

import warnings
warnings.filterwarnings('ignore')
数据概览
train = pd.read_csv('../train.csv')
test = pd.read_csv('../test.csv')
feature_desp = pd.read_csv('../feature_description.csv', error_bad_lines=False)
feature_desp = feature_desp.set_index('F_name')
train.head()
Idv2a1hacdorroomshacapov14arefrigSQBedjefeSQBhogar_ninSQBovercrowdingSQBdependencySQBmeanedagesqTarget
0ID_279628684190000.00301110001.0000000.0100.018494
1ID_f29eb3ddd135000.00401114401.00000064.0144.044894
2ID_68de51c94NaN08011000.25000064.0121.084644
3ID_d671db89c180000.00501112141.7777781.0121.02894
4ID_d56d6f5f5180000.00501112141.7777781.0121.013694

5 rows × 143 columns

#特征的描述信息
feature_desp.tail()
F_namedescription
136SQBhogar_ninhogar_nin squared
137SQBovercrowdingovercrowding squared
138SQBdependencydependency squared
139SQBmeanedsquare of the mean years of education of adul…
140agesqAge squared
train.info()
test.info()

[int]类型特征unique value统计

train.select_dtypes(['int']).nunique().sort_values().plot(kind='barh',figsize=(12, 35))
plt.grid(axis='x', color='r', linestyle='-.', linewidth=1)
# 位置&标签
locations = np.array([0,2,20,40,60,80,100])
labels = [0,2,20,40,60,80,100]
plt.xticks(locations, labels)
plt.xlabel('Number of Unique Values')
plt.ylabel('Feature Name')
plt.title('Count of Unique Values in Integer Columns')

这里写图片描述

  可以发现有些特征的count(unique_value) = 2,这些特征只有两个值,0/1,比如,parentesco1用来标识这个人是否为户主,0:不是,1:是.

  下面继续观察特征为浮点数类型变量的KDE核密度估计,同时根据target映射到不同的颜色.
由于核密度估计方法不利用有关数据分布的先验知识,对数据分布不附加任何假定,是一种从数据样本本身出发研究数据分布特征的方法。

#KDE(float)特征
from collections import OrderedDict

plt.figure(figsize = (28, 12))

# Color mapping
colors = OrderedDict({1: 'red', 2: 'orange', 3: 'blue', 4: 'green'})
label_mapping = OrderedDict({1: 'extreme', 2: 'moderate', 3: 'vulnerable', 
                               4: 'non vulnerable'})

# float columns
for i, col in enumerate(train.select_dtypes(['float'])):
    ax = plt.subplot(4, 3, i + 1)
    # Iterate through the poverty levels
    for poverty_level, color in colors.items():
        # 核密度估计
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = label_mapping[poverty_level])

    plt.title('%s Kernel Density Estimate'%(col.capitalize()))
    plt.xlabel('%s'%col)
    plt.ylabel('Density')

plt.subplots_adjust(top = 2)

这里写图片描述

(object)类型特征

# (object)
train.select_dtypes(['object']).head()
 
Ididhogardependencyedjefeedjefa
0ID_27962868421eb7fcc1no10no
1ID_f29eb3ddd0e5d7a658812no
2ID_68de51c942c7317ea88no11
3ID_d671db89c2b58d945fyes11no
4ID_d56d6f5f52b58d945fyes11no

Ididhogar 是家庭成员和家庭的标识符.
下面是其他三个特征描述信息:

pd.options.display.max_colwidth = 200
pd.options.display.large_repr = 'truncate'

feature_desp.loc[['dependency','edjefe','edjefa']]
 
description
F_name
dependencyDependency rate calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
edjefeyears of education of male head of household based on the interaction of escolari (years of education) head of household and gender yes=1 and no=0
edjefayears of education of female head of household based on the interaction of escolari (years of education) head of household and gender yes=1 and no=0

  现在已经简单了解了一下数据,数据中的有些特征是家庭的信息,有的是家庭成员的个人信息.还有一些特征包含一些混合信息,比如上面的三个特征,有int和object两种,根据特征的描述信息,yes=1,no=0,dependency没有提示yes/no代表什么意思,但是这个特征记录的是,家中[19~64]的人的数量.所以姑且也把yes=1,no=0.

train.dependency = train.dependency.replace({"yes": 1, "no": 0}).astype(np.float64)
train.edjefa = train.edjefa.replace({"yes": 1, "no": 0}).astype(np.float64)
train.edjefe = train.edjefe.replace({"yes": 1, "no": 0}).astype(np.float64)

KDE(kernel density estimate)

plt.figure(figsize = (21, 4))
for i, col in enumerate(['dependency','edjefe','edjefa']):
    ax = plt.subplot(1, 3, i + 1)
    # Iterate through the poverty levels
    for poverty_level, color in colors.items():
        # 核密度估计
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = label_mapping[poverty_level])

    plt.title('%s KDE'%(col.capitalize()))
    plt.xlabel('%s'%col)
    plt.ylabel('Density')

plt.subplots_adjust()

这里写图片描述

Distribution of Target

  Target是每一个家庭的标记信息,接下来观察一下Target的分布信息,需要筛选出parentecso1 == 1的数据,parentecso = 1,表示该名家庭成员为一家之主,(一个家庭只有一个一家之主),所以parentecso = 1的家庭成员的target即代表:家庭的target.

target_counts = train.loc[train.parentesco1 == 1].Target.value_counts()
target_counts.plot.bar(figsize = (8, 6),rot = 0)
plt.xlabel('Poverty Level')
plt.ylabel('Target Counts')
plt.xticks(np.arange(0, 4, 1), ['non vulnerable','moderate','vulnerable','extreme'])
plt.title('Poverty Level Distribution')
Text(0.5,1,'Poverty Level Distribution')

这里写图片描述

很明显这是类别不平衡

Addressing Wrong Target(纠错标签)

  正常情况是:如果一个家庭的Target是extreme,即为户主身份的成员Target为extreme,那么这个家庭中所有成员的Target应该是一致的extreme.但是,有些异常情况,属于同一个家庭的成员有多个不同的target,这种异常可能是人为的或者其他因素造成的.
  但是这个异常,事实上不用去修正,只要在训练模型时还是以户主的Target为准的.如果要修正这个的话,先要对数据进行分组,根据idhogar进行分组,检查Target是否一致.

# 家庭成员Target标记一致的家庭
norm = train.groupby(by = 'idhogar')['Target'].apply(lambda x:x.nunique() == 1)
# 有问题的
# num_unique != 1
unnorm = norm[norm == False]

Checkout one unnorm example,家庭编号:0172ab1d9

train[train['idhogar'] == unnorm.index[0]][['idhogar', 'parentesco1', 'Target']]
 
idhogarparentesco1Target
76510172ab1d903
76520172ab1d902
76530172ab1d903
76540172ab1d913
76550172ab1d902

parentesco1 == 1的家庭成员为户主,户主的target即代表整个家庭的Target,正常情况下,其他家庭成员的Target应该与户主相同Target = 3.因此可以根据,户主的Target来修正其他家庭成员的Target.大多数正常情况下是这样,但是就怕有意外情况,这家没有户主,而且家庭成员Target不一致,这个时候怎么办,怎么确定这个家庭的Target.下面先检查一下有没有没有户主的情况.

#Check head of household
(train.groupby(by = 'idhogar')['parentesco1'].sum() < 1).value_counts()
False    2973
True       15
Name: parentesco1, dtype: int64

15个家庭是没有户主的状态.下面是这个15个家庭的部分信息

# groupby idhogar , count parentesco1
count_head_of_household = train.groupby(by = 'idhogar')['parentesco1'].sum()

# select don't have head of household family member
households_no_head = train.loc[train.idhogar.isin(count_head_of_household[count_head_of_household == 0].index),:]    

households_no_head[['idhogar', 'parentesco1', 'Target']].sort_values('idhogar')
 
idhogarparentesco1Target
770603c6bdf8503
770503c6bdf8503
493509b195e7a03
70861367ab31d03
94971bc617b2303
5396374ca5a1903
539161c10e09903
74386b1b2405f04
74396b1b2405f04
74406b1b2405f04
4975896fe6d3e03
8636a0812ef1703
7756ad687ad8903
7757b1f4d89d703
6444bfd5067c203
6443bfd5067c203
8431c0c8a501303
8432c0c8a501303
8433c0c8a501303
9489d363d918303
7463f2bfa75c403
7461f2bfa75c403
7462f2bfa75c403

  上面是15个家庭23个家庭成员的部分信息,虽然这些家庭没有标记户主,但是家庭成员的target都是一致的.下面是这15个没有户主家庭成员的target,unique_values信息.

households_no_head.groupby(by = 'idhogar')['Target'].apply(lambda x:x.nunique())
idhogar
03c6bdf85    1
09b195e7a    1
1367ab31d    1
1bc617b23    1
374ca5a19    1
61c10e099    1
6b1b2405f    1
896fe6d3e    1
a0812ef17    1
ad687ad89    1
b1f4d89d7    1
bfd5067c2    1
c0c8a5013    1
d363d9183    1
f2bfa75c4    1
Name: Target, dtype: int64

  看来是多虑了,这15个没有户主的家庭,家庭成员的Target标记是相同的.接下就只用纠正那些标记了户主的家庭,家庭成员有多种不同Target的数据.原则就是,家庭成员的Target应与户主相同.

Fix wrong target

# unnorm.index : 不正常的家庭idhogar
for household_id in unnorm.index:
    # 找到户主的Target
    real_target = int(train[(train.idhogar == household_id) & (train.parentesco1 == 1)].Target)
    # 修正household的Target
    train.loc[train.idhogar == household_id, 'Target'] = real_target
# check check
(train.groupby(by = 'idhogar')['Target'].apply(lambda x:x.nunique()) > 1).value_counts()
False    2988
Name: Target, dtype: int64

meaneduc,Target,gender之间的联系

下面通过箱线图来观察一下,不同年龄人群受教育程度与贫困情况之间的联系

# Extract the labels
label_df = train[train['parentesco1'] == 1].copy()

# Create a gender mapping
label_df['gender'] = label_df['male'].replace({1: 'M', 0: 'F'})

plt.figure(figsize = (8, 6))

# Boxplot
sns.boxplot(x = 'Target', y = 'meaneduc', hue = 'gender', data = label_df)
plt.title('Mean Education vs Target by Gender')
plt.grid(axis = 'y', color = 'green', linestyle ='-.')

这里写图片描述

  可以发现贫困级别为1:extreme极度贫困的家庭,无论是男性还是女性户主平均受教育程度都是最低的.随着受教育程度的升高,贫困水平有所下降,可以发现edu和povert是呈现反比的.无论哪个级别贫困家庭,户主为女性的平均受教育程度要略高与男性户主,下面是根据性别和贫困水平的分组统计信息.

label_df.groupby(['gender', 'Target'])['meaneduc'].agg(['mean', 'count'])
meancount
genderTarget
F17.332627118
27.371026195
37.075253132
410.575066717
M15.965545104
26.704251247
36.983483223
410.1490901234

violin plot 是 box plot和数据分布密度的组合图,Box Plots在数据显示方面受到限制,因为它们的视觉简洁性往往会隐藏有关数据中值如何分布的重要细节。例如,使用Box Plots,您无法查看分布是双峰还是多峰。中心的粗黑条表示四分位数范围,从中延伸的细黑线表示95%置信区间,中间白点是中位数.

plt.figure(figsize = (8,6))
sns.violinplot(x = 'Target', y = 'meaneduc',
               hue = 'gender', data = label_df);

这里写图片描述

dependency,edjefe,edjefatarget之间的联系

plt.figure(figsize = (21, 4))

# Iterate through the variables
for i, col in enumerate(['dependency', 'edjefa', 'edjefe']):
    ax = plt.subplot(1, 3, i+ 1)
    # Violinplot colored by `Target`
    sns.violinplot(x = 'Target', y = col, ax = ax, data = label_df, hue = 'gender')
    plt.title('%s by Target'%col.capitalize())
    plt.grid(axis = 'y', color='green', linestyle='-.')

这里写图片描述

Define Variable Categories

在进行数据聚合之前,还要对特征变量分类(参考特征描述:feature_desp)

  • 1.属于描述个人信息的特征
    • 布尔型
    • 连续型
  • 2.描述家庭信息的特征
    • 布尔型
    • 连续型
  • 3.id&target

float类型特征描述

feature_desp.ix[list(train.select_dtypes(['float']).columns)]
 
description
F_name
v2a1Monthly rent payment
v18q1number of tablets household owns
rez_escYears behind in school
meaneducaverage years of education for adults (18+)
overcrowding# persons per room
SQBovercrowdingovercrowding squared
SQBdependencydependency squared
SQBmeanedsquare of the mean years of education of adults (>=18) in the household
# 这些特征在一个家庭中的情况'001ff74ca
indexs = list(train.select_dtypes(['float']).columns)
train.loc[train.idhogar == norm.index[5],indexs]
 
v2a1v18q1rez_escmeaneducovercrowdingSQBovercrowdingSQBdependencySQBmeaned
3775180000.02.0NaN15.52.04.01.0240.25
3776180000.02.0NaN15.52.04.01.0240.25
3777180000.02.00.015.52.04.01.0240.25
3778180000.02.00.015.52.04.01.0240.25
# int类型 int类型的特征较多,可以分为两类,布尔型(只有0/1)和非布尔型的.
# int 类型特征列名
int_cols = list(train.select_dtypes(['int']).columns)

# int类型,bool类型的特征列名
bool_cols = list(train[int_cols].nunique()[train[int_cols].nunique() == 2].index)

# int类型,non bool的特征列名
non_bool_cols = list(train[int_cols].nunique()[train[int_cols].nunique() !=2].index)

# elimbasu5是个特例,他是bool型的,只是在训练数据中,没有出现过1,全是0
non_bool_cols.remove('elimbasu5')
bool_cols.append('elimbasu5')

non_bool_cols的特征描述信息

feature_desp.ix[non_bool_cols]
description
F_name
roomsnumber of all rooms in the house
r4h1Males younger than 12 years of age
r4h2Males 12 years of age and older
r4h3Total males in the household
r4m1Females younger than 12 years of age
r4m2Females 12 years of age and older
r4m3Total females in the household
r4t1persons younger than 12 years of age
r4t2persons 12 years of age and older
r4t3Total persons in the household
tamhogsize of the household
tamvivnumber of persons living in the household
escolariyears of schooling
hhsizehousehold size
hogar_ninNumber of children 0 to 19 in household
hogar_adulNumber of adults in household
hogar_mayor# of individuals 65+ in the household
hogar_total# of total individuals in the household
bedroomsnumber of bedrooms
qmobilephone# of mobile phones
ageAge in years
SQBescolariescolari squared
SQBageage squared
SQBhogar_totalhogar_total squared
SQBedjefeedjefe squared
SQBhogar_ninhogar_nin squared
agesqAge squared
TargetNaN
# 具体到一个家庭的情况
pd.options.display.max_columns = 15
train.loc[train.idhogar == norm.index[5], non_bool_cols]
roomsr4h1r4h2r4h3r4m1r4m2r4m3r4t1r4t2r4t3tamhogtamvivescolarihhsizehogar_nin
37754112112224441542
37764112112224441642
3777411211222444242
3778411211222444442
hogar_adulhogar_mayorhogar_totalbedroomsqmobilephoneageSQBescolariSQBageSQBhogar_total
37752042239225152116
37762042236256129616
377720422846416
377820422111612116
SQBedjefeSQBhogar_ninagesqTarget
3775225415214
3776225412964
37772254644
377822541214

个人特征:escolari,age,SQBescolari,SQBage,agesq,剩余的为家庭的特征

bool类型的int类型的特征

feature_desp.ix[bool_cols]
description
F_name
hacdor=1 Overcrowding by bedrooms
hacapo=1 Overcrowding by rooms
v14a=1 has bathroom in the household
refrig=1 if the household has refrigerator
v18qowns a tablet
paredblolad=1 if predominant material on the outside wall is block or brick
paredzocalo=1 if predominant material on the outside wall is socket wood zinc or absbesto
paredpreb=1 if predominant material on the outside wall is prefabricated or cement
pareddes=1 if predominant material on the outside wall is waste material
paredmad=1 if predominant material on the outside wall is wood
paredzinc=1 if predominant material on the outside wall is zink
paredfibras=1 if predominant material on the outside wall is natural fibers
paredother=1 if predominant material on the outside wall is other
pisomoscer=1 if predominant material on the floor is mosaic ceramic terrazo
pisocemento=1 if predominant material on the floor is cement
pisoother=1 if predominant material on the floor is other
pisonatur=1 if predominant material on the floor is natural material
pisonotiene=1 if no floor at the household
pisomadera=1 if predominant material on the floor is wood
techozinc=1 if predominant material on the roof is metal foil or zink
techoentrepiso=1 if predominant material on the roof is fiber cement mezzanine
techocane=1 if predominant material on the roof is natural fibers
techootro=1 if predominant material on the roof is other
cielorazo=1 if the house has ceiling
abastaguadentro=1 if water provision inside the dwelling
abastaguafuera=1 if water provision outside the dwelling
abastaguano=1 if no water provision
public=1 electricity from CNFL ICE ESPH/JASEC
planpri=1 electricity from private plant
noelec=1 no electricity in the dwelling
description
F_name
coopele=1 electricity from cooperative
sanitario1=1 no toilet in the dwelling
sanitario2=1 toilet connected to sewer or cesspool
sanitario3=1 toilet connected to septic tank
sanitario5=1 toilet connected to black hole or letrine
sanitario6=1 toilet connected to other system
energcocinar1=1 no main source of energy used for cooking (no kitchen)
energcocinar2=1 main source of energy used for cooking electricity
energcocinar3=1 main source of energy used for cooking gas
energcocinar4=1 main source of energy used for cooking wood charcoal
elimbasu1=1 if rubbish disposal mainly by tanker truck
elimbasu2=1 if rubbish disposal mainly by botan hollow or buried
elimbasu3=1 if rubbish disposal mainly by burning
elimbasu4=1 if rubbish disposal mainly by throwing in an unoccupied space
elimbasu6=1 if rubbish disposal mainly other
epared1=1 if walls are bad
epared2=1 if walls are regular
epared3=1 if walls are good
etecho1=1 if roof are bad
etecho2=1 if roof are regular
etecho3=1 if roof are good
eviv1=1 if floor are bad
eviv2=1 if floor are regular
eviv3=1 if floor are good
dis=1 if disable person
male=1 if male
female=1 if female
estadocivil1=1 if less than 10 years old
estadocivil2=1 if free or coupled uunion
estadocivil3=1 if married
description
F_name
estadocivil4=1 if divorced
estadocivil5=1 if separated
estadocivil6=1 if widow/er
estadocivil7=1 if single
parentesco1=1 if household head
parentesco2=1 if spouse/partner
parentesco3=1 if son/doughter
parentesco4=1 if stepson/doughter
parentesco5=1 if son/doughter in law
parentesco6=1 if grandson/doughter
parentesco7=1 if mother/father
parentesco8=1 if father/mother in law
parentesco9=1 if brother/sister
parentesco10=1 if brother/sister in law
parentesco11=1 if other family member
parentesco12=1 if other non family member
instlevel1=1 no level of education
instlevel2=1 incomplete primary
instlevel3=1 complete primary
instlevel4=1 incomplete academic secondary level
instlevel5=1 complete academic secondary level
instlevel6=1 incomplete technical secondary level
instlevel7=1 complete technical secondary level
instlevel8=1 undergraduate and higher education
instlevel9=1 postgraduate higher education
tipovivi1=1 own and fully paid house
tipovivi2=1 own paying in installments
tipovivi3=1 rented
tipovivi4=1 precarious
tipovivi5=1 other(assigned borrowed)
computer=1 if the household has notebook or desktop computer
television=1 if the household has TV
mobilephone=1 if mobile phone
lugar1=1 region Central
lugar2=1 region Chorotega
lugar3=1 region Pacífico central
lugar4=1 region Brunca
lugar5=1 region Huetar Atlántica
lugar6=1 region Huetar Norte
area1=1 zona urbana
area2=2 zona rural
elimbasu5=1 if rubbish disposal mainly by throwing in river creek or sea
#(object)类型
feature_desp.loc[['dependency','edjefe','edjefa']]
description
F_name
dependencyDependency rate calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
edjefeyears of education of male head of household based on the interaction of escolari (years of education) head of household and gender yes=1 and no=0
edjefayears of education of female head of household based on the interaction of escolari (years of education) head of household and gender yes=1 and no=0

特征分类

  • ind_bool : 表示家庭成员个人特征
  • ind_non_bool : 非布尔型个人特征
  • hh_bool : 家庭特征
  • hh_non_bool : 非布尔型家庭特征
  • ids : Id, idhogar, Target
  • hh_count : 家庭统计信息(人数,户主受教育时间)
ind_bool = ['v18q', 'dis', 'male', 'female', 'estadocivil1', 'estadocivil2', 'estadocivil3', 
            'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 
            'parentesco1', 'parentesco2',  'parentesco3', 'parentesco4', 'parentesco5', 
            'parentesco6', 'parentesco7', 'parentesco8',  'parentesco9', 'parentesco10', 
            'parentesco11', 'parentesco12', 'instlevel1', 'instlevel2', 'instlevel3', 
            'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 
            'instlevel9', 'mobilephone']

ind_non_bool = ['rez_esc', 'escolari', 'age','SQBescolari','SQBage','agesq']
hh_bool = ['hacdor', 'hacapo', 'v14a', 'refrig', 'paredblolad', 'paredzocalo', 
           'paredpreb','pisocemento', 'pareddes', 'paredmad',
           'paredzinc', 'paredfibras', 'paredother', 'pisomoscer', 'pisoother', 
           'pisonatur', 'pisonotiene', 'pisomadera',
           'techozinc', 'techoentrepiso', 'techocane', 'techootro', 'cielorazo', 
           'abastaguadentro', 'abastaguafuera', 'abastaguano',
            'public', 'planpri', 'noelec', 'coopele', 'sanitario1', 
           'sanitario2', 'sanitario3', 'sanitario5',   'sanitario6',
           'energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4', 
           'elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 
           'elimbasu5', 'elimbasu6', 'epared1', 'epared2', 'epared3',
           'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 
           'tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5', 
           'computer', 'television', 'lugar1', 'lugar2', 'lugar3',
           'lugar4', 'lugar5', 'lugar6', 'area1', 'area2']

hh_non_bool = ['v2a1', 'v18q1', 'meaneduc', 'SQBovercrowding', 'SQBdependency',
               'SQBmeaned', 'overcrowding', 'rooms', 'r4h1', 'r4h2', 'r4h3', 'r4m1',
               'r4m2', 'r4m3', 'r4t1', 'r4t2', 'r4t3', 'tamhog', 'tamviv', 'hhsize',
               'hogar_nin', 'hogar_adul', 'hogar_mayor', 'hogar_total',  'bedrooms',
               'qmobilephone', 'SQBhogar_total', 'SQBedjefe', 'SQBhoagr_nin']

hh_cont = [ 'dependency', 'edjefe', 'edjefa']


ids = ['Id', 'idhogar', 'Target']
len(ind_bool)+len(ind_non_bool)+len(hh_bool)+len(hh_non_bool)+len(hh_cont)+len(ids)
140

fix testset dependency,edjefe,edfefa

对测试数据的dependency,edjefe,edjefa进行处理.

test.dependency = test.dependency.replace({"yes": 1, "no": 0}).astype(np.float64)
test.edjefa = test.edjefa.replace({"yes": 1, "no": 0}).astype(np.float64)
test.edjefe = test.edjefe.replace({"yes": 1, "no": 0}).astype(np.float64)
train.to_csv('../fix_train.csv',index = False)
test.to_csv('../fix_test.csv',index = False)

DEA结束,接下来Feature Preprocess 和 find baseline.

评论 16
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值