kaggle:Costa Rican Household Poverty Level Prediction(1)DEA

最新推荐文章于 2024-03-04 17:05:19 发布

今晚打佬虎

最新推荐文章于 2024-03-04 17:05:19 发布

阅读量3.8k

点赞数 5

分类专栏： kaggle 文章标签： kaggle

本文链接：https://blog.csdn.net/u014281392/article/details/81512614

版权

kaggle 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

DEA: Data Exploration Analysis

Costa Rican Household Poverty Level Prediction

哥斯达黎加家庭贫困水平预测

　　这个比赛的目的是，基于历史数据，使用机器学习方法来预测家庭的贫困水平．数据不大，非常适合单机操作．

Data Explanation

比赛提供了两组数据train.csv和test.csv:

每一行代表一个家庭成员的数据，（一个家庭可以有多个成员组成，只对户主进行预测）
train.csv,包含１４３列，含ID,Target和141个特征
test.csv,包含１４２列，不含Target
户主代表一个家庭，只对户主进行评分

关键字段：

Id : 每一个样本数据的一个标识
idhogar : 每一个家庭的唯一标识，拥有相同标识样本属于一个家庭
parentesco1 : 标识该成员是否为户主
Target : 家庭的贫困情况
- 1 = extreme poverty
- 2 = moderate poverty
- 3 = vulnerable households
- 4 = non vulnerable households

!，原始数据是每个家庭成员的信息，最后提交的结果是每个家庭的贫困情况.

问题就简单介绍完了．下面开对数据进行探索分析

# DEA 用到的python库： - pandas - numpy - matplotlib - seaborn - sklearn

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# set default params
%matplotlib inline
plt.rcParams['font.size'] = 18
pd.options.display.max_columns = 10

import warnings
warnings.filterwarnings('ignore')

数据概览

train = pd.read_csv('../train.csv')
test = pd.read_csv('../test.csv')
feature_desp = pd.read_csv('../feature_description.csv', error_bad_lines=False)
feature_desp = feature_desp.set_index('F_name')
train.head()

	Id	v2a1	rooms	v14a	refrig	…	SQBedjefe	SQBhogar_nin	SQBovercrowding	SQBdependency	SQBmeaned	agesq	Target
0	ID_279628684	190000.0	3	1	1	…	100	0	1.000000	0.0	100.0	1849	4
1	ID_f29eb3ddd	135000.0	4	1	1	…	144	0	1.000000	64.0	144.0	4489	4
2	ID_68de51c94	NaN	8	1	1	…	0	0	0.250000	64.0	121.0	8464	4
3	ID_d671db89c	180000.0	5	1	1	…	121	4	1.777778	1.0	121.0	289	4
4	ID_d56d6f5f5	180000.0	5	1	1	…	121	4	1.777778	1.0	121.0	1369	4

5 rows × 143 columns

#特征的描述信息
feature_desp.tail()

	F_name	description
136	SQBhogar_nin	hogar_nin squared
137	SQBovercrowding	overcrowding squared
138	SQBdependency	dependency squared
139	SQBmeaned	square of the mean years of education of adul…
140	agesq	Age squared

train.info()

test.info()

[int]类型特征unique value统计

train.select_dtypes(['int']).nunique().sort_values().plot(kind='barh',figsize=(12, 35))
plt.grid(axis='x', color='r', linestyle='-.', linewidth=1)
# 位置＆标签
locations = np.array([0,2,20,40,60,80,100])
labels = [0,2,20,40,60,80,100]
plt.xticks(locations, labels)
plt.xlabel('Number of Unique Values')
plt.ylabel('Feature Name')
plt.title('Count of Unique Values in Integer Columns')

这里写图片描述

　　可以发现有些特征的count(unique_value) = 2,这些特征只有两个值，0/1,比如，parentesco1用来标识这个人是否为户主，０:不是，１：是．

　　下面继续观察特征为浮点数类型变量的KDE核密度估计，同时根据target映射到不同的颜色．
由于核密度估计方法不利用有关数据分布的先验知识，对数据分布不附加任何假定，是一种从数据样本本身出发研究数据分布特征的方法。

#KDE(float)特征

from collections import OrderedDict

plt.figure(figsize = (28, 12))

# Color mapping
colors = OrderedDict({1: 'red', 2: 'orange', 3: 'blue', 4: 'green'})
label_mapping = OrderedDict({1: 'extreme', 2: 'moderate', 3: 'vulnerable', 
                               4: 'non vulnerable'})

# float columns
for i, col in enumerate(train.select_dtypes(['float'])):
    ax = plt.subplot(4, 3, i + 1)
    # Iterate through the poverty levels
    for poverty_level, color in colors.items():
        # 核密度估计
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = label_mapping[poverty_level])

    plt.title('%s Kernel Density Estimate'%(col.capitalize()))
    plt.xlabel('%s'%col)
    plt.ylabel('Density')

plt.subplots_adjust(top = 2)

这里写图片描述

(object)类型特征

# (object)
train.select_dtypes(['object']).head()

	Id	idhogar	dependency	edjefe	edjefa
0	ID_279628684	21eb7fcc1	no	10	no
1	ID_f29eb3ddd	0e5d7a658	8	12	no
2	ID_68de51c94	2c7317ea8	8	no	11
3	ID_d671db89c	2b58d945f	yes	11	no
4	ID_d56d6f5f5	2b58d945f	yes	11	no

Id 和 idhogar 是家庭成员和家庭的标识符．
下面是其他三个特征描述信息：

pd.options.display.max_colwidth = 200
pd.options.display.large_repr = 'truncate'

feature_desp.loc[['dependency','edjefe','edjefa']]

	description
F_name
dependency	Dependency rate calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
edjefe	years of education of male head of household based on the interaction of escolari (years of education) head of household and gender yes=1 and no=0
edjefa	years of education of female head of household based on the interaction of escolari (years of education) head of household and gender yes=1 and no=0

　　现在已经简单了解了一下数据，数据中的有些特征是家庭的信息，有的是家庭成员的个人信息．还有一些特征包含一些混合信息，比如上面的三个特征，有int和object两种，根据特征的描述信息，yes=1,no=0,dependency没有提示yes/no代表什么意思，但是这个特征记录的是，家中[19~64]的人的数量．所以姑且也把yes=1,no=0.

train.dependency = train.dependency.replace({"yes": 1, "no": 0}).astype(np.float64)
train.edjefa = train.edjefa.replace({"yes": 1, "no": 0}).astype(np.float64)
train.edjefe = train.edjefe.replace({"yes": 1, "no": 0}).astype(np.float64)

KDE(kernel density estimate)

plt.figure(figsize = (21, 4))
for i, col in enumerate(['dependency','edjefe','edjefa']):
    ax = plt.subplot(1, 3, i + 1)
    # Iterate through the poverty levels
    for poverty_level, color in colors.items():
        # 核密度估计
        sns.kdeplot(train.loc[train['Target'] == poverty_level, col].dropna(), 
                    ax = ax, color = color, label = label_mapping[poverty_level])

    plt.title('%s KDE'%(col.capitalize()))
    plt.xlabel('%s'%col)
    plt.ylabel('Density')

plt.subplots_adjust()

这里写图片描述

Distribution of Target

　　Target是每一个家庭的标记信息，接下来观察一下Target的分布信息，需要筛选出parentecso1 == 1的数据，parentecso = 1,表示该名家庭成员为一家之主,(一个家庭只有一个一家之主),所以parentecso = 1的家庭成员的target即代表：家庭的target.

target_counts = train.loc[train.parentesco1 == 1].Target.value_counts()
target_counts.plot.bar(figsize = (8, 6),rot = 0)
plt.xlabel('Poverty Level')
plt.ylabel('Target Counts')
plt.xticks(np.arange(0, 4, 1), ['non vulnerable','moderate','vulnerable','extreme'])
plt.title('Poverty Level Distribution')

Text(0.5,1,'Poverty Level Distribution')

这里写图片描述

很明显这是类别不平衡

Addressing Wrong Target(纠错标签)

　　正常情况是：如果一个家庭的Target是extreme,即为户主身份的成员Target为extreme,那么这个家庭中所有成员的Target应该是一致的extreme.但是，有些异常情况，属于同一个家庭的成员有多个不同的target,这种异常可能是人为的或者其他因素造成的．
　　但是这个异常，事实上不用去修正，只要在训练模型时还是以户主的Target为准的．如果要修正这个的话，先要对数据进行分组，根据idhogar进行分组,检查Target是否一致．

# 家庭成员Target标记一致的家庭
norm = train.groupby(by = 'idhogar')['Target'].apply(lambda x:x.nunique() == 1)
# 有问题的
# num_unique != 1
unnorm = norm[norm == False]

Checkout one unnorm example,家庭编号：0172ab1d9

train[train['idhogar'] == unnorm.index[0]][['idhogar', 'parentesco1', 'Target']]

	idhogar	parentesco1	Target
7651	0172ab1d9	0	3
7652	0172ab1d9	0	2
7653	0172ab1d9	0	3
7654	0172ab1d9	1	3
7655	0172ab1d9	0	2

parentesco1 == 1的家庭成员为户主，户主的target即代表整个家庭的Target,正常情况下，其他家庭成员的Target应该与户主相同Target = 3.因此可以根据，户主的Target来修正其他家庭成员的Target.大多数正常情况下是这样，但是就怕有意外情况，这家没有户主，而且家庭成员Target不一致，这个时候怎么办，怎么确定这个家庭的Target.下面先检查一下有没有没有户主的情况．

#Check head of household

(train.groupby(by = 'idhogar')['parentesco1'].sum() < 1).value_counts()

False    2973
True       15
Name: parentesco1, dtype: int64

15个家庭是没有户主的状态．下面是这个１５个家庭的部分信息

# groupby idhogar , count parentesco1
count_head_of_household = train.groupby(by = 'idhogar')['parentesco1'].sum()

# select don't have head of household family member
households_no_head = train.loc[train.idhogar.isin(count_head_of_household[count_head_of_household == 0].index),:]    

households_no_head[['idhogar', 'parentesco1', 'Target']].sort_values('idhogar')

	idhogar	Target
7706	03c6bdf85	3
7705	03c6bdf85	3
4935	09b195e7a	3
7086	1367ab31d	3
9497	1bc617b23	3
5396	374ca5a19	3
5391	61c10e099	3
7438	6b1b2405f	4
7439	6b1b2405f	4
7440	6b1b2405f	4
4975	896fe6d3e	3
8636	a0812ef17	3
7756	ad687ad89	3
7757	b1f4d89d7	3
6444	bfd5067c2	3
6443	bfd5067c2	3
8431	c0c8a5013	3
8432	c0c8a5013	3
8433	c0c8a5013	3
9489	d363d9183	3
7463	f2bfa75c4	3
7461	f2bfa75c4	3
7462	f2bfa75c4	3

　　上面是１５个家庭２３个家庭成员的部分信息,虽然这些家庭没有标记户主，但是家庭成员的target都是一致的．下面是这１５个没有户主家庭成员的target,unique_values信息．

households_no_head.groupby(by = 'idhogar')['Target'].apply(lambda x:x.nunique())

idhogar
03c6bdf85    1
09b195e7a    1
1367ab31d    1
1bc617b23    1
374ca5a19    1
61c10e099    1
6b1b2405f    1
896fe6d3e    1
a0812ef17    1
ad687ad89    1
b1f4d89d7    1
bfd5067c2    1
c0c8a5013    1
d363d9183    1
f2bfa75c4    1
Name: Target, dtype: int64

　　看来是多虑了，这１５个没有户主的家庭，家庭成员的Target标记是相同的．接下就只用纠正那些标记了户主的家庭，家庭成员有多种不同Target的数据．原则就是，家庭成员的Target应与户主相同．

Fix wrong target

# unnorm.index : 不正常的家庭idhogar
for household_id in unnorm.index:
    # 找到户主的Target
    real_target = int(train[(train.idhogar == household_id) & (train.parentesco1 == 1)].Target)
    # 修正household的Target
    train.loc[train.idhogar == household_id, 'Target'] = real_target
# check check
(train.groupby(by = 'idhogar')['Target'].apply(lambda x:x.nunique()) > 1).value_counts()

False    2988
Name: Target, dtype: int64

meaneduc,Target,gender之间的联系

下面通过箱线图来观察一下，不同年龄人群受教育程度与贫困情况之间的联系

# Extract the labels
label_df = train[train['parentesco1'] == 1].copy()

# Create a gender mapping
label_df['gender'] = label_df['male'].replace({1: 'M', 0: 'F'})

plt.figure(figsize = (8, 6))

# Boxplot
sns.boxplot(x = 'Target', y = 'meaneduc', hue = 'gender', data = label_df)
plt.title('Mean Education vs Target by Gender')
plt.grid(axis = 'y', color = 'green', linestyle ='-.')

这里写图片描述

　　可以发现贫困级别为１:extreme极度贫困的家庭，无论是男性还是女性户主平均受教育程度都是最低的．随着受教育程度的升高，贫困水平有所下降，可以发现edu和povert是呈现反比的．无论哪个级别贫困家庭，户主为女性的平均受教育程度要略高与男性户主，下面是根据性别和贫困水平的分组统计信息．

label_df.groupby(['gender', 'Target'])['meaneduc'].agg(['mean', 'count'])

		mean	count
gender	Target
F	1	7.332627	118
	2	7.371026	195
	3	7.075253	132
	4	10.575066	717
M	1	5.965545	104
	2	6.704251	247
	3	6.983483	223
	4	10.149090	1234

violin plot 是 box plot和数据分布密度的组合图，Box Plots在数据显示方面受到限制，因为它们的视觉简洁性往往会隐藏有关数据中值如何分布的重要细节。例如，使用Box Plots，您无法查看分布是双峰还是多峰。中心的粗黑条表示四分位数范围，从中延伸的细黑线表示95％置信区间，中间白点是中位数．

plt.figure(figsize = (8,6))
sns.violinplot(x = 'Target', y = 'meaneduc',
               hue = 'gender', data = label_df);

这里写图片描述

dependency,edjefe,edjefa与target之间的联系

plt.figure(figsize = (21, 4))

# Iterate through the variables
for i, col in enumerate(['dependency', 'edjefa', 'edjefe']):
    ax = plt.subplot(1, 3, i+ 1)
    # Violinplot colored by `Target`
    sns.violinplot(x = 'Target', y = col, ax = ax, data = label_df, hue = 'gender')
    plt.title('%s by Target'%col.capitalize())
    plt.grid(axis = 'y', color='green', linestyle='-.')

这里写图片描述

Define Variable Categories

在进行数据聚合之前，还要对特征变量分类(参考特征描述：feature_desp)

1.属于描述个人信息的特征
- 布尔型
- 连续型
2.描述家庭信息的特征
- 布尔型
- 连续型
3.id&target

float类型特征描述

feature_desp.ix[list(train.select_dtypes(['float']).columns)]

	description
F_name
v2a1	Monthly rent payment
v18q1	number of tablets household owns
rez_esc	Years behind in school
meaneduc	average years of education for adults (18+)
overcrowding	# persons per room
SQBovercrowding	overcrowding squared
SQBdependency	dependency squared
SQBmeaned	square of the mean years of education of adults (>=18) in the household

# 这些特征在一个家庭中的情况'001ff74ca
indexs = list(train.select_dtypes(['float']).columns)
train.loc[train.idhogar == norm.index[5],indexs]

	v2a1	v18q1	rez_esc	meaneduc	overcrowding	SQBovercrowding	SQBdependency	SQBmeaned
3775	180000.0	2.0	NaN	15.5	2.0	4.0	1.0	240.25
3776	180000.0	2.0	NaN	15.5	2.0	4.0	1.0	240.25
3777	180000.0	2.0	0.0	15.5	2.0	4.0	1.0	240.25
3778	180000.0	2.0	0.0	15.5	2.0	4.0	1.0	240.25

# int类型 int类型的特征较多，可以分为两类，布尔型（只有０/1)和非布尔型的．

# int 类型特征列名
int_cols = list(train.select_dtypes(['int']).columns)

# int类型，bool类型的特征列名
bool_cols = list(train[int_cols].nunique()[train[int_cols].nunique() == 2].index)

# int类型，non bool的特征列名
non_bool_cols = list(train[int_cols].nunique()[train[int_cols].nunique() !=2].index)

# elimbasu5是个特例，他是bool型的，只是在训练数据中，没有出现过１，全是０
non_bool_cols.remove('elimbasu5')
bool_cols.append('elimbasu5')

non_bool_cols的特征描述信息

feature_desp.ix[non_bool_cols]

	description
F_name
rooms	number of all rooms in the house
r4h1	Males younger than 12 years of age
r4h2	Males 12 years of age and older
r4h3	Total males in the household
r4m1	Females younger than 12 years of age
r4m2	Females 12 years of age and older
r4m3	Total females in the household
r4t1	persons younger than 12 years of age
r4t2	persons 12 years of age and older
r4t3	Total persons in the household
tamhog	size of the household
tamviv	number of persons living in the household
escolari	years of schooling
hhsize	household size
hogar_nin	Number of children 0 to 19 in household
hogar_adul	Number of adults in household
hogar_mayor	# of individuals 65+ in the household
hogar_total	# of total individuals in the household
bedrooms	number of bedrooms
qmobilephone	# of mobile phones
age	Age in years
SQBescolari	escolari squared
SQBage	age squared
SQBhogar_total	hogar_total squared
SQBedjefe	edjefe squared
SQBhogar_nin	hogar_nin squared
agesq	Age squared
Target	NaN

# 具体到一个家庭的情况
pd.options.display.max_columns = 15
train.loc[train.idhogar == norm.index[5], non_bool_cols]

	rooms	r4h1	r4h2	r4h3	r4m1	r4m2	r4m3	r4t1	r4t2	r4t3	tamhog	tamviv	escolari	hhsize	hogar_nin
3775	4	1	1	2	1	1	2	2	2	4	4	4	15	4	2
3776	4	1	1	2	1	1	2	2	2	4	4	4	16	4	2
3777	4	1	1	2	1	1	2	2	2	4	4	4	2	4	2
3778	4	1	1	2	1	1	2	2	2	4	4	4	4	4	2

	hogar_adul	hogar_total	bedrooms	qmobilephone	age	SQBescolari	SQBage	SQBhogar_total
3775	2	4	2	2	39	225	1521	16
3776	2	4	2	2	36	256	1296	16
3777	2	4	2	2	8	4	64	16
3778	2	4	2	2	11	16	121	16

	SQBedjefe	SQBhogar_nin	agesq	Target
3775	225	4	1521	4
3776	225	4	1296	4
3777	225	4	64	4
3778	225	4	121	4

个人特征：escolari,age,SQBescolari,SQBage,agesq,剩余的为家庭的特征

bool类型的int类型的特征

feature_desp.ix[bool_cols]

	description
F_name
hacdor	=1 Overcrowding by bedrooms
hacapo	=1 Overcrowding by rooms
v14a	=1 has bathroom in the household
refrig	=1 if the household has refrigerator
v18q	owns a tablet
paredblolad	=1 if predominant material on the outside wall is block or brick
paredzocalo	=1 if predominant material on the outside wall is socket wood zinc or absbesto
paredpreb	=1 if predominant material on the outside wall is prefabricated or cement
pareddes	=1 if predominant material on the outside wall is waste material
paredmad	=1 if predominant material on the outside wall is wood
paredzinc	=1 if predominant material on the outside wall is zink
paredfibras	=1 if predominant material on the outside wall is natural fibers
paredother	=1 if predominant material on the outside wall is other
pisomoscer	=1 if predominant material on the floor is mosaic ceramic terrazo
pisocemento	=1 if predominant material on the floor is cement
pisoother	=1 if predominant material on the floor is other
pisonatur	=1 if predominant material on the floor is natural material
pisonotiene	=1 if no floor at the household
pisomadera	=1 if predominant material on the floor is wood
techozinc	=1 if predominant material on the roof is metal foil or zink
techoentrepiso	=1 if predominant material on the roof is fiber cement mezzanine
techocane	=1 if predominant material on the roof is natural fibers
techootro	=1 if predominant material on the roof is other
cielorazo	=1 if the house has ceiling
abastaguadentro	=1 if water provision inside the dwelling
abastaguafuera	=1 if water provision outside the dwelling
abastaguano	=1 if no water provision
public	=1 electricity from CNFL ICE ESPH/JASEC
planpri	=1 electricity from private plant
noelec	=1 no electricity in the dwelling

	description
F_name
coopele	=1 electricity from cooperative
sanitario1	=1 no toilet in the dwelling
sanitario2	=1 toilet connected to sewer or cesspool
sanitario3	=1 toilet connected to septic tank
sanitario5	=1 toilet connected to black hole or letrine
sanitario6	=1 toilet connected to other system
energcocinar1	=1 no main source of energy used for cooking (no kitchen)
energcocinar2	=1 main source of energy used for cooking electricity
energcocinar3	=1 main source of energy used for cooking gas
energcocinar4	=1 main source of energy used for cooking wood charcoal
elimbasu1	=1 if rubbish disposal mainly by tanker truck
elimbasu2	=1 if rubbish disposal mainly by botan hollow or buried
elimbasu3	=1 if rubbish disposal mainly by burning
elimbasu4	=1 if rubbish disposal mainly by throwing in an unoccupied space
elimbasu6	=1 if rubbish disposal mainly other
epared1	=1 if walls are bad
epared2	=1 if walls are regular
epared3	=1 if walls are good
etecho1	=1 if roof are bad
etecho2	=1 if roof are regular
etecho3	=1 if roof are good
eviv1	=1 if floor are bad
eviv2	=1 if floor are regular
eviv3	=1 if floor are good
dis	=1 if disable person
male	=1 if male
female	=1 if female
estadocivil1	=1 if less than 10 years old
estadocivil2	=1 if free or coupled uunion
estadocivil3	=1 if married

	description
F_name
estadocivil4	=1 if divorced
estadocivil5	=1 if separated
estadocivil6	=1 if widow/er
estadocivil7	=1 if single
parentesco1	=1 if household head
parentesco2	=1 if spouse/partner
parentesco3	=1 if son/doughter
parentesco4	=1 if stepson/doughter
parentesco5	=1 if son/doughter in law
parentesco6	=1 if grandson/doughter
parentesco7	=1 if mother/father
parentesco8	=1 if father/mother in law
parentesco9	=1 if brother/sister
parentesco10	=1 if brother/sister in law
parentesco11	=1 if other family member
parentesco12	=1 if other non family member
instlevel1	=1 no level of education
instlevel2	=1 incomplete primary
instlevel3	=1 complete primary
instlevel4	=1 incomplete academic secondary level
instlevel5	=1 complete academic secondary level
instlevel6	=1 incomplete technical secondary level
instlevel7	=1 complete technical secondary level
instlevel8	=1 undergraduate and higher education
instlevel9	=1 postgraduate higher education
tipovivi1	=1 own and fully paid house
tipovivi2	=1 own paying in installments
tipovivi3	=1 rented
tipovivi4	=1 precarious
tipovivi5	=1 other(assigned borrowed)
computer	=1 if the household has notebook or desktop computer
television	=1 if the household has TV
mobilephone	=1 if mobile phone
lugar1	=1 region Central
lugar2	=1 region Chorotega
lugar3	=1 region PacÃƒÂfico central
lugar4	=1 region Brunca
lugar5	=1 region Huetar AtlÃƒÂ¡ntica
lugar6	=1 region Huetar Norte
area1	=1 zona urbana
area2	=2 zona rural
elimbasu5	=1 if rubbish disposal mainly by throwing in river creek or sea

#(object)类型

feature_desp.loc[['dependency','edjefe','edjefa']]

	description
F_name
dependency	Dependency rate calculated = (number of members of the household younger than 19 or older than 64)/(number of member of household between 19 and 64)
edjefe	years of education of male head of household based on the interaction of escolari (years of education) head of household and gender yes=1 and no=0
edjefa	years of education of female head of household based on the interaction of escolari (years of education) head of household and gender yes=1 and no=0

特征分类

ind_bool : 表示家庭成员个人特征
ind_non_bool : 非布尔型个人特征
hh_bool : 家庭特征
hh_non_bool : 非布尔型家庭特征
ids : Id, idhogar, Target
hh_count : 家庭统计信息(人数，户主受教育时间）

ind_bool = ['v18q', 'dis', 'male', 'female', 'estadocivil1', 'estadocivil2', 'estadocivil3', 
            'estadocivil4', 'estadocivil5', 'estadocivil6', 'estadocivil7', 
            'parentesco1', 'parentesco2',  'parentesco3', 'parentesco4', 'parentesco5', 
            'parentesco6', 'parentesco7', 'parentesco8',  'parentesco9', 'parentesco10', 
            'parentesco11', 'parentesco12', 'instlevel1', 'instlevel2', 'instlevel3', 
            'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7', 'instlevel8', 
            'instlevel9', 'mobilephone']

ind_non_bool = ['rez_esc', 'escolari', 'age','SQBescolari','SQBage','agesq']

hh_bool = ['hacdor', 'hacapo', 'v14a', 'refrig', 'paredblolad', 'paredzocalo', 
           'paredpreb','pisocemento', 'pareddes', 'paredmad',
           'paredzinc', 'paredfibras', 'paredother', 'pisomoscer', 'pisoother', 
           'pisonatur', 'pisonotiene', 'pisomadera',
           'techozinc', 'techoentrepiso', 'techocane', 'techootro', 'cielorazo', 
           'abastaguadentro', 'abastaguafuera', 'abastaguano',
            'public', 'planpri', 'noelec', 'coopele', 'sanitario1', 
           'sanitario2', 'sanitario3', 'sanitario5',   'sanitario6',
           'energcocinar1', 'energcocinar2', 'energcocinar3', 'energcocinar4', 
           'elimbasu1', 'elimbasu2', 'elimbasu3', 'elimbasu4', 
           'elimbasu5', 'elimbasu6', 'epared1', 'epared2', 'epared3',
           'etecho1', 'etecho2', 'etecho3', 'eviv1', 'eviv2', 'eviv3', 
           'tipovivi1', 'tipovivi2', 'tipovivi3', 'tipovivi4', 'tipovivi5', 
           'computer', 'television', 'lugar1', 'lugar2', 'lugar3',
           'lugar4', 'lugar5', 'lugar6', 'area1', 'area2']

hh_non_bool = ['v2a1', 'v18q1', 'meaneduc', 'SQBovercrowding', 'SQBdependency',
               'SQBmeaned', 'overcrowding', 'rooms', 'r4h1', 'r4h2', 'r4h3', 'r4m1',
               'r4m2', 'r4m3', 'r4t1', 'r4t2', 'r4t3', 'tamhog', 'tamviv', 'hhsize',
               'hogar_nin', 'hogar_adul', 'hogar_mayor', 'hogar_total',  'bedrooms',
               'qmobilephone', 'SQBhogar_total', 'SQBedjefe', 'SQBhoagr_nin']

hh_cont = [ 'dependency', 'edjefe', 'edjefa']


ids = ['Id', 'idhogar', 'Target']

len(ind_bool)+len(ind_non_bool)+len(hh_bool)+len(hh_non_bool)+len(hh_cont)+len(ids)

fix testset `dependency`,`edjefe`,`edfefa`

对测试数据的dependency,edjefe,edjefa进行处理．

test.dependency = test.dependency.replace({"yes": 1, "no": 0}).astype(np.float64)
test.edjefa = test.edjefa.replace({"yes": 1, "no": 0}).astype(np.float64)
test.edjefe = test.edjefe.replace({"yes": 1, "no": 0}).astype(np.float64)

train.to_csv('../fix_train.csv',index = False)

test.to_csv('../fix_test.csv',index = False)

DEA结束，接下来Feature Preprocess 和 find baseline.

今晚打佬虎

关注

5
点赞
踩
13

收藏

觉得还不错? 一键收藏
16
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

kaggle:Costa Rican Household Poverty Level Prediction(1)DEA

DEA: Data Exploration Analysis

Costa Rican Household Poverty Level Prediction

哥斯达黎加家庭贫困水平预测

Data Explanation

[int]类型特征unique value统计

(object)类型特征

KDE(kernel density estimate)

Distribution of Target

Addressing Wrong Target(纠错标签)

Fix wrong target

meaneduc,Target,gender之间的联系

Define Variable Categories

float类型特征描述

bool类型的int类型的特征

特征分类

fix testset dependency,edjefe,edfefa

fix testset `dependency`,`edjefe`,`edfefa`