李宏毅机器学习2021系列
作业2-年收入判断
项目描述
二元分类是机器学习中最基础的问题之一,在这份教学中,你将学会如何实作一个线性二元分类器,来根据人们的个人资料,判断其年收入是否高于 50,000 美元。我们将以两种方法: logistic regression 与 generative model,来达成以上目的,你可以尝试了解、分析两者的设计理念及差别。
实现二分类任务:
- 个人收入是否超过50000元?
数据集介绍
这个资料集是由UCI Machine Learning Repository 的Census-Income (KDD) Data Set 经过一些处理而得来。为了方便训练,我们移除了一些不必要的资讯,并且稍微平衡了正负两种标记的比例。事实上在训练过程中,只有 X_train、Y_train 和 X_test 这三个经过处理的档案会被使用到,train.csv 和 test.csv 这两个原始资料档则可以提供你一些额外的资讯。
- 已经去除不必要的属性。
- 已经平衡正标和负标数据之间的比例。
特征格式
- train.csv,test_no_label.csv。
- 基于文本的原始数据
- 去掉不必要的属性,平衡正负比例。
- X_train, Y_train, X_test(测试)
- train.csv中的离散特征=>在X_train中onehot编码(学历、状态…)
- train.csv中的连续特征 => 在X_train中保持不变(年龄、资本损失…)。
- X_train, X_test : 每一行包含一个510-dim的特征,代表一个样本。
- Y_train: label = 0 表示 “<=50K” 、 label = 1 表示 " >50K " 。
项目要求
- 请动手编写 gradient descent 实现 logistic regression
- 请动手实现概率生成模型。
- 单个代码块运行时长应低于五分钟。
- 禁止使用任何开源的代码(例如,你在GitHub上找到的决策树的实现)。
数据准备
项目数据保存在:work/data/ 目录下。
环境配置/安装
无
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('work/data/train.csv', header=None, encoding='big5')
print(df.shape)
df.head()
(54257, 42)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (0,1,3,4,6,17,18,19,30,36,38,39,40) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | id | age | class of worker | detailed industry recode | detailed occupation recode | education | wage per hour | enroll in edu inst last wk | marital stat | major industry code | ... | country of birth father | country of birth mother | country of birth self | citizenship | own business or self employed | fill inc questionnaire for veteran's admin | veterans benefits | weeks worked in year | year | y |
1 | 0 | 33 | Private | 34 | 26 | Masters degree(MA MS MEng MEd MSW MBA) | 0 | Not in universe | Married-civilian spouse present | Finance insurance and real estate | ... | China | China | Taiwan | Foreign born- Not a citizen of U S | 2 | Not in universe | 2 | 52 | 95 | 50000+. |
2 | 1 | 63 | Private | 7 | 22 | Some college but no degree | 0 | Not in universe | Never married | Manufacturing-durable goods | ... | ? | ? | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 52 | 95 | - 50000. |
3 | 2 | 71 | Not in universe | 0 | 0 | 7th and 8th grade | 0 | Not in universe | Married-civilian spouse present | Not in universe or children | ... | Germany | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 0 | 95 | - 50000. |
4 | 3 | 43 | Local government | 43 | 10 | Bachelors degree(BA AB BS) | 0 | Not in universe | Married-civilian spouse present | Education | ... | United-States | United-States | United-States | Native- Born in the United States | 0 | Not in universe | 2 | 52 | 95 | - 50000. |
5 rows × 42 columns
df = pd.read_csv('work/data/X_train', header=None, encoding='big5')
print(df.shape)
df.head()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,9,10,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,502,503,504,508) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
(54257, 511)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 501 | 502 | 503 | 504 | 505 | 506 | 507 | 508 | 509 | 510 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | id | age | Private | Self-employed-incorporated | State government | Self-employed-not incorporated | Not in universe | Without pay | Federal government | Never worked | ... | 1 | Not in universe | Yes | No | 2 | 0 | 1 | weeks worked in year | 94 | 95 |
1 | 0 | 33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 52 | 0 | 1 |
2 | 1 | 63 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 52 | 0 | 1 |
3 | 2 | 71 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
4 | 3 | 43 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 52 | 0 | 1 |
5 rows × 511 columns
Preparing Data
对属性进行正则化,处理过后划分为训练集和验证集
np.random.seed(0)
X_train_fpath = 'work/data/X_train'
Y_train_fpath = 'work/data/Y_train'
X_test_fpath = 'work/data/X_test'
output_fpath = 'work/output_{}.csv'
# Parse csv files to numpy array
with open(X_train_fpath) as f:
next(f)
X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
next(f)
Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
next(f)
X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
# 规范化
def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
# This function normalizes specific columns of X.
# The mean and standard variance of training data will be reused when processing testing data.
#
# Arguments:
# X: data to be processed
# train: 'True' when processing training data, 'False' for testing data
# specific_column: indexes of the columns that will be normalized. If 'None', all columns
# will be normalized.
# X_mean: mean value of training data, used when train = 'False'
# X_std: standard deviation of training data, used when train = 'False'
# Outputs:
# X: normalized data
# X_mean: computed mean value of training data
# X_std: computed standard deviation of training data
if specified_column == None:
specified_column = np.arange(X.shape[1])
if train:
X_mean = np.mean(X[:, specified_column] ,0).reshape(1