李宏毅机器学习2021系列作业2-年收入判断

最新推荐文章于 2024-09-15 11:11:14 发布

闲人_Yty

最新推荐文章于 2024-09-15 11:11:14 发布

阅读量2.7k

点赞数

分类专栏：机器学习学习文章标签： python 机器学习

本文链接：https://blog.csdn.net/qq_40326280/article/details/114937424

版权

该博客介绍了李宏毅机器学习课程的作业，涉及二元分类问题——根据个人资料判断年收入是否超过50,000美元。内容涵盖数据集介绍、Logistic Regression和概率生成模型的实现，以及训练、损失和准确率曲线的绘制。" 51164491,5028833,"SHA1, MD5, Base64与AES加密解密原理及应用

摘要由CSDN通过智能技术生成

李宏毅机器学习2021系列

作业2-年收入判断

项目描述

二元分类是机器学习中最基础的问题之一，在这份教学中，你将学会如何实作一个线性二元分类器，来根据人们的个人资料，判断其年收入是否高于 50,000 美元。我们将以两种方法: logistic regression 与 generative model，来达成以上目的，你可以尝试了解、分析两者的设计理念及差别。
实现二分类任务：

个人收入是否超过50000元？

数据集介绍

这个资料集是由UCI Machine Learning Repository 的Census-Income (KDD) Data Set 经过一些处理而得来。为了方便训练，我们移除了一些不必要的资讯，并且稍微平衡了正负两种标记的比例。事实上在训练过程中，只有 X_train、Y_train 和 X_test 这三个经过处理的档案会被使用到，train.csv 和 test.csv 这两个原始资料档则可以提供你一些额外的资讯。

已经去除不必要的属性。
已经平衡正标和负标数据之间的比例。

特征格式

train.csv，test_no_label.csv。

基于文本的原始数据
去掉不必要的属性，平衡正负比例。

X_train, Y_train, X_test(测试)

train.csv中的离散特征=>在X_train中onehot编码(学历、状态…)
train.csv中的连续特征 => 在X_train中保持不变(年龄、资本损失…)。
X_train, X_test : 每一行包含一个510-dim的特征，代表一个样本。
Y_train: label = 0 表示 “<=50K” 、 label = 1 表示 " >50K " 。

项目要求

请动手编写 gradient descent 实现 logistic regression
请动手实现概率生成模型。
单个代码块运行时长应低于五分钟。
禁止使用任何开源的代码(例如，你在GitHub上找到的决策树的实现)。

数据准备

项目数据保存在：work/data/ 目录下。

环境配置/安装

无

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('work/data/train.csv', header=None, encoding='big5')
print(df.shape)
df.head()

(54257, 42)


/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (0,1,3,4,6,17,18,19,30,36,38,39,40) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

	0	1	2	3	4	5	6	7	8	9	...	32	33	34	35	36	37	38	39	40	41
0	id	age	class of worker	detailed industry recode	detailed occupation recode	education	wage per hour	enroll in edu inst last wk	marital stat	major industry code	...	country of birth father	country of birth mother	country of birth self	citizenship	own business or self employed	fill inc questionnaire for veteran's admin	veterans benefits	weeks worked in year	year	y
1	0	33	Private	34	26	Masters degree(MA MS MEng MEd MSW MBA)	0	Not in universe	Married-civilian spouse present	Finance insurance and real estate	...	China	China	Taiwan	Foreign born- Not a citizen of U S	2	Not in universe	2	52	95	50000+.
2	1	63	Private	7	22	Some college but no degree	0	Not in universe	Never married	Manufacturing-durable goods	...	?	?	United-States	Native- Born in the United States	0	Not in universe	2	52	95	- 50000.
3	2	71	Not in universe	0	0	7th and 8th grade	0	Not in universe	Married-civilian spouse present	Not in universe or children	...	Germany	United-States	United-States	Native- Born in the United States	0	Not in universe	2	0	95	- 50000.
4	3	43	Local government	43	10	Bachelors degree(BA AB BS)	0	Not in universe	Married-civilian spouse present	Education	...	United-States	United-States	United-States	Native- Born in the United States	0	Not in universe	2	52	95	- 50000.

5 rows × 42 columns

df = pd.read_csv('work/data/X_train', header=None, encoding='big5')
print(df.shape)
df.head()

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,9,10,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,502,503,504,508) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)


(54257, 511)

	0	1	2	3	4	5	6	7	8	9	...	501	502	503	504	505	507	508	509	510
0	id	age	Private	Self-employed-incorporated	State government	Self-employed-not incorporated	Not in universe	Without pay	Federal government	Never worked	...	1	Not in universe	Yes	No	2	1	weeks worked in year	94	95
1	0	33	1	0	0	0	0	0	0	0	...	0	1	0	0	1	0	52	0	1
2	1	63	1	0	0	0	0	0	0	0	...	0	1	0	0	1	0	52	0	1
3	2	71	0	0	0	0	1	0	0	0	...	0	1	0	0	1	0	0	0	1
4	3	43	0	0	0	0	0	0	0	0	...	0	1	0	0	1	0	52	0	1

5 rows × 511 columns

Preparing Data

对属性进行正则化，处理过后划分为训练集和验证集

np.random.seed(0)
X_train_fpath = 'work/data/X_train'
Y_train_fpath = 'work/data/Y_train'
X_test_fpath = 'work/data/X_test'
output_fpath = 'work/output_{}.csv'

# Parse csv files to numpy array
with open(X_train_fpath) as f:
    next(f)
    X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
    next(f)
    Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
    next(f)
    X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)

# 规范化
def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
    # This function normalizes specific columns of X.
    # The mean and standard variance of training data will be reused when processing testing data.
    #
    # Arguments:
    #     X: data to be processed
    #     train: 'True' when processing training data, 'False' for testing data
    #     specific_column: indexes of the columns that will be normalized. If 'None', all columns
    #         will be normalized.
    #     X_mean: mean value of training data, used when train = 'False'
    #     X_std: standard deviation of training data, used when train = 'False'
    # Outputs:
    #     X: normalized data
    #     X_mean: computed mean value of training data
    #     X_std: computed standard deviation of training data

    if specified_column == None:
        specified_column = np.arange(X.shape[1])
    if train:
        X_mean = np.mean(X[:, specified_column] ,0).reshape(1