李宏毅机器学习2021系列 作业2-年收入判断

该博客介绍了李宏毅机器学习课程的作业,涉及二元分类问题——根据个人资料判断年收入是否超过50,000美元。内容涵盖数据集介绍、Logistic Regression和概率生成模型的实现,以及训练、损失和准确率曲线的绘制。" 51164491,5028833,"SHA1, MD5, Base64与AES加密解密原理及应用
摘要由CSDN通过智能技术生成

李宏毅机器学习2021系列

作业2-年收入判断

项目描述

二元分类是机器学习中最基础的问题之一,在这份教学中,你将学会如何实作一个线性二元分类器,来根据人们的个人资料,判断其年收入是否高于 50,000 美元。我们将以两种方法: logistic regression 与 generative model,来达成以上目的,你可以尝试了解、分析两者的设计理念及差别。
实现二分类任务:

  • 个人收入是否超过50000元?

数据集介绍

这个资料集是由UCI Machine Learning Repository 的Census-Income (KDD) Data Set 经过一些处理而得来。为了方便训练,我们移除了一些不必要的资讯,并且稍微平衡了正负两种标记的比例。事实上在训练过程中,只有 X_train、Y_train 和 X_test 这三个经过处理的档案会被使用到,train.csv 和 test.csv 这两个原始资料档则可以提供你一些额外的资讯。

  • 已经去除不必要的属性。
  • 已经平衡正标和负标数据之间的比例。

特征格式

  1. train.csv,test_no_label.csv。
  • 基于文本的原始数据
  • 去掉不必要的属性,平衡正负比例。
  1. X_train, Y_train, X_test(测试)
  • train.csv中的离散特征=>在X_train中onehot编码(学历、状态…)
  • train.csv中的连续特征 => 在X_train中保持不变(年龄、资本损失…)。
  • X_train, X_test : 每一行包含一个510-dim的特征,代表一个样本。
  • Y_train: label = 0 表示 “<=50K” 、 label = 1 表示 " >50K " 。

项目要求

  1. 请动手编写 gradient descent 实现 logistic regression
  2. 请动手实现概率生成模型。
  3. 单个代码块运行时长应低于五分钟。
  4. 禁止使用任何开源的代码(例如,你在GitHub上找到的决策树的实现)。

数据准备

项目数据保存在:work/data/ 目录下。

环境配置/安装

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('work/data/train.csv', header=None, encoding='big5')
print(df.shape)
df.head()
(54257, 42)


/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (0,1,3,4,6,17,18,19,30,36,38,39,40) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
0 1 2 3 4 5 6 7 8 9 ... 32 33 34 35 36 37 38 39 40 41
0 id age class of worker detailed industry recode detailed occupation recode education wage per hour enroll in edu inst last wk marital stat major industry code ... country of birth father country of birth mother country of birth self citizenship own business or self employed fill inc questionnaire for veteran's admin veterans benefits weeks worked in year year y
1 0 33 Private 34 26 Masters degree(MA MS MEng MEd MSW MBA) 0 Not in universe Married-civilian spouse present Finance insurance and real estate ... China China Taiwan Foreign born- Not a citizen of U S 2 Not in universe 2 52 95 50000+.
2 1 63 Private 7 22 Some college but no degree 0 Not in universe Never married Manufacturing-durable goods ... ? ? United-States Native- Born in the United States 0 Not in universe 2 52 95 - 50000.
3 2 71 Not in universe 0 0 7th and 8th grade 0 Not in universe Married-civilian spouse present Not in universe or children ... Germany United-States United-States Native- Born in the United States 0 Not in universe 2 0 95 - 50000.
4 3 43 Local government 43 10 Bachelors degree(BA AB BS) 0 Not in universe Married-civilian spouse present Education ... United-States United-States United-States Native- Born in the United States 0 Not in universe 2 52 95 - 50000.

5 rows × 42 columns

df = pd.read_csv('work/data/X_train', header=None, encoding='big5')
print(df.shape)
df.head()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3018: DtypeWarning: Columns (0,1,2,3,4,5,6,7,8,9,10,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,502,503,504,508) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)


(54257, 511)
0 1 2 3 4 5 6 7 8 9 ... 501 502 503 504 505 506 507 508 509 510
0 id age Private Self-employed-incorporated State government Self-employed-not incorporated Not in universe Without pay Federal government Never worked ... 1 Not in universe Yes No 2 0 1 weeks worked in year 94 95
1 0 33 1 0 0 0 0 0 0 0 ... 0 1 0 0 1 0 0 52 0 1
2 1 63 1 0 0 0 0 0 0 0 ... 0 1 0 0 1 0 0 52 0 1
3 2 71 0 0 0 0 1 0 0 0 ... 0 1 0 0 1 0 0 0 0 1
4 3 43 0 0 0 0 0 0 0 0 ... 0 1 0 0 1 0 0 52 0 1

5 rows × 511 columns

Preparing Data

对属性进行正则化,处理过后划分为训练集和验证集

np.random.seed(0)
X_train_fpath = 'work/data/X_train'
Y_train_fpath = 'work/data/Y_train'
X_test_fpath = 'work/data/X_test'
output_fpath = 'work/output_{}.csv'

# Parse csv files to numpy array
with open(X_train_fpath) as f:
    next(f)
    X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
    next(f)
    Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
    next(f)
    X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)

# 规范化
def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
    # This function normalizes specific columns of X.
    # The mean and standard variance of training data will be reused when processing testing data.
    #
    # Arguments:
    #     X: data to be processed
    #     train: 'True' when processing training data, 'False' for testing data
    #     specific_column: indexes of the columns that will be normalized. If 'None', all columns
    #         will be normalized.
    #     X_mean: mean value of training data, used when train = 'False'
    #     X_std: standard deviation of training data, used when train = 'False'
    # Outputs:
    #     X: normalized data
    #     X_mean: computed mean value of training data
    #     X_std: computed standard deviation of training data

    if specified_column == None:
        specified_column = np.arange(X.shape[1])
    if train:
        X_mean = np.mean(X[:, specified_column] ,0).reshape(1
评论 8
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值