李宏毅 作业2答案 详解 台大 机器学习HW2 winner or loser

Homework 2 - Classification

作业代码是课程给出的答案
课程主页参见http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML20.html
文件读写的内容参考了关于文件的读写(https://www.cnblogs.com/ymjyqsx/p/6554817.html)

二元分類是機器學習中最基礎的問題之一,在這份教學中,你將學會如何實作一個線性二元分類器,來根據人們的個人資料,判斷其年收入是否高於 50,000 美元。我們將以兩種方法: logistic regression 與 generative model,來達成以上目的,你可以嘗試了解、分析兩者的設計理念及差別

首先载入数据

Downloading : https://drive.google.com/uc?id=1KSFIRh0-_Vr7SdiSCZP1ItV7bXPxMD92

将下载好的压缩包解压:tar -zxvf data.tar.gz。这是Linux下的命令,Windows的环境下可以用wsl。

Logistic Regression

读取数据出来

import numpy as np

np.random.seed(0)
X_train_fpath = './data/X_train'
Y_train_fpath = './data/Y_train'
X_test_fpath = './data/X_test'
output_fpath = './output_{}.csv'

读取数据,但是第一行和第一列是列名和id 所以都不要

# Parse csv files to numpy array
with open(X_train_fpath) as f:
    next(f)  #next的参数是迭代器,在这里目的是跳过第一行
    #去掉每一行末尾的'\n' 用,分开数据
    X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
    next(f)
    Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
    next(f)
    X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)

由于文件读写时都有可能产生IOError,一旦出错,后面的f.close()就不会调用。所以,为了保证无论是否出错都能正确地关闭文件,我们可以使用try … finally来实现f = open()也可以读写文件。Python引入了with语句来自动帮我们调用close()方法。

对数据normalize处理

def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
    # This function normalizes specific columns of X.
    # The mean and standard variance of training data will be reused when processing testing data.
    #
    # Arguments:
    #     X: data to be processed
    #     train: 'True' when processing training data, 'False' for testing data
    #     specific_column: indexes of the columns that will be normalized. If 'None', all columns
    #         will be normalized.
    #     X_mean: mean value of training data, used when train = 'False'
    #     X_std: standard deviation of training data, used when train = 'False'
    # Outputs:
    #     X: normalized data
    #     X_mean: computed mean value of training data
    #     X_std: computed standard deviation of training data

    if specified_column == None:
        specified_column = np.arange(X.shape[1])
    if train:
        #平均值.对每一列进行处理。reshape前的大小:(510,)
        #reshape后还是???
        X_mean = np.mean(X[:, specified_column] ,0).reshape(1, -1)
        #方差
        X_std  = np.std(X[:, specified_column], 0).reshape(1, -1)

    X[:,specified_column] = (X[:, specified_column] - X_mean) / (X_std + 1e-8)
     
    return X, X_mean, X_std

分割数据集

def _train_dev_split(X, Y, dev_ratio = 0.25):
    # This function spilts data into training set and development set.
    train_size = int(len(X) * <
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值