Homework 2 - Classification
作业代码是课程给出的答案
课程主页参见http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML20.html
文件读写的内容参考了关于文件的读写(https://www.cnblogs.com/ymjyqsx/p/6554817.html)
二元分類是機器學習中最基礎的問題之一,在這份教學中,你將學會如何實作一個線性二元分類器,來根據人們的個人資料,判斷其年收入是否高於 50,000 美元。我們將以兩種方法: logistic regression 與 generative model,來達成以上目的,你可以嘗試了解、分析兩者的設計理念及差別
首先载入数据
Downloading : https://drive.google.com/uc?id=1KSFIRh0-_Vr7SdiSCZP1ItV7bXPxMD92
将下载好的压缩包解压:tar -zxvf data.tar.gz。这是Linux下的命令,Windows的环境下可以用wsl。
Logistic Regression
读取数据出来
import numpy as np
np.random.seed(0)
X_train_fpath = './data/X_train'
Y_train_fpath = './data/Y_train'
X_test_fpath = './data/X_test'
output_fpath = './output_{}.csv'
读取数据,但是第一行和第一列是列名和id 所以都不要
# Parse csv files to numpy array
with open(X_train_fpath) as f:
next(f) #next的参数是迭代器,在这里目的是跳过第一行
#去掉每一行末尾的'\n' 用,分开数据
X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
next(f)
Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
next(f)
X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
由于文件读写时都有可能产生IOError,一旦出错,后面的f.close()就不会调用。所以,为了保证无论是否出错都能正确地关闭文件,我们可以使用try … finally来实现f = open()也可以读写文件。Python引入了with语句来自动帮我们调用close()方法。
对数据normalize处理
def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
# This function normalizes specific columns of X.
# The mean and standard variance of training data will be reused when processing testing data.
#
# Arguments:
# X: data to be processed
# train: 'True' when processing training data, 'False' for testing data
# specific_column: indexes of the columns that will be normalized. If 'None', all columns
# will be normalized.
# X_mean: mean value of training data, used when train = 'False'
# X_std: standard deviation of training data, used when train = 'False'
# Outputs:
# X: normalized data
# X_mean: computed mean value of training data
# X_std: computed standard deviation of training data
if specified_column == None:
specified_column = np.arange(X.shape[1])
if train:
#平均值.对每一列进行处理。reshape前的大小:(510,)
#reshape后还是???
X_mean = np.mean(X[:, specified_column] ,0).reshape(1, -1)
#方差
X_std = np.std(X[:, specified_column], 0).reshape(1, -1)
X[:,specified_column] = (X[:, specified_column] - X_mean) / (X_std + 1e-8)
return X, X_mean, X_std
分割数据集
def _train_dev_split(X, Y, dev_ratio = 0.25):
# This function spilts data into training set and development set.
train_size = int(len(X) * <