Homework 2 - Classification
1.数据清洗
读取文件
在这次作业中只有X_train,X_test,Y_train三个文件被用到,可以通过Notepad++查看三个文件具体的数据。
X_train,X_test第一行是表头,第二行开始是数据。表头是个人信息。
Y_train只有两列,第一列是人的编号(ID),第二列是一个label——如果年收入>50K美元,label就是1;如果年收入≤50K美元,label就是0。
用numpy数组来存数据
标准化
划分训练集和验证集
用训练集的10%作为验证集
import numpy as np
np.random.seed(0)
#设置文件路径
X_train_fpath = './hwdata/hw2/data/X_train'
X_test_fpath = './hwdata/hw2/data/X_test'
Y_train_fpath = './hwdata/hw2/data/Y_train'
output_fpath = './output_{}.csv'
with open(X_train_fpath) as f:
next(f)
X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = 'float')
with open(X_test_fpath) as f:
next(f)
X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = 'float')
with open(Y_train_fpath) as f:
next(f)
Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = 'float')
#标准化
def _normal(X, train = True, specificed_column = None, X_mean = None, X_std = None):
if specifice