实例2-估算收入阶层
本节将根据14个属性建立分类器评估一个人的收入等级。可能的输出类型是“高于50K”和“低于或等于50K”。这个数据集稍微有点复杂,里面的每个数据点都是数字和字符串的混合体。数值数据是有价值的,在这种情况下,不能用标记编码器进行编码。需要设计一套既可以处理数值数据, 也可以处理非数值数据的系统。我们将用美国人口普查收入数据集中的数据:
https://archive.ics.uci.edu/ml/datasets/Census+Income详细步骤导入必要的数据包,我们采用朴素贝叶斯分类器。
import numpy as np from sklearn import preprocessing from sklearn.naive_bayes import GaussianNB
加载txt数据。数据的形式:
Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K
54, Private, 302146, HS-grad, 9, Separated, Other-service, Unmarried, Black, Female, 0, 0, 20, United-States, <=50K
input_file = 'adult.data.txt' # Reading the data X = [] y = [] count_lessthan50k = 0 count_morethan50k = 0 num_images_threshold = 30000 with open(input_file, 'r') as f: for line in f.readlines(): if '?' in line: continue data = line[:-1].split(', ') if data[-1] == '<=50K' and count_lessthan50k < num_images_threshold: X.append(data) count_lessthan50k = count_lessthan50k + 1 elif data[-1] == '>50K' and count_morethan50k < num_images_threshold: X.append(data) count_morethan50k = count_morethan50k + 1 if count_lessthan50k >= num_images_threshold and count_morethan50k >= num_images_threshold: break X = np.array(X)