python3机器学习经典实例-学习笔记11-分类算法

最新推荐文章于 2021-08-11 10:39:34 发布

wiliken

最新推荐文章于 2021-08-11 10:39:34 发布

阅读量727

点赞数 1

分类专栏：机器学习

本文链接：https://blog.csdn.net/wangsai07/article/details/79821847

版权

机器学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

实例2-估算收入阶层

本节将根据14个属性建立分类器评估一个人的收入等级。可能的输出类型是“高于50K”和“低于或等于50K”。这个数据集稍微有点复杂，里面的每个数据点都是数字和字符串的混合体。数值数据是有价值的，在这种情况下，不能用标记编码器进行编码。需要设计一套既可以处理数值数据，也可以处理非数值数据的系统。我们将用美国人口普查收入数据集中的数据：
https://archive.ics.uci.edu/ml/datasets/Census+Income

详细步骤

导入必要的数据包，我们采用朴素贝叶斯分类器。

import numpy as np
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB

加载txt数据。数据的形式：

Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 60, United-States, >50K

54, Private, 302146, HS-grad, 9, Separated, Other-service, Unmarried, Black, Female, 0, 0, 20, United-States, <=50K

input_file = 'adult.data.txt'

# Reading the data
X = []
y = []
count_lessthan50k = 0
count_morethan50k = 0
num_images_threshold = 30000
with open(input_file, 'r') as f:
    for line in f.readlines():
        if '?' in line:
            continue

        data = line[:-1].split(', ')

        if data[-1] == '<=50K' and count_lessthan50k < num_images_threshold:
            X.append(data)
            count_lessthan50k = count_lessthan50k + 1

        elif data[-1] == '>50K' and count_morethan50k < num_images_threshold:
            X.append(data)
            count_morethan50k = count_morethan50k + 1

        if count_lessthan50k >= num_images_threshold and count_morethan50k >= num_images_threshold:
            break

X = np.array(X)