数据挖掘实例2:分类规则 ONERULE方法

最新推荐文章于 2023-10-15 20:28:50 发布

00勇士王子

最新推荐文章于 2023-10-15 20:28:50 发布

阅读量735

点赞数

分类专栏： python 文章标签：数据挖掘 sklearn python

本文链接：https://blog.csdn.net/qq_45813980/article/details/120606045

版权

python 专栏收录该内容

9 篇文章 8 订阅

订阅专栏

前言

本实例采用python3环境，编辑器采用Jupyter Notebook

实例介绍

实验内容：使用著名的Iris植物分类数据集：集共有150条植物数据，每条数据都给出了四个特征：sepal length、sepal width、petal length、petal width（分别表示萼片和花瓣的长与宽），单位均为cm）。该数据集共有三种类别：Iris Setosa（山鸢尾）、Iris Versicolour（变色鸢尾）和Iris Virginica（维吉尼亚鸢尾）。我们这里的分类目的是根据植物的特征推测它的种类。抽象来说，就是根据已有的植物特征（4维数组）和所属分类（0,1,2）来对新的4维数组输出一个分类值。

代码与注释

import numpy as np  #导入numpy库，并简写为np

# 加载我们的数据集，scikit-learn库内置了该数据集，可直接导入
from sklearn.datasets import load_iris
#使用numpy库的loadxt方法，读取txt文件，将读取后的内容赋值给变量。此时的变量存储的是一个数据集
#X, y = np.loadtxt("X_classification.txt"), np.loadtxt("y_classification.txt")
dataset = load_iris()
X = dataset.data #字典dataset中data键下的数据 每条数据含植物的四种特征属性
y = dataset.target #字典dataset中target键下的数据 0、1、2分别代表三种植物 
#查看字典dataset中DESCR键下的内容
print(dataset.DESCR) 
# 每一个X都有一个shape属性，是对X向量的描述，samples指样本，features指特征
n_samples, n_features = X.shape #确定数据的维度

print(dataset) # 输出数据集

在这里插入图片描述

# 求每个横坐标的均值：特征均值
attribute_means = X.mean(axis=0)#mean()：函数求取均值；axis =0：对各列求均值，返回 1*n 矩阵 
assert attribute_means.shape == (n_features,)# 断言函数：不满足条件则直接触发异常，不必执行接下来的代码
X_d = np.array(X >= attribute_means, dtype='int')#构建离散数据样本：大于均值为1，否则为0

# 构造训练数据和验证数据：数据全部来自这个Isri的数据集，这里使用train_test_split方法直接把数据分成112组训练数据和38组验证数据
from sklearn.model_selection import train_test_split

# 将随机状态设置为与书本中相同的数字，以获得相同的结果
random_state = 14

X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)
print("它们是 {} 组训练数据".format(y_train.shape))
print("它们是 {} 组验证数据".format(y_test.shape))

在这里插入图片描述

#导入defaultdict和itemgetter库
from collections import defaultdict
from operator import itemgetter
#接下来需要把训练函数写到循环里：（对于每个feature，得到一个预测列表和错误合计）
def train(X, y_true, feature):
   
    # 检查变量是否为有效数字
    n_samples, n_features = X.shape
    assert 0 <= feature < n_features
    #获取此变量具有的所有唯一值
    values = set(X[:,feature])
    # 存储返回的预测器数组
    predictors = dict()
    errors = []
    for current_value in values:
        most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)
        predictors[current_value] = most_frequent_class
        errors.append(error)
    # 计算使用此功能进行分类的总错误
    total_error = sum(errors)
    # 返回错误结果
    return predictors, total_error

# 计算我们的预测者所说的每个样本都是基于它的值
#y_predicted = np.array([predictors[sample[feature]] for sample in X])
    
# 构建函数计算每个特征值预测的错误率
def train_feature_value(X, y_true, feature, value):
    # 创建一个简单的字典来计算他们给出特定预测的频率
    class_counts = defaultdict(int)
    # 迭代每个样本并计算每个类/值对的频率
    for sample, y in zip(X, y_true):
        if sample[feature] == value:
            class_counts[y] += 1
    # 现在通过排序（最高优先）并选择第一个项目来获得最佳的
    sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)
    most_frequent_class = sorted_class_counts[0][0]
    # 错误是未分类为最频繁类别的样本数
    # *and* 具有特征值
    n_samples = X.shape[1]
    error = sum([class_count for class_value, class_count in class_counts.items()
                 if class_value != most_frequent_class])
    return most_frequent_class, error

# 计算所有的预测值
all_predictors = {variable: train(X_train, y_train, variable) for variable in range(X_train.shape[1])}
errors = {variable: error for variable, (mapping, error) in all_predictors.items()}
# 现在选择最好的并保存为 "model"
# 按错误排序
best_variable, best_error = sorted(errors.items(), key=itemgetter(1))[0]
print("最佳模型以变量 {0}为基础，误差为 {1:.2f}".format(best_variable, best_error))

# 选择bset模型
model = {'variable': best_variable,
         'predictor': all_predictors[best_variable][0]}
print(model)

在这里插入图片描述

#通过遍历数据集中的每条数据来完成预测
def predict(X_test, model):
    variable = model['variable']
    predictor = model['predictor']
    y_predicted = np.array([predictor[int(sample[variable])] for sample in X_test])
    return y_predicted

# 用上面这个函数预测测试集中每条数据的类别
y_predicted = predict(X_test, model)
print(y_predicted)

在这里插入图片描述

# 通过取y_预测的数量的平均值等于y_检验来计算精度
accuracy = np.mean(y_predicted == y_test) * 100
# 输出计算精度的百分比值
print("测试精度为 {:.1f}%".format(accuracy))

在这里插入图片描述

00勇士王子

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
数据挖掘实例2:分类规则 ONERULE方法

前言本实例采用python3环境，编辑器采用Jupyter Notebook实例介绍实验内容：使用著名的Iris植物分类数据集：集共有150条植物数据，每条数据都给出了四个特征：sepal length、sepal width、petal length、petal width（分别表示萼片和花瓣的长与宽），单位均为cm）。该数据集共有三种类别：Iris Setosa（山鸢尾）、Iris Versicolour（变色鸢尾）和Iris Virginica（维吉尼亚鸢尾）。我们这里的分类目的是根据植物
复制链接

扫一扫

专栏目录