Python数据挖掘入门与实践-OneR分类算法

最新推荐文章于 2022-12-05 19:19:20 发布

路人张的鱼生

最新推荐文章于 2022-12-05 19:19:20 发布

阅读量434

点赞数

分类专栏： Python 数据挖掘文章标签：数据挖掘

本文链接：https://blog.csdn.net/zhangdy12307/article/details/88921341

版权

Python 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

数据挖掘

3 篇文章 0 订阅

订阅专栏

Python数据挖掘入门与实践-OneR分类算法

OneR算法

OneR算法是根据已有的数据中，具有相同特征值的个体最可能属于哪个类别进行分类。
在本例中，只需选区Iris是个特征中分类效果最好的一个作为分类依据。

使用OneR算法对植物进行分类

离散化

数据集的特征为连续值，把连续值转变为类别行，这个过程叫作离散化。

1、准备数据集

每条数据集中给出了四个特征：sepal length,sepal width,petal length,petal width，可以从sklearn-learn使用该数据集

from sklearn.datasets import load_iris
import numpy as np
dataset = load_iris()
print(dataset.DESCR)
X = dataset.data
y = dataset.target

2、数据处理

使用离散化算法确定数据，通过确定一个阈值，将低于该阈值的特征值置为0，高于阈值的置为1，设每个特征的阈值为所有特征的均值，计算方法如下：

attribute_means = X.mean()

进行类型转换

X_d = np.array(X >= attribute_means, dtype='int')

3、实现OneR算法

通过OneR算法，我们将计算按照每个特征进行分类的错误率，然后选区错误率最低的特征作为分类准则，
首先创建函数，参数分别是数据集，类别数组，选好的特征索引值，特征值

from collections import defaultdict
from operator import itemgetter
def train_feature_value(X, y_true, feature_index, value):
    class_counts = defaultdict(int)
    for sample, y in zip(X, y_true):
        if sample[feature_index] == value:
            class_counts[y] += 1
    sorted_class_counts = sorted(class_counts.items(),key=itemgetter(1), reverse=True)
    most_frequent_class = sorted_class_counts[0][0]
    incorrect_predictions = [class_count for class_value,class_count in class_counts.items() if class_value != most_frequent_class]
    error = sum(incorrect_predictions)
    return most_frequent_class, error

对于任意一项特征，遍历其中每一个特征值使用上述函数计算错误率。

def train_on_feature(X, y_true, feature_index):
    values = set(X[:,feature_index])
    #字典predictors作为预测器，字典的键位特征值，值为类别
    #errors表示每个特征值的错误率
    predictors = {}
    errors = []
    #调用函数记录每个特征值可能的类别，计算错误率并保存到predictor中
    for current_value in values:
        most_frequent_class, error = train_feature_value(X,y_true, feature_index, current_value)
        predictors[current_value] = most_frequent_class
        errors.append(error)
    total_error = sum(errors)
    return predictors, total_error

4、测试算法

切割数据集

from sklearn.cross_validation import train_test_split
Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, y, random_state=14)

接下来计算所有特征值的目标类别（预测器）

all_predictors = {}
errors = {}
for feature_index in range(Xd_train.shape[1]):
    predictors, total_error = train_on_feature(Xd_train,y_train,feature_index)
    all_predictors[feature_index] = predictors
    errors[feature_index] = total_error

找出错误率最低的特征，作为分类准则

best_feature, best_error = sorted(errors.items(), key=itemgetter(1))[0]

对预测器进行排序，找出最佳特征并创建model模型，创建函数通过遍历数据集中每条数据完成预测

model = {'feature': best_feature,'predictor': all_predictors[best_feature]}

def predict(X_test, model):
    feature = model['feature']
    predictor = model['predictor']
    y_predicted = [predictor[int(sample[feature])] for sample in X_test]
    return y_predicted

比较结果和实际类别进行预测，得出正确率

y_predicted = predict(Xd_test, model)
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {:.1f}%".format(accuracy))

运行结果

写在最后

OneR算法的思路很简单，但是整个编程过程对于新手来说比较复杂，还是需要加强算法和数据结构方面的练习。

路人张的鱼生

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python数据挖掘入门与实践-OneR分类算法

Python数据挖掘入门与实践-OneR分类算法OneR算法OneR算法是根据已有的数据中，具有相同特征值的个体最可能属于哪个类别进行分类。在本例中，只需选区Iris是个特征中分类效果最好的一个作为分类依据。使用OneR算法对植物进行分类离散化数据集的特征为连续值，把连续值转变为类别行，这个过程叫作离散化。1、准备数据集每条数据集中给出了四个特征：sepal length,sepa...
复制链接

扫一扫