机器学习之数据挖掘算法（一）OneR算法

最新推荐文章于 2024-04-30 11:32:30 发布

幻想流浪人

最新推荐文章于 2024-04-30 11:32:30 发布

阅读量2.4k

点赞数 4

分类专栏： python 数据挖掘算法

本文链接：https://blog.csdn.net/wangsansheng/article/details/81014159

版权

python 同时被 3 个专栏收录

1 篇文章

订阅专栏

数据挖掘

1 篇文章

订阅专栏

算法

1 篇文章

订阅专栏

本文介绍了一种简单分类算法——OneR算法，并通过鸢尾花数据集进行实例演示，包括数据预处理、算法训练及测试评估。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一.初识OneR算法

1.在数据挖掘中，我们会接触到knn，决策树等许多复杂的分类算法，那么有没有一种比较简单的分类算法呢？那就是OneR算法。

2.思想：OneR即One Rule顾名思义，也就是一条规则的意思。它的主要思想是遍历每一个特征值的每一个取值，统计特征值在各个类别中出现的次数，找出次数最高的类别，并计算它在其它类别出现的总次数。通俗点来说，OneR算法通过训练集找出分类效果最好的一个特征，以它作为分类依据，以其它特征计算错误率来实现分类的算法。当然这种算法准确率不是很高，但它简练的思想在有些时候特别好用。

3.准备：我们在这里使用鸢尾花数据集（load_iris），基于python3实现该算法。以下是该数据集的官方说明：

Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

如果你的电脑上安装了sklearn库，可以通过以下代码查看该数据集：

#导入数据集
from sklearn.datasets import  load_iris

dataset = load_iris()
print(dataset.DESCR)

二.算法分步说明

首先我们加载数据

from sklearn.datasets import  load_iris

dataset = load_iris()

#得到数据和输出
X = dataset.data
y = dataset.target

获得数据条数和特征数量

n_samples, n_features = X.shape

print(n_samples, n_features)

150 4

因为该数据集中的数据是连续的，需要把数据集打散。我们需要确定阈值，最简单的确定方法是把各特征的平均值作为阈值，大于等于阈值的设为1，小于阈值的设为0.

#设置阈值
attribute_means = X.mean()

#打散特征值
X_d = np.array(X >= attribute_means, dtype='int')

现在我们得到了处理后的数据集，它是4X150的数列（数据被打散成0和1），这样我们便优化了特征值，便于我们处理数据。

接下来我们创建一个函数，根据待预测数据的某项特征值预测类别，并给出错误率。首先defaultdict和itemgetter模块。

from collections import defaultdict
from operator import itemgetter

下面声明函数，用于训练特征值（特征值已经被优化成0和1），参数分别是数据集、类别数组、选好的特征索引值、特征值。

def train_feature_value(X, y_true, feature_index, value):
    class_counts = defaultdict(int)
    for sample, y in zip(X, y_true):
        if sample[feature_index] == value:
            class_counts[y] += 1
    sorted_class_counts = sorted(class_counts.items(),key=itemgetter(1), reverse=True)
    most_frequent_class = sorted_class_counts[0][0]
    incorrect_predictions = [class_count for class_value,class_count in class_counts.items() if class_value != most_frequent_class]
    error = sum(incorrect_predictions)
    return most_frequent_class, error

训练特征项（说明：上面的函数是用来对指定的特征值进行训练，如0。下面的函数通过调用上面的函数得到0和1两个特征值的训练模型）

def train_on_feature(X, y_true, feature_index):
    values = set(X[:,feature_index])
    predictors = {}
    errors = []
    for current_value in values:
        most_frequent_class, error = train_feature_value(X,y_true, feature_index, current_value)
        predictors[current_value] = most_frequent_class
        errors.append(error)
    total_error = sum(errors)
    return predictors, total_error

这样我们就得到了特征值分别为0和1时的指定特征值的相关性最高的类别和错误率（其中错误率是特征值为0和1时的总和。）

下面我们进行测试，把数据集划分成训练集和测试集，scikit-learn库提供了一个将数据集切分为训练集和测试集的函数。

from sklearn.cross_validation import train_test_split
Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, y, random_state=14)

接下来，计算所有特征值的相关性最高的类别。记得只使用训练集。遍历数据集中的每个特征，使用我们先前定义的函数train_on_feature()训练预测器，计算错误率。

all_predictors = {}
errors = {}
for feature_index in range(Xd_train.shape[1]):
    predictors, total_error = train_on_feature(Xd_train,y_train,feature_index)
    all_predictors[feature_index] = predictors
    errors[feature_index] = total_error
best_feature, best_error = sorted(errors.items(), key=itemgetter(1))[0]

创建模型

model = {'feature': best_feature,'predictor': all_predictors[best_feature]}

验证测试集

variable = model['feature']
predictor = model['predictor']
prediction = predictor[int(sample[variable])]

接着我们编写预测函数：

def predict(X_test, model):
    feature = model['feature'] 
    predictor = model['predictor']
    y_predicted = [predictor[int(sample[feature])] for sample in X_test]
    return y_predicted

我们用上面这个函数预测测试集中每条数据的类别。

y_predicted = predict(Xd_test, model)
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {:.1f}%".format(accuracy))

得到准确率：

accuracy = np.mean(y_predicted == y_test) * 100
print("accuracy is %s"  % (accuracy) )

accuracy is 65.7894736842

输出报告：

from sklearn.metrics import  classification_report
print(classification_report(y_test, y_predicted))

  precision    recall  f1-score   support

          0       0.94      1.00      0.97        17
          1       0.00      0.00      0.00        13
          2       0.40      1.00      0.57         8

avg / total       0.51      0.66      0.55        38

备注：OneR算法最后对测试集分类的时候只能分为两个类别，总类别有3个，这是这个算法的局限之一，但从准确率来说如此简单的规则准确率高达65%已经很了不起了。