机器学习之数据挖掘算法(一)OneR算法

1 篇文章 0 订阅
1 篇文章 0 订阅

一.初识OneR算法

1.在数据挖掘中,我们会接触到knn,决策树等许多复杂的分类算法,那么有没有一种比较简单的分类算法呢?那就是OneR算法。

2.思想:OneR即One Rule顾名思义,也就是一条规则的意思。它的主要思想是遍历每一个特征值的每一个取值,统计特征值在各个类别中出现的次数,找出次数最高的类别,并计算它在其它类别出现的总次数。通俗点来说,OneR算法通过训练集找出分类效果最好的一个特征,以它作为分类依据,以其它特征计算错误率来实现分类的算法。当然这种算法准确率不是很高,但它简练的思想在有些时候特别好用。

3.准备:我们在这里使用鸢尾花数据集(load_iris),基于python3实现该算法。以下是该数据集的官方说明:

Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

如果你的电脑上安装了sklearn库,可以通过以下代码查看该数据集:

#导入数据集
from sklearn.datasets import  load_iris

dataset = load_iris()
print(dataset.DESCR)

二.算法分步说明

  首先我们加载数据

from sklearn.datasets import  load_iris

dataset = load_iris()

#得到数据和输出
X = dataset.data
y = dataset.target

获得数据条数和特征数量

n_samples, n_features = X.shape

print(n_samples, n_features)
150 4

因为该数据集中的数据是连续的,需要把数据集打散。我们需要确定阈值,最简单的确定方法是把各特征的平均值作为阈值,大于等于阈值的设为1,小于阈值的设为0.

#设置阈值
attribute_means = X.mean()

#打散特征值
X_d = np.array(X >= attribute_means, dtype='int')

现在我们得到了处理后的数据集,它是4X150的数列(数据被打散成0和1),这样我们便优化了特征值,便于我们处理数据。

接下来我们创建一个函数,根据待预测数据的某项特征值预测类别,并给出错误率。首先defaultdict和itemgetter模块。

from collections import defaultdict
from operator import itemgetter

下面声明函数,用于训练特征值(特征值已经被优化成0和1),参数分别是数据集、类别数组、选好的特征索引值、特征值。

def train_feature_value(X, y_true, feature_index, value):
    class_counts = defaultdict(int)
    for sample, y in zip(X, y_true):
        if sample[feature_index] == value:
            class_counts[y] += 1
    sorted_class_counts = sorted(class_counts.items(),key=itemgetter(1), reverse=True)
    most_frequent_class = sorted_class_counts[0][0]
    incorrect_predictions = [class_count for class_value,class_count in class_counts.items() if class_value != most_frequent_class]
    error = sum(incorrect_predictions)
    return most_frequent_class, error

训练特征项(说明:上面的函数是用来对指定的特征值进行训练,如0。下面的函数通过调用上面的函数得到0和1两个特征值的训练模型)

def train_on_feature(X, y_true, feature_index):
    values = set(X[:,feature_index])
    predictors = {}
    errors = []
    for current_value in values:
        most_frequent_class, error = train_feature_value(X,y_true, feature_index, current_value)
        predictors[current_value] = most_frequent_class
        errors.append(error)
    total_error = sum(errors)
    return predictors, total_error

这样我们就得到了特征值分别为0和1时的指定特征值的相关性最高的类别和错误率(其中错误率是特征值为0和1时的总和。)

下面我们进行测试,把数据集划分成训练集和测试集,scikit-learn库提供了一个将数据集切分为训练集和测试集的函数。

from sklearn.cross_validation import train_test_split
Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, y, random_state=14)

接下来,计算所有特征值的相关性最高的类别。记得只使用训练集。遍历数据集中的每个特征,使用我们先前定义的函数train_on_feature()训练预测器,计算错误率。

all_predictors = {}
errors = {}
for feature_index in range(Xd_train.shape[1]):
    predictors, total_error = train_on_feature(Xd_train,y_train,feature_index)
    all_predictors[feature_index] = predictors
    errors[feature_index] = total_error
best_feature, best_error = sorted(errors.items(), key=itemgetter(1))[0]

创建模型

model = {'feature': best_feature,'predictor': all_predictors[best_feature]}

验证测试集

variable = model['feature']
predictor = model['predictor']
prediction = predictor[int(sample[variable])]

接着我们编写预测函数:

def predict(X_test, model):
    feature = model['feature'] 
    predictor = model['predictor']
    y_predicted = [predictor[int(sample[feature])] for sample in X_test]
    return y_predicted

我们用上面这个函数预测测试集中每条数据的类别。

y_predicted = predict(Xd_test, model)
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {:.1f}%".format(accuracy))

得到准确率:

accuracy = np.mean(y_predicted == y_test) * 100
print("accuracy is %s"  % (accuracy) )
accuracy is 65.7894736842

输出报告:

from sklearn.metrics import  classification_report
print(classification_report(y_test, y_predicted))
  precision    recall  f1-score   support

          0       0.94      1.00      0.97        17
          1       0.00      0.00      0.00        13
          2       0.40      1.00      0.57         8

avg / total       0.51      0.66      0.55        38

备注:OneR算法最后对测试集分类的时候只能分为两个类别,总类别有3个,这是这个算法的局限之一,但从准确率来说如此简单的规则准确率高达65%已经很了不起了。









  • 4
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值