Python数据挖掘入门与实践学习笔记（一）

最新推荐文章于 2022-08-30 22:03:45 发布

lazy_wzyuan

最新推荐文章于 2022-08-30 22:03:45 发布

阅读量432

点赞数 1

分类专栏：数据挖掘学习笔记文章标签：数据挖掘 python 笔记

本文链接：https://blog.csdn.net/qq_34190232/article/details/96150451

版权

数据挖掘学习笔记专栏收录该内容

4 篇文章 2 订阅

订阅专栏

Python数据挖掘入门与实践学习笔记（一）

基于《python数据挖掘入门与实践》这一书的学习笔记，其中数据集和源码可以去图灵社区下载。

一、亲和性分析

1、数据集分析

1）首先，亲和性分析就是根据个体间的相似度，确定他们之间的亲密度。

2）原数据集的维度为（100,5），这五列分别代表了面包、牛奶、奶酪、苹果和香蕉。
行代表的是个体，列代表的是特征。用一段代码分析该数据集。

import numpy as np
dataset_filename = "affinity_dataset.txt"
X = np.loadtxt(dataset_filename)
n_samples, n_features = X.shape
print("该数据集还有{0}个样本和{1}个特征".format(n_samples, n_features))
print(X[:5])
features = ["奶油", "牛奶", "奶酪", "苹果", "香蕉"]

结果为
[[ 0. 0. 1. 1. 1.]
[ 1. 1. 0. 1. 0.]
[ 1. 0. 1. 1. 0.]
[ 0. 0. 1. 1. 1.]
[ 0. 1. 0. 0. 1.]]
其中0表示没有买该商品，1表示买了该商品。

2、算法实现

3）简单的排序规则：如果顾客买了A,那么他们也愿意买B
这里利用支持度和置信度作为衡量规则好坏的方法。支持度就是数据集中规则应验的次数。
支持度衡量的是给定规则应验的比例，而置信度衡量的则是规则的准确率。
支持度=规则应验个数
置信度=支持度/(规则应验+规则无效)

4）求解“如果顾客买了苹果，他们也会买香蕉”的支持度和置信度。

# 顾客买了苹果也会买香蕉吗，规则因应验和规则无效的个数
 rule_valid = 0
 rule_invalid = 0
 for sample in X:
     if sample[3] == 1:  # 前提是买苹果
         if sample[4] == 1:
             # 苹果香蕉都买了
             rule_valid += 1
         else:
             # 买苹果没有买香蕉
             rule_invalid += 1
 print("{0}个规则应验".format(rule_valid))
 print("{0}个规则无效".format(rule_invalid))
 support = rule_valid  # 支持度
 confidence = rule_valid / (rule_valid + rule_invalid)#置信度
 print("支持度为{0}，置信度为{1:.3f}.".format(support, confidence))

5）统计数据集中所有规则的相关数据
python中的sorted函数以及operator.itemgetter函数
https://www.cnblogs.com/100thMountain/p/4719503.html
官网
https://wiki.python.org/moin/HowTo/Sorting/

#规则分解为条件->结果
from collections import defaultdict
#利用字典存储
valid_rules = defaultdict(int)#字典的key为（1,2），（3,4）这种二元组（条件，结果）
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)#字典的key为0,1,2,3,4，主要是和条件保持一致
for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0: continue #首先要满足假设的条件
        # 满足假设的条件则
        num_occurences[premise] += 1
        for conclusion in range(n_features):
            if premise == conclusion:  # 去掉条件和结果相同的情况
                continue
            if sample[conclusion] == 1:
                # 这就是满足条件和结果的情况，规则应验
                valid_rules[(premise, conclusion)] += 1
            else:
                # 规则无效
                invalid_rules[(premise, conclusion)] += 1
support = valid_rules
confidence = defaultdict(float)#定义为float型的字典
for premise, conclusion in valid_rules.keys():
confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]
for premise, conclusion in confidence.keys():
premise_name = features[premise]
conclusion_name = features[conclusion]
print("规则: 如果一个人买了{0}同样会买{1}".format(premise_name, conclusion_name))
print(" - 置信度: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - 支持度: {0}".format(support[(premise, conclusion)]))
print("")

def print_rule(premise, conclusion, support, confidence, features):
premise_name = features[premise]
conclusion_name = features[conclusion]
print("规则: 如果一个人买了{0}同样会买{1}".format(premise_name, conclusion_name))
print(" - 置信度: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - 支持度: {0}".format(support[(premise, conclusion)]))
print("")

from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
#items返回（（2,3），支持度）key表示排序的方式，这里表示按支持度排序，reverse=True降序
for index in range(5):#取出排名前5
print("Rule #{0}".format(index + 1))
(premise, conclusion) = sorted_support[index][0]#取出已经排序的key
print_rule(premise, conclusion, support, confidence, features)#打印函数前面已经封装好了
#按照置信度排序
sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)
for index in range(5):
print("Rule #{0}".format(index + 1))
(premise, conclusion) = sorted_confidence[index][0]
print_rule(premise, conclusion, support, confidence, features)

结果（支持度）：
Rule #1
规则: 如果一个人买了奶酪同样会买香蕉

置信度: 0.659
支持度: 27

Rule #2
规则: 如果一个人买了香蕉同样会买奶酪

置信度: 0.458
支持度: 27

Rule #3
规则: 如果一个人买了奶酪同样会买苹果

置信度: 0.610
支持度: 25

Rule #4
规则: 如果一个人买了苹果同样会买奶酪

置信度: 0.694
支持度: 25

Rule #5
规则: 如果一个人买了苹果同样会买香蕉

置信度: 0.583
支持度: 21

二、OneR算法及实现

1、OneR算法

OneR为one rule 的简写，就是只选取一个规则。一般我们通过多种特征来预测某个个体属于哪一类，但是OneR算法只挑选准确率最高的特征，然后通过这个特征来进行预测，就是这么的简单暴力。

接下来通过著名的鸢尾花数据集来验证这个算法的分类效果。

2、Iris数据集

这个数据集共有150条植物数据，每条数据都给出了四个特征：sepal length、sepal width、petal length、petal width(分别表示萼片和花瓣的长宽），单位均为cm。这个数据集的鸢尾花有Iris Setosa(山鸢尾）、Iris Versicolour(变色鸢尾)、Iris Virginica(维吉尼亚鸢尾)。因此，可以通过算法根据植物的特征来推测它的种类。
scikit-learn库的数据集中就有这个经典的数据集。

import numpy as np
# 导入数据集
from sklearn.datasets import load_iris
#X, y = np.loadtxt("X_classification.txt"), np.loadtxt("y_classification.txt")
dataset = load_iris()
X = dataset.data
y = dataset.target #y中的是每个样本属于的类别分别为0，1,2
#print(dataset.DESCR) #查看该数据集的相关信息
n_samples, n_features = X.shape
print(X[:5])

结果为:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
显然特征值为连续值，不符合算法要求，因此需要离散化。我们可以求取每种特征值的均值，并且作为阈值，大于阈值特征值为1，否则为0。

#离散化处理
#计算每种特征值的平均值
attribute_means = X.mean(axis=0) #axis=0列计算
assert attribute_means.shape == (n_features,)#断言attribute_means为1x4的数组，刚好为4个特征
print(attribute_means)
X_d = np.array(X >= attribute_means, dtype='int') #特征值离散化，大于等于均值为1，否则为0
print(X_d[:5])

结果为：
[5.84333333 3.05733333 3.758 1.19933333]
[[0 1 0 0]
[0 0 0 0]
[0 1 0 0]
[0 1 0 0]
[0 1 0 0]]
显然离散化后的结果比较熟悉，类似于第一个算法中的数据集。
接下来将数据集划分成训练集和测试集两个部分。这里利用到sk库中的train_test_split
简单用法如下：
https://blog.csdn.net/samsam2013/article/details/80702582
X_train,X_test, y_train, y_test =sklearn.model_selection.train_test_split(train_data,train_target,test_size=0.4, random_state=0,stratify=y_train)
train_data：所要划分的样本特征集
train_target：所要划分的样本结果
test_size：样本占比，如果是整数的话就是样本的数量
random_state：是随机数的种子。
随机数种子：其实就是该组随机数的编号，在需要重复试验的时候，保证得到一组一样的随机数。比如你每次都填1，其他参数一样的情况下你得到的随机数组是一样的。但填0或不填，每次都会不一样。

stratify是为了保持split前类的分布。比如有100个数据，80个属于A类，20个属于B类。如果train_test_split(… test_size=0.25, stratify = y_all), 那么split之后数据如下：
training: 60个数据，其中48个属于A类，12个属于B类。
testing: 40个数据，其中32个属于A类，8个属于B类。

用了stratify参数，training集和testing集的类的比例是 A：B= 4：1，等同于split前的比例（80：20）。通常在这种类分布不平衡的情况下会用到stratify。
将stratify=X就是按照X中的比例分配
将stratify=y就是按照y中的比例分配

# 将数据集分成训练集和测试集两个部分
from sklearn.model_selection import train_test_split
random_state = 14
X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)
print("训练集个数：{} ".format(y_train.shape[0]))
print("测试集个数：{} ".format(y_test.shape[0]))

3、算法实现

每个特征只有两种特征值，对于每一个特征值（0或1），统计它在各个类别中的出现次数，找到它出现次数最多的类别，并统计它在其他类别中的出现次数。比如对某一个特征，给定特征值为0的情况下，0类有20个这样的个体，1类有70个这样的个体，2类有10个这样的个体，则特征值为0的个体很有可能属于1类，同时错误率为30%。
可以统计所有特征值及其在每个类别的出现次数，计算每个特征的错误率，从而挑选出one rule。

def train_feature_value(X, y_true, feature, value):
	#feature特征索引值，可以选择0，1，2，3，value为特征值可以取值0、1
    # 用一个字典来统计具有给定特征值的个体在各个类别中的出现次数
    class_counts = defaultdict(int)
    for sample, y in zip(X, y_true):
        if sample[feature] == value:
            class_counts[y] += 1
    # 排序，找到最大值，可以找出具有给定特征值的个体在哪个类别中出现的次数最多
    sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)
    #sorted_class_counts类型为 [(0, 33), (1, 15), (2, 4)]该结果为feature为0，value为0时的取值
    #即第一个特征特征值0出现最多的类是第0类
    most_frequent_class = sorted_class_counts[0][0]#特定特征值出现次数最多的类
    # 计算错误率（出现在其他类中的次数）
    n_samples = X.shape[1]
    error = sum([class_count for class_value, class_count in class_counts.items()
                 if class_value != most_frequent_class])
    return most_frequent_class, error

def train(X, y_true, feature):
"""
参数
X: array [n_samples, n_features]数据集
y_true: array [n_samples,]类别数组

feature: int
    选好的特征索引值
    0 <= variable < n_features
    
返回值
-------
predictors: 元组字典: (value, prediction)，键为特征值，值为所属类别
error: float
"""
# 检测变量是否为一个合理的数字
n_samples, n_features = X.shape
assert 0 <= feature < n_features
# 不同特征值的数据从原数据集中抽离出来转换成集合
values = set(X[:,feature])#集合中元素不会重复，所以只会为0或者1
predictors = { }#键为特征值，值为所属类别
errors = []
for current_value in values:
    most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)
    predictors[current_value] = most_frequent_class
    errors.append(error)
# 计算误差和
total_error = sum(errors)
return predictors, total_error

4、算法测试

# 计算所有特征预测值
all_predictors = {variable: train(X_train, y_train, variable) for variable in range(X_train.shape[1])}
errors = {variable: error for variable, (mapping, error) in all_predictors.items()}#mapping为一个字典类predictors
best_variable, best_error = sorted(errors.items(), key=itemgetter(1))[0]#升序，误差最小，模型最佳
print("最佳模型特征为{0}误差为{1:.2f}".format(best_variable, best_error))
model = {'variable': best_variable,
         'predictor': all_predictors[best_variable][0]}
print(model)#model包含两个元素：用于分类的特征和预测器
def predict(X_test, model):
variable = model['variable']
predictor = model['predictor']
y_predicted = np.array([predictor[(sample[variable])] for sample in X_test])
return y_predicted
y_predicted = predict(X_test, model)
print(y_predicted)

结果为
[0 0 0 2 2 2 0 2 0 2 2 0 2 2 0 2 0 2 2 2 0 0 0 2 0 2 0 2 2 0 0 0 2 0 2 0 2 2]

# 计算预测正确率
accuracy = np.mean(y_predicted == y_test) * 100
print("正确率为{:.1f}%".format(accuracy))

5、思考

通过这次学习我发现字典这个功能的强大，书中OneR算法的离散方法相当于把特征值化成两个值，转换成二分类问题了，如果我们把特征值离散成3个值呢？正确率是否会提高？怎么离散呢？

lazy_wzyuan

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python数据挖掘入门与实践学习笔记（一）

Python数据挖掘入门与实践学习笔记（一）基于《python数据挖掘入门与实践》这一书的学习笔记，其中数据集合源码可以去图灵社区下载。一、亲和性分析1、数据集分析1）首先，亲和性分析就是根据个体间的相似度，确定他们之间的亲密度。2）原数据集的维度为（100,5），这五列分别代表了面包、牛奶、奶酪、苹果和香蕉。行代表的是个体，列代表的是特征。用一段代码分析该数据集。import nu...
复制链接

扫一扫