python退避算法原理_Apriori算法原理及Python代码-CSDN博客

本文链接：https://blog.csdn.net/weixin_31614753/article/details/113492283

一、Apriori算法原理

参考：

Python --深入浅出Apriori关联分析算法（一）www.cnblogs.com

二、在Python中使用Apriori算法

查看Apriori算法的帮助文档：

from mlxtend.frequent_patterns import apriori
help(apriori)
Help on function apriori in module mlxtend.frequent_patterns.apriori:
apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0, low_memory=False)
    Get frequent itemsets from a one-hot DataFrame
    
    Parameters
    -----------
    df : pandas DataFrame
      pandas DataFrame the encoded format. Also supports
      DataFrames with sparse data; 
    
      Please note that the old pandas SparseDataFrame format
      is no longer supported in mlxtend >= 0.17.2.
    
      The allowed values are either 0/1 or True/False.
      For example,
    #apriori算法对输入数据类型有特殊要求！
#需要是数据框格式，并且数据要进行one-hot编码转换，转换后商品名称为列名，值为True或False。
#每一行记录代表一个顾客一次购物记录。
#第0条记录为[Apple,Beer,Chicken,Rice]，第1条记录为[Apple,Beer,Rice]，以此类推。
    ```
             Apple  Bananas   Beer  Chicken   Milk   Rice
        0     True    False   True     True  False   True
        1     True    False   True    False  False   True
        2     True    False   True    False  False  False
        3     True     True  False    False  False  False
        4    False    False   True     True   True   True
        5    False    False   True    False   True   True
        6    False    False   True    False   True  False
        7     True     True  False    False  False  False
    ```
    
    min_support : float (default: 0.5)#最小支持度
      A float between 0 and 1 for minumum support of the itemsets returned.
      The support is computed as the fraction
      `transactions_where_item(s)_occur / total_transactions`.
    
    use_colnames : bool (default: False)
#设置为True，则返回的关联规则、频繁项集会使用商品名称，而不是商品所在列的索引值
      If `True`, uses the DataFrames' column names in the returned DataFrame
      instead of column indices.
    
    max_len : int (default: None)
      Maximum length of the itemsets generated. If `None` (default) all
      possible itemsets lengths (under the apriori condition) are evaluated.
    
    verbose : int (default: 0)
      Shows the number of iterations if >= 1 and `low_memory` is `True`. If
      >=1 and `low_memory` is `False`, shows the number of combinations.
    
    low_memory : bool (default: False)
      If `True`, uses an iterator to search for combinations above
      `min_support`.
      Note that while `low_memory=True` should only be used for large dataset
      if memory resources are limited, because this implementation is approx.
      3-6x slower than the default.
    
    
    Returns
    -----------
    pandas DataFrame with columns ['support', 'itemsets'] of all itemsets
      that are >= `min_support` and < than `max_len`
      (if `max_len` is not None).
      Each itemset in the 'itemsets' column is of type `frozenset`,
      which is a Python built-in type that behaves similarly to
      sets except that it is immutable.

练习数据集：

链接: https://pan.baidu.com/s/1VLgpdc2N6ZEDvbaB5UnU1w

提取码: 6mbg

部分数据截图：

导入数据：

import pandas as pd
path = 'C:UsersCaraDesktopstore_data.csv'
records = pd.read_csv(path,header=None,encoding='utf-8')
print(records)

结果如下：

使用TransactionEncoder对交易数据进行one-hot编码：

先查看TransactionEncoder的帮助文档：

from mlxtend.preprocessing import TransactionEncoder
... help(TransactionEncoder)
... 
Help on class TransactionEncoder in module mlxtend.preprocessing.transactionencoder:
class TransactionEncoder(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin)
 |  Encoder class for transaction data in Python lists
 |  
 |  Parameters
 |  ------------
 |  None
 |  
 |  Attributes
 |  ------------
 |  columns_: list
 |    List of unique names in the `X` input list of lists
 |  
 |  Examples
 |  ------------  
 |  Method resolution order:
 |      TransactionEncoder
 |      sklearn.base.BaseEstimator
 |      sklearn.base.TransformerMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  fit(self, X)
 |      Learn unique column names from transaction DataFrame
 |      
 |      Parameters
 |      ------------
 |      X : list of lists#指嵌套列表，列表中嵌套小列表。如[[],[],[]]格式。
 |        A python list of lists, where the outer list stores the
 |        n transactions and the inner list stores the items in each
 |        transaction.
 |      
 |        For example,
 |        [['Apple', 'Beer', 'Rice', 'Chicken'],
 |         ['Apple', 'Beer', 'Rice'],
 |         ['Apple', 'Beer'],
 |         ['Apple', 'Bananas'],
 |         ['Milk', 'Beer', 'Rice', 'Chicken'],
 |         ['Milk', 'Beer', 'Rice'],
 |         ['Milk', 'Beer'],
 |         ['Apple', 'Bananas']]
 |  
 |  fit_transform(self, X, sparse=False)
 |      Fit a TransactionEncoder encoder and transform a dataset.
 |  
 |  inverse_transform(self, array)#逆转换
 |      Transforms an encoded NumPy array back into transactions.
 |      
 |      Parameters
 |      ------------
 |      array : NumPy array [n_transactions, n_unique_items]
 |          The NumPy one-hot encoded boolean array of the input transactions,
 |          where the columns represent the unique items found in the input
 |          array in alphabetic order
 |      
 |          For example,
 |          ```
 |          array([[True , False, True , True , False, True ],
 |                [True , False, True , False, False, True ],
 |                [True , False, True , False, False, False],
 |                [True , True , False, False, False, False],
 |                [False, False, True , True , True , True ],
 |                [False, False, True , False, True , True ],
 |                [False, False, True , False, True , False],
 |                [True , True , False, False, False, False]])
 |          ```
 |          The corresponding column labels are available as self.columns_,
 |          e.g., ['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']
 |      
 |      Returns
 |      ------------
 |      X : list of lists
 |          A python list of lists, where the outer list stores the
 |          n transactions and the inner list stores the items in each
 |          transaction.
 |      
 |        For example,
 |        ```
 |        [['Apple', 'Beer', 'Rice', 'Chicken'],
 |         ['Apple', 'Beer', 'Rice'],
 |         ['Apple', 'Beer'],
 |         ['Apple', 'Bananas'],
 |         ['Milk', 'Beer', 'Rice', 'Chicken'],
 |         ['Milk', 'Beer', 'Rice'],
 |         ['Milk', 'Beer'],
 |         ['Apple', 'Bananas']]
 |        ```
 |  
 |  transform(self, X, sparse=False)
 |      Transform transactions into a one-hot encoded NumPy array.
 |      
 |      Parameters
 |      ------------
 |      X : list of lists#嵌套列表
 |        A python list of lists, where the outer list stores the
 |        n transactions and the inner list stores the items in each
 |        transaction.
 |      
 |        For example,
 |        [['Apple', 'Beer', 'Rice', 'Chicken'],
 |         ['Apple', 'Beer', 'Rice'],
 |         ['Apple', 'Beer'],
 |         ['Apple', 'Bananas'],
 |         ['Milk', 'Beer', 'Rice', 'Chicken'],
 |         ['Milk', 'Beer', 'Rice'],
 |         ['Milk', 'Beer'],
 |         ['Apple', 'Bananas']]
 |      
 |      sparse: bool (default=False)
 |        If True, transform will return Compressed Sparse Row matrix
 |        instead of the regular one.
 |      
 |      Returns
 |      ------------
 |      array : NumPy array [n_transactions, n_unique_items]
 |         if sparse=False (default).
 |         Compressed Sparse Row matrix otherwise
 |         The one-hot encoded boolean array of the input transactions,
 |         where the columns represent the unique items found in the input
 |         array in alphabetic order. Exact representation depends
 |         on the sparse argument
 |      
 |         For example,
 |         array([[True , False, True , True , False, True ],
 |                [True , False, True , False, False, True ],
 |                [True , False, True , False, False, False],
 |                [True , True , False, False, False, False],
 |                [False, False, True , True , True , True ],
 |                [False, False, True , False, True , True ],
 |                [False, False, True , False, True , False],
 |                [True , True , False, False, False, False]])
 |        The corresponding column labels are available as self.columns_, e.g.,
 |        ['Apple', 'Bananas', 'Beer', 'Chicken', 'Milk', 'Rice']

由TransactionEncoder的帮助文档可知，需要将导入的文件由数据框格式转为嵌套列表形式，然后才能调用TransactionEncoder类下的fit()方法合transform()方法进行编码转换。

from mlxtend.preprocessing import TransactionEncoder
TE = TransactionEncoder()#类实例化
one_hot_records = TE.fit(lst_records).transform(lst_records)
print(one_hot_records)

部分结果如下：

[['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil'], ['burgers', 'meatballs', 'eggs', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], ['chutney', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], ['turkey', 'avocado', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea', nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], ……]

可以看到，原本是空白单元格，在读入时被标记为NaN了，然后转换为list后，也保留了NaN。直接使用这样的数据进行one-hot编码转换报错了……

因此，尝试重新进行数据导入，设置na_filter=False将空值过滤掉。

#导入数据
import pandas as pd
path = 'C:UsersCaraDesktopstore_data.csv'
records = pd.read_csv(path,header=None,encoding='utf-8',na_filter=False)
print(records)

结果如下：

将数据框转换为列表：

import numpy as np
lst_records = np.array(records).tolist()
print(lst_records)

部分结果如下：

[['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil'], ['burgers', 'meatballs', 'eggs', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['chutney', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''], ……]

对交易数据进行one-hot编码：

from mlxtend.preprocessing import TransactionEncoder
TE = TransactionEncoder()#类实例化
one_hot_records = TE.fit(lst_records).transform(lst_records)
print(one_hot_records)

结果如下：

调用Apriori算法进行频繁项集和关联规则挖掘：

由于Apriori算法要求输入数据格式为数据框，因此，需要对one-hot编码转换后的数据进行再次转换，目标格式：数据框。

df_records = pd.DataFrame(data = one_hot_records,columns = TE.columns_)
#TE类中的columns信息是由TE下的fit()方法得出的，参见TransactionEncoder类的参考文档
print(df_records)

结果如下：

挖掘频繁项集：

from mlxtend.frequent_patterns import apriori
freq_items = apriori(df_records,min_support=0.05,use_colnames=True)
#min_support默认值为0.5，这里设置为0.05这么小，是因为设置太大了根本找不到频繁项集……
#设置use_colnames=True是为了在显示频繁项集时，使用商品名称，而不是商品所在列的索引值
print(freq_items)

结果如下：

     support                      itemsets
0   0.999867                            ()
1   0.087188                     (burgers)
2   0.081056                        (cake)
3   0.059992                     (chicken)
4   0.163845                   (chocolate)
5   0.080389                     (cookies)
6   0.051060                 (cooking oil)
7   0.179709                        (eggs)
8   0.079323                    (escalope)
9   0.170911                (french fries)
10  0.063325             (frozen smoothie)
11  0.095321           (frozen vegetables)
12  0.052393               (grated cheese)
13  0.132116                   (green tea)
14  0.098254                 (ground beef)
15  0.076523              (low fat yogurt)
16  0.129583                        (milk)
17  0.238368               (mineral water)
18  0.065858                   (olive oil)
19  0.095054                    (pancakes)
20  0.071457                      (shrimp)
21  0.050527                        (soup)
22  0.174110                   (spaghetti)
23  0.068391                    (tomatoes)
24  0.062525                      (turkey)
25  0.058526            (whole wheat rice)
26  0.087188                   (, burgers)
27  0.081056                      (, cake)
28  0.059992                   (, chicken)
29  0.163845                 (, chocolate)
30  0.080389                   (, cookies)
31  0.051060               (, cooking oil)
32  0.179709                      (, eggs)
33  0.079323                  (, escalope)
34  0.170911              (, french fries)
35  0.063192           (, frozen smoothie)
36  0.095321         (, frozen vegetables)
37  0.052393             (, grated cheese)
38  0.131982                 (, green tea)
39  0.098254               (, ground beef)
40  0.076390            (, low fat yogurt)
41  0.129583                      (, milk)
42  0.238235             (, mineral water)
43  0.065725                 (, olive oil)
44  0.095054                  (, pancakes)
45  0.071324                    (, shrimp)
46  0.050527                      (, soup)
47  0.174110                 (, spaghetti)
48  0.068391                  (, tomatoes)
49  0.062525                    (, turkey)
50  0.058526          (, whole wheat rice)
51  0.052660    (mineral water, chocolate)
52  0.050927         (eggs, mineral water)
53  0.059725    (spaghetti, mineral water)
54  0.052660  (, mineral water, chocolate)
55  0.050927       (, eggs, mineral water)
56  0.059725  (, spaghetti, mineral water)

由上可以看到，有一些被高频次购买的单品，如burgers、cake、chicken……也有一些被一起高频购买的商品组合，如(mineral water, chocolate)，(eggs, mineral water)……

接下来使用association_rules()函数进行关联规则挖掘：

查看association_rules()函数帮助文档：

from mlxtend.frequent_patterns import association_rules
help(association_rules)
Help on function association_rules in module mlxtend.frequent_patterns.association_rules:
association_rules(df, metric='confidence', min_threshold=0.8, support_only=False)
    Generates a DataFrame of association rules including the
    metrics 'score', 'confidence', and 'lift'
    
    Parameters
    -----------
    df : pandas DataFrame
      pandas DataFrame of frequent itemsets#输入数据为频繁项集，即apriori算法返回的结果
      with columns ['support', 'itemsets']
    
    metric : string (default: 'confidence')#默认度量指标为置信度
      Metric to evaluate if a rule is of interest.
      **Automatically set to 'support' if `support_only=True`.**
      Otherwise, supported metrics are 'support', 'confidence', 'lift',
      'leverage', and 'conviction'
      These metrics are computed as follows:
    
      - support(A->C) = support(A+C) [aka 'support'], range: [0, 1]#支持度取值范围为0~1
    
      - confidence(A->C) = support(A+C) / support(A), range: [0, 1]#置信度取值范围为0~1
    
      - lift(A->C) = confidence(A->C) / support(C), range: [0, inf]#提升度(lift)取值范围为0到正无穷
    
      - leverage(A->C) = support(A->C) - support(A)*support(C),
        range: [-1, 1]
    
      - conviction = [1 - support(C)] / [1 - confidence(A->C)],
        range: [0, inf]
    
    min_threshold : float (default: 0.8)
      Minimal threshold for the evaluation metric,
      via the `metric` parameter,
      to decide whether a candidate rule is of interest.
    
    support_only : bool (default: False)
      Only computes the rule support and fills the other
      metric columns with NaNs. This is useful if:
    
      a) the input DataFrame is incomplete, e.g., does
      not contain support values for all rule antecedents
      and consequents
    
      b) you simply want to speed up the computation because
      you don't need the other metrics.
    
    Returns
    ----------
    pandas DataFrame with columns "antecedents" and "consequents"
      that store itemsets, plus the scoring metric columns:
      "antecedent support", "consequent support",
      "support", "confidence", "lift",
      "leverage", "conviction"
      of all rules for which
      metric(rule) >= min_threshold.
      Each entry in the "antecedents" and "consequents" columns are
      of type `frozenset`, which is a Python built-in type that
      behaves similarly to sets except that it is immutable.

进行关联规则挖掘：

from mlxtend.frequent_patterns import association_rules
association_rules_1 = association_rules(freq_items,metric='confidence',min_threshold=0.2)
#默认的min_threshold=0.8，但是这么高的情况下，找不到关联规则，于是不断降低门槛……
print(association_rules_1)

部分结果如下：

由于显示不完全，可以使用to_csv()方法将结果导出到excel表格：

association_rules_1.to_csv(path_or_buf='C:UsersCaraDesktopassociation_rules.csv')

结果如下：

附：全部Python代码

#导入数据
import pandas as pd
path = 'C:UsersCaraDesktopstore_data.csv'
records = pd.read_csv(path,header=None,encoding='utf-8',na_filter=False)
#na_filter=False，表示空值导入后会显示为空，而不是NaN
print(records)

#将数据框转为嵌套列表
import numpy as np
lst_records = np.array(records).tolist()
print(lst_records)

#对交易数据进行one-hot编码
from mlxtend.preprocessing import TransactionEncoder
TE = TransactionEncoder()#类实例化
one_hot_records = TE.fit(lst_records).transform(lst_records)
print(one_hot_records)


#数据格式转为数据框
df_records = pd.DataFrame(data = one_hot_records,columns = TE.columns_)
print(df_records)

#挖掘频繁项集
from mlxtend.frequent_patterns import apriori
freq_items = apriori(df_records,min_support=0.05,use_colnames=True)
print(freq_items)

#挖掘关联规则
from mlxtend.frequent_patterns import association_rules
association_rules_1 = association_rules(freq_items,metric='confidence',min_threshold=0.2)
print(association_rules_1)

#导出关联规则挖掘结果
association_rules_1.to_csv(path_or_buf='C:UsersCaraDesktopassociation_rules.csv')

参考资料：

Python --深入浅出Apriori关联分析算法（一）www.cnblogs.com

Python --深入浅出Apriori关联分析算法（二） Apriori关联规则实战www.cnblogs.com

https://www.bilibili.com/video/BV15Z4y1p7evwww.bilibili.com