用Python在Mysql的帮助下简单实现Apriori

最新推荐文章于 2021-02-08 00:37:37 发布

JUMBO-K

最新推荐文章于 2021-02-08 00:37:37 发布

阅读量445

点赞数

分类专栏： Von Jumbo 文章标签： Apriori

本文链接：https://blog.csdn.net/zhubozhen/article/details/80205210

版权

Von Jumbo 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

这篇博客介绍了如何使用Python和Mysql实现Apriori算法。博主首先讲解了建立数据库表的过程，接着展示了查询表的内容。在分析部分，博主分享了代码，并解释了如何计算频繁项和支持度，以及如何生成关联规则。最后，博主给出了运行结果的部分截图和参考资料。

摘要由CSDN通过智能技术生成

尽管这个方法很老，但是很适合入门先体验一下XD 而且效率还是可以的，结果也令人满意。

1 建表

首先打开Mysql


create table if not exists fc_project_tags(
    project_id int(11) not null default '0',
    tag_name varchar(50) not null default '0',
    primary key (project_id,tag_name)
)engine=MyISAM default Charset=latin1;

这里用到了一个engine=MyISAM。即指定了储存引擎

这里写图片描述
这里可以详见https://blog.csdn.net/redbloodbody/article/details/58185240，写的十分详细了。
数据库的具体文件在：
https://github.com/megansquire/masteringDM/blob/master/ch2/fc_project_tags.sql.gz
这里书中作者推荐用source命令导入。懒得打字，详见百度。

2 检阅一下这个表

可以先简单查询一下

select count(*)
from fc_project_tags;

353400

select count(DISTINCT project_id)
from fc_project_tags;

46510

select count(DISTINCT tag_name)
from fc_project_tags;

11006

select tag_name ,count(project_id)
from fc_project_tags
group by 1
having count(project_id) >= 2325
order by 2 desc;

注意这里的having，不能用where，见本博客的某篇MYSQL文章。
具体来说，where子句中是不能用聚集函数作为条件表达式的。
以及，这里的group by 1是指以选择的第一列进行groupby聚集
以下是运行的结果的截取

tag_name, count(project_id)

‘GPL’, ‘21182’
‘POSIX’, ‘16875’
‘Linux’, ‘16288’
‘C’, ‘10292’
‘OS Independent’, ‘10180’
‘Software Development’, ‘9619’
……
Scientific/Engineering 2679
Games/Entertainment 2528
BSD 2497
Desktop Environment 2335

3 开始分析

准备工作：
代码在https://github.com/megansquire/masteringDM/tree/master/ch2
这段程序是为了计算总的篮子的数量(可以通过最小支持阈值的有关联的组合，bzw.有序对，的数量）
多一嘴，这段程序和书中有些许不同的是他更新了了一小部分，导入了getpass标准库。

同时之前的pip重新下了一个版本，所以path忘记重新配置了，提醒一定要加入系统变量。为此装第三方库的时候折腾了好久。
关于用到的itertools库可以看这个。https://blog.csdn.net/c465869935/article/details/51598388
提供了包括但不限于count容器，十分好用。

开始运行：
1）首先应该输入的参数是MINSUPPORTPCT，default是5%，这里的支持度设置的够高的话可以省掉很多时间，同时也会忽略一些不是特别明显的组合，换句话说可能会漏掉一些有价值的信息。

# set threshold as a percent (example, 5% of Freecode baskets is about 2325)
MINSUPPORTPCT = 5

2）链接数据库需要的相关参数

dbhost = 'localhost'
dbschema = 'test'
dbuser = 'root'
dbpasswd = 'xxxxxx'
dbport = 3306
dbcharset = 'utf8mb4'

port注意一下，如果没有特殊需求就选3306就好了。
3）这里定义了找出size=2的频繁项的函数，用到了之前通过最低阈值的size=1的频繁项。这里的数学原理是：如果一个项的任何一个子项不是频繁项，那么它也一定不是一个频繁项。所以只需要check size=1里已经保存好的项并将其组合就可以了，并且计算、保存他的support和confidence。同样的，这里应该删去一部分confidence不高/由于各种原因并不可靠的关联，这个confidence的阈值同样可以调整。这里用到了一些intertoos里的函数，具体作用见上面的链接里的文章。

def findDoubletons():
    print("======")
    print("Frequent doubletons found:")
    print("======")
    # use the list of allSingletonTags to make the doubleton candidates
    doubletonCandidates = list(itertools.combinations(allSingletonTags, 2))
    for (index, candidate) in enumerate(doubletonCandidates):
        # figure out if this doubleton candidate is frequent
        tag1 = candidate[0]
        tag2 = candidate[1]
        getDoubletonFrequencyQuery = "SELECT count(fpt1.project_id) \
                                     FROM fc_project_tags fpt1 \
                                     INNER JOIN fc_project_tags fpt2 \
                                     ON fpt1.project_id = fpt2.project_id \
                                     WHERE fpt1.tag_name = %s \
                                     AND fpt2.tag_name = %s"
        insertPairQuery = "INSERT INTO test.fc_project_tag_pairs \
                                (tag1, tag2, num_projs) \
                                VALUES (%s,%s,%s)"
        cursor.execute(getDoubletonFrequencyQuery, (tag1, tag2))

    到这一步为止，使用cursor游标从mysql的数据库中读出需要的size=1的频繁项集并保存tag_pairs。注意本文所有的pairs都是有序对。

        count = cursor.fetchone()[0]
        # add frequent doubleton to database                
        if count > minsupport:
            print (tag1,tag2,"[",count,"]")


            cursor.execute(insertPairQuery,(tag1, tag2, count))

            # save the frequent doubleton to our final list
            doubletonSet.add(candidate)         
            # add terms to a set of all doubleton terms (no duplicates)
            allDoubletonTags.add(tag1)
            allDoubletonTags.add(tag2)

这里将结果找到的tag(s),bzw.,count>minsupport（就是开头用百分比定义的最低值）的tag对保存到allDoubletonTags当中。注意重复。
4）找出size=3的频繁项，和size=2的这个基本没区别。

def findTripletons():
    print("======")
    print("Frequent tripletons found:")
    print("======")
    # use the list of allDoubletonTags to make the tripleton candidates
    tripletonCandidates = list(itertools.combinations(allDoubletonTags,3))

    # sort each candidate tuple and add these to a new sorted candidate list    
    tripletonCandidatesSorted = []
    for tc in tripletonCandidates:
        tripletonCandidatesSorted.append(sorted(tc))

    # figure out if this tripleton candidate is frequent
    for (index, candidate) in enumerate(tripletonCandidatesSorted):          
        # all doubletons inside this tripleton candidate MUST also be frequent
        doubletonsInsideTripleton = list(itertools.combinations(candidate,2))
        tripletonCandidateRejected = 0
        for (index, doubleton) in enumerate(doubletonsInsideTripleton):
            if doubleton not in doubletonSet:
                tripletonCandidateRejected = 1
                break
        # set up queries
        getTripletonFrequencyQuery = "SELECT count(fpt1.project_id) \
                                        FROM fc_project_tags fpt1 \
                                        INNER JOIN fc_project_tags fpt2 \
                                        ON fpt1.project_id = fpt2.project_id \
                                        INNER JOIN fc_project_tags fpt3 \
                                        ON fpt2.project_id = fpt3.project_id \
                                        WHERE (fpt1.tag_name = %s \
                                        AND fpt2.tag_name = %s \
                                        AND fpt3.tag_name = %s)"
        insertTripletonQuery = "INSERT INTO test.fc_project_tag_triples \
                                (tag1, tag2, tag3, num_projs) \
                                VALUES (%s,%s,%s,%s)"
        # insert frequent tripleton into database
        if tripletonCandidateRejected == 0:
            cursor.execute(getTripletonFrequencyQuery, (candidate[0],
                                                        candidate[1],
                                                        candidate[2]))
            count = cursor.fetchone()[0]
            if count > minsupport:
                print (candidate[0],",",
                       candidate[1],",",
                       candidate[2],
                       "[",count,"]")
                cursor.execute(insertTripletonQuery,
                                (candidate[0],
                                 candidate[1],
                                 candidate[2],
                                 count))`

5）生成规则
这里生成三个tag之间的support和confidence并且保存。就是简单的读到频繁项的集然后生成规则（及由2个tag关联到第3个tag）。用到了calcSCAV函数，在下面介绍。

def generateRules():
    print("======")    
    print("Association Rules:")
    print("======")

    # pull final list of tripletons to make the rules
    getFinalListQuery = "SELECT tag1, tag2, tag3, num_projs \
                   FROM test.fc_project_tag_triples"
    cursor.execute(getFinalListQuery)
    triples = cursor.fetchall()
    for(triple) in triples:
        tag1 = triple[0]
        tag2 = triple[1]
        tag3 = triple[2]
        ruleSupport = triple[3]

        calcSCAV(tag1, tag2, tag3, ruleSupport)
        calcSCAV(tag1, tag3, tag2, ruleSupport)
        calcSCAV(tag2, tag3, tag1, ruleSupport)
        print("*")

6）calcSCAV() 计算support和confidence，保存，打印结果。

def calcSCAV(tagA, tagB, tagC, ruleSupport):
    # Support
    ruleSupportPct = round((ruleSupport/baskets),2)

    # Confidence    
    queryConf = "SELECT num_projs \
              FROM test.fc_project_tag_pairs \
              WHERE (tag1 = %s AND tag2 = %s) \
              OR    (tag2 = %s AND tag1 = %s)"
    cursor.execute(queryConf, (tagA, tagB, tagA, tagB))
    pairSupport = cursor.fetchone()[0]
    confidence = round((ruleSupport / pairSupport),2)

    # Added Value
    queryAV = "SELECT count(*) \
              FROM test.fc_project_tags \
              WHERE tag_name= %s"
    cursor.execute(queryAV, tagC)
    supportTagC = cursor.fetchone()[0]
    supportTagCPct = supportTagC/baskets
    addedValue = round((confidence - supportTagCPct),2)

    # Result
    print(tagA,",",tagB,"->",tagC,
          "[S=",ruleSupportPct,
          ", C=",confidence,
          ", AV=",addedValue,
          "]")

7）‘正文’部分

# Open local database connection
db = pymysql.connect(host=dbhost,
                     db=dbschema,
                     user=dbuser,
                     passwd=dbpasswd,
                     port=dbport,
                     charset=dbcharset,
                     autocommit=True)
cursor = db.cursor()

# calculate number of baskets
queryBaskets = "SELECT count(DISTINCT project_id) \
                FROM fc_project_tags;"
cursor.execute(queryBaskets)
baskets = cursor.fetchone()[0]

# calculate minimum number of baskets based on minimum support threshold
minsupport = baskets*(MINSUPPORTPCT/100)
print("Minimum support count:",minsupport,"(",MINSUPPORTPCT,"% of",baskets,")")

# get tags that meet our minimum support threshold
tagNameQuery = "SELECT DISTINCT tag_name \
                FROM fc_project_tags \
                GROUP BY 1 \
                HAVING COUNT(project_id) >= %s \
                ORDER BY tag_name"
cursor.execute(tagNameQuery,(minsupport))
singletons = cursor.fetchall()

for singleton in singletons:
    allSingletonTags.append(singleton[0])

findDoubletons()
findTripletons()
generateRules()

db.close()

8）小结
内容会直接打印在console，但是生成频繁项会保存在mysql中。

4 运行结果只显示console上的部分

Minimum support count: 2325.5 ( 5 % of 46510 )
======
Frequent doubletons found:
======
C GPL [ 5543 ]
C Linux [ 5653 ]
C POSIX [ 6956 ]
C++ GPL [ 2914 ]
C++ Linux [ 3428 ]
C++ POSIX [ 3502 ]
Communications GPL [ 2578 ]
Dynamic Content Internet [ 3173 ]
Dynamic Content Web [ 3171 ]
English Linux [ 2662 ]
GPL Internet [ 4038 ]
GPL Linux [ 8038 ]
GPL multimedia [ 2883 ]
GPL OS Independent [ 4405 ]
GPL PHP [ 2376 ]
GPL POSIX [ 10069 ]
GPL Software Development [ 3319 ]
GPL Web [ 2901 ]
GPL Windows [ 2605 ]
Internet OS Independent [ 3007 ]
Internet POSIX [ 2832 ]
Internet Web [ 5978 ]
Java OS Independent [ 3436 ]
Java Software Development [ 2360 ]
Libraries Software Development [ 5638 ]
Linux Mac OS X [ 2974 ]
Linux POSIX [ 11903 ]
Linux Software Development [ 2336 ]
Linux Unix [ 2494 ]
Linux Windows [ 5281 ]
Mac OS X Windows [ 3131 ]
multimedia POSIX [ 2539 ]
OS Independent Software Development [ 3566 ]
OS Independent Web [ 2605 ]
POSIX Software Development [ 3503 ]
POSIX Unix [ 2326 ]
POSIX Windows [ 4467 ]
======
Frequent tripletons found:
======
Internet , OS Independent , Web [ 2519 ]
Dynamic Content , Internet , Web [ 3166 ]
GPL , Internet , Web [ 2878 ]
C++ , Linux , POSIX [ 2622 ]
Linux , POSIX , Windows [ 3315 ]
C , Linux , POSIX [ 4629 ]
C , GPL , POSIX [ 4364 ]
GPL , Linux , POSIX [ 7384 ]
C , GPL , Linux [ 3299 ]
======
Association Rules:
======
Dynamic Content , Internet -> Web [S= 0.07 , C= 1.0 , AV= 0.87 ]
Dynamic Content , Web -> Internet [S= 0.07 , C= 1.0 , AV= 0.83 ]
Internet , Web -> Dynamic Content [S= 0.07 , C= 0.53 , AV= 0.46 ]
*
Internet , OS Independent -> Web [S= 0.05 , C= 0.84 , AV= 0.71 ]
Internet , Web -> OS Independent [S= 0.05 , C= 0.42 , AV= 0.2 ]
OS Independent , Web -> Internet [S= 0.05 , C= 0.97 , AV= 0.8 ]
*
GPL , Internet -> Web [S= 0.06 , C= 0.71 , AV= 0.58 ]
GPL , Web -> Internet [S= 0.06 , C= 0.99 , AV= 0.82 ]
Internet , Web -> GPL [S= 0.06 , C= 0.48 , AV= 0.02 ]
*
Linux , POSIX -> Windows [S= 0.07 , C= 0.28 , AV= 0.12 ]
Linux , Windows -> POSIX [S= 0.07 , C= 0.63 , AV= 0.27 ]
POSIX , Windows -> Linux [S= 0.07 , C= 0.74 , AV= 0.39 ]
*
C , GPL -> Linux [S= 0.07 , C= 0.6 , AV= 0.25 ]
C , Linux -> GPL [S= 0.07 , C= 0.58 , AV= 0.12 ]
GPL , Linux -> C [S= 0.07 , C= 0.41 , AV= 0.19 ]
*
GPL , Linux -> POSIX [S= 0.16 , C= 0.92 , AV= 0.56 ]
GPL , POSIX -> Linux [S= 0.16 , C= 0.73 , AV= 0.38 ]
Linux , POSIX -> GPL [S= 0.16 , C= 0.62 , AV= 0.16 ]
*
C , GPL -> POSIX [S= 0.09 , C= 0.79 , AV= 0.43 ]
C , POSIX -> GPL [S= 0.09 , C= 0.63 , AV= 0.17 ]
GPL , POSIX -> C [S= 0.09 , C= 0.43 , AV= 0.21 ]
*
C , Linux -> POSIX [S= 0.1 , C= 0.82 , AV= 0.46 ]
C , POSIX -> Linux [S= 0.1 , C= 0.67 , AV= 0.32 ]
Linux , POSIX -> C [S= 0.1 , C= 0.39 , AV= 0.17 ]
*
C++ , Linux -> POSIX [S= 0.06 , C= 0.76 , AV= 0.4 ]
C++ , POSIX -> Linux [S= 0.06 , C= 0.75 , AV= 0.4 ]
Linux , POSIX -> C++ [S= 0.06 , C= 0.22 , AV= 0.09 ]
*
Internet , OS Independent -> Web [S= 0.05 , C= 0.84 , AV= 0.71 ]
Internet , Web -> OS Independent [S= 0.05 , C= 0.42 , AV= 0.2 ]
OS Independent , Web -> Internet [S= 0.05 , C= 0.97 , AV= 0.8 ]
*
Dynamic Content , Internet -> Web [S= 0.07 , C= 1.0 , AV= 0.87 ]
Dynamic Content , Web -> Internet [S= 0.07 , C= 1.0 , AV= 0.83 ]
Internet , Web -> Dynamic Content [S= 0.07 , C= 0.53 , AV= 0.46 ]
*
GPL , Internet -> Web [S= 0.06 , C= 0.71 , AV= 0.58 ]
GPL , Web -> Internet [S= 0.06 , C= 0.99 , AV= 0.82 ]
Internet , Web -> GPL [S= 0.06 , C= 0.48 , AV= 0.02 ]
*
C++ , Linux -> POSIX [S= 0.06 , C= 0.76 , AV= 0.4 ]
C++ , POSIX -> Linux [S= 0.06 , C= 0.75 , AV= 0.4 ]
Linux , POSIX -> C++ [S= 0.06 , C= 0.22 , AV= 0.09 ]
*
Linux , POSIX -> Windows [S= 0.07 , C= 0.28 , AV= 0.12 ]
Linux , Windows -> POSIX [S= 0.07 , C= 0.63 , AV= 0.27 ]
POSIX , Windows -> Linux [S= 0.07 , C= 0.74 , AV= 0.39 ]
*
C , Linux -> POSIX [S= 0.1 , C= 0.82 , AV= 0.46 ]
C , POSIX -> Linux [S= 0.1 , C= 0.67 , AV= 0.32 ]
Linux , POSIX -> C [S= 0.1 , C= 0.39 , AV= 0.17 ]
*
C , GPL -> POSIX [S= 0.09 , C= 0.79 , AV= 0.43 ]
C , POSIX -> GPL [S= 0.09 , C= 0.63 , AV= 0.17 ]
GPL , POSIX -> C [S= 0.09 , C= 0.43 , AV= 0.21 ]
*
GPL , Linux -> POSIX [S= 0.16 , C= 0.92 , AV= 0.56 ]
GPL , POSIX -> Linux [S= 0.16 , C= 0.73 , AV= 0.38 ]
Linux , POSIX -> GPL [S= 0.16 , C= 0.62 , AV= 0.16 ]
*
C , GPL -> Linux [S= 0.07 , C= 0.6 , AV= 0.25 ]
C , Linux -> GPL [S= 0.07 , C= 0.58 , AV= 0.12 ]
GPL , Linux -> C [S= 0.07 , C= 0.41 , AV= 0.19 ]
*

5 reference：

数据挖掘原理与算法毛国君清华大学出版社 P66-P79
Python数据挖掘概念、方法与实践 Megan Squire
数据库系统概论（第5版） P97
Python数据科学实践指南纪路机械工业出版社 #个人认为是非常好的入门书籍

以及一篇比较详尽的理论文https://blog.csdn.net/GarfieldEr007/article/details/51051491

JUMBO-K

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
用Python在Mysql的帮助下简单实现Apriori

尽管这个方法很老，但是很适合入门先体验一下XD 而且效率还是可以的，结果也令人满意。1 建表首先打开Mysqlcreate table if not exists fc_project_tags( project_id int(11) not null default '0', tag_name varchar(50) not null default '0',...
复制链接

扫一扫