用Python在Mysql的帮助下简单实现Apriori

这篇博客介绍了如何使用Python和Mysql实现Apriori算法。博主首先讲解了建立数据库表的过程,接着展示了查询表的内容。在分析部分,博主分享了代码,并解释了如何计算频繁项和支持度,以及如何生成关联规则。最后,博主给出了运行结果的部分截图和参考资料。
摘要由CSDN通过智能技术生成

尽管这个方法很老,但是很适合入门先体验一下XD 而且效率还是可以的,结果也令人满意。

1 建表

首先打开Mysql


create table if not exists fc_project_tags(
    project_id int(11) not null default '0',
    tag_name varchar(50) not null default '0',
    primary key (project_id,tag_name)
)engine=MyISAM default Charset=latin1;

这里用到了一个engine=MyISAM。即指定了储存引擎

这里写图片描述
这里可以详见https://blog.csdn.net/redbloodbody/article/details/58185240,写的十分详细了。
数据库的具体文件在:
https://github.com/megansquire/masteringDM/blob/master/ch2/fc_project_tags.sql.gz
这里书中作者推荐用source命令导入。懒得打字,详见百度。

2 检阅一下这个表

可以先简单查询一下

select count(*)
from fc_project_tags;

353400

select count(DISTINCT project_id)
from fc_project_tags;

46510

select count(DISTINCT tag_name)
from fc_project_tags;

11006

select tag_name ,count(project_id)
from fc_project_tags
group by 1
having count(project_id) >= 2325
order by 2 desc;

注意这里的having,不能用where,见本博客的某篇MYSQL文章。
具体来说,where子句中是不能用聚集函数作为条件表达式的。
以及,这里的group by 1是指以选择的第一列进行groupby聚集
以下是运行的结果的截取

tag_name, count(project_id)

‘GPL’, ‘21182’
‘POSIX’, ‘16875’
‘Linux’, ‘16288’
‘C’, ‘10292’
‘OS Independent’, ‘10180’
‘Software Development’, ‘9619’
……
Scientific/Engineering 2679
Games/Entertainment 2528
BSD 2497
Desktop Environment 2335

3 开始分析

准备工作:
代码在https://github.com/megansquire/masteringDM/tree/master/ch2
这段程序是为了计算总的篮子的数量(可以通过最小支持阈值的有关联的组合,bzw.有序对,的数量)
多一嘴,这段程序和书中有些许不同的是他更新了了一小部分,导入了getpass标准库。

同时之前的pip重新下了一个版本,所以path忘记重新配置了,提醒一定要加入系统变量。为此装第三方库的时候折腾了好久。
关于用到的itertools库可以看这个。https://blog.csdn.net/c465869935/article/details/51598388
提供了包括但不限于count容器,十分好用。

开始运行:
1)首先应该输入的参数是MINSUPPORTPCT,default是5%,这里的支持度设置的够高的话可以省掉很多时间,同时也会忽略一些不是特别明显的组合,换句话说可能会漏掉一些有价值的信息。

# set threshold as a percent (example, 5% of Freecode baskets is about 2325)
MINSUPPORTPCT = 5

2)链接数据库需要的相关参数

dbhost = 'localhost'
dbschema = 'test'
dbuser = 'root'
dbpasswd = 'xxxxxx'
dbport = 3306
dbcharset = 'utf8mb4'

port注意一下,如果没有特殊需求就选3306就好了。
3)这里定义了找出size=2的频繁项的函数,用到了之前通过最低阈值的size=1的频繁项。这里的数学原理是:如果一个项的任何一个子项不是频繁项,那么它也一定不是一个频繁项。所以只需要check size=1里已经保存好的项并将其组合就可以了,并且计算、保存他的support和confidence。同样的,这里应该删去一部分confidence不高/由于各种原因并不可靠的关联,这个confidence的阈值同样可以调整。这里用到了一些intertoos里的函数,具体作用见上面的链接里的文章。

def findDoubletons():
    print("======")
    print("Frequent doubletons found:")
    print("======")
    # use the list of allSingletonTags to make the doubleton candidates
    doubletonCandidates = list(itertools.combinations(allSingletonTags, 2))
    for (index, candidate) in enumerate(doubletonCandidates):
        # figure out if this doubleton candidate is frequent
        tag1 = candidate[0]
        tag2 = candidate[1]
        getDoubletonFrequencyQuery = "SELECT count(fpt1.project_id) \
                                     FROM fc_project_tags fpt1 \
                                     INNER JOIN fc_project_tags fpt2 \
                                     ON fpt1.project_id = fpt2.project_id \
                                     WHERE fpt1.tag_name = %s \
                                     AND fpt2.tag_name = %s"
        insertPairQuery = "INSERT INTO test.fc_project_tag_pairs \
                                (tag1, tag2, num_projs) \
                                VALUES (%s,%s,%s)"
        cursor.execute(getDoubletonFrequencyQuery, (tag1, tag2))
    到这一步为止,使用cursor游标从mysql的数据库中读出需要的size=1的频繁项集并保存tag_pairs。注意本文所有的pairs都是有序对。
        count = cursor.fetchone()[0]
        # add frequent doubleton to database                
        if count > minsupport:
            print (tag1,tag2,"[",count,"]")


            cursor.execute(insertPairQuery,(tag1, tag2, count))

            # save the frequent doubleton to our final list
            doubletonSet.add(candidate)         
            # add terms to a set of all doubleton terms (no duplicates)
            allDoubletonTags.add(tag1)
            allDoubletonTags.add(tag2)

这里将结果找到的tag(s),bzw.,count>minsupport(就是开头用百分比定义的最低值)的tag对保存到allDoubletonTags当中。注意重复。
4)找出size=3的频繁项,和size=2的这个基本没区别。

def findTripletons():
    print("======")
    print("Frequent tripletons found:")
    print("======")
    # use the list of allDoubletonTags to make the tripleton candidates
    tripletonCandidates = list(itertools.combinations(allDoubletonTags,3))

    # sort each candidate tuple and add these to a new sorted candidate list    
    tripletonCandidatesSorted = []
    for tc in tripletonCandidates:
        tripletonCandidatesSorted.append(sorted(tc))

    # figure out if this tripleton candidate is frequent
    for (index, candidate) in enumerate(tripletonCandidatesSorted):          
        # all doubletons inside this tripleton candidate MUST also be frequent
        doubletonsInsideTripleton = list(itertools.combinations(candidate,2))
        tripletonCandidateRejected = 0
        for (index, doubleton) in enumerate(doubletonsInsideTripleton):
            if doubleton not in doubletonSet:
                tripletonCandidateRejected = 1
                break
        # set up queries
        getTripletonFrequencyQuery = "SELECT count(fpt1.project_id) \
                                        FROM fc_project_tags fpt1 \
                                        INNER JOIN fc_project_tags fpt2 \
                                        ON fpt1.project_id = fpt2.project_id \
                                        INNER JOIN fc_project_tags fpt3 \
                                        ON fpt2.project_id = fpt3.project_id \
                                        WHERE (fpt1.tag_name = %s \
                                        AND fpt2.tag_name = %s \
                                        AND fpt3.tag_name = %s)"
        insertTripletonQuery = "INSERT INTO test.fc_project_tag_triples \
                                (tag1, tag2, tag3, num_projs) \
                                VALUES (%s,%s,%s,%s)"
        # insert frequent tripleton into database
        if tripletonCandidateRejected == 0:
            cursor.execute(getTripletonFrequencyQuery, (candidate[0],
                                                        candidate[1],
                                                        candidate[2]))
            count = cursor.fetchone()[0]
            if count > minsupport:
                print (candidate[0],",",
                       candidate[1],",",
                       candidate[2],
                       "[",count,"]")
                cursor.execute(insertTripletonQuery,
                                (candidate[0],
                                 candidate[1],
                                 candidate[2],
                                 count))`

5)生成规则
这里生成三个tag之间的support和confidence并且保存。就是简单的读到频繁项的集然后生成规则(及由2个tag关联到第3个tag)。用到了calcSCAV函数,在下面介绍。

def generateRules():
    print("======")    
    print("Association Rules:")
    print("======")

    # pull final list of tripletons to make the rules
    getFinalListQuery = "SELECT tag1, tag2, tag3, num_projs \
                   FROM test.fc_project_tag_triples"
    cursor.execute(getFinalListQuery)
    triples = cursor.fetchall()
    for(triple) in triples:
        tag1 = triple[0]
        tag2 = triple[1]
        tag3 = triple[2]
        ruleSupport = triple[3]

        calcSCAV(tag1, tag2, tag3, ruleSupport)
        calcSCAV(tag1, tag3, tag2, ruleSupport)
        calcSCAV(tag2, tag3, tag1, ruleSupport)
        print("*")

6)calcSCAV() 计算support和confidence,保存,打印结果。

def calcSCAV(tagA, tagB, tagC, ruleSupport):
    # Support
    ruleSupportPct = round((ruleSupport/baskets),2)

    # Confidence    
    queryConf = "SELECT num_projs \
              FROM test.fc_project_tag_pairs \
              WHERE (tag1 = %s AND tag2 = %s) \
              OR    (tag2 = %s AND tag1 = %s)"
    cursor.execute(queryConf, (tagA, tagB, tagA, tagB))
    pairSupport = cursor.fetchone()[0]
    confidence = round((ruleSupport / pairSupport),2)

    # Added Value
    queryAV = "SELECT count(*) \
              FROM test.fc_project_tags \
              WHERE tag_name= %s"
    cursor.execute(queryAV, tagC)
    supportTagC = cursor.fetchone()[0]
    supportTagCPct = supportTagC/baskets
    addedValue = round((confidence - supportTagCPct),2)

    # Result
    print(tagA,",",tagB,"->",tagC,
          "[S=",ruleSupportPct,
          ", C=",confidence,
          ", AV=",addedValue,
          "]")

7)‘正文’部分

# Open local database connection
db = pymysql.connect(host=dbhost,
                     db=dbschema,
                     user=dbuser,
                     passwd=dbpasswd,
                     port=dbport,
                     charset=dbcharset,
                     autocommit=True)
cursor = db.cursor()

# calculate number of baskets
queryBaskets = "SELECT count(DISTINCT project_id) \
                FROM fc_project_tags;"
cursor.execute(queryBaskets)
baskets = cursor.fetchone()[0]

# calculate minimum number of baskets based on minimum support threshold
minsupport = baskets*(MINSUPPORTPCT/100)
print("Minimum support count:",minsupport,"(",MINSUPPORTPCT,"% of",baskets,")")

# get tags that meet our minimum support threshold
tagNameQuery = "SELECT DISTINCT tag_name \
                FROM fc_project_tags \
                GROUP BY 1 \
                HAVING COUNT(project_id) >= %s \
                ORDER BY tag_name"
cursor.execute(tagNameQuery,(minsupport))
singletons = cursor.fetchall()

for singleton in singletons:
    allSingletonTags.append(singleton[0])

findDoubletons()
findTripletons()
generateRules()

db.close()

8)小结
内容会直接打印在console,但是生成频繁项会保存在mysql中。

4 运行结果 只显示console上的部分

Minimum support count: 2325.5 ( 5 % of 46510 )
======
Frequent doubletons found:
======
C GPL [ 5543 ]
C Linux [ 5653 ]
C POSIX [ 6956 ]
C++ GPL [ 2914 ]
C++ Linux [ 3428 ]
C++ POSIX [ 3502 ]
Communications GPL [ 2578 ]
Dynamic Content Internet [ 3173 ]
Dynamic Content Web [ 3171 ]
English Linux [ 2662 ]
GPL Internet [ 4038 ]
GPL Linux [ 8038 ]
GPL multimedia [ 2883 ]
GPL OS Independent [ 4405 ]
GPL PHP [ 2376 ]
GPL POSIX [ 10069 ]
GPL Software Development [ 3319 ]
GPL Web [ 2901 ]
GPL Windows [ 2605 ]
Internet OS Independent [ 3007 ]
Internet POSIX [ 2832 ]
Internet Web [ 5978 ]
Java OS Independent [ 3436 ]
Java Software Development [ 2360 ]
Libraries Software Development [ 5638 ]
Linux Mac OS X [ 2974 ]
Linux POSIX [ 11903 ]
Linux Software Development [ 2336 ]
Linux Unix [ 2494 ]
Linux Windows [ 5281 ]
Mac OS X Windows [ 3131 ]
multimedia POSIX [ 2539 ]
OS Independent Software Development [ 3566 ]
OS Independent Web [ 2605 ]
POSIX Software Development [ 3503 ]
POSIX Unix [ 2326 ]
POSIX Windows [ 4467 ]
======
Frequent tripletons found:
======
Internet , OS Independent , Web [ 2519 ]
Dynamic Content , Internet , Web [ 3166 ]
GPL , Internet , Web [ 2878 ]
C++ , Linux , POSIX [ 2622 ]
Linux , POSIX , Windows [ 3315 ]
C , Linux , POSIX [ 4629 ]
C , GPL , POSIX [ 4364 ]
GPL , Linux , POSIX [ 7384 ]
C , GPL , Linux [ 3299 ]
======
Association Rules:
======
Dynamic Content , Internet -> Web [S= 0.07 , C= 1.0 , AV= 0.87 ]
Dynamic Content , Web -> Internet [S= 0.07 , C= 1.0 , AV= 0.83 ]
Internet , Web -> Dynamic Content [S= 0.07 , C= 0.53 , AV= 0.46 ]
*
Internet , OS Independent -> Web [S= 0.05 , C= 0.84 , AV= 0.71 ]
Internet , Web -> OS Independent [S= 0.05 , C= 0.42 , AV= 0.2 ]
OS Independent , Web -> Internet [S= 0.05 , C= 0.97 , AV= 0.8 ]
*
GPL , Internet -> Web [S= 0.06 , C= 0.71 , AV= 0.58 ]
GPL , Web -> Internet [S= 0.06 , C= 0.99 , AV= 0.82 ]
Internet , Web -> GPL [S= 0.06 , C= 0.48 , AV= 0.02 ]
*
Linux , POSIX -> Windows [S= 0.07 , C= 0.28 , AV= 0.12 ]
Linux , Windows -> POSIX [S= 0.07 , C= 0.63 , AV= 0.27 ]
POSIX , Windows -> Linux [S= 0.07 , C= 0.74 , AV= 0.39 ]
*
C , GPL -> Linux [S= 0.07 , C= 0.6 , AV= 0.25 ]
C , Linux -> GPL [S= 0.07 , C= 0.58 , AV= 0.12 ]
GPL , Linux -> C [S= 0.07 , C= 0.41 , AV= 0.19 ]
*
GPL , Linux -> POSIX [S= 0.16 , C= 0.92 , AV= 0.56 ]
GPL , POSIX -> Linux [S= 0.16 , C= 0.73 , AV= 0.38 ]
Linux , POSIX -> GPL [S= 0.16 , C= 0.62 , AV= 0.16 ]
*
C , GPL -> POSIX [S= 0.09 , C= 0.79 , AV= 0.43 ]
C , POSIX -> GPL [S= 0.09 , C= 0.63 , AV= 0.17 ]
GPL , POSIX -> C [S= 0.09 , C= 0.43 , AV= 0.21 ]
*
C , Linux -> POSIX [S= 0.1 , C= 0.82 , AV= 0.46 ]
C , POSIX -> Linux [S= 0.1 , C= 0.67 , AV= 0.32 ]
Linux , POSIX -> C [S= 0.1 , C= 0.39 , AV= 0.17 ]
*
C++ , Linux -> POSIX [S= 0.06 , C= 0.76 , AV= 0.4 ]
C++ , POSIX -> Linux [S= 0.06 , C= 0.75 , AV= 0.4 ]
Linux , POSIX -> C++ [S= 0.06 , C= 0.22 , AV= 0.09 ]
*
Internet , OS Independent -> Web [S= 0.05 , C= 0.84 , AV= 0.71 ]
Internet , Web -> OS Independent [S= 0.05 , C= 0.42 , AV= 0.2 ]
OS Independent , Web -> Internet [S= 0.05 , C= 0.97 , AV= 0.8 ]
*
Dynamic Content , Internet -> Web [S= 0.07 , C= 1.0 , AV= 0.87 ]
Dynamic Content , Web -> Internet [S= 0.07 , C= 1.0 , AV= 0.83 ]
Internet , Web -> Dynamic Content [S= 0.07 , C= 0.53 , AV= 0.46 ]
*
GPL , Internet -> Web [S= 0.06 , C= 0.71 , AV= 0.58 ]
GPL , Web -> Internet [S= 0.06 , C= 0.99 , AV= 0.82 ]
Internet , Web -> GPL [S= 0.06 , C= 0.48 , AV= 0.02 ]
*
C++ , Linux -> POSIX [S= 0.06 , C= 0.76 , AV= 0.4 ]
C++ , POSIX -> Linux [S= 0.06 , C= 0.75 , AV= 0.4 ]
Linux , POSIX -> C++ [S= 0.06 , C= 0.22 , AV= 0.09 ]
*
Linux , POSIX -> Windows [S= 0.07 , C= 0.28 , AV= 0.12 ]
Linux , Windows -> POSIX [S= 0.07 , C= 0.63 , AV= 0.27 ]
POSIX , Windows -> Linux [S= 0.07 , C= 0.74 , AV= 0.39 ]
*
C , Linux -> POSIX [S= 0.1 , C= 0.82 , AV= 0.46 ]
C , POSIX -> Linux [S= 0.1 , C= 0.67 , AV= 0.32 ]
Linux , POSIX -> C [S= 0.1 , C= 0.39 , AV= 0.17 ]
*
C , GPL -> POSIX [S= 0.09 , C= 0.79 , AV= 0.43 ]
C , POSIX -> GPL [S= 0.09 , C= 0.63 , AV= 0.17 ]
GPL , POSIX -> C [S= 0.09 , C= 0.43 , AV= 0.21 ]
*
GPL , Linux -> POSIX [S= 0.16 , C= 0.92 , AV= 0.56 ]
GPL , POSIX -> Linux [S= 0.16 , C= 0.73 , AV= 0.38 ]
Linux , POSIX -> GPL [S= 0.16 , C= 0.62 , AV= 0.16 ]
*
C , GPL -> Linux [S= 0.07 , C= 0.6 , AV= 0.25 ]
C , Linux -> GPL [S= 0.07 , C= 0.58 , AV= 0.12 ]
GPL , Linux -> C [S= 0.07 , C= 0.41 , AV= 0.19 ]
*

5 reference:

数据挖掘原理与算法 毛国君 清华大学出版社 P66-P79
Python数据挖掘 概念、方法与实践 Megan Squire
数据库系统概论(第5版) P97
Python数据科学实践指南 纪路 机械工业出版社 #个人认为是非常好的入门书籍

以及一篇比较详尽的理论文https://blog.csdn.net/GarfieldEr007/article/details/51051491

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值