《斯坦福数据挖掘教程·第三版》读书笔记(英文版) Chapter 6 Frequent Itemsets

来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 6 Frequent Itemsets

The market-basket model of data is used to describe a common form of many-many relationship between two kinds of objects. On the one hand, we have items, and on the other we have baskets, sometimes called “transactions.” Each basket consists of a set of items (an itemset), and usually we assume that the number of items in a basket is small – much smaller than the total number of items. The number of baskets is usually assumed to be very large, bigger than what can fit in main memory. The data is assumed to be represented in a file consisting of a sequence of baskets. In terms of the distributed file system, the baskets are the objects of the file, and each basket is of type “set of items.”

We assume there is a number s, called the support threshold. If I is a set of items, the support for I is the number of baskets for which I is a subset. We say I is frequent if its support is s or more.

Suppose that we set our threshold at s = 3 s = 3 s=3. Then there are five frequent singleton itemsets: {dog}, {cat}, {and}, {a}, and {training}.

在这里插入图片描述

A doubleton cannot be frequent unless both items in the set are frequent by themselves.

Applications of frequent-itemset analysis is not limited to market baskets. The same model can be used to mine many other kinds of data. Some examples are:

  1. Related concepts: Let items be words, and let baskets be documents. A basket/document contains those items/words that are present in the document. If we look for sets of words that appear together in many documents, the sets will be dominated by the most common words (stop words). There, even though the intent was to find snippets that talked about cats and dogs, the stop words “and” and “a” were prominent among the frequent itemsets. However, if we ignore all the most common words, then we would hope to find among the frequent pairs some pairs of words that represent a joint concept.
  2. Plagiarism: Let the items be documents and the baskets be sentences. An item/document is “in” a basket/sentence if the sentence is in the document. This arrangement appears backwards, but it is exactly what we need, and we should remember that the relationship between items and baskets is an arbitrary many-many relationship. That is, “in” need not have its conventional meaning: “part of.” In this application, we look for pairs of items that appear together in several baskets. If we find such a pair, then we have two documents that share several sentences in common. In practice, even one or two sentences in common is a good indicator of plagiarism.
  3. Biomarkers: Let the items be of two types – biomarkers such as genes or blood proteins, and diseases. Each basket is the set of data about a patient: their genome and blood-chemistry analysis, as well as their medical history of disease. A frequent itemset that consists of one disease and one or more biomarkers suggests a test for the disease.

Thus, we define the interest of an association rule I → j I → j Ij to be the difference between its confidence and the fraction of baskets that contain j.

There is another approach to storing counts that may be more appropriate, depending on the fraction of the possible pairs of items that actually appear in some basket. We can store counts as triples [ i , j , c ] [i, j, c] [i,j,c], meaning that the count of pair { i , j } \{i, j\} {i,j}, with i < j i < j i<j, is c c c. A data structure, such as a hash table with i i i and j j j as the search key, is used so we can tell if there is a triple for a given i i i and j j j and, if so, to find it quickly. We call this approach the triples method of storing counts.

the A-Priori Algorithm, one pass is taken for each set-size k. If no frequent itemsets of a certain size are found, then monotonicity tells us there can be no larger frequent itemsets, so we can stop.

The pattern of moving from one size k to the next size k + 1 k + 1 k+1 can be summarized as follows. For each size k, there are two sets of itemsets:

  1. C k C_k Ck is the set of candidate itemsets of size k – the itemsets that we must count in order to determine whether they are in fact frequent.
  2. L k L_k Lk is the set of truly frequent itemsets of size k k k. The pattern of moving from one set to the next and one size to the next is suggested by Fig. 6.4.

在这里插入图片描述

Summary of Chapter 6

  • Market-Basket Data: This model of data assumes there are two kinds of entities: items and baskets. There is a many–many relationship between items and baskets. Typically, baskets are related to small sets of items, while items may be related to many baskets.
  • Frequent Itemsets: The support for a set of items is the number of baskets containing all those items. Itemsets with support that is at least some threshold are called frequent itemsets.
  • Association Rules: These are implications that if a basket contains a certain set of items I, then it is likely to contain another particular item j as well. The probability that j is also in a basket containing I is called the confidence of the rule. The interest of the rule is the amount by which the confidence deviates from the fraction of all baskets that contain j.
  • The Pair-Counting Bottleneck: To find frequent itemsets, we need to examine all baskets and count the number of occurrences of sets of a certain size. For typical data, with a goal of producing a small number of itemsets that are the most frequent of all, the part that often takes the most main memory is the counting of pairs of items. Thus, methods for finding frequent itemsets typically concentrate on how to minimize the main memory needed to count pairs.
  • Triangular Matrices: While one could use a two-dimensional array to count pairs, doing so wastes half the space, because there is no need to count pair { i , j } \{i, j\} {i,j} in both the i-j and j-i array elements. By arranging the pairs ( i , j ) (i, j) (i,j) for which i < j i < j i<j in lexicographic order, we can store only the needed counts in a one-dimensional array with no wasted space, and yet be able to access the count for any pair efficiently.
  • Storage of Pair Counts as Triples: If fewer than 1/3 of the possible pairs actually occur in baskets, then it is more space-efficient to store counts of pairs as triples ( i , j , c ) (i, j, c) (i,j,c), where c is the count of the pair { i , j } \{i, j\} {i,j}, and i < j i < j i<j. An index structure such as a hash table allows us to find the triple for ( i , j ) (i, j) (i,j) efficiently.
  • Monotonicity of Frequent Itemsets: An important property of itemsets is that if a set of items is frequent, then so are all its subsets. We exploit this property to eliminate the need to count certain itemsets by using its contrapositive: if an itemset is not frequent, then neither are its supersets.
  • The A-Priori Algorithm for Pairs: We can find all frequent pairs by making two passes over the baskets. On the first pass, we count the items themselves, and then determine which items are frequent. On the second pass, we count only the pairs of items both of which are found frequent on the first pass. Monotonicity justifies our ignoring other pairs.
  • Finding Larger Frequent Itemsets: A-Priori and many other algorithms allow us to find frequent itemsets larger than pairs, if we make one pass over the baskets for each size itemset, up to some limit. To find the frequent itemsets of size k, monotonicity lets us restrict our attention to only those itemsets such that all their subsets of size k − 1 k − 1 k1 have already been found frequent.
  • The PCY Algorithm: This algorithm improves on A-Priori by creating a hash table on the first pass, using all main-memory space that is not needed to count the items. Pairs of items are hashed, and the hash-table buckets are used as integer counts of the number of times a pair has hashed to that bucket. Then, on the second pass, we only have to count pairs of frequent items that hashed to a frequent bucket (one whose count is at least the support threshold) on the first pass.
  • The Multistage Algorithm: We can insert additional passes between the first and second pass of the PCY Algorithm to hash pairs to other, independent hash tables. At each intermediate pass, we only have to hash pairs of frequent items that have hashed to frequent buckets on all previous passes.
  • The Multihash Algorithm: We can modify the first pass of the PCY Algorithm to divide available main memory into several hash tables. On the second pass, we only have to count a pair of frequent items if they hashed to frequent buckets in all hash tables.
  • Randomized Algorithms: Instead of making passes through all the data, we may choose a random sample of the baskets, small enough that it is possible to store both the sample and the needed counts of itemsets in main memory. The support threshold must be scaled down in proportion. We can then find the frequent itemsets for the sample, and hope that it is a good representation of the data as whole. While this method uses at most one pass through the whole dataset, it is subject to false positives (itemsets that are frequent in the sample but not the whole) and false negatives (itemsets that are frequent in the whole but not the sample).
  • The SON Algorithm: An improvement on the simple randomized algorithm is to divide the entire file of baskets into segments small enough that all frequent itemsets for the segment can be found in main memory. Candidate itemsets are those found frequent for at least one segment. A second pass allows us to count all the candidates and find the exact collection of frequent itemsets. This algorithm is especially appropriate in a
    MapReduce setting.
  • Toivonen’s Algorithm: This algorithm starts by finding frequent itemsets in a sample, but with the threshold lowered so there is little chance of missing an itemset that is frequent in the whole. Next, we examine the entire file of baskets, counting not only the itemsets that are frequent in the sample, but also, the negative border – itemsets that have not been found frequent, but all their immediate subsets are. If no member of the
    negative border is found frequent in the whole, then the answer is exact. But if a member of the negative border is found frequent, then the whole process has to repeat with another sample.
  • Frequent Itemsets in Streams: If we use a decaying window with constant c c c, then we can start counting an item whenever we see it in a basket. We start counting an itemset if we see it contained within the current basket, and all its immediate proper subsets already are being counted. As the window is decaying, we multiply all counts by 1 − c 1 − c 1c and eliminate those that are less than 1 / 2 1/2 1/2.

END

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Flash CS5实例教程(第2版) 光盘 刘杰 著 人民邮电出版社 二版时间 2012年5月 本书全面、系统地介绍了Flash CS5的基本操作方法和网页动画的制作技巧,包括FlashCS5基础入门、图形的绘制与编辑、对象的编辑与修饰、文本的编辑、外部素材的应用、元件和库、基本动画的制作、层与高级动画、声音素材的编辑、动作脚本的应用、交互式动画的制作、组件与行为,以及作品的测试、优化、输出和发布等内容。   本书内容的讲解均以案例为主线,通过案例制作,学生可以快速熟悉软件功能和艺术设计思路。书中的软件功能解析部分能使学生深入学习软件功能;课堂练习和课后习题可以拓展学生的实际应用能力,提高学生的软件使用技巧。   本书适合作为高等职业院校数字媒体艺术类专业课程的教材,也可作为相关人员的自学参考书。 目 录 第1章 Flash CS5基础入门 1 1.1 Flash CS5的操作界面 2 1.1.1 菜单栏 2 1.1.2 主工具栏 3 1.1.3 工具箱 3 1.1.4 时间轴 5 1.1.5 场景和舞台 6 1.1.6 属性面板 7 1.1.7 浮动面板 7 1.2 Flash CS5的文件操作 8 1.2.1 新建文件 8 1.2.2 保存文件 8 1.2.3 打开文件 9 1.3 Flash CS5的系统配置 9 1.3.1 首选参数面板 9 1.3.2 设置浮动面板 13 1.3.3 历史记录面板 13 第2章 图形的绘制与编辑 15 2.1 基本线条与图形的绘制 16 2.1.1 课堂案例——绘制沙滩风景 16 2.1.2 线条工具 20 2.1.3 铅笔工具 20 2.1.4 椭圆工具 21 2.1.5 刷子工具 22 2.2 图形的绘制与选择 23 2.2.1 课堂案例——绘制衣恋堂标志 24 2.2.2 矩形工具 29 2.2.3 多角星形工具 29 2.2.4 钢笔工具 30 2.2.5 选择工具 32 2.2.6 部分选取工具 33 2.2.7 套索工具 35 2.3 图形的编辑 36 2.3.1 课堂案例——绘制卡通形象 36 2.3.2 墨水瓶工具 42 2.3.3 颜料桶工具 43 2.3.4 滴管工具 44 2.3.5 橡皮擦工具 45 2.3.6 任意变形工具和渐变变形工具 47 2.3.7 手形工具和缩放工具 50 2.4 图形的色彩 52 2.4.1 课堂案例——绘制水晶按钮 52 2.4.2 纯色编辑面板 56 2.4.3 颜色面板 56 2.4.4 样本面板 59 2.5 课堂练习——绘制冬天夜景 60 2.6 课后习题——绘制花店标志 60 第3章 对象的编辑与修饰 61 3.1 对象的变形与操作 62 3.1.1 课堂案例——绘制风车风景 62 3.1.2 扭曲对象 69 3.1.3 封套对象 69 3.1.4 缩放对象 69 3.1.5 旋转与倾斜对象 70 3.1.6 翻转对象 70 3.1.7 组合对象 71 3.1.8 分离对象 71 3.1.9 叠放对象 71 3.1.10 对齐对象 72 3.2 对象的修饰 72 3.2.1 课堂案例——绘制草原风景画 72 3.2.2 优化曲线 79 3.2.3 将线条转换为填充 80 3.2.4 扩展填充 80 3.2.5 柔化填充边缘 81 3.3 对齐面板与变形面板的使用 82 3.3.1 课堂案例——绘制运动鞋宣传单 82 3.3.2 对齐面板 86 3.3.3 变形面板 88 3.4 课堂练习——绘制太阳插画 91 3.5 课后习题——绘制卡套图形 91 第4章 文本的编辑 92 4.1 文本的类型及使用 93 4.1.1 课堂案例——制作心情日记 93 4.1.2 文本的类型 96 4.1.3 文本属性 97 4.1.4 静态文本 101 4.1.5 动态文本 102 4.1.6 输入文本 102 4.1.7 拼写检查 102 4.1.8 嵌入字体 103 4.2 文本的转换 104 4.2.1 课堂案例——绘制水果标志 105 4.2.2 变形文本 107 4.2.3 填充文本 107 4.3 课堂练习——制作圣诞贺卡 108 4.4 课后习题——制作变形文字 108 第5章 外部素材的应用 109 5.1 图像素材的应用 110 5.1.1 课堂案例——制作啤酒广告 110 5.1.2 图像素材的格式 112 5.1.3 导入图像素材 112 5.1.4 设置导入位图属性 115 5.1.5 将位图转换为图形 116 5.1.6 将位图转换为矢量图 117 5.2 视频素材的应用 119 5.2.1 课堂案例——制作摄像机广告 119 5.2.2 视频素材的格式 121 5.2.3 导入视频素材 121 5.2.4

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值