It’s the very first time that I have learnt about FP-TREE item-set mining algorithm in the DATA MINING course. I was really intrigued by its unique design to reduce cost in traversing transactions in database. In comparison with APRIORI algorithm, which would look up the whole database for k times if there are k items at most in all transactions, fp-tree interact with date only twice, one for constructing frequency record for each item and the other for building up the tree.
How to make full use of transactions is the key to reduce cost of reading from database. Initially, I thought tree might work but my design was more of dictionary tree. If do so, there will be much more redundant nodes in such tree.Hence, we need to abstract infomation and eliminate redundancy in the tree.
Firstly, in comparison with Apriori algorithm, fp-tree is more transaction-oriented, which means it does not generate item sets consecutively instead it use record of frequency of single items in all transactions and produce item set at the end.
As the graph illustrated above, each path in this tress represents a transaction read from database. What is worthwhile of mention is that the value of each node stands for counts in corresponding transactions,that is the number of children nodes it has. Intuively, the more frequently a single item appear in transactions, the more peers it is connected to. Thus, if we put such item (e.g. i2) in a higher position, it will have more descendents.
Once such tree is constructed, we could find the longest common frequent prefix for the target item. For example, if we would like to obtain frequent item sets containing item ‘i5’, we could search for node labeled ‘i5’ and sum up their values. As long as the sum of values reaches the support threshold( we set 2 as the threshold), the common ancestors in the tree of such nodes,i2 and i1, are the results.
In the final step, all we need to do is to combine prefix and its subsets with target single item. In this example, we finally get (i1,i5),(i2,i5) and (i1,i2,i5). And the reason for this step is that, any superset of a frequent subset must be frequent.
All in all, with the powerful data structure " tree", we are able to rearrange and record information from database efficiently and of course we could access desired records easily and less costly.
More details to be added in the coming days!