In this chapter, we will look at affinity analysis which determines when objects occur frequently together. This is also colloquially[kəˈloʊkwiəli]口语地;用通俗语 called market basket analysis, after one of the common use cases - determining when items are purchased together frequently in a store.
In m03 Predicting Sports Winners with Decision Trees_NBA_TP_metric_OneHotEncoder_bias_colab_Linli522362242的专栏-CSDN博客, Predicting Sports Winners with Decision Trees, we looked at an object as a focus and used features to describe that object. In this chapter, the data has a different form. We have transactions where the objects of interest (movies, in this chapter) are used within those transactions in some way. The aim is to discover when objects occur simultaneously. In a case where we wish to work out when two movies are recommended by the same reviewers, we can use affinity analysis.
The key concepts of this chapter are as follows:
- • Affinity analysis
- • Feature association mining using the Apriori algorithm
- • Movie recommendations
- • Sparse data formats
Affinity analysis
Affinity analysis is the task of determining when objects are used in similar ways. In the m03 Predicting Sports Winners with Decision Trees_NBA_TP_metric_OneHotEncoder_bias_colab_Linli522362242的专栏-CSDN博客, we focused on whether the objects themselves are similar. The data for affinity analysis is often described in the form of a transaction. Intuitively, this comes from a transaction at a store—determining when objects are purchased together.
However, affinity analysis can be applied to many processes that do not use transactions in this sense:
- Fraud detection
- Customer segmentation
- Software optimization
- Product recommendations
Affinity analysis is usually much more exploratory than classification. At the very least, we often simply rank the results and choose the top five recommendations (or some other number), rather than expect the algorithm to give us a specific answer.
Furthermore, we often don't have the complete dataset we expect for many classification tasks. For instance, in movie recommendation, we have reviews from different people on different movies. However, it is highly unlikely we have each reviewer review all of the movies in our dataset. This leaves an important and difficult question in affinity analysis. If a reviewer hasn't reviewed a movie, is that an indication that they aren't interested in the movie (and therefore wouldn't recommend it) or simply that they haven't reviewed it yet?
Thinking about gaps in your datasets can lead to questions like this. In turn, that can lead to answers that may help improve the efficacy of your approach. As a budding[ˈbʌdɪŋ]萌芽的;发育期的 data miner, knowing where your models and methodologies need improvement is key to creating great results.
Algorithms for affinity analysis
We introduced a basic method for affinity analysis in m01_DataMining_Crawling_download file_xpath_contains(text(),“{}“)_sort dict by key_discrete_continu_Linli522362242的专栏-CSDN博客, Getting Started with Data Mining, which tested all of the possible rule combinations. We computed the confidence(confidence measures how accurate they are when they can be used. You can compute this by determining the percentage of times the rule applies when the premise applies.) and support(Support is the number of times that a rule occurs in a dataset, which is computed by simply counting the number of samples for which the rule is valid.) for each rule, which in turn allowed us to rank them to find the best rules.
from collections import defaultdict
# Now compute for all possible rules
valid_rules = defaultdict( int )
invalid_rules = defaultdict( int )
num_occurences = defaultdict( int )
# n_samples, n_features = X.shape
# X.shape : (100, 5)
for sample in X: # rows
for premise in range( n_features ): # columns or fruit
if sample[premise] == 0: # x[row_idx, col_idx]
continue
# elif sample[premise] != 0:
# Record that the premise was bought in another transaction
num_occurences[premise] += 1
# mapping the relationship premise to conclusion
for conclusion in range( n_features ):
if premise == conclusion: # It makes little sense to measure if apple_idx-->apple_idx
continue
if sample[conclusion] == 1:
# This persion also bought the conclusion item
valid_rules[ (premise, conclusion) ] +=1
else:
# This person bought the premise, but not the conclusion
invalid_rules[ (premise, conclusion) ] +=1
num_occurences :
various_foods_bought = dict( sorted( num_occurences.items(),
key = lambda num_occurences:num_occurences[0], ###
reverse=False
)# return [(0, 28), (1, 52), (2, 39), (3, 43), (4, 57)]
)
various_foods_bought
support = valid_rules
confidence = defaultdict( float )
for premise, conclusion in valid_rules.keys():
confidence[(premise, conclusion)] = valid_rules[(premise, conclusion)] / num_occurences[premise]
from operator import itemgetter
sorted_support = sorted( support.items(), key=itemgetter(1), reverse=True )
pprint( sorted_support )
However, this approach is not efficient, and it had just five items for sale. We could expect even a small store to have hundreds of items for sale, while many online stores would have thousands (or millions!). With a naive rule creation, such as our previous algorithm, the growth in the time needed to compute these rules increases exponentially. As we add more items, the time it takes to compute all rules increases significantly faster. Specifically, the total possible number of rules is 2n - 1. For our five-item dataset, there are 31 possible rules. For 10 items, it is 1023. For just 100 items, the number has 30 digits. Even the drastic increase in computing power couldn't possibly keep up with the increases in the number of items stored online. Therefore, we need algorithms that work smarter, as opposed to computers that work harder.
The classic algorithm for affinity analysis is called the Apriori algorithm. It addresses the exponential problem of creating sets of items that occur frequently within a database, called frequent itemsets. Once these frequent itemsets are discovered, creating association rules is straightforward, which we will see later in the chapter.
The intuition behind Apriori is both simple and clever. First, we ensure that a rule has sufficient support within the dataset. Defining a minimum support level is the key parameter for Apriori. To build a frequent itemset we combine smaller frequent itemsets. For itemset (A, B) to have a support of at least 30, both A and B must occur at least 30 times in the database. This property extends to larger sets as well. For an itemset (A, B, C, D) to be considered frequent, the set (A, B, C) must also be frequent (as must D).
These frequent itemsets can be built and possible itemsets that are not frequent (of which there are many) will never be tested. This saves significant time in testing new rules, as the number of frequent itemsets is expected to be significantly fewer than the total number of possible itemsets.
Other example algorithms for affinity analysis build on this, or similar concepts, including the Eclat and FP-growth algorithms. There are many improvements to these algorithms in the data mining literature that further improve the efficiency of the method. In this chapter, we will focus on the basic Apriori algorithm.
Overall methodology or Choosing parameters
To perform association rule mining for affinity analysis, we first use the Apriori algorithm to generate frequent itemsets. Next, we create association rules (for example, if a person recommended movie X, they would also recommend movie Y) by testing combinations of premises and conclusions within those frequent itemsets.
- 1. For the first stage, the Apriori algorithm needs a value for the minimum support that an itemset needs to be considered frequent. Any itemsets with less support will not be considered.
Setting this minimum support too low will cause Apriori to test a larger number of itemsets, slowing the algorithm down. Setting it too high will result in fewer itemsets being considered frequent. - 2. In the second stage, after the frequent itemsets have been discovered, association rules are tested based on their confidence. We could choose a minimum confidence level, a number of rules to return, or simply return all of them and let the user decide what to do with them.
In this chapter, we will return only rules above a given confidence level. Therefore, we need to set our minimum confidence level. Setting this too low will result in rules that have a high support, but are not very accurate. Setting this higher will result in only more accurate rules being returned, but with fewer rules being discovered overall.
Dealing with the movie recommendation problem
Product recommendation is a big business. Online stores use it to up-sell to customers by recommending other products that they could buy. Making better recommendations leads to better sales. When online shopping is selling to millions of customers every year, there is a lot of potential money to be made by selling more items to these customers.
Product recommendations, including movie and books, have been researched for many years; however, the field gained a significant boost when Netflix ran their Netflix Prize between 2007 and 2009. This competition aimed to determine if anyone can predict a user's rating of a film better than Netflix was currently doing. The prize went to a team that was just over 10 percent better than the current solution. While this may not seem like a large improvement, such an improvement would net millions to Netflix in revenue from better movie recommendations over the following years.
Obtaining the dataset
Since the inception of the Netflix Prize, Grouplens, a research group at the University of Minnesota, has released several datasets that are often used for testing algorithms in this area. They have released several versions of a movie rating dataset, which have different sizes. There is a version with 100,000 reviews, one with 1 million reviews and one with 10 million reviews.
The datasets are available from MovieLens | GroupLens and the dataset we are going to use in this chapter is the MovieLens 100K Dataset (with 100,000 reviews). Download this dataset and unzip it in your data folder. Start a new Jupyter Notebook and type the following code:
https://files.grouplens.org/datasets/movielens/ml-100k-README.txt
u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC
主要有四部分:
1、封锁IP检测:就是检测用户IP访问的速度,如果访问速度达到设置的阈值,就会开启限制封锁IP,让爬虫终止无法继续获取数据。
2、请求头检测:爬虫不是用户,在访问时没有其他特征,网站可以通过检测爬虫的请求头来检测对方到底是用户还是爬虫。
3、验证码检测:登陆验证码限制设置,若是没有输入正确的验证码,将不能再获取到信息。由于爬虫可以借用其他的工具识别验证码,故网站不断的加深验证码的难度,从普通的纯数据研验证码到混合验证码,还是滑动验证码,图片验证码等。
4.cookie检测:浏览器会保存cookie,因此网站会通过检测cookie来识别你是否是真实的用户,若是爬虫没有伪装好,将会触发被限制访问。
访问结束清楚cookie?
5.链接
我们知道,爬虫爬取页面时,会识别页面中所有的URL地址去爬取,特别是一些没有明确目标的爬虫。有的网站会将一些链接放在CSS里或者JS里,这些链接正常用户是不会去访问的,它们就相当于陷进,作用是钓出爬虫,一不小心就容易中招。
import requests
url = 'https://files.grouplens.org/datasets/movielens/ml-100k.zip'
r = requests.get(url)
with open('ml-100k.zip', 'wb') as f:
f.write( r.content )
zip_file.extractall( 'dataset' )
import os
import pandas as pd
ratings_filename = os.path.join( 'dataset/ml-100k', 'u.data' )
Ensure that ratings_filename points to the u.data file in the unzipped folder('dataset/ml-100k').
Loading with pandas
The MovieLens dataset is in a good shape; however, there are some changes from the default options in pandas.read_csv that we need to make. To start with, the data is separated by tabs, not commas. Next, there is no heading line. This means the first line in the file is actually data and we need to manually set the column names.
all_ratings = pd.read_csv( ratings_filename,
header=None,
names = ["UserID", "MovieID", "Rating", "Datetime"]
)
all_ratings.head()
When loading the file, we set the delimiter parameter to the tab character, tell pandas not to read the first row as the header (with header=None) and to set the column names with given values. Let's look at the following code:
all_ratings = pd.read_csv( ratings_filename, delimiter='\t',
header=None,
names = ["UserID", "MovieID", "Rating", "Datetime"]
)
all_ratings.head()
While we won't use it at here, you can properly parse the date timestamp using the following line. Dates for reviews can be an important feature in recommendation prediction, as movies that are rated together often have more similar rankings than movies ranked separately. Accounting for this can improve models significantly.
all_ratings['Datetime'] = pd.to_datetime( all_ratings['Datetime'], unit='s')
all_ratings.head()
The format given here represents the full matrix, but in a more compact way. The first row indicates that user number 196 reviewed movie number 242, giving it a ranking of 3 (out of five) on December 4, 1997.
Sparse data formats
This dataset is in a sparse format. Each row can be thought of as a cell in a large feature matrix of the type used in m03 Predicting Sports Winners with Decision Trees_NBA_TP_metric_OneHotEncoder_bias_colab_Linli522362242的专栏-CSDN博客, where rows are users and columns are individual movies. The first column would be each user's review of the first movie, the second column would be each user's review of the second movie, and so on.
len( set( all_ratings['UserID'] ) ), len( set( all_ratings['MovieID'] ) )
There are around 1,000 users and 1,700 movies in this dataset, which means that the full matrix would be quite large (nearly 2 million entries). We may run into issues storing the whole matrix in memory and computing on it would be troublesome. However, this matrix has the property that most cells are empty, that is, there is no review for most movies for most users. There is no review of movie number 675 for user number 213 though, and not for most other combinations of user and movie.
all_ratings[ (all_ratings['MovieID']==675) &
(all_ratings['UserID']==213)
]
Any combination of user and movie that isn't in this database is assumed to not exist. This saves significant space, as opposed to storing a bunch of zeroes in memory. This type of format is called a sparse matrix format. As a rule of thumb, if you expect about 60 percent or more of your dataset to be empty or zero, a sparse format will take less space to store.
When computing on sparse matrices, the focus isn't usually on the data we don't have—comparing all of the zeroes. We usually focus on the data we have and compare those.
Understanding the Apriori algorithm and its implementation
The goal of this chapter is to produce rules of the following form: if a person recommends this set of movies, they will also recommend this movie. We will also discuss extensions where a person who recommends a set of movies, is likely to recommend another particular movie.
To do this, we first need to determine if a person recommends a movie. We can do this by creating a new feature Favorable, which is True if the person gave a favorable review to a movie:
all_ratings['Favorable'] = all_ratings['Rating']>3
We can see the new feature by viewing the dataset:
all_ratings[10:15]
We will sample our dataset to form training data. This also helps reduce the size of the dataset that will be searched, making the Apriori algorithm run faster. We obtain all reviews from the first 200 users:
ratings = all_ratings[ all_ratings['UserID'].isin( range(200) ) ]
ratings
Next, we can create a dataset of only the favorable reviews in our sample:
favorable_ratings_mask = ratings['Favorable']
favorable_ratings = ratings[ favorable_ratings_mask ]
favorable_ratings
We will be searching the user's favorable reviews for our itemsets. So, the next thing we need is the movies which each user has given a favorable rating. We can compute this by grouping the dataset by the UserID and iterating over the movies in each group:
for k, v in favorable_ratings.groupby('UserID'):
print('groupby user id : ',k)
print(v)
... ...
for k, v in favorable_ratings.groupby('UserID')['MovieID']:
print('Group By user_id: ',k)
print(v)
... ...( {user_id, movie_id_list}, ..., {user_id, movie_id_list} )
favorable_reviews_by_users = dict( ( k, frozenset(v.values) )
for k, v in favorable_ratings.groupby('UserID')['MovieID']
)
favorable_reviews_by_users #{'user_id':reviewed_movie_id_set,... ,'user_id':reviewed_movie_id_set}
#{'user_id':reviewed_movie_id_set,... ,'user_id':reviewed_movie_id_set}
In the preceding code, we stored the values as a frozenset, allowing us to quickly check if a movie has been rated by a user.
Sets are much faster than lists for this type of operation, and we will use them in later code.
Finally, we can create a DataFrame that tells us how frequently each movie has been given a favorable review:
num_favoriable_by_movie_df = ratings[['MovieID', 'Favorable']].groupby('MovieID').sum()
num_favoriable_by_movie_df
We can see the top five movies by running the following code:
num_favoriable_by_movie_df.sort_values( by='Favorable', ascending=False ).head()
Looking into the basics of the Apriori algorithm
The Apriori algorithm is part of our affinity analysis methodology and deals specifically with finding frequent itemsets within the data. The basic procedure of Apriori builds up new candidate itemsets from previously discovered frequent itemsets. These candidates are tested to see if they are frequent, and then the algorithm iterates as explained here:
- 1. Create initial frequent itemsets by placing each item in its own itemset. Only items with at least the minimum support are used in this step.
- 2. New candidate itemsets are created from the most recently discovered frequent itemsets by finding supersets of the existing frequent itemsets.
- 3. All candidate itemsets are tested to see if they are frequent. If a candidate is not frequent then it is discarded. If there are no new frequent itemsets from this step, go to the last step.
- 4. Store the newly discovered frequent itemsets and go to the second step.
- 5. Return all of the discovered frequent itemsets.
This process is outlined in the following workflow:
Implementing the Apriori algorithm
On the first iteration of Apriori, the newly discovered itemsets will have a length of 2, as they will be supersets of the initial itemsets created in the first step. On the second iteration (after applying the fourth step and going back to step 2), the newly discovered itemsets will have a length of 3. This allows us to quickly identify the newly discovered itemsets, as needed in the second step.
We can store our discovered frequent itemsets in a dictionary, where the key is the length of the itemsets. This allows us to quickly access the itemsets of a given length, and therefore the most recently discovered frequent itemsets, with the help of the following code:
frequent_itemsets_dict = {}
We also need to define the minimum support needed for an itemset to be considered frequent. This value is chosen based on the dataset but try different values to see how that affects the result. I recommend only changing it by 10 percent at a time though, as the time the algorithm takes to run will be significantly different! Let's set a minimum support value:
min_support = 50
To implement the first step of the Apriori algorithm, we create an itemset with each movie individually and test if the itemset is frequent. We use frozenset, as they allow us to perform faster setbased operations later on, and they can also be used as keys in our counting dictionary (normal sets cannot).
Let's look at the following example of frozenset code:
####################
for movie_id, row in num_favoriable_by_movie_df.iterrows() :
print( movie_id, row)
group_key('MovieID') : Series( index=['Favorable'] ) Note: can't use df.iteritems()
#################### step 1
itemsets_frequent_dict[1] = dict( ( frozenset( (movie_id,) ), row['Favorable'] )
for movie_id, row in num_favoriable_by_movie_df.iterrows()
if row['Favorable'] > min_support
)
itemsets_frequent_dict[1] # movie_id_set: Favorable_fequency
# {movie_id}: Favorable_fequency
We implement the 2nd and 3rd steps together for efficiency by creating a function that takes the newly discovered frequent itemsets, creates the supersets, and then tests if they are frequent. First, we set up the function to perform these steps:
from collections import defaultdict
# k_1_itemsets : itemsets_frequent_dict[k-1]
# { frozenset({'movie_id'}): favorable_fequency, ..., frozenset({'movie_id'}): favorable_fequency }
def find_frequent_itemsets( favorable_reviews_by_users, k_1_itemsets, min_support ):
counts = defaultdict( int )
# favorable_reviews_by_users :
# { 'user_id': reviewed_movie_id_set, ...,'user_id': reviewed_movie_id_set }
for user_id, reviewed_movie_id_set in favorable_reviews_by_users.items():
for itemset in k_1_itemsets:# k_1_itemsets:the most recently discovered
frequent itemsets
# 'user_id':reviewed_movie_id_set
if itemset.issubset( reviewed_movie_id_set ): # - : difference set
for other_reviewed_movie in reviewed_movie_id_set - itemset:
# create multiple union set: itemset U each element in reviewed_movie_id_set - itemset
# '|' : union set
current_superset = itemset | frozenset( (other_reviewed_movie,) )
counts[current_superset] += 1
return dict( [ (itemset, frequency)
for itemset, frequency in counts.items()
if frequency >= min_support
]
)
- In keeping with our rule of thumb经验法则 of reading through the data as little as possible, we iterate over the dataset once per call to this function. While this doesn't matter too much in this implementation (our dataset is relatively small), it is a good practice to get into for larger applications. We iterate over all of the users and their reviews:
for user_id, reviewed_movie_id_set in favorable_reviews_by_users.items():
Traverse all user_id and corresponding reviewed_movie_id_set
-
Next, we go through each of the previously discovered itemsets(### k_1_itemsets ###) and see if it(### itemset ###) is a subset of the current set of reviews of current user_id. If it is, this means that the user has reviewed each movie in the itemset. Let's look at the code:
for itemset in k_1_itemsets: # 'user_id':reviewed_movie_id_set if itemset.issubset( reviewed_movie_id_set ): # - : difference set
-
We can then go through each individual movie that the user has reviewed that isn't in the itemset, create a superset from it, and record in our counting dictionary that we saw this particular itemset. Let's look at the code:
for other_reviewed_movie in reviewed_movie_id_set - itemset: # create multiple union set: itemset U each element in reviewed_movie_id_set - itemset # '|' : union set current_superset = itemset | frozenset( (other_reviewed_movie,) ) counts[current_superset] += 1
-
We end our function by testing which of the candidate itemsets have enough support to be considered frequent and return only those:
return dict( [ (itemset, frequency) for itemset, frequency in counts.items() if frequency >= min_support ] )
To run our code, we now create a loop that iterates over the steps of the Apriori algorithm, storing the new itemsets as we go. In this loop, k represents the length of the soon-to-be即将 discovered frequent itemsets( will be stored in itemsets_frequent_dict), allowing us to access the previously most discovered ones by looking in our itemsets_frequent_dict dictionary using the key k - 1. We create the frequent itemsets and store them in our dictionary by their length. Let's look at the code:
import sys
itemsets_frequent_dict = {} # itemsets are sorted by length
# movie_id_set: Favorable_fequency
min_support = 50
# k=1 candidates are the isbns with more than min_support favourable reviews
itemsets_frequent_dict[1] = dict( ( frozenset( (movie_id,) ), row['Favorable'] )
for movie_id, row in num_favoriable_by_movie_df.iterrows()
if row['Favorable'] > min_support
)
print( "There are {} movies with more than {} favorable reviews".format(
len(itemsets_frequent_dict[1]), min_support
) # format
)
sys.stdout.flush()
# favorable_reviews_by_users :
# { 'user_id': favorable_moive_id_set, ...,'user_id': favorable_moive_id_set }
for k in range( 2, 20 ):
# Generate candidates of length k, using the frequent itemsets of length k-1
# Only store the frequent itemsets
cur_itemsets_frequent_dict = find_frequent_itemsets( favorable_reviews_by_users,
itemsets_frequent_dict[k-1],
min_support
)
if len(cur_itemsets_frequent_dict) == 0:
print( "Did not find any frequent itemsets of length {}".format(k) )
sys.stdout.flush()
break
else:
print( "I found {} frequent itemsets of length {}".format(
len( cur_itemsets_frequent_dict ), k
) # format
)
sys.stdout.flush()
itemsets_frequent_dict[k] = cur_itemsets_frequent_dict
# We aren't interested in the itemsets of length 1, so remove those
del itemsets_frequent_dict[1]
itemsets_frequent_dict
... ...
Sort nested dictionaries
1. Sort the internal dictionary first
itemsets_frequent_dict = sorted( itemsets_frequent_dict.items(),
key=lambda x: sorted(x[1].values(), reverse=True),
reverse = True
)
itemsets_frequent_dict
... ...
... ...
2. Sort the external dictionary
itemsets_frequent_dict = dict( sorted( itemsets_frequent_dict.items(),
key=lambda x: max( x[1].values() ),
reverse=True
)
)# since sorted function return a sorted list
itemsets_frequent_dict
... ...
... ...
... ...
- We want to break out the preceding loop if we didn't find any new frequent itemsets (and also to print a message to let us know what is going on):
if len(cur_itemsets_frequent_dict) == 0: print( "Did not find any frequent itemsets of length {}".format(k) ) sys.stdout.flush() break
We use sys.stdout.flush() to ensure that the printouts happen while the code is still running. Sometimes, in large loops in particular cells, the printouts will not happen until the code has completed. Flushing the output in this way ensures that the printout happens when we want. Don't do it too much though—the flush operation carries a computational cost (as does printing) and this will slow down the program.
-
If we do find frequent itemsets, we print out a message to let us know the loop will be running again. This algorithm can take a while to run, so it is helpful to know that the code is still running while you wait for it to complete! Let's look at the code:
else: print( "I found {} frequent itemsets of length {}".format( len( cur_itemsets_frequent_dict ), k ) # format ) sys.stdout.flush()
Finally, after the end of the loop, we are no longer interested in the first set of itemsets(length=1) anymore—these are itemsets of length one, which won't help us create association rules – we need at least two items to create association rules. Let's delete them:
# We aren't interested in the itemsets of length 1, so remove those
del itemsets_frequent_dict[1]
You can now run this code. It may take a few minutes, more if you have older hardware. If you find you are having trouble running any of the code samples, take a look at using an online cloud provider for additional speed. Details about using the cloud to do the work are given in Appendix, Next Steps.
num_vary_set = 0
for k in itemsets_frequent_dict:
num_vary_set+=len(itemsets_frequent_dict[k])
num_vary_set
The preceding code returns 2968 frequent itemsets of varying lengths. You'll notice that the number of itemsets grows as the length increases before it shrinks. It grows because of the increasing number of possible rules. After a while, the large number of combinations no longer has the support necessary to be considered frequent. This results in the number shrinking. This shrinking is the benefit of the Apriori algorithm. If we search all possible itemsets (not just the supersets of frequent ones), we would be searching thousands of times more itemsets to see if they are frequent.
Extracting association rules
After the Apriori algorithm has completed, we have a list of frequent itemsets. These aren't exactly association rules, but they can easily be converted into these rules. A frequent itemset is a set of items with a minimum support, while an association rule has a premise and a conclusion. The data is the same for the two.
We can make an association rule from a frequent itemset by taking one of the movies in the itemset and denoting it as the conclusion. The other movies in the itemset will be the premise. This will form rules of the following form: if a reviewer recommends all of the movies in the premise, they will also recommend the conclusion.
For each itemset, we can generate a number of association rules by setting each movie to be the conclusion and the remaining movies as the premise.
In code, we first generate a list of all of the rules from each of the frequent itemsets, by iterating over each of the discovered frequent itemsets of each length下面的代码通过遍历不同长度的频繁项集,为每个项集生成规则.
# Now we create the association rules.
# First, they are candidates until the confidence has been tested
candidate_rules = []
for itemset_length, itemset_counts_dict in itemsets_frequent_dict.items():
for itemset in itemset_counts_dict.keys():
We then iterate over every movie in this itemset, using it as our conclusion. The remaining movies in the itemset are the premise. We save the premise and conclusion as our candidate rule:
# Now we create the association rules.
# First, they are candidates until the confidence has been tested
candidate_rules = []
for itemset_length, itemset_counts_dict in itemsets_frequent_dict.items():
for itemset in itemset_counts_dict.keys():
for conclusion in itemset:
premise = itemset - set( (conclusion,) )
candidate_rules.append( ( premise, conclusion) )
print( "There are {} candidate rules".format( len(candidate_rules) ) )
This returns a very large number of candidate rules. We can see some by printing out the first few rules in the list:
print( candidate_rules[:5] )
The result before the dictionary is not sorted :
print( candidate_rules[:5] )
In these rules, the first part (the frozenset) is the list of movies in the premise, while the number after it is the conclusion. In the first case, if a reviewer recommends movie_id 7, they are also likely to recommend movie_id 1.
Next, we compute the confidence of each of these rules. This is performed much like in m01_DataMining_Crawling_download file_xpath_contains(text(),“{}“)_sort dict by key_discrete_continu_Linli522362242的专栏-CSDN博客, Getting Started with Data Mining, with the only changes being those necessary for computing using the new data format.
# Now, we compute the confidence of each of these rules.
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user_id, reviewed_movie_id_set in favorable_reviews_by_users.items():
for candidate_rule in candidate_rules:
premise, conclusion = candidate_rule
if premise.issubset( reviewed_movie_id_set ):
if conclusion in reviewed_movie_id_set:
correct_counts[ candidate_rule ] +=1
else:
incorrect_counts[ candidate_rule ] +=1
rule_confidence = { candidate_rule:\
correct_counts[candidate_rule] / float( correct_counts[candidate_rule] +\
incorrect_counts[candidate_rule]
)
for candidate_rule in candidate_rules
}
- The process starts by creating dictionaries to store how many times we see the premise leading to the conclusion (a correct example of the rule) and how many times it doesn't (an incorrect example). Let's look at the code:
# Now, we compute the confidence of each of these rules. correct_counts = defaultdict(int) incorrect_counts = defaultdict(int)
- We iterate over all of the users, their favorable reviews, and over each candidate association rule:
for user_id, reviewed_movie_id_set in favorable_reviews_by_users.items(): for candidate_rule in candidate_rules: premise, conclusion = candidate_rule
- We then test to see if the premise is applicable to this user. In other words, did the user favorably review all of the movies in the premise? Let's look at the code:
if premise.issubset( reviewed_movie_id_set ):
- If the premise applies, we see if the conclusion movie was also rated favorably. If so, the rule is correct in this instance. If not, it is incorrect. Let's look at the code:
if conclusion in reviewed_movie_id_set: correct_counts[ candidate_rule ] +=1 else: incorrect_counts[ candidate_rule ] +=1
- We then compute the confidence for each rule by dividing the correct count by the total number of times the rule was seen:
rule_confidence = { candidate_rule:\ correct_counts[candidate_rule] / float( correct_counts[candidate_rule] +\ incorrect_counts[candidate_rule] ) for candidate_rule in candidate_rules }
# Choose only rules above a minimum confidence level
min_confidence = 0.9
# Filter out the rules with poor confidence
rule_confidence_dict = { rule:confidence
for rule, confidence in rule_confidence.items()
if confidence > min_confidence
}
print( len(rule_confidence_dict) )
Now we can print the top five rules by sorting this confidence dictionary and printing the results:
from operator import itemgetter
sorted_confidence = sorted( rule_confidence_dict.items(),
key=itemgetter(1),
reverse = True
)
for index in range(5152):
print( "Rule #{0}".format(index + 1) )
# candidate_rules.append( ( premise, conclusion) )
( premise, conclusion ) = sorted_confidence[index][0] # rule
print("Rule: If a person recommends {0} they will also recommend {1}".format( premise,
conclusion
)
)
print( " - Confidence: {0:.3f}".format( rule_confidence_dict[ (premise, conclusion) ]
)
)
print("")
... ...
... ...
... ...
... ...
... ...
... ...
The resulting printout shows only the movie IDs, which isn't very helpful without the names of the movies also. The dataset came with a file called u.items, which stores the movie names and their corresponding MovieID (as well as other information, such as the genre).
We can load the titles from this file using pandas. Additional information about the file and categories is available in the README that came with the dataset. The data in the files is in CSV format, but with data separated by the | symbol; it has no header and the encoding is important to set. The column names were found in the README file.
u.items file
README file
movie_name_filename = os.path.join( 'dataset/ml-100k', "u.item" )
movie_name_data = pd.read_csv( movie_name_filename,
delimiter="|",
header=None,
encoding = "mac-roman"
)
movie_name_data.columns = [ "MovieID", "Title", "Release Date", "Video Release",
"IMDB", "<UNK>", "Action", "Adventure", "Animation",
"Children's", "Comedy", "Crime", "Documentary", "Drama",
"Fantasy", "Film-Noir", "Horror", "Musical", "Mystery",
"Romance", "Sci-Fi", "Thriller", "War", "Western"
]
movie_name_data.head(n=5)
Getting the movie title is important, so we will create a function that will return a movie's title from its MovieID, saving us the trouble of looking it up each time. Let's look at the code:
def get_movie_name( movie_id ):
title_object = movie_name_data[ movie_name_data['MovieID'] == movie_id ]['Title']
title = title_object.values[0]
return title
get_movie_name(4)
- We look up the movie_name_data DataFrame for the given MovieID and return only the title column:
- We use the values parameter to get the actual value (and not the pandas Series object that is currently stored in title_object). We are only interested in the first value—there should only be one title for a given MovieID anyway!
- We end the function by returning the title as needed.
In a new IPython Notebook cell, we adjust our previous code for printing out the top rules to also include the titles:
from operator import itemgetter
sorted_confidence = sorted( rule_confidence_dict.items(),
key=itemgetter(1),
reverse = True
)
for index in range(5):
print( "Rule #{0}".format(index + 1) )
# candidate_rules.append( ( premise, conclusion) )
( premise, conclusion ) = sorted_confidence[index][0] # rule
premise_names = ', '.join( get_movie_name(idx)
for idx in sorted(premise, reverse=False)
)
conclusion_names = get_movie_name( conclusion )
print("Rule: If a person recommends {0} they will also recommend {1}".format( premise_names,
conclusion_names
)
)
print( " - Confidence: {0:.3f}".format( rule_confidence_dict[ (premise, conclusion) ]
)
)
print("")
The result is much more readable (there are still some issues, but we can ignore them for now):
Evaluation
In a broad sense, we can evaluate the association rules using the same concept as for classification. We use a test set of data that was not used for training, and evaluate our discovered rules based on their performance in this test set.
To do this, we will compute the test set confidence, that is, the confidence of each rule on the testing set.
We won't apply a formal evaluation metric in this case; we simply examine the rules and look for good examples.
First, we extract the test dataset, which is all of the records we didn't use in the training set. We used the first 200 users (by ID value) for the training set, and we will use all of the rest for the testing dataset. As with the training set, we will also get the favorable reviews for each of the users in this dataset as well. Let's look at the code:
# Evaluation using test data
test_dataset = all_ratings[~all_ratings['UserID'].isin( range(200) ) ]
test_favorable = test_dataset[ test_dataset["Favorable"] ]
test_favorable_by_users = dict( ( k, frozenset(v.values) )
for k, v in test_favorable.groupby("UserID")["MovieID"]
)
test_favorable_by_users
#{'user_id':reviewed_movie_id_set,... ,'user_id':reviewed_movie_id_set}
... ...
We then count the correct instances where the premise leads to the conclusion, in the same way we did before. The only change here is the use of the test data instead of the training data. Let's look at the code:
correct_counts = defaultdict(int)
incorrect_counts = defaultdict(int)
for user_id, reviewed_movie_id_set in test_favorable_by_users.items():
for candidate_rule in candidate_rules:
premise, conclusion = candidate_rule ######
if premise.issubset(reviewed_movie_id_set): ######
if conclusion in reviewed_movie_id_set: ######
correct_counts[candidate_rule] += 1
else:
incorrect_counts[candidate_rule] += 1
Next, we compute the confidence of each rule from the correct counts. Let's look at the code:
test_confidence = { candidate_rule:\
( correct_counts[candidate_rule] / float( correct_counts[candidate_rule] +
incorrect_counts[candidate_rule]
)
)
for candidate_rule in rule_confidence
}
print( len(test_confidence) )
Finally, we print out the best association rules with the titles instead of the movie IDs.
for index in range(10):
print( "Rule #{0}".format(index + 1) )
(premise, conclusion) = sorted_confidence[index][0]
premise_names = ", ".join( get_movie_name(idx)
for idx in premise
)
conclusion_name = get_movie_name( conclusion )
print("Rule: If a person recommends {0} they will also recommend {1}".format( premise_names,
conclusion_name
)
)
print(" - Train Confidence: {0:.3f}".format( rule_confidence.get( (premise, conclusion),
-1
)# -1 indicates that the particular rule wasn't found
#in the test dataset at all.
)
)
print(" - Test Confidence: {0:.3f}".format( test_confidence.get( (premise, conclusion),
-1
)
)
)
print("")
We can now see which rules are most applicable in new unseen data:
The 8th rule, for instance, has a perfect confidence in the training data, but it is only accurate in 68.2 percent of cases for the test data. Many of the other rules in the top 10 have high confidences in test data though, making them good rules for making recommendations.
If you are looking through the rest of the rules, some will have a test confidence of -1. Confidence values are always between 0 and 1. This value indicates that the particular rule wasn't found in the test dataset at all.
Summary
In this chapter, we performed affinity analysis in order to recommend movies based on a large set of reviewers. We did this in two stages. First, we found frequent itemsets in the data using the Apriori algorithm. Then, we created association rules from those itemsets.
The use of the Apriori algorithm was necessary due to the size of the dataset. While in Cp1m01_DataMining_Crawling_download file_xpath_contains(text(),“{}“)_sort dict by key_discrete_continu_Linli522362242的专栏-CSDN博客, Getting Started With Data Mining, we used a brute-force approach, the exponential growth in the time needed to compute those rules required a smarter approach. This is a common pattern for data mining: we can solve many problems in a brute force manner, but smarter algorithms allow us to apply the concepts to larger datasets.
We performed training on a subset of our data in order to find the association rules, and then tested those rules on the rest of the data—a testing set. From what we discussed in the previous chapters, we could extend this concept to use cross-fold validation to better evaluate the rules. This would lead to a more robust evaluation of the quality of each rule.
So far, all of our datasets have been in terms of features. However, not all datasets are "pre-defined" in this way. In the next chapter, we will look at scikit-learn's transformers (they were introduced in Chapter 3, Predicting Sports Winners with Decision Trees) as a way to extract features from data. We will discuss how to implement our own transformers, extend existing ones, and concepts we can implement using them.