【机器学习】建立基于GitHub库的推荐系统引擎

最新推荐文章于 2024-06-07 09:34:46 发布

ChenVast

最新推荐文章于 2024-06-07 09:34:46 发布

阅读量8k

点赞数 1

分类专栏： Machine Learning 机器学习算法理论与实战文章标签： Python 机器学习推荐系统

本文链接：https://blog.csdn.net/ChenVast/article/details/79257689

版权

机器学习算法理论与实战同时被 2 个专栏收录

156 篇文章 27 订阅

订阅专栏

Machine Learning

132 篇文章 28 订阅

订阅专栏

如果不熟悉协同过滤算法的可以查看我的一篇文章：【推荐系统】协同过滤浅入（基于用户/项目/内容/混合方式）

代码存放在我的GitHub：https://github.com/935048000/GitHubRecommendationSystem

开始

该推荐引擎是用于GitHub的库推荐

这里使用GitHub的API，基于协同过滤的推荐系统。

这个推荐系统的任务是获得我所有标星的资料库，然后得到这些库的全部创作者。

再获取这些作者的标星资料库。然后比较已加星的资料库，找到和我最相似的用户。发现最相似的GitHub用户后，把他标星的所有资料库生成一组推荐。

准备

# 导入需要的库
import pandas as pd
import numpy as np
import requests
import json

获取令牌，令牌的获取方法：

访问该URL，进入到Personalaccess tokens（个人访问令牌）页面。

https://github.com/settings/tokens

点击生成新的令牌，勾上repo和user即可，点击生成令牌。

然后得到一串随机串，填入上面的mypw变量里即可。

# 需要自己的GitHub账户和令牌
myun = '935048000'
mypw = 'Tokens'

创建获取资料库函数

该实验的前提是你的标星的资料库有一定的数量，才能有意义。

# 创建一个能拉取我已标星资料库名称的函数
my_starred_repos = []
def get_starred_by_me():
    resp_list = []
    last_resp = ''
    first_url_to_get = 'https://api.github.com/user/starred'
    first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw))
    last_resp = first_url_resp
    resp_list.append(json.loads(first_url_resp.text))
    
    while last_resp.links.get('next'):
        next_url_to_get = last_resp.links['next']['url']
        next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw))
        last_resp = next_url_resp
        resp_list.append(json.loads(next_url_resp.text))
        
    for i in resp_list:
        for j in i:
            msr = j['html_url']
            my_starred_repos.append(msr)

# 调用函数，获取我已标星的资料库名字,并输出名字列表
get_starred_by_me()
print(my_starred_repos)

共有45个

# 获取每个标星库的用户名，并输出
my_starred_users = []
for ln in my_starred_repos:
    right_split = ln.split('.com/')[1]
    starred_usr = right_split.split('/')[0]
    my_starred_users.append(starred_usr)

print(my_starred_users)

# 去除重复的用户
print(len(set(my_starred_users)))
# 剩下40个

创建标星资料库检索函数

# 构建一个可以检索他们所有标星的资料库的函数
starred_repos = {k:[] for k in set(my_starred_users)}
def get_starred_by_user(user_name):
    starred_resp_list = []
    last_resp = ''
    first_url_to_get = 'https://api.github.com/users/'+ user_name +'/starred'
    first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw))
    last_resp = first_url_resp
    starred_resp_list.append(json.loads(first_url_resp.text))
    
    while last_resp.links.get('next'):
        next_url_to_get = last_resp.links['next']['url']
        next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw))
        last_resp = next_url_resp
        starred_resp_list.append(json.loads(next_url_resp.text))
        
    for i in starred_resp_list:
        for j in i:
            sr = j['html_url']
            starred_repos.get(user_name).append(sr)

# 调用函数
for usr in list(set(my_starred_users)):
    print(usr)
    try:
        get_starred_by_user(usr)
    except:
        print('failed for user', usr)

# 为所有被标星的资料库，构建一个特征集，并去除重复的资料库
repo_vocab = [item for sl in list(starred_repos.values()) for item in sl]
repo_set = list(set(repo_vocab))

# 这里就有700多个库了
# 对每位用户和每个资料库建立一个二进制向量，标星=1，不标星=0.
all_usr_vector = []
for k,v in starred_repos.items():
    usr_vector = []
    for url in repo_set:
        if url in v:
            usr_vector.extend([1])
        else:
            usr_vector.extend([0])
    all_usr_vector.append(usr_vector)

# 把这些数据放入一个pandas的DataFrame里
df = pd.DataFrame(all_usr_vector, columns=repo_set, index=starred_repos.keys())

# 查看DataFrame
print(df)

# 需要将自己和他们比较，需要把自己添加进数据框里。
my_repo_comp = []
for i in df.columns:
    if i in my_starred_repos:
        my_repo_comp.append(1)
    else:
        my_repo_comp.append(0)
mrc = pd.Series(my_repo_comp).to_frame(myun).T

# 输出
print(mrc)

# 添加列名并连接到数据框中
mrc.columns = df.columns
fdf = pd.concat([df, mrc])
print(fdf)

# 计算自己和其他用户的相似性，使用scipy库的pearsonr函数
from sklearn.metrics import jaccard_similarity_score
from scipy.stats import pearsonr

计算相似度

# 将数据框中最后一个向量和其他向量进行比较，并生成中心化余弦相似度。
sim_score = {}
for i in range(len(fdf)):
    ss = pearsonr(fdf.iloc[-1,:], fdf.iloc[i,:])
    sim_score.update({i: ss[0]})
sf = pd.Series(sim_score).to_frame('similarity')
print(sf)

相似度排序

# 有一些值为NaN，因为他们没有给任何项目标星，导致计算除以0.
# 对值进行排序，返回最相似的索引列表
sf.sort_values('similarity', ascending=False)

# 第一个是自己，相似度100%，查询和自己最相似的用户,索引等于31的用户。
print(fdf.index[31])

# 查看他的标星的库
fdf.iloc[31,:][fdf.iloc[31,:]==1]

验证

去GitH上验证一下，准确无误

# 创建一个数据框，内容为和我最相似的三位用户已标星的资料库
all_recs = fdf.iloc[[31,17,36,40],:][fdf.iloc[[31,17,36,40],:]==1].fillna(0).T
all_recs[(all_recs==1).all(axis=1)]

str_recs_tmp = all_recs[all_recs[myun]==0].copy()
str_recs = str_recs_tmp.iloc[:,:-1].copy()
str_recs

# 看看是否存在两位共同标星的资料库
str_recs[(str_recs==1).all(axis=1)]
str_recs[str_recs.sum(axis=1)>1]

实验进行到此结束。

更进一步的话，安装每个被推荐的项目的标星数量进行排序

为了改进结果，可以添加基于内容的过滤。

为自己的库建立特征，这些特征可以表明这是个人兴趣。

生成一组单词特征，可以用于生成基于协同过滤的推荐。

如python、machine等等单词标签。确保与我们不太相似的用户可以提供基于自身兴趣的推荐。

ChenVast

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
6
评论
【机器学习】建立基于GitHub库的推荐系统引擎

如果不熟悉协同过滤算法的可以查看我的一篇文章：【推荐系统】协同过滤浅入（基于用户/项目/内容/混合方式）代码存放在我的GitHub：https://github.com/935048000/GitHubRecommendationSystem 开始该推荐引擎是用于GitHub的库推荐这里使用GitHub的API，基于协同过滤的推荐系统。这个推荐系统的任务是获得我所有标星的资...
复制链接

扫一扫