github地址:https://github.com/iwtbs/user_searchquery_analyse
整体架构
直接看代码
#python get_novel_info_from_feed_monitor.py ./data/novel_info.txt
#python get_video_info_from_video_film.py ./data/video_info.txt
#python get_star_info_from_video_film.py ./data/star_info.txt
#python stat_searchquery_times.py ./data/mid_searchquerys_20190331_31 ./data/searchquery_times
##############
#python analyse_searchquery.py ./data/novel_info.txt ./data/video_info.txt ./data/game_info.txt ./data/qingse_keyword.txt ./data/searchquery_times ./data/searchquery_times_analyse
#python stat_entity_searchquerynumber_searchquerytimes.py ./data/searchquery_times_analyse ./data/entity_searchquerynumber_searchquerytimes
#python cal_mid_entity_info.py ./data/searchquery_times_analyse ./data/mid_searchquerys_20190331_31 ./data/mid_searchquerys_entitys
分部介绍
- 获取novel_info.txt,video_info.txt,star_info.txt
novel_info.txt:从mysql获取,title+hot
video_info.txt:从mysql获取,dockey + doctype+ hit_count + name + alias_name + serial + alais_serial
star_info.txt:从mysql获取,star_id + name + alias_name + hit_count
- stat_searchquery_times.py 统计每个搜索词的次数,并排序,输入文件mid+searchquery+times
searchquery = items3[0]
times = int(items3[2])
searchquery_times_dict[searchquery] =searchquery_times_dict.get(searchquery, 0) + times
- 外部文件包括game_info.txt,qingse_keyword.txt
analyse_searchquery.py 结合之前的文件分析搜索行为
都是构建关键词查询,可以参考之前的博客敏感词匹配——python使用esmre实现ac自动机,以情色为例
def gen_qingse_index(file_path):
qingse_index = esm.Index()
line_num = len([ "" for line in open(file_path, "r")])
with tqdm.tqdm(total=line_num) as progress:
valid_num = 0
for line in file(file_path):
progress.update(1)
qingse_index.enter(line.strip())
valid_num += 1
print valid_num
qingse_index.fix()
return qingse_index
def get_match_entity(index, searchquery):
index_result = index.query(searchquery)
match_entity_dict = {}
for (st_end, match_entity) in index_result:
if st_end[0] % 2 == 0:
match_entity_dict[match_entity] = True
ret = ''
if len(match_entity_dict) > 0:
ret = ','.join(match_entity_dict.keys())
return ret
qingse_index = gen_qingse_index(sys.argv[4])
qingse_result = get_match_entity(qingse_index, searchquery)
if len(qingse_result) > 0:
output += 'qingse'
fw.write(searchquery + '\t' + str(times) + '\t' + output + '\n')
- stat_entity_searchquerynumber_searchquerytimes.py 统计每一个分类下,有多少关键词,搜索了多少次,占总比多少。
fw.write(entity + '\t' + str(searchquerynumber) + '\t' + str(searchquerytimes) + '\t' + str(searchquerynumber*1.0/total_searchquerynumber) + '\t' + str(searchquerytimes*1.0/tota l_searchquerytimes) + '\n')
- cal_mid_entity_info.py 统计用户的实体信息