基于文本的推荐系统学习记录

最新推荐文章于 2024-09-11 21:45:27 发布

在平凡生活中挣扎

最新推荐文章于 2024-09-11 21:45:27 发布

阅读量597

点赞数

分类专栏： python 推荐系统文章标签：学习数据挖掘人工智能

本文链接：https://blog.csdn.net/qq_29220369/article/details/125163835

版权

python 同时被 2 个专栏收录

25 篇文章 1 订阅

订阅专栏

推荐系统

3 篇文章 0 订阅

订阅专栏

一、要完成的任务：

1.数据预处理（1.8G数据中大部分都是没用的，需要剔除掉无关项）

2.文本清洗（文本数据直接用恐怕不行，停用词，文本筛选，正则等操作都得做起来）

3.矩阵分解（SVD与NMF,到底哪个好，还得试一试，其他任务中可能SVD效果好一些，这个项目中恰好就NMF强一些）

4.LDA主题模型（无监督神器，文本分析任务中经常会用到，由于不涉及标签，用途比较广泛）

5.构建推荐引擎（其实就是相似度计算，得出推荐结果）

## 涉及到的工具包

- numpy，pandas这些就不用说啦，必备的！

- gensim：这个可以说是文本处理与建模神器，预处理方法与LDA模型等都可以在这里直接调用

- sklearn：NMF与SVD直接可以调用，机器学习中用的最多的包。

1.文本预处理

主要是一些对字典的一些操作

medium = pd.read_csv('Medium_AggregatedData.csv', nrows = 1000)

medium.head()

### 预处理除了固定的套路，还得根据数据自己来设计一些规则

- 大部分文本数据都是英文的，还有少量其他的，只保留英文数据

- 推荐的文章也得差不多一点，点赞数量少的，暂时去除掉

medium = medium[medium['language'] == 'en']

medium = medium[medium['totalClapCount'] >= 25]

titles = medium['title'].unique()                   # 所有文章名字

tag_dict = {'title': [], 'tags': []}               # 文章对应标签

for title in titles:
    tag_dict['title'].append(title)
    tag_dict['tags'].append(findTags(title))

tag_df = pd.DataFrame(tag_dict)                     # 转换成DF

# 去重
medium = medium.drop_duplicates(subset = 'title', keep = 'first')

# 将标签加入到原始DF中
medium['allTags'] = medium['title'].apply(addTags)

# 只保留需要的列
keep_cols = ['title', 'url', 'allTags', 'readingTime', 'author', 'text']
medium = medium[keep_cols]

# 标题为空的不要了
null_title = medium[medium['title'].isna()].index
medium.drop(index = null_title, inplace = True)

medium.reset_index(drop = True, inplace = True)

print(medium.shape)
medium.head()

2.文本清洗

###通过正则化表达式对文本进行清洗

def clean_text(text):  
    # 去掉http开头那些链接
    text = re.sub('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+','', text)
    # 去掉特殊字符之类的
    text = re.sub('\w*\d\w*', ' ', text)
    # 去掉标点符号等，将所有字符转换成小写的
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text.lower())
    # 去掉换行符
    text = text.replace('\n', ' ')
    return text

medium['text'] = medium['text'].apply(clean_text)

### 去停用词

- 一般都是用现成的停用词典，但是现成的往往难以满足自己的任务需求，还需要额外补充

- 可以自己添加，一个词一个词的加入，也可以基于统计方法来计算，比如词频最高的前100个词等

# 自己添加一部分停用词
stop_list = STOPWORDS.union(set(['data', 'ai', 'learning', 'time', 'machine', 'like', 'use', 'new', 'intelligence', 'need', "it's", 'way',
                                 'artificial', 'based', 'want', 'know', 'learn', "don't", 'things', 'lot', "let's", 'model', 'input',
                                 'output', 'train', 'training', 'trained', 'it', 'we', 'don', 'you', 'ce', 'hasn', 'sa', 'do', 'som',
                                 'can']))

# 去停用词
def remove_stopwords(text):
    clean_text = []
    for word in text.split(' '):
        if word not in stop_list and (len(word) > 2):
            clean_text.append(word)
    return ' '.join(clean_text)

medium['text'] = medium['text'].apply(remove_stopwords)

### 词干提取

- 英文数据也有事多的时候，统一成标准的词

stemmer = PorterStemmer()

def stem_text(text):
    word_list = []
    for word in text.split(' '):
        word_list.append(stemmer.stem(word))
    return ' '.join(word_list)

medium['text'] = medium['text'].apply(stem_text)

### 预处理通常花的时间比较多，把结果保存下来

medium.to_csv('pre-processed.csv')

3.TFIDF处理

vectorizer = TfidfVectorizer(stop_words = stop_list, ngram_range = (1,1))
doc_word = vectorizer.fit_transform(medium['text'])

# 打印函数
def display_topics(model, feature_names, no_top_words, no_top_topics, topic_names=None):
    count = 0
    for ix, topic in enumerate(model.components_):
        if count == no_top_topics:
            break
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", (ix + 1))
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        count += 1

4.LDA主题模型

### 再来试试传说中的LDA

LDA无监督的对文本中的关键词进行聚类，分出类别，并在每个类别中指出哪些词起到了关键作用。

#列名，一会要用到
column_names = ['title', 'url', 'allTags', 'readingTime', 'author', 'Tech',
                'Modeling', 'Chatbots', 'Deep Learning', 'Coding', 'Business',
                'Careers', 'NLP', 'sum']
#计算各个类别可能性总和
topic_sum = pd.DataFrame(np.sum(docs_nmf, axis = 1))
#做成DF
doc_topic_df = pd.DataFrame(data = docs_nmf)
#汇总大表
doc_topic_df = pd.concat([medium[['title', 'url', 'allTags', 'readingTime', 'author']], doc_topic_df, topic_sum], axis = 1)
doc_topic_df.columns = column_names
#剔除掉那些啥也不是的
doc_topic_df = doc_topic_df[doc_topic_df['sum'] != 0]
doc_topic_df.drop(columns = 'sum', inplace = True)

5.结果汇总

#列名，一会要用到
column_names = ['title', 'url', 'allTags', 'readingTime', 'author', 'Tech',
                'Modeling', 'Chatbots', 'Deep Learning', 'Coding', 'Business',
                'Careers', 'NLP', 'sum']
#计算各个类别可能性总和
topic_sum = pd.DataFrame(np.sum(docs_nmf, axis = 1))
#做成DF
doc_topic_df = pd.DataFrame(data = docs_nmf)
#汇总大表
doc_topic_df = pd.concat([medium[['title', 'url', 'allTags', 'readingTime', 'author']], doc_topic_df, topic_sum], axis = 1)
doc_topic_df.columns = column_names
#剔除掉那些啥也不是的
doc_topic_df = doc_topic_df[doc_topic_df['sum'] != 0]
doc_topic_df.drop(columns = 'sum', inplace = True)