大数据分析标签库
This article is following the steps of the analysis started here.
本文正在按照从此处开始的分析步骤进行操作。
We are going to have a look at the tags used in our 60,000 questions from StackOverflow with Quality Rating. It should give us a better understanding of the situation and, with a bit of work, we might already be able to spot some trends.
我们将看看StackOverflow的“ 60000个带有质量评级”问题中使用的标签。 它应该使我们对情况有了更好的了解,并且通过一些工作,我们也许已经能够发现一些趋势。
介绍 (Introduction)
In this article, we want to do a few things using the Tags
field. We want to have a look at what the bulk of the questions are about but we also want to see if there are some common combinations. All this will eventually be confronted to the quality of the post to try and identify trends.
在本文中,我们想使用“ Tags
字段做一些事情。 我们想看看大部分问题是关于什么,但我们也想看看是否有一些常见的组合。 所有这些最终都将面临职位质量,以尝试识别趋势。
To that end, we are going to use the lambda
function, build cleaning functions, build a bag of words, create a wordcloud and use nltk's FreqDist
.
为此,我们将使用lambda
函数,构建清理函数,构建一袋单词,创建wordcloud并使用nltk的FreqDist
。
进口和清洁功能 (Imports and cleaning functions)
Nothing too fancy with the cleaning functions but the one we are going to use for our wordclouds is a little more invasive to try and get rid of some noise.
清理功能没有什么花哨的,但是我们将用于词云的清理功能更具侵入性,可以消除一些噪音。
from nltk import FreqDist
from wordcloud import WordCloud
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (30,30)def wc(text):
"""
Cleaning function to be used with our first wordcloud
"""
if text:
tags = text.replace('><',' ')
tags = tags.replace('-','')
tags = tags.replace('.','DOT')
tags = tags.replace('c++','Cpp')
tags = tags.replace('c#','Csharp')
tags = tags.replace('>','')
return tags.replace('<','')
else:
return 'None'
def clean_tags(text):
"""
Cleaning function for tags
"""
if text:
tags = text.replace('><',' ')
tags = tags.replace('>','')
return tags.replace('<','')
else:
return 'None'
词云 (Wordclouds)
wordcloud()
needs a document of space-separated words. We are going to create a list of words then use the