大数据谈论热点有_数据科学推文分析–人们在谈论什么工具?

大数据谈论热点有

By Chris Musselle PhD, Mango UK

克里斯·穆塞尔(Chris Musselle)博士,英国芒果

At ­Mango we use a variety of tools in-house to address our clients’ business needs and when these fall within the data science arena, the main candidates we turn to are either the R or Python programming languages.

在Mango,我们使用内部各种工具来满足客户的业务需求,当这些需求属于数据科学领域时,我们选择的主要候选对象是R或Python编程语言。

The question as to which is the “best” language for doing data science is a hotly debated topic ([link] [link] [link] [link]), with both languages having their pros and cons. However the capabilities of each are expanding all the time thanks to continuous open source development in both areas.

关于哪种是用于数据科学的“最佳”语言的问题是一个备受争议的话题([ link ] [ link ] [ link ] [ link ]),两种语言都有其优缺点。 但是,由于这两个领域的持续开源开发,每个人的能力一直在扩展。

With both languages becoming increasingly popular for data analysis, we thought it would be interesting to track current trends and see what people are saying about these and other tools for data science on Twitter.

随着这两种语言在数据分析中变得越来越流行,我们认为跟踪当前趋势并在Twitter上了解人们对这些以及其他数据科学工具的看法将很有趣。

This post is the first of three that will look into the results of our analysis, but first a bit of background.

这篇文章是研究我们分析结果的三篇文章中的第一篇,但首先是一些背景知识。

 

Twitter Analysis

Twitter分析

Today many companies are routinely drawing on social media data sources such as Twitter and Facebook to enhance their business decision making in a number of ways. This type of analysis can be a component of market research, an avenue for collecting customer feedback or a way to promote campaigns and conduct targeted advertising.

如今,许多公司都例行使用社交媒体数据源,例如Twitter和Facebook,以多种方式增强其业务决策能力。 这种类型的分析可以是市场研究的组成部分,可以是收集客户反馈的途径,也可以是宣传活动和进行有针对性的广告的方式。

To facilitate this type of analysis, Twitter offer a variety of Application Programming Interfaces or APIs that enable an application to programmatically interact with the services provided by Twitter. These APIs currently come in three main flavours.

为了促进这种类型的分析,Twitter提供了各种应用程序编程接口或API,使应用程序能够以编程方式与Twitter提供的服务进行交互。 这些API当前有三种主要风格。

  • REST API – Allows automated access to searching, reading and writing tweets
  • Streaming API – Allows tracking of multiple users and or search terms in near real time, though results may only be a sample
  • Twitter Firehose – Allows tracking of all tweets past and future, no limits on search results returned.
  • REST API –允许自动访问搜索,阅读和编​​写推文
  • 流API –尽管结果可能只是示例,但允许几乎实时跟踪多个用户和/或搜索字词
  • Twitter Firehose –允许跟踪过去和将来的所有推文,对返回的搜索结果没有限制。

These different approaches have different trade-offs. The REST API can only search past tweets, and is limited in how far back you can search as Twitter only keeps the last couple of weeks of data. The Streaming API tracks tweets as they happen, but Twitter only guarantees a sample of all current tweets will be collected [link]. This means that if your search term is very generic and matches a lot of tweets, then not all of these tweets will be returned [link].

这些不同的方法具有不同的权衡。 REST API只能搜索过去的推文,并且由于Twitter仅保留最后两周的数据,因此您可以搜索多远。 Streaming API会在发推时进行跟踪,但Twitter仅保证将收集所有当前发推的样本[ link ]。 这意味着,如果您的搜索词非常通用并且匹配大量推文,则并非所有这些推文都将被返回[ link ]。

The Twitter Firehose addresses the shortcomings of the previous two APIs, but at quite a substantial cost, whereas the other two are free to use. There are also a growing number of third party intermediaries that have access to the Twitter Firehose, and sell on the Twitter data they collect [link].

Twitter Firehose解决了前两个API的缺点,但是成本很高,而另两个API是免费的。 越来越多的第三方中介可以访问Twitter Firehose,并在Twitter上出售他们收集的数据[ link ]。

 

Our Approach

我们的方法

We chose to use the Streaming API to collect tweets containing the hashtags “python” and/or “rstats” and/or “datascience” over a 10 day period.

我们选择使用Streaming API在10天的时间内收集包含主题标签“ python”和/或“ rstats”和/或“ datascience”的推文。

To harvest the data, a python script was created to utilize the API and append tweets to a single file. Command line tools such as cvskit and jq were then used to clean and preprocess the data, with the analysis done in Python using the pandas library.

为了收集数据,创建了一个Python脚本来利用API并将推文附加到单个文件中。 然后,使用命令行工具(例如cvskitjq)来清理和预处理数据,并使用pandas库在Python中进行分析。

 

Preliminary Results: Hashtag Counts and Co-occurrence

初步结果:标签计数和同时出现

From Figure 1, it is immediately obvious that “python” and “datascience” were more popular hashtags than “rstats” over the time period sampled. Though interestingly, there was little overlap between these groups.

从图1中可以明显看出,在所采样的时间段内,“ python”和“ datascience”是比“ rstats”更流行的主题标签。 尽管有趣的是,这些组之间几乎没有重叠。

Twitterblog  1

Figure 1: Venn diagram of tweet counts by hashtag

图1:按主题标签的推文计数的维恩图

 

This suggests that the majority of tweets that mentioned these subjects either did so in isolation or alongside other hashtags that were not tracked. We can get a sense of which is the case by looking at a count of the total number of unique hashtags that occurred alongside each tracked hashtag, this is shown in Table 1.

这表明提到这些主题的大多数推文要么是孤立进行的,要么是与其他未跟踪的主题标签一起进行的。 通过查看与每个跟踪的标签同时出现的唯一标签的总数,我们可以大致了解情况,如表1所示。

twitterblog 2

Table 1: Total unique hashtags used per tracked subset

表1:每个跟踪子集使用的唯一主题标签总数

 

These counts show that the “python” hashtag is mentioned alongside a lot more other topics/hashtags than “rstats” and “datascience”. This makes sense when you consider that Python is a general purpose programming language, and as such has a broader range across application domains than R, which is more statistically focused. In between these is the “datascience” hashtag, a term that relates to many different skillsets and technologies, and so we would expect the number of unique hashtag co-occurrences to be quite high.

这些计数表明,除了“ rstats”和“ datascience”之外,还提到了“ python”主题标签以及其他许多主题/主题标签。 当您认为Python是一种通用的编程语言时,这是有道理的,因此跨应用程序域的范围比R在统计上更集中。 在这两者之间的是“数据科学”主题标签,该术语与许多不同的技能和技术相关,因此,我们希望唯一的主题标签共现次数会很高。

 

So what are people mentioning alongside these hashtags if not these technologies?

那么,如果没有这些技术,人们除了这些标签之外还会提到什么?

Table 2 shows the top hashtags mentioned alongside the three tracked hashtags. Here the numbers in the header are the total number of tweets that contained the tracked hashtag term, plus at least one other hashtag. So the vast majority of tweets occur with multiple hashtags As can be seen all three subjects were commonly mentioned alongside other hashtags.

表2显示了三个跟踪的主题标签旁边提到的顶级主题标签。 标题中的数字是包含跟踪的主题标签术语的推文总数,以及至少一个其他主题标签。 因此,绝大多数推文都带有多个主题标签。可以看出,所有三个主题通常都与其他主题标签一起被提及。

twitterblog 3

Table 2: Table of most frequent co-occurring hashtags with tracked keywords. Numbers in the header are the total number of tweets containing at least one other hashtag to the one tracked.

表2:带有跟踪关键字的最常见的同时出现的#标签表。 标头中的数字是推文的总数,其中至少包含一个与所跟踪的主题标签相关的其他主题标签。

As we may expect, many co-occurring hashtags are closely related, though in general it’s interesting to see that “datascience” co-occurs with many more general concepts and or ‘buzzwords’ frequently, with technologies mentioned further down the list.

正如我们可能预期的那样,许多共同出现的主题标签密切相关,尽管总的来说,有趣的是,“数据科学”与更多的一般概念和(或“流行语”)经常同时出现,而提及的技术则排在后面。

Python on the other hand occurs frequently alongside other web technologies, as well as “careers” and “hiring”, which may reflect a high demand for jobs that use Python and these related technologies for web development. On the other hand it may simply be that many good web developers are active on Twitter, and as such recruitment companies favor this medium of advertising when trying to fill web development positions.

另一方面,Python与其他Web技术以及“职业”和“雇用”同时出现,这可能反映了对使用Python和这些相关技术进行Web开发的工作的高需求。 另一方面,可能仅仅是许多优秀的Web开发人员活跃在Twitter上,因此,招聘公司在试图填补Web开发职位时偏爱这种广告媒介。

It’s interesting that tweets with the “Rstats” hashtags mentioned “datascience” and “bigdata” more than any other, likely reflecting the increasing trends in using R in this arena. The other co-occurring hashtags for R can be grouped into: those that relate to its domain specific use (“statistics”, “analytics”, “machinelearning” etc.); possible ways of integrating it with other language (“python”, “excel”, “d3js”); and other ways of referencing R itself (“r”, “rlang”)!

有趣的是,带有“ Rstats”标签的推文中提到“数据科学”和“大数据”的比例最高,这可能反映了在此领域使用R的趋势日益增长。 R的其他共同出现的标签可以分为:与特定领域使用相关的标签(“统计”,“分析”,“机器学习”等); 将其与其他语言(“ python”,“ excel”,“ d3js”)集成的可能方法; 以及引用R本身的其他方式(“ r”,“ rlang”)!

 

Summary

摘要

So from looking at the counts of hashtags and their co-occurrences, it looks like:

因此,从查看主题标签的计数及其共现来看,它看起来像:

  • Tweets containing Python or data science were roughly 5 times more frequent than those containing Rstats. There was also little relative overlap in the three hashtags tracked.
  • Tweets containing Python also mention a broader range of other topics, while R is more focused around data science, statistics and analytics.
  • Tweets mentioning data science most commonly include hashtags for general analytics concepts and ‘buzzwords’, with specific technologies only occasionally mentioned.
  • Tweets mentioning Python most commonly include hashtags for web development technologies and are likely the result of a high volume of recruitment advertising.
  • 包含Python或数据科学的推文的频率大约是包含Rstats的推文的5倍。 跟踪的三个主题标签中的相对重叠也很少。
  • 包含Python的推文还提到了更多其他主题,而R更侧重于数据科学,统计和分析。
  • 提及数据科学的推文中,最常见的是针对一般分析概念的主题标签和“流行语”,仅偶尔提及特定技术。
  • 提及Python的推文最通常包含用于Web开发技术的主题标签,并且很可能是大量招聘广告的结果。

 

Future Work

未来的工作

So far we have only looked at the hashtag contents of the tweet and there is much more data contained within that can be analysed. Two other key components are the user mentions and the URLs in the message. Future posts will look into both of these to investigate the content being shared, along with who is retweeting/being retweeted by whom.

到目前为止,我们仅查看了tweet的主题标签内容,并且可以分析其中包含的更多数据。 其他两个关键组成部分是用户提及和消息中的URL。 将来的帖子将研究这两种内容,以调查共享的内容以及谁在转发/被谁转发。

翻译自: https://www.pybloggers.com/2015/05/data-science-tweet-analysis-what-tools-are-people-talking-about/

大数据谈论热点有

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值