震颤图方法

I.颤振图介绍 (I. Introducing Chatter Charts)

Chatter Charts is a sports visualization that mixes statistics with social media data to create a storyboard retelling of the game through the collective fanbase’s perspective.

Chatter Charts是一种体育可视化工具, 将统计数据与社交媒体数据混合在一起 ,从而通过集体支持者的视角创建了故事板,以转播游戏。

Chatter Charts splits a sports game’s social media comments into two-minute intervals and treats these comments as if they are part of a book telling a linear story where each interval is a chapter.

Chatter Charts将体育游戏的社交媒体评论分成两分钟的间隔,并将这些评论当作是一本讲述线性故事的书的一部分,其中每个间隔都是一章。

With this approach, it can leverage a statistical method called TF-IDF to rank all words at every interval and filter for the best performing one.

通过这种方法,它可以利用称为TF-IDF统计方法对每个间隔的所有单词进行排名,并筛选出效果最佳的单词。

二。 了解TF-IDF (II. Understanding TF-IDF)

For stats nerds, TF-IDF calculates the relative word count in an interval — Term Frequency — and weights it based on how often that word appears throughout the entire game — Inverse-Document Frequency.

对于统计书呆子,TF-IDF在一个间隔(术语频率)中计算相对单词计数,并根据该单词在整个游戏中出现的频率(反文档频率)加权。

In short, TF-IDF does two things well, punishing generic words such as “the” and allowing modifying words to outperform their subjects so “hooking” or “dive” can outrank “penalty” if multiple penalties happen in a game.

简而言之,TF-IDF做得很好,可以惩罚诸如“ the”之类的通用词 ,并允许对单词进行修饰以胜过其主体,因此,如果在游戏中发生多种惩罚,“上钩”或“下潜”可以超过“惩罚”。

The two data frames below demonstrates the difference between raw counts and the TF-IDF approach. The left data frame ranks words using a raw count, while the right ranks using TF-IDF. My charts use the top result every interval.

下面的两个数据帧演示了原始计数与TF-IDF方法之间的区别。 左数据帧使用原始计数对单词进行排名,而右数据帧使用TF-IDF进行排名。 我的图表在每个时间间隔使用最高结果。

Image for post
Image for post
On the left, ranking by raw count. On the right, ranking by tf-idf.
在左侧,按原始计数排名。 在右边,按tf-idf排名。

At a glance, this showcases the strength of TF-IDF, particularly how effective IDF weighting is at punishing common words and allowing a more reactionary verbiage to outperform others.

乍一看,这展示了TF-IDF的优势,尤其是IDF加权在惩罚常用字词以及允许更多反动词优于其他字词方面的有效程度。

After applying this technique to each interval, here’s a sample of what a final data frame for Chatter Charts looks like:

在将这种技术应用于每个间隔之后,下面是Chatter Charts最终数据帧的样例:

Image for post
  • interval is the rounded two-minute interval

    时间间隔是四分钟取整

  • interval_volume is the number of full text comments inside that interval

    interval_volume是该间隔内的全文注释的数量

  • n is the word’s raw count inside that interval

    n是该间隔内单词的原始计数

  • tf is the percentage of that word relative to the total number of words in that interval — ex. 2.4% of words were “scorianov”

    tf是该单词相对于该时间间隔内单词总数的百分比。 2.4%的单词是“ scorianov”

  • idf is a weighting based on how often the word occurs throughout the entire corpus — 3.5x is quite high because “scorianov” rarely occurs throughout the entire game

    idf是一个基于单词在整个语料库中出现的频率的权重-3.5倍非常高,因为“ scorianov”在整个游戏中很少出现

  • tf_idf is tf multiplied by idf

    tf_idf是tf乘以idf

You can read more about TF-IDF from the Wikipedia page.

您可以从Wikipedia页面上阅读有关TF-IDF的更多信息。

三, 获得优质曲棍球评论 (III. Getting Quality Hockey Comments)

The challenge for most data science problems is having quality data. For Chatter Charts, I combine two sources where fans gather to talk about sports during the game.

大多数数据科学问题面临的挑战是拥有高质量的数据。 对于Chatter Charts,我将球迷们聚集在一起谈论比赛期间体育的两个来源结合在一起。

Reddit游戏线程 (Reddit Game Threads)

Image for post
r/Canucks game thread
r / Canucks游戏线程

A game thread is a dedicated forum where subreddit members can talk about a specific game. Every sports team has a subreddit. Some are larger than others, but the members are obviously hardcore fans.

游戏主题是专门的论坛,子推荐成员可以在其中讨论特定的游戏。 每个运动队都有一个subreddit。 有些人比其他人大,但成员显然是铁杆粉丝。

There is no shortage of live reactions and hockey comradery. But, if you know Reddit, it can be quite crass — like r/Canucks’ “WIN DA TURD” second intermission chant.

不乏现场React和曲棍球同志。 但是,如果您了解Reddit,它可能会很无聊-就像r / Canucks的“ WIN DA TURD”第二场中场颂歌一样。

For Reddit, I scrape all the comments off game threads using Python’s {PRAW} package. All you need is the thread’s URL. If you want the code, let me know in the comments!

对于Reddit,我使用Python的{ PRAW }包从游戏线程中删除了所有注释。 您需要的只是线程的URL。 如果您需要代码,请在注释中告诉我!

Note: You’ll need to create a Reddit Web app to use {PRAW}

注意:您需要创建一个 Reddit Web应用程序 才能使用{ PRAW }

Twitter帐户,标签和关键字 (Twitter Accounts, Hashtags, and Keywords)

Image for post
A tweet that uses #GoStars
使用#GoStars的推文

Twitter has a much larger professional presence than Reddit. Pundits, fan blogs, and beat reporters share their insights on the game here. So to find quality team-related tweets, I leverage a few techniques.

Twitter具有比Reddit更大的专业形象。 权威人士,支持者博客和出色的记者在这里分享了他们对游戏的见解。 因此,为了找到与团队相关的高质量推文,我利用了一些技巧。

First, I store a list of team-specific keywords, accounts, and hashtags to search every game. For example, these are the terms I use to find Toronto Maple Leafs tweets.

首先,我存储一组特定团队的关键字,帐户和主题标签以搜索每个游戏。 例如,这些是我用来查找Toronto Maple Leafs tweet的术语。

@MapleLeafs #TMLTalk#LeafsForever#GoLeafsGo #MapleLeafs #LeafsNation #leafsleafsTML

@MapleLeafs#TMLTalk#LeafsForever#GoLeafsGo #MapleLeafs #LeafsNation #leafsleafsTML

This will cover tweets that mention the team’s main account, use popular hashtags, or have a team-specific keywords in it like “leafs”.

这将涵盖提及团队主要帐户,使用流行主题标签或其中包含特定于团队的关键字(例如“叶子”)的推文。

Further, I search a game-specific hashtag like #TORvsDAL. I can’t hard-code these, so I write a dynamic line of code to create it every game.

此外,我搜索了特定游戏的标签,例如#TORvsDAL。 我无法对这些代码进行硬编码,因此我编写了动态代码行以在每个游戏中创建它。

Lastly, I have a VIP list of Twitter accounts for each team. This is comprised of accounts who tweet a lot about their team, but might not use keywords or hashtags all the time.

最后,我有每个团队的Twitter帐户的VIP列表 。 该帐户由那些在团队中发了很多推文的客户组成,但可能不会一直使用关键字或主题标签。

To build the VIP list, I search team hashtags and dig into suggested accounts. I collect some active tweeters in the community and simply ask them to nominate the other fan accounts they like to follow.

要建立VIP列表,我搜索团队标签并挖掘建议的帐户。 我在社区中收集了一些活跃的高音扬声器,只是请他们提名他们喜欢关注的其他粉丝帐户。

Image for post
Building my VIP list by asking the communities for nominations
通过询问社区提名来建立我的VIP清单

I take all my VIP user’s first 50 tweets and add them into the Reddit and Twitter data — removing duplicates of course.

我将所有VIP用户的前50条推文都添加到了Reddit和Twitter数据中,当然删除了重复项。

All of this is possible using {rtweet}’s search_tweets() and get_timeline() functions.

使用{ rtweet }的search_tweets()get_timeline()函数可以实现所有这些功能。

Note: You’ll need a Twitter developer account to get access to API calls and make requests using {rtweet}.

注意:您需要一个 Twitter开发人员帐户 才能访问API调用并使用{ rtweet } 发出请求

Together, these sources net me enough quality comments to produce quality results.

这些资源一起使我获得足够的质量评论,以产生高质量的结果。

IV。 创建高效的工作流程 (IV. Creating an Efficient Workflow)

I’ve built out a workflow where I only need to provide a few details about a game and everything else will populate. It is written 97% in R and 3% in Python — Python only fetches the Reddit comments for me.

我建立了一个工作流程,仅需提供有关游戏的一些详细信息,其他所有内容都将填充。 它用R编写的占97%,用Python编写的占3%-Python仅为我获取Reddit注释。

This is my command center:

这是我的指挥中心:

Image for post
Canucks POV example: I just need to fill this out and scripts will do the rest.
Canucks POV示例:我只需要填写此内容,脚本将完成其余工作。

In the first chunk, team-specific data is pulled using the main_team and opponent variables. My script looks up colours, logos, social media info, and track down a list of fans who tweet about the team from a metadata.csv I’ve created, seen below.

在第一块中,使用main_teamopponent变量提取特定于团队的数据。 我的脚本会查找颜色,徽标,社交媒体信息,并从我创建的metadata.csv中跟踪发布有关该团队的推文的粉丝列表,如下所示。

Image for post
metadata.csv
metadata.csv

In the second chunk of my workflow, I paste event-based markers. I have to open Twitter and manually copy the links of tweets containing game start/end, goals, and intermission . What you see pasted is the string of numbers at the end of a tweet.

在工作流的第二部分中,我粘贴了基于事件的标记。 我必须打开Twitter并手动复制包含游戏开始/结束,目标和间歇的鸣叫链接。 您看到的粘贴内容是一条tweet末尾的数字字符串。

https://twitter.com/<account>/status/1300474445925167104

https://twitter.com/ < 帐户> / status / 1300474445925167104

My script looks up those numbers, also known as the status_id , then grabs their timestamps and plots them in the correct positions with the correct colours and markers.

我的脚本查找这些数字(也称为status_id ,然后获取它们的时间戳,并使用正确的颜色和标记将它们绘制在正确的位置。

Image for post
a plot with goal markers, intermissions, and game start/end
带有目标标记,中场休息和比赛开始/结束的情节

I went with this workflow because it is flexible to build Chatter Charts for any sport. For instance football, soccer, and baseball all have large events that define a game — touchdowns, goals, and RBIs respectively.

我使用此工作流程是因为它可以灵活地为任何运动构建Chatter Charts。 例如,足球,足球和棒球都具有定义游戏的大型事件-分别是触地得分,进球和RBI。

V.令牌化和执行TF-IDF (V. Tokenizing and Performing TF-IDF)

So now I have my raw comments from Reddit & Twitter pulled as well as some markers. Let’s walk through how the data is being processed.

所以现在我有了来自Reddit和Twitter的原始评论以及一些标记。 让我们来看一下如何处理数据。

First, group the comments into two-minute intervals. I do this with the round_date function from {lubridate}. Super easy to use.

首先, 将评论分为两分钟。 我使用{ lubridate }中的round_date函数执行此操作。 超级好用。

### ROUND DATES
rounded_interval_df <- raw_df %>%
round_date(interval, unit = "2 mins")
Image for post
rounded interval output, see how `created_at` rounds into `interval`
四舍五入的时间间隔输出,请参阅“ created_at”如何四舍五入为“ interval”

Why two-minutes you ask? Hockey is fast. Things happen quickly. Anything longer can drown out events.

为什么要问两分钟? 曲棍球很快。 事情很快发生。 更长的时间可能淹没事件。

Why not one-minute? There’s usually not enough volume to satisfy TF-IDF, especially with smaller fan bases. However, I can use one-minute for ad-hoc charts — like a third period collapse.

为什么不一分钟呢? 通常没有足够的体积来满足TF-IDF的要求,尤其是对于较小的风扇底座。 但是,我可以将一分钟用于临时图表,就像第三次崩溃一样。

Next, calculate the comment volume for each interval.

接下来, 计算每个时间间隔 评论量

This allows me plot the line in the chart and acts as the y-axis guide for the animated words to follow.

这使我可以在图表中绘制线条,并作为动画单词遵循的y轴指南。

### CALCULATE VOLUME
interval_volume_df <- rounded_interval_df %>%
count(interval, name = "interval_volume")
Image for post
interval volume output
间隔音量输出

Next, tokenize. This means I take data that is currently structured as one comment per row and break it up so each word in a comment has its own row.

接下来, 标记化 。 这意味着我将当前结构为每行一个注释的数据进行分解,以便注释中的每个单词都有自己的行。

{tidytext} does this with unnest_tokens .

{ tidytext }使用unnest_tokens完成此unnest_tokens

### TOKENIZE
unnested_df <- rounded_interval_df %>%
unnest_tokens(word, text, token = "tweets")

Also, remove stop-words. These are words like “I” and “the” — the stop_words variable is made available when you load {tidytext}. TF-IDF does discount these, but I find it’s friendlier to remove them straight up.

另外, 删除停用词 。 这些是“ I”和“ the”之类的词-加载{ tidytext }时, stop_words变量可用。 TF-IDF 确实打折了这些,但我发现直接将它们删除更为友好。

### TOKENIZED AND PROCESSED
processed_df <- unnested_df %>%
anti_join(stop_words, by = "word")
Image for post
A tokenized data frame, see how each non stop-word is pulled from the sentence — — — — →
一个标记化的数据帧,了解如何从句子中提取每个非停用词— — — — —→

Next, count the words in each interval as the last data preparation before TF-IDF.

接下来, 将每个间隔中的字数作为TF-IDF之前的最后数据准备。

### COUNT TOKENS
counted_token_df <- processed_df %>%
count(word, interval)
Image for post
word count of a single interval, excuse the cussing!
单间隔字数,请原谅!

Lastly, apply TF-IDF. I use thebind_tf_idf function from {tidytext}. You can also try using log-odds from {tidylo} for some variation!

最后, 应用TF-IDF 。 我使用{ tidytext }中的bind_tf_idf函数。 您也可以尝试使用{ tidylo }中的对数奇数进行更改!

### TF-IDF
important_word_df <- counted_token_df %>%
bind_tf_idf(word, interval, n) %>% # one line!
filter(n >= 3) %>% # number of occurrences to be considered
filter(idf < 4) %>% # limit VERY random words (typically noise)
arrange(interval, desc(tf_idf)) %>%
distinct(interval, .keep_all = TRUE) # take the top term### COMBINE WITH VOLUME
full_data <- interval_volume_df %>%
full_join(important_word_df, by = "interval") %>%
filter(interval >= min_hour,
interval <= max_hour) %>%
arrange(interval) %>%
fill(word, .direction = "down")

Note: full_join and fill will make sure any instances our intervals do not meet the minimum number of occurrences for TF-IDF will instead forward-fill the interval with the previous word.

注意: full_join fill 将确保我们的间隔不符合TF-IDF的最小出现次数的所有实例,而是使用前一个单词来向前填充间隔。

And that is how I get the output we looked at above.

这就是我上面看到的输出的方式。

Image for post

VI。 绘图和动画 (VI. Plotting & Animating)

The Chatter Chart actually looks like this before animation.

实际上,Chatter Chart在动画之前看起来像这样。

Image for post
base plot before animation
动画前的基本情节

At this state, I animate the plot using {gganimate}.

在这种状态下,我使用{ gganimate }对绘图进行动画处理

animated_plot <- base_plot +
transition_reveal(interval) # animate over intervalanimate(plot = animated_plot,
fps = 25, duration = 38,
height = 608, width = 1080,
units = 'px', type = "cairo", res = 144,
renderer = av_renderer("file-name.mp4"))

By using transition_reveal(interval) , I can build dynamic features into the chart. For instance, my scoreboard increases and grows when someone scores. It’s quite similar to creating markers in Adobe After Effects.

通过使用transition_reveal(interval) ,我可以在图表中构建动态特征。 例如,当有人得分时,我的记分板会增加并增加。 这与在Adobe After Effects中创建标记非常相似。

Image for post
When the score changes, the board size will be increased for 2-minutes, then return back to 12.
得分变化时,棋盘尺寸将增加2分钟,然后返回到12。

The rest is intermediate styling in ggplot. Some of the packages I leverage include {ggtext} for adding html styling to the title, {shadowtext} to add the white background to the words, and {extrafont} for importing custom fonts.

其余的是ggplot中的中间样式。 我使用的一些软件包包括{ ggtext }用于在标题中添加html样式,{ shadowtext }用于将单词添加白色背景,以及{ Extrafont }用于导入自定义字体。

谢谢阅读 (Thanks for Reading)

I hope that gives you a better grasp on what’s going on behind the scenes.

我希望这能使您更好地了解幕后情况。

Of course, I invite you to follow along with me on Twitter or join r/ChatterCharts. My DMs are open for feedback.

当然,我邀请您跟随我在Twitter关注或加入r / ChatterCharts 。 我的DM公开征求意见。

Finally, I’m looking for sponsors, affiliates, and hitting up your podcast. Email: chattercharts@gmail.com.

最后,我正在寻找赞助商,会员,并查找您的播客。 电子邮件: chattercharts@gmail.com。

Cheers!

干杯!

翻译自: https://medium.com/swlh/chatter-charts-methodology-5f82a405a673

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值