Python中的幸存者实体提取和网络图

最新推荐文章于 2022-02-03 12:34:51 发布

李_涛

最新推荐文章于 2022-02-03 12:34:51 发布

阅读量421

点赞数

文章标签： python

原文链接：https://medium.com/swlh/survivor-entity-extraction-and-network-graphs-in-python-2811d5f68427

版权

During the lockdown, I watched and re-watched copious amounts of television. I turned around and saw that people were *gasp* learning new skills. So it occurred to me — why not combine the two for some enjoyable productivity?

在锁定期间，我观看并重新观看了大量的电视。我转过身，发现人们正在*学习*学习新技能。所以我想到了-为什么不将两者结合起来以获得令人愉悦的生产力？

Survivor is one of my favourite shows. In this hit US reality show, a group of people live under primitive conditions, compete in challenges, and periodically have to vote someone out. In the end, the winner is decided by a jury of the booted contestants. It’s such a great mix of survival, strategy and social dynamics!

幸存者是我最喜欢的节目之一。在这部热门的美国真人秀中，一群人生活在原始条件下，在挑战中竞争，并定期不得不将某人投票出去。最后，获胜者是由参赛选手组成的评审团决定的。这是生存，战略和社会动力的完美结合！

Can we understand Survivor social dynamics based on what contestants say about each other? I will be using confessionals, which are basically when players speak to the camera in private. Specifically, I am going to analyse how often players mention each other.

我们能否根据参赛者之间的相互理解来了解幸存者的社会动态？ 我将使用悔室，基本上是在玩家私下对摄像机讲话时。具体来说，我将分析玩家互相提及的频率 。

In this article, I lay out my exploratory analysis step-by-step:

在本文中，我分步介绍了我的探索性分析：

Identify data sources
识别数据源
Prep confessional data
准备悔数据
Visualise by plotting network graphs
通过绘制网络图进行可视化
Draw insights
汲取见解

Since this is very free-form, I will be working in a Jupyter notebook. Key packages include pandas for ETL, spaCy for entity extraction, and networkx for visualising.

由于这是非常自由的形式，因此我将在Jupyter笔记本中工作。关键套餐包括熊猫用于ETL，spaCy 用于实体提取和networkx 用于可视化。

I’m conscious that the snippets here don’t always adhere to the highest of coding standards— bear in mind that this is a fun side project. Plus, I also want to show the real, genuine working process. Any comments on the thinking, coding and analysis are welcomed!

我意识到这里的代码片段并不总是遵循最高的编码标准，请记住，这是一个有趣的附带项目。另外，我还想展示真实，真实的工作过程。欢迎对思想，编码和分析发表任何意见！

1.识别数据源 (1. Identify data sources)

The first step — finding usable data.

第一步-查找可用数据。

At this point, I have some ideas but I hadn’t decided what to do with Survivor data yet. So this step is crucial because the availability of good data informs my problem statement. A little extra research can also save lots of data handling time!

在这一点上，我有一些想法，但是我还没有决定如何处理Survivor数据。因此，这一步骤至关重要，因为好的数据的可用性可以告知我的问题陈述。进行一些额外的研究还可以节省大量数据处理时间！

Luckily, some googling quickly identified sources of Survivor data:

幸运的是，一些使用谷歌搜索技术快速识别出幸存者数据的来源：

Performance data inc. challenges, jury votes and tribal council record
性能数据公司挑战，陪审团投票和部落议会记录
Confessionals transcribed
悔录

Personally, I’m more interested to do some natural language processing (NLP) on the confessionals— so I am going to focus on this and put the performance data to one side.

就个人而言，我对在告白上进行一些自然语言处理(NLP)更感兴趣-因此，我将专注于此并将性能数据放在一边。

The confessionals data come in google sheets by season. I picked Millenials v Gen X (S33, filmed 2016) because it was the latest in my re-watch.

自白数据按季节列在Google工作表中。我选择了Millenials v Gen X (S33，2016年拍摄)，因为这是我重新观看的最新影片。

One sheet looks like this. Each row in column B is one confessional. The numbers, like (1/5), indicate that it’s the 1st of 5 confessionals that episode. Note that they’re not labelled with the actual episode number.

一张像这样。 B列中的每一行都是一个悔室。像(1/5)这样的数字表示这是该集的5个悔室中的第1名。请注意，它们没有标注实际的剧集编号。

In a different file, there is a summary showing the number of confessionals by contestant and episode. If only this also included their finish — who won, who was a finalist, and who made the jury?

在不同的文件中，有一个摘要显示按参赛者和情节划分的自白人数。如果这还包括他们的完成-谁赢了，谁进入了决赛，谁进行了评审？

At a glance the data looks imperfect but well maintained (there are even guidelines on what gets transcribed). Kudos to the maintainers, I am definitely using these!

乍一看，数据看起来并不完美，但是维护得很好(甚至还有关于转录内容的指南 )。感谢维护者，我肯定在使用这些！

2.准备悔数据 (2. Prep confessional data)

Now I need to read and clean this data so it’s ready to be analysed! There are two parts: ‘data’, i.e. the text of the confessionals, and ‘summary’ which has metadata on how the contestant did.

现在，我需要读取并清除此数据，以便可以对其进行分析！它由两部分组成： “数据” ，即the悔的文本，以及“摘要” ，其中包含有关参赛者表现的元数据。

Since I don’t own these google sheets, I decided to save the relevant tabs as csv files. In ‘summary’, I also manually added a ‘Finish’ column to indicate ‘winner’, ‘finalist’ etc. as there is no way to work this out from the data alone.

由于我不拥有这些Google工作表，因此我决定将相关标签保存为csv文件。在“ 摘要”中 ，我还手动添加了“完成”列以指示“获胜者”，“决赛选手”等，因为无法单独从数据中得出结论。

Now we’re ready to do some analysis. In order to understand how players talk about each other, I need to identify the names mentioned in confessionals.

现在我们准备进行一些分析。为了了解玩家之间的交谈方式，我需要确定悔室中提到的姓名。

3.通过绘制网络图进行可视化 (3. Visualise by plotting network graphs)

My first instinct is to plot a heatmap so we can easily spot where the numbers are. This is easily done with seaborn.

我的第一个直觉是绘制热图，以便我们可以轻松确定数字的位置。用seaborn很容易做到这一点。

Unfortunately the heatmap is a bit hard to read… even if we filter down to only the 20 players, it’s still difficult to match labels to the values.

不幸的是，热图有点难以阅读……即使我们只过滤掉20个播放器，仍然很难将标签与值匹配。

Instead, I think we need a network graph showing who spoke about who. The players will be nodes, with arrows in between for mentions. More mentions = bigger arrows!

相反，我认为我们需要一个网络图来显示谁在谈论谁。玩家将是节点，中间有箭头。更多提及=更大的箭头！

There’s a bit of fidgeting with formatting, as always… but this is in the right direction now.

像往常一样，格式设置有些烦躁……但这是朝着正确的方向发展。

Immediately it’s very noticeable that David and Zeke are very central, alongside winner Adam. For Zeke, this is probably a reflection of the infamous incident where another player ousted him as transgender in front of his whole tribe. With David, this is probably a reflection of his popularity (therefore more of his confessionals getting airtime).

立刻，大卫和泽克与获胜者亚当非常重要。对于Zeke来说，这可能是臭名昭著的事件的反映，另一名球员在整个部落面前将他驱逐为变性人。对于David来说，这可能反映了他的受欢迎程度(因此，更多他的ess悔室获得了通话时间)。

Just looking at the graph, it’s obvious that Zeke is more central than, say, Mari (bottom left, yellow dot). But it’s hard to tell if he is more central than David or Adam. Let’s calculate this properly, then I’ll be able to easily check hypothesis like ‘the winner is always the most central of all finalists’:

只看图表，很明显Zeke比Mari(左下角的黄点)更重要。但是很难说他是否比大卫或亚当更重要。让我们正确地计算一下，然后我将能够轻松地检验“胜利者始终是所有决赛入围者中最重要的”这样的假设：

The winner, Adam, is indeed the most central of all finalists — however, he only ranks 4 out of 20.

获胜者亚当确实是所有决赛选手中最重要的，但他在20名选手中仅排名4。

Clearly, I need to do this with multiple seasons to find generic Survivor patterns (if they exist). So I tidied up my code a little and made it slightly more flexible.

显然，我需要对多个季节执行此操作以找到通用的Survivor模式(如果存在)。所以我整理了一下代码，使其更加灵活。

In my final notebook, I set up so that new graphs can be drawn with one line of code. It’s a little slow, but it does the trick for what I need!

在我的最后一本笔记本中，我进行了设置，以便可以用一行代码绘制新图形。有点慢，但是可以满足我的需要！

4.汲取见解 (4. Draw insights)

Here are networks from 3 different seasons. Winners are in red, finalists in blue, jury members green, and the rest yellow.

这是来自3个不同季节的网络。优胜者为红色，决赛选手为蓝色，评审团成员为绿色，其余为黄色。

Winners

优胜者

Winners (red) are always the most central of all finalists (blue). I wonder if this holds through all 40 seasons!

优胜者(红色)始终是所有决赛入围者(蓝色)中最重要的部分。我想知道这是否在所有40个季节中都成立！

By design, winner are in the game for the max number of days, so there are more episodes where their confessionals can be aired. It might be interesting to see this analysis by episode — doing something like this.

根据设计，获胜者会在游戏中停留最多的天数，因此可以播出更多供自白的情节。这可能是有趣的，看看这种分析的插曲-做这样的事情这样。

However, it’s not easy to identify winners just based on mentions. Although Rob (RI, middle) and Tony (WaW, right) are both the most central by far in their winning seasons, Adam (MvG, left) breaks the pattern. Rob and Tony are both big characters — and that’s not true of all winners.

但是，仅凭提及来确定获奖者并不容易。尽管罗布(RI，中级)和托尼(WaW，右)都是迄今为止夺冠季节中最重要的人物，但亚当(MvG，左)打破了格局。 Rob和Tony都是大人物-并非所有获奖者都是如此。

Finalists

决赛入围者

This is much less consistent. Finalists (blue) aren’t even always in the upper half — look at Natalie in WaW!

这不太一致。决赛入围者(蓝色)甚至并不总是在上半身-看看《魔兽世界》中的娜塔莉！

Natalie’s non-centrality is driven by the edge of extinction twist. She was actually the first player voted out and only re-joined the game at final-6. Obviously, that meant fewer mentions from fewer players.

娜塔莉(Natalie)的非中心地位是由灭绝边缘引发的。实际上，她是第一个被选出的球员，并且只在第6场比赛才重新加入比赛。显然，这意味着更少的玩家提及。

In RI, Phillip is much more central than Natalie, driven by many mentions from Rob. However, in MvG, the non-winning finalists have very similar levels of centrality. In all three seasons, prominent jurors (green) come out more central: David, Andrea, Ben.

在罗伯特(RI)的许多提及的推动下，菲利普(Phillip)在纳粹党(RI)中比纳塔莉(Natalie)更重要。但是，在MvG中，未获奖的决赛入围者的中心化程度非常相似。在所有三个赛季中，杰出的陪审员(绿色)都显得更加重要：大卫，安德里亚，本。

My conclusion: finalists as a group is not very homogenous.

我的结论是：决赛入围者并不是很同质。

所以……我学到了什么？ (So… what did I learn?)

I realised that confessionals are telling, but you can’t draw hard conclusions based on this alone.

我意识到悔者在讲，但是您不能仅凭此得出艰难的结论。

For starters, confessionals are heavily edited by production to sculpt interesting characters and build a storyline. Players also have very different styles, and even winners come in many shapes and forms. So confessionals paint an incomplete picture, in the same way that you can’t analyse the game using only challenge wins.

对于初学者来说，production悔室通过制作大量编辑，以雕刻有趣的角色并建立故事情节。玩家的风格也大不相同，甚至胜利者的形态和形式也多种多样。因此，悔者描绘的场景并不完整，就像您无法仅凭挑战胜利来分析游戏一样。

Having said that, this is a very basic piece of analysis that only scratches the surface.

话虽如此，这是一个非常基本的分析，只涉及表面。

Some ideas for further analyses:

进一步分析的一些想法：

Full picture: analysing centrality for all 40 seasons, possibly segmenting by the twists in play (with vs without immunity idols, with vs without potential redemption after being voted out)
全图：分析所有40个季节的中心性，可能按游戏过程的曲折进行细分(有或没有免疫偶像，有或没有被淘汰后可能的赎回)
Episode by episode view: this could show anomalies like controversial incidents, medical evacuations, and classic moments such as Erik giving away his individual immunity.
逐集查看：这可能显示异常事件，例如有争议的事件，医疗撤离以及经典时刻，例如Erik放弃个人免疫力。
Signal spotting: Which way will a challenge or a vote go? Fans in early seasons noticed that during suspenseful votes, host Jeff Probst tended to first read the name of the player not going home. Perhaps there are also tell-tale signs from confessionals.
信号发现：挑战或投票会走哪条路？赛季初的球迷注意到，在悬而未决的选票中，房东杰夫·普罗布斯特(Jeff Probst)往往会首先读取不回家的球员的名字。悔者也许也有讲故事的迹象。
Sentiment analysis: differentiate between positive, neutral negative mentions might show us alliances and voting blocks. This can be done using tools like this.
情绪分析：区分正面提及和中性负面提及可能会向我们展示联盟和投票障碍。可以使用类似这样的工具来完成。
Returnees: any relationship between confessionals and whether players come back in a future season?
海归：:悔室之间有什么关系，球员是否会在未来的赛季复出？
Predictive power: combining this with challenge performance and voting outcomes, how well can we predict players’ finish? How important a predictor would the confessionals be?
预测能力：将其与挑战表现和投票结果相结合，我们如何预测球员的完成情况？自白者的预测者有多重要？

I had fun picking up networkx for the first time while thinking about one of my favourite shows. What cool Survivor network graphs would you like to see? What about for other TV shows, movies, and books?

在考虑我最喜欢的节目之一时，我第一次玩起了networkx很开心。您想看哪些酷的Survivor网络图？其他电视节目，电影和书籍呢？

翻译自: https://medium.com/swlh/survivor-entity-extraction-and-network-graphs-in-python-2811d5f68427

李_涛

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python中的幸存者实体提取和网络图

During the lockdown, I watched and re-watched copious amounts of television. I turned around and saw that people were *gasp* learning new skills. So it occurred to me — why not combine the two for som...
复制链接

扫一扫