sklearn的pca建模_基于pca和内容的建模，用于英雄推荐英雄联盟-CSDN博客

sklearn的pca建模

Note: All the code for the below can be found here.

注意：以下所有代码均可在此处找到。

Previously I wrote an article on how we can use graph networks to help provide Champion recommendations in the game League of Legends (LoL). The technique is known as “User-user collaborative filtering”, where we utilise the information we know about a person to find similar users and then base our recommendation on what we know they like.

之前，我写过一篇文章，介绍如何使用图形网络帮助英雄联盟(LoL)游戏中提供冠军推荐。该技术被称为“用户-用户协作过滤” ，其中我们利用我们了解的有关某人的信息来查找相似的用户，然后根据我们所知道的他们喜欢的东西提出建议。

To help illustrate this, we’ll use the classic Amazon example. Imagine that you have added a PS4 and the latest FIFA game to your Amazon basket, the algorithm looks at all users who have previously bought a PS4 and FIFA together and then finds which other items they tend to have in their basket, i.e. the latest NFL game, Madden, which is then recommended to you.

为了说明这一点，我们将使用经典的Amazon示例。想象一下，您已经在您的亚马逊购物篮中添加了PS4和最新的FIFA游戏，该算法会查看先前一起购买过PS4和FIFA的所有用户，然后查找他们倾向于在购物篮中拥有哪些其他物品，即最新的NFL游戏，Madden，然后推荐给您。

Today, we’re looking at a different form of recommendation algorithm known as a “Content Based Model”. This technique instead looks to connect items together based on their similarities, i.e. if you’re buying a PS4 sports game produced by EA then here are some other PS4 sports games produced by EA. This technique is favourable when you have no information about user preference, such as when just launching the product.

今天，我们正在寻找一种不同形式的推荐算法，即“基于内容的模型”。 这项技术而是根据相似度将项目连接在一起，即，如果您购买的是EA制作的PS4体育游戏，那么这里是EA制作的其他PS4体育游戏。当您没有有关用户首选项的信息时(例如仅在启动产品时)，此技术非常有用。

However, there are almost 150 LoL Champions and we don’t want to spend all our time labeling them with all the various attributes we would need to make this work. So instead, what we are going to do is “describe” the Champions using their in-game statistics, such as their average kills per game or how much objective damage they do.

但是，有将近150个LoL冠军，我们不想花所有时间为他们贴上进行这项工作所需的所有各种属性的标签。因此，相反，我们要做的是使用游戏中的统计数据“描述”冠军，例如他们每场比赛的平均击杀数或他们造成的客观伤害。

To do this, we can analyse 150,000 Diamond games. Note that I’ve limited this to Top, Middle and ADC players only given the inherent difference support and junglers have in their statistics (i.e. low gold from minions).

为此，我们可以分析15万钻石游戏。请注意，我仅将这种情况限制在顶级，中级和ADC播放器中，仅出于对内在差异的支持，而打野者的统计数据也是如此(例如，从小兵中获得的低价)。

After averaging the data for all Champions the first thing to note is that there are some very distinct correlations between many of the statistics. It shouldn’t be a surprise that attributes such as “killingSprees” and “kills” are almost perfectly correlated (the former indicating how many times a player has been on a killing spree, the latter is how many kills in total that game).

将所有冠军的数据平均后，首先要注意的是，许多统计数据之间存在一些非常不同的相关性。诸如“ killingSprees”和“ kills”之类的属性几乎完美相关(前者表示玩家进行一次杀戮狂潮的次数，后者是该游戏总共杀灭了多少次)，这并不奇怪。

Image for post — Graph illustrating the multicollinearity issue that occurs with such a large number of attributes.

A common approach to deal with this level of multicollinearity is either exclusion (pick kills, delete killingSprees) or aggregation (kills * killingSprees). However, there is a better solution known as Principle Component Analysis (PCA) which is able to extract the core relationship between these attributes without manual intervention or the removal of potential key drivers.

处理这种多重共线性的一种常见方法是排除(剔除杀死，删除killingSprees)或聚合(杀死* killingSprees)。但是，有一个更好的解决方案称为主成分分析(PCA)，它能够提取这些属性之间的核心关系，而无需人工干预或删除潜在的关键驱动因素。

PCA is a fairly complex subject that requires an understanding of Eigenvectors/values and there are plenty of great articles on it so I won’t labour the subject here. Instead, I will say that what PCA is trying to do is capture as much of the variance in the data as possible, whilst minimising the amount of variables used.

PCA是一个相当复杂的主题，需要了解特征向量/值，并且上面有很多不错的文章，因此我在这里不做任何工作。相反，我要说的是PCA要做的是捕获数据中尽可能多的方差，同时最大程度地减少使用的变量量。

After fitting PCA to the dataset, we find that well over 30% of the variance of the data can be fit inside a single component, just over 16% is then found in the second component, 11% or so in the third and so on..

将PCA拟合到数据集后，我们发现可以将数据方差的30％以上拟合到单个组件中，然后在第二个组件中找到16％以上，在第三个组件中找到11％左右，依此类推。 ..

But what are these components? To help understand what they are made of and where they have come from, take a look at the graph below illustrating which variables are part of the first component. It’s clear that goldEarned is the largest contributor to this component, alongside objective damage, the largest multi-kill achieved, the number of killing sprees, damage dealt and total kills. It’s safe to say that this component is capturing the variables relating to stomping lane. If we add on the fact that “physical” damage is specified, you can almost see the Fiora/Riven/Trynd one tricks appearing in front of your eyes.

但是这些成分是什么？为了帮助理解它们的构成以及它们的来源，请查看下图，其中说明了哪些变量是第一个组件的一部分。显然，goldEarned是这一部分的最大贡献者，此外还有客观伤害，所实现的最大多重杀伤力，杀伤力的数量，造成的伤害和总杀伤力。可以肯定地说，此组件正在捕获有关踩踏车道的变量。如果加上指定了“物理”损坏的事实，您几乎可以看到Fiora / Riven / Trynd一招出现在眼前。

The 2nd component compromises of two main attributes: towers taken and damage self-mitigated (blocked/parried/immune/reduced etc..). However, you may be thinking how this all relates to content based recommendation models! Well, what we now have are two components that contain over 50% of the variance between the Champions. These can be considered as proxies for descriptions, where instead of “sports game” we have “Champion who kills everyone” and “produced by EA” becomes “high turret damage”! We can then plot these descriptive components out in a 2D space and we can start to see how it all comes together (warning, big old graph coming at you for visibility):

第二部分是两个主要属性的折衷方案：被夺取的塔和自减轻的伤害(受阻/格挡/免疫/降低等)。但是，您可能正在考虑这一切与基于内容的推荐模型之间的关系！好了，我们现在有两个组成部分，其中包含冠军之间方差的50％以上。这些可以看作是描述的代理，在这里我们不是“体育比赛”，而是“杀死所有人的冠军”，而“ EA生产的”则变成了“高炮塔伤害”！然后，我们可以在2D空间中绘制这些描述性组件，并且可以开始看到它们是如何组合在一起的(警告，较大的旧图形会向您显示)：

Note: Although “Support” champions are shown here in yellow, the data is actually derived from farming lanes only. I.e. the Zilean data you see above is from when the Champion is played in either Top, Mid or as the APC.

注意：虽然此处以黄色显示“支持”冠军，但这些数据实际上仅来自耕种车道。 也就是说，您在上方看到的Zilean数据来自当冠军在上，中或作为APC比赛时。

Those of you paying attention will note that component 1 is inversed, where high damage/kills is scored low on the X-axis. Component 2 is not inversed, so a high number on the Y-axis indicates lots of turret taking and damage mitigation. To make sure it’s worked as expected, take a look at the Champions in the top left (i.e. that do lots of physical damage, take towers and mitigate damage); Fiora & Tryndamere (Trynd’s ult counts as damage mitigation). How about the bottom center where we see Katarina and Karthus who score relatively high on damage and kills but aren’t smashing turrets and mitigating damage. Sounds right to me.

那些需要注意的人会注意到，组件1相反，在X轴上，较高的伤害/杀伤力得分较低。部件2没有反转，所以在Y轴的高数字表示大量炮塔了结和减轻损失。为了确保它能按预期工作，请查看左上角的冠军(即造成大量物理伤害，防御塔并减轻伤害)； Fiora＆Tryndamere(Trynd的超值可算是减轻伤害)。在底部中心，我们看到卡塔琳娜和卡尔萨斯在伤害和杀伤力上得分较高，但没有砸破炮塔并减轻伤害的情况如何？对我来说听起来不错。

The next step is simple, the recommendation is based on the Champion with the shortest Euclidean distance (straight line) from the Champion they currently play. You play a lot of Taric? Try Maokai. Akali? How about Fizz. Unkillable Dr. Mundo? You’ll love our boy Sion.

下一步很简单，建议是基于距当前比赛冠军最短欧几里德距离(直线)的冠军。你玩很多塔里克吗？试试茂凯。阿卡利？菲兹呢。不可杀死的蒙多博士？您会爱我们的男孩Sion。

If we wanted to expand on this, we’d move to higher dimensions. If you go back to the graph showing how much variance is captured in each component, I’d say there’s an argument to build the model based on 3, maybe even 5 dimensions. The rest works the same, but given the visualisation becomes tricky we’ll leave it there for now!

如果我们想对此进行扩展，我们将移至更高的维度。如果返回到显示每个组件捕获了多少差异的图表，我会说有一个论据可以基于3维甚至5维构建模型。其余的工作原理相同，但是鉴于可视化变得棘手，我们现在就将其保留！

I hope this provides another insight into potential recommendation types that may be worth exploring and the benefits PCA provides, although I use League of Legends as my domain these can easily be applied to any other field. I recommend going back up to the large graph, find your main and seeing whether you’d agree that the ones surrounding it are a similar play-style — let me know below in the comments!

我希望这可以为潜在的推荐类型提供另一种见解，尽管PCA可以将其应用到其他领域，但我可能将PCA提供的优势与英雄联盟联系在一起。我建议回到大型图表，找到您的主要图表，然后看看您是否同意围绕它的图表是类似的游戏风格-在下面的评论中让我知道！

Thanks for getting to the bottom of my article! My name is Jack J. and I’m a professional Data Scientist, writer and founder of the League of Legends analytics site JUNG.GG. You can also find me on my blog LeagueOfData, where I post less Data Science intense articles, it’s also the best place to get in contact with me.

感谢您深入我的文章！我叫Jack J.，我是职业数据科学家，英雄联盟分析网站JUNG.GG的作家和创始人。您也可以在我的博客LeagueOfData上找到我，我在该博客上发布了有关Data Science的文章较少，这也是与我联系的最佳场所。