网络分析的力量

The most common way to store data is in what we call relational form. Most systems get analyzed as collections of independent data points. It looks something like this:

存储数据的最常见方式是我们所谓的关系形式。 大多数系统作为独立数据点的集合进行分析。 看起来像这样:

Whether you’re a spreadsheet user or a machine learning master, you’re probably used to seeing your data that way. Rows and columns representing different categories and metrics.

无论您是电子表格用户还是机器学习大师,您都可能习惯于以这种方式查看数据。 行和列代表不同的类别和指标。

However, this approach makes it very difficult to capture information about the relationships that are fundamental to so much of our world. When we stop to think about some of the common systems around us — systems which we care about understanding, optimizing and predicting — we start to see how treating these systems as independent data points misses crucial information.

但是,这种方法很难捕获有关我们世界大部分地区基本关系的信息。 当我们停止思考周围的一些常见系统(我们关心理解,优化和预测的系统)时,我们开始看到如何将这些系统视为独立数据点会丢失关键信息。

Image for post
Trade networks
贸易网络

Economies are defined by relationships and transactions more than just individual players that operate independently.

经济关系和交易的定义不仅限于独立运作的个体参与者。

Image for post
A power grid
电网

Infrastructure we use every day is highly connected. We have transportation systems linking cities and people, and communication systems linking electronic devices.

我们每天使用的基础设施紧密相连。 我们拥有连接城市和人的交通系统以及通讯系统 链接电子设备。

Image for post
Gene interactions
基因相互作用

In biology, life doesn’t emerge from cells/proteins/genes working separately, but from those components coming together and performing many interactions to make the cell alive. Even our thoughts are hidden and encoded in the connections and wiring between billions of neurons.

生物学中,生命不是来自分开工作的细胞/蛋白质/基因,而是来自那些聚集在一起并进行许多相互作用以使细胞存活的成分。 甚至我们的思想也被隐藏并编码在数十亿个神经元之间的连接和连线中。

Image for post
Social networks
社交网络

And, of course, we have social networks. This has become a term to describe social media platforms, but to be more specific, it is the data underlying these platforms, recording friendships and followers, that is useful to model as a literal network.

而且,当然,我们有社交网络。 这已经成为描述社交媒体平台的术语,但是更具体地说,正是这些平台的基础数据(记录友谊和关注者)对于建模为文字网络很有用。

We could keep going with endless examples. What do all of these systems have in common? Highly connected data. Essentially, anything that involves humans is highly connected. Our world isn’t just a collection of individuals isolated from everyone else, but a network of billions of members who are constantly interacting with each other. Therefore, data that describes these systems will be useful insofar as it captures those connections.

我们可以继续列举无尽的例子。 所有这些系统有什么共同点? 高度关联的数据 。 本质上,任何涉及人类的事物都是高度关联的。 我们的世界不仅是一个彼此孤立的个人的集合,还包括数十亿个不断互动的成员组成的网络。 因此,描述这些系统的数据将在捕获这些连接的范围内很有用。

简而言之,在许多系统后面都有一个复杂的接线图, 即网络 ,它定义了组件之间的连接。 (In short, behind many systems there is an intricate wiring diagram, a network, that defines the connections between the components.)

Although the traditional relational data model has served many domains well, highly connected systems can never be fully modeled or used for prediction unless we understand the networks behind them.

尽管传统的关系数据模型已经很好地服务于许多领域,但是除非我们了解背后的网络,否则高度连接的系统永远无法完全建模或用于预测。

Google的PageRank算法 (Google’s PageRank Algorithm)

To further understand the transformative potential of network analysis upon its introduction to a new domain, I think it’d be useful to explain its role on a platform that you likely use every day.

为了进一步了解网络分析在引入新领域后的变革潜力,我认为在您可能每天使用的平台上解释其作用很有用。

In the late 1990s and early 2000s, there were many search engines on the web. The internet was a vast, ever-evolving terrain whose users desperately needed navigation help. Many understood this need, and the field of search engines and directories was crowded.

在1990年代末和2000年代初,网络上有许多搜索引擎。 互联网是一个广阔而不断发展的领域,用户迫切需要导航帮助。 许多人都了解这种需求,因此搜索引擎和目录领域非常拥挤。

Despite being a latecomer, Google managed to surpass the competition only a few years after its founding by Larry Page and Sergey Brin in 1998. What made Google different? It modeled the internet as a network.

尽管是后来者,但Google在1998年由拉里·佩奇(Larry Page)和谢尔盖·布林(Sergey Brin)创立后仅几年就超越了竞争对手。是什么使Google与众不同? 它将互联网建模为网络。

The basic problem that search engines were faced with was: how do you measure the relevance or importance of different pages in order to determine what results to show after a user searches? There was no obvious answer. Most search engines attempted to measure the importance of a website by analyzing the content on that website itself. Similar to the way we use Excel, these entailed rows and columns, where each row was a page and each column was a variable or metric about the content on that page.

搜索引擎面临的基本问题是:如何测量不同页面的相关性或重要性,以便确定用户搜索后显示什么结果? 没有明显的答案。 大多数搜索引擎都试图通过分析网站本身的内容来衡量该网站的重要性。 与我们使用Excel的方式类似,这些包含行和列,其中每一行是一个页面,每一列是关于该页面上内容的变量或度量。

However, this is very gameable. If you wanted to look up how to bake cupcakes, for example, and the search engine you’re using determines which results to return based only on website content, I could create a cupcake-baking website in 30 minutes with all the right content to make your search engine deem my page “relevant.” But my site is unlikely to be the most relevant or highest quality for your needs. In the early days of the internet, when everyone was trying to cash in on this new medium, cupcake con-men abounded.

但是,这是非常可玩的。 例如,如果您想查找如何烘焙纸杯蛋糕,并且您使用的搜索引擎仅根据网站内容确定要返回的结果,那么我可以在30分钟内创建一个包含所有正确内容的纸杯蛋糕烘焙网站让您的搜索引擎认为我的页面“相关”。 但是我的网站不太可能是您需要的最相关或质量最高的网站。 在互联网的早期,当每个人都试图在这种新媒体上赚钱时,纸杯蛋糕盛装出现。

Page and Brin invented a different approach to search. They realized that they could dramatically improve the results they showed users if they first modeled the internet as a network of domains and pages that reference each other. More specifically, they developed an algorithm to detect the “role” or the “importance” of a node in a network, now called the PageRank algorithm. Once understood, the PageRank algorithm seems very simple, but it’s very powerful.

Page和Brin发明了另一种搜索方法。 他们意识到,如果他们首先将互联网建模为相互参照的域名和网页网络,则可以极大地改善向用户展示的结果。 更具体地说,他们开发了一种算法来检测网络中节点的“角色”或“重要性”,现在称为PageRank算法。 一旦理解,PageRank算法看起来很简单,但是功能非常强大。

It looks like this. Given a collection of web pages, we can keep track of all of the links or references that pages make to each other. In our model, when one page references another, we can add these references as arrows, or “edges,” pointing from the first page to the second page. We can do this across all of the pages in our collection of interest, and Google did it across all of the pages on the internet. What we end up with might look something like this:

看起来像这样。 给定一个网页集合,我们可以跟踪页面之间的所有链接或引用。 在我们的模型中,当一页引用另一页时,我们可以将这些引用添加为箭头,即从第一页指向第二页的“边”。 我们可以在我们感兴趣的所有页面上执行此操作,而Google在互联网上的所有页面上都执行此操作。 我们最终得到的结果可能是这样的:

Image for post
Some pages are referenced more than others.
有些页面被引用得比其他页面更多。

As you can imagine, and also see in the simple illustrated example, some web pages are referenced way more often than others. A web page that has authority and is relevant, unlike that which a poser just created 30 minutes ago, will be one of those nodes referenced more often. Without going into the mathematical details of the actual algorithm, we can think of PageRank as essentially measuring the “importance” or “influence” of a page based on its role in the network. By scraping the entire internet and all of the references that web pages make to each other, Google was able to calculate precisely this importance of each page, weed out the irrelevant ones, and subsequently return higher quality search results to its users.

您可以想象,也可以在简单的示例中看到,某些网页的引用频率比其他网页高。 具有权威性和相关性的网页将不同于其中一个在30分钟前创建的姿势者那样的网页,它将成为被引用次数最多的节点之一。 无需深入研究实际算法的数学细节,我们可以认为PageRank实际上是根据页面在网络中的作用来衡量页面的“重要性”或“影响力”。 通过抓取整个互联网以及网页相互之间的所有引用,Google能够精确地计算出每个网页的重要性,剔除不相关的网页,然后将更高质量的搜索结果返回给用户。

我们还能用网络做什么? (What else can we do with networks?)

A description of the potential of of networks could fill (and has filled) many books, and indeed more recently this modeling approach has garnered much interest in machine learning implementations, particularly deep learning. But all of these fancy applications still depend on the basic advantage of modeling your data as a network, similar to the way that Google grasped it: networks allow you to calculate entirely new metrics to describe and understand your data that you never would have been able to calculate previously. These metrics are many, and they are derived from various algorithms, like Google’s PageRank, that can be run once you model your data as a network.

对网络潜力的描述可能会填满(并且已经填满)许多书籍,实际上,最近,这种建模方法已经引起了人们对机器学习实现(特别是深度学习)的极大兴趣。 但是,所有这些精美的应用程序仍然依赖于将数据建模为网络的基本优势,类似于Google掌握的方式:网络使您能够计算全新的指标来描述和理解您从未有过的数据以前计算。 这些指标很多,它们衍生自各种算法(例如Google的PageRank),一旦您将数据建模为网络即可运行。

Image for post
Nodes highlighted in light blue are more connected/central.
浅蓝色突出显示的节点连接度更高/位于中心。

There are various measures of centrality, similar to pagerank, and these centrality measures correspond to many concepts that we already think about and would be interested in measuring. In a social network, for example, these might be the members who have many friends and whose opinions are highly regarded. The role of those authorities/influencers would become clear after modeling the relationships between people as a network.

有多种集中度度量类似于pagerank,这些集中度度量对应于我们已经考虑过并且将有兴趣进行度量的许多概念。 例如,在社交网络中,这些人可能是拥有许多朋友并且其观点得到高度重视的成员。 在将人与人之间的关系建模为网络之后,这些权威/影响者的作用将变得清晰。

Image for post
Networks can be used to model movement within a network, which can lead to different measures of flow/directedness.
网络可用于对网络内的移动进行建模,这可能导致流量/方向性的不同度量。

We can also measure directionality or flow within our networks, where the connections between components are essentially arrows. These might allow you to uncover patterns of movement. For example, you might be able to notice confusion or inefficiencies in transportation networks where there are a lot of cycles or zig-zags, and there are many ways to calculate that numerically given the relationships between the nodes in your network.

我们还可以测量网络中的方向性流量 ,其中组件之间的连接实质上是箭头。 这些可能使您发现运动模式。 例如,在运输网络中存在许多周期或曲折的运输网络时,您可能会注意到混乱或效率低下,并且有多种方法可以根据给定网络中节点之间的关系以数字方式进行计算。

Image for post
A network’s connectedness can be used to measure the robustness of a system or detect communities.
网络的连通性可用于衡量系统的健壮性或检测社区。

Another aspect that modelers are often interested in quantifying is a network’s connectedness. Again, there are several ways to do this, but it’s useful in many different applications.

建模人员经常对量化感兴趣的另一个方面是网络的连通性。 同样,有几种方法可以执行此操作,但是它在许多不同的应用程序中很有用。

For example, if you were modeling any type of infrastructure — transportation, trade, IT — you could use this as a measure of your infrastructure’s robustness. Or, as an e-commerce vendor, you could use this approach to find communities or clusters of customers.

例如,如果您要对任何类型的基础架构进行建模(运输,贸易,IT),则可以将其用作衡量基础架构健壮性的指标。 或者,作为电子商务供应商,您可以使用这种方法来查找客户的社区或集群。

In Conclusion

结论

Networks are an extremely exciting and useful domain of analysis, and one that is increasingly garnering interest from a wide variety of fields. In particular, the possibility of performing network analysis at scale, with datasets of billions of nodes/edges, is seen by many as one of the next big challenges in prediction and machine learning. I plan to write more about all of that in future stories, but hopefully this article gives a helpful brief introduction to the idea of networks and why they can be so powerful in many domains.

网络是一种非常令人兴奋和有用的分析领域,并且越来越引起广泛领域的关注。 特别是,对数十亿个节点/边缘的数据集进行大规模网络分析的可能性被许多人视为预测和机器学习中的下一个重大挑战之一。 我计划在将来的故事中写更多有关所有这些内容的信息,但是希望本文对网络的概念以及为什么它们在许多领域如此强大的原因提供有益的简要介绍。

翻译自: https://medium.com/analytics-vidhya/the-power-of-network-analysis-8a245633a36

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值