# 可视化编程语言_可视化编程语言影响图

#### Gephi和Sigma.js的网络可视化教程 (A network visualization tutorial with Gephi and Sigma.js)

Here’s a preview of what we’ll be making today: the programming languages influence graph. Check out the link to explore the “design influence” relationships between over 250 programming languages past and present!

In today’s hyper-connected world, networks are an ubiquitous aspect of modern life.

Take the start of my day so far — I used London’s transport network to travel into town. Then I went into a branch of my favourite coffee shop and used my Chromebook to connect to their Wi-Fi network. Next, I logged in to the various social networking sites I frequent.

It’s no secret that some of the most influential companies of the last few decades owe their success to the power of networks.

Facebook, Twitter, Instagram, LinkedIn and other social media platforms rely on the small-world properties of social networks. This lets them connect their users with each other (and advertisers) effectively.

Google owes much of its current success to their early dominance of the search engine market — enabled in part through their ability to return relevant results with the help of their Page Rank network algorithm.

Amazon’s efficient distribution network allows them to offer same-day delivery in some major cities.

Networks are also super-important in fields such as Artificial Intelligence and Machine Learning. Neural networks are a very active field of research. Many feature detection algorithms, essential in Computer Vision, rely heavily on using networks to model different parts of images.

A wide range of scientific phenomena can also be understood in terms of network models. This includes quantum mechanics, biochemical pathways, and ecological and socio-economic systems.

Given their undeniable importance, then, how can we better understand networks and their properties?

The mathematical study of networks is known as “graph theory”, and is one of the more accessible branches of mathematics. This article aims to provide an introduction, assuming little prior knowledge or experience.

We’ll be using Python 3.x and some awesome open-source software called Gephi to put together a network visualization of how a range of programming languages past and present are linked by influence.

### 但首先… (But first…)

What exactly is a network?

The examples described above give us some clues. Transport networks are made up of destinations connected by routes. Social networks are made up of individuals, connected through their relationships to one another. Google’s search engine algorithms evaluate the “rank” of different webpages by looking at which pages link out to others.

More generally, a network is any system that can be described in terms of nodes and edges, or in colloquial terms, “dots and lines”.

Some systems are readily abstracted in this manner. Social networks are perhaps the most obvious example. Computer filesystems are another — folders and files are linked by their “parent” and “child” relationships.

But the real power of networks comes from the fact that many, many systems can be abstracted and modelled in network terms, even if at first it isn’t obvious how.

### 代表网络 (Representing networks)

We need to go a little beyond pen-and-paper sketches to analyze and describe networks mathematically. How can we turn pictures of dots and lines into numbers we can crunch?

One solution is to draw up an adjacency matrix to represent our network.

Matrices are one of those concepts that might sound a little intimidating if you’re not familiar with them, but fear not. Think of them as grids of numbers which can be used to perform many calculations all at once. Here’s an example below:

Python Java Scala C#
Python     0    1     0  0
Java       0    0     0  1
Scala      0    1     0  0
C#         0    1     0  0

In this matrix, the intersection of each row and column is either 0 or 1, depending on whether or not the respective languages are linked. You can check this against the illustration above!

For most purposes, the adjacency matrix is a good way of representing a network mathematically. From a computational perspective, however, it can sometimes be a bit cumbersome.

For instance, with even a relatively modest number of nodes (say 1000), there will be a much larger number of elements in the matrix (e.g., 1000² = 1,000,000).

Many real-world systems yield sparse networks. In these networks, most nodes only connect to a small proportion of all the others.

If we represented a 1000-node sparse network in computer memory as an adjacency matrix, we’d have 1,000,000 bytes of data stored in RAM. Most will be zeros. There’s got to be a more efficient way of going about this.

An alternative approach is to work with edge lists instead. These are exactly what they say they are. They are simply a list of which node pairs link to each other.

For example, the programming languages network above can be represented as follows:

Java, Python
Java, Scala
Java, C#
C#, Java

For larger networks, this is a much more computationally efficient means of representing them. It is of course possible to generate an adjacency matrix from an edge list (and vice versa). It’s not like we have to pick one or the other.

Another means of representing networks are adjacency lists. This lists every node followed by the nodes it links to. For example:

Java: Python, Scala, C#
C#: Java

### 收集数据，建立连接 (Collecting data, making connections)

Any network model and visualisation will only be as good as the data used to construct it. This means, as well as ensuring the data is both accurate and complete, we also need to justify a means of inferring edges between nodes.

In many respects, this is the critical step. Any subsequent analysis and inferences made about the network depend on being able to justify the “linkage criterion”.

For example, in social network analysis, you might link people based upon whether they follow one another on social media. In molecular biology, you might link genes based upon their co-expression.

Often, the method used to link nodes will allow for weights to be assigned to the edges, giving a measure of “strength”.

For instance, in the context of online retail, you could link products based upon how often they are purchased together. Products that are frequently bought together would be linked by a higher weighted edge than products which are only sometimes bought together. Products that are bought together no more often than would be expected by chance wouldn’t be linked at all.

As you might imagine, the methods for linking nodes to one another can be as sophisticated as you like.

However, for this tutorial we’ll be using a simpler means of connecting programming languages. We’re gonna rely on the accuracy of Wikipedia.

For our purposes, this should be fine. Wikipedia’s success is testament that it must be doing something right. The open-source, collaborative method by which articles are written should ensure some degree of objectivity.

Also, its relatively consistent page structure makes it a convenient playground for trying out web-scraping techniques.

Another bonus is the extensive, well-documented Wikipedia API, which makes information retrieval easier still. Let’s get started.

### 第1步-安装Gephi (Step 1 — Installing Gephi)

Gephi is available on Linux, Mac and Windows. You can download it here.

Gephi在Linux，Mac和Windows上可用。 您可以在此处下载。

For this project, I was using Lubuntu. If you’re on Ubuntu/Debian, then you can follow the steps below to get Gephi up and running. Otherwise, the installation process will likely be much the same as whatever you’re familiar with.

cd Downloads
tar -xvzf gephi-0.9.1-linux.tar.gz
cd gephi-0.9.1/bin./gephi

You may need to check your version of the Java JRE. Gephi requires a recent version. On my relatively fresh install of Lubuntu, I simply installed the default-jre, and everything worked from there.

apt install default-jre
./gephi

There’s one more step before you’re ready to get underway. In order to export the graph to the Web, you can use the Sigma.js plugin for Gephi.

From Gephi’s menu bar, choose the “Tools” option, and select “Plugins”.

Click on the “Available Plugins” tab and select “SigmaExporter” (I also installed JSON Exporter, because it’s another useful plugin to have around).

Hit the “Install” button and you’ll be walked through the process. You’ll need to restart Gephi once you’re done.

### 第2步-编写Python脚本 (Step 2 — Writing the Python script)

This tutorial will use Python 3.x, plus a few modules to make life easier. Using the pip module installer, run the following command:

pip3 install wikipedia

Now, in a new directory, create a file called something like script.py, and open it up in your favourite code editor/IDE. Below is an outline of the main logic:

1. First, you’ll need a list of programming languages to include.

首先，您需要包含一系列编程语言

2. Next, go through that list and retrieve the HTML of the relevant Wikipedia article.

接下来，浏览该列表并检索相关Wikipedia文章HTML。
3. From this, extract a list of programming languages that each language has influenced. This will be a rough-and-ready linkage criterion.

从中，提取每种语言影响的编程语言列表。 这将是一个粗略的关联标准。
4. While you’re at it, it’d be nice to grab some metadata about each language.

当您使用它时，最好能获取有关每种语言的一些元数据。
5. Finally, you’ll want to write all the data you’ve collected to a .csv file

最后，您需要将收集的所有数据写入.csv文件

The full script can be found in this gist.

#### 导入一些模块 (Import some modules)

In script.py, start by importing a few modules which will make things easier:

script.py ，首先导入一些模块，这将使事情变得更容易：

import csv
import wikipedia
import urllib.request
from bs4 import BeautifulSoup as BS
import re

OK — begin by making a list of nodes to include. This is where the Wikipedia module comes in handy. It makes accessing the Wikipedia API super-easy.

pageTitle = "List of programming languages"
print(nodes)

If you save and run this script, you’ll see it prints out all the links from the “List of programming languages” Wikipedia article. Nice!

However, it’s always sensible to manually inspect any automatically collected data. A quick glance will reveal that, as well as many actual programming languages, the script has also picked up a few extra links.

For example, you might see “List of markup languages”, “Comparison of programming languages” and others in there.

Although Gephi lets you remove nodes you’d rather not include, it wouldn’t hurt to “clean” the data before proceeding. If anything, this will save time later on.

removeList = [
"List of",
"Lists of",
"Timeline",
"Comparison of",
"History of",
"Esoteric programming language"
]

nodes = [i for i in nodes if not any(r in i for r in removeList)]

These lines define a list of substrings to be removed from the data. The script then goes through the data, removing any elements that contain any of the unwanted substrings.

In Python, this requires just one line of code!

#### 一些辅助功能 (Some helper functions)

Now you can start scraping Wikipedia to build up an edge list (and collect any metadata). To make this easier, first define a few functions.

#### 抓HTML (Grabbing HTML)

The first function uses the BeautifulSoup module to get hold of the HTML for each language’s Wikipedia page.

base = "https://en.wikipedia.org/wiki/"

def getSoup(n):
try:
with urllib.request.urlopen(base+n) as response:
table = soup.find_all("table",class_="infobox vevent")[0]                return table
except:
pass

This function uses the urllib.request module to get hold of the HTML for the page at “https://en.wikipedia.org/wiki/” + “programming language”.

This is then passed to BeautifulSoup, which reads and parses the HTML into an object we can use to search for information.

Next, use the find_all() method to extract the HTML element you’re interested in.

Here, this will be the summary table at the top of each programming language article. How can these be identified?

The easiest way is to visit one of the programming language pages. Here, you can simply use the browser’s Developer Tools to inspect the elements of interest.

The summary table has the HTML tag <table> and the CSS classes "infobox" and "vevent", so you can use these to identify the table in the HTML.

Specify this with the arguments:

• "table" and

"table"

• class_="infobox vevent"

class_="infobox vevent"

find_all() returns a list of all elements that match the criteria. In order to actually specify the element you’re interested in, add the index [0]. If the function is successful, it returns the table object. Otherwise, it returns None.

find_all()返回符合条件的所有元素的列表。 为了实际指定您感兴趣的元素，请添加索引[0] 。 如果函数成功，则返回table对象。 否则，它返回None

With any automated data collection procedure, it’s always important to handle exceptions thoroughly. If not, then in the best case scenario the script crashes and you’ll need to start over.

In the worst case, you’ll end up with a data set riddled with inconsistencies and errors. This will make it a nightmare to work with down the line.

The next function uses the table object to look for some metadata. Here, it searches the table for the year the language first appeared.

def getYear(t):
try:
t = t.get_text()
year = t[t.find("appear"):t.find("appear")+30]
year = re.match(r'.*([1-3][0-9]{3})',year).group(1)
return int(year)
except:
return "Could not determine"

This short function takes the table object as its argument, and uses BeautifulSoup’s get_text() function to produce a string.

The next step is to create a substring called year. This takes the 30 characters after the first appearance of the word "appear". This string should contain the year the language first appeared.

In order to extract just the year, use a regular expression (courtesy of the re module) to match any characters that begin with a digit between 1 and 3, and are followed by three digits.

re.match(r'.*([1-3][0-9]{3})',year)

If this is successful, the function returns year as an integer. Otherwise, it returns a sad-looking “Could not determine”. You might wish to scrape further metadata — such as paradigm, designer or typing discipline.

One more function for you — this time, you’ll feed in the table object for a given language, and hopefully receive out a list of other programming languages.

def getLinks(t):
try:
table_rows = t.find_all("tr")
for i in range(0,len(table_rows)-1):
try:
if table_rows[i].get_text() == "\nInfluenced\n":
out = []
for j in table_rows[i+1].find_all("a"):
try:
out.append(j['title'])
except:
continue
return out
except:
continue
return
except:
return

Woah, look at all that nesting… What is actually going on here then?

This function makes use of the fact that the table objects have a consistent structure. The information in the table is stored in rows (the relevant HTML tag is <tr> ). One of these rows will contain the text "\nInfluenced\n". The first part of the function finds which row this is.

Once this row has been found, you can then be pretty sure the next row contains links to each of the programming languages influenced by the current one. Find these links using find_all("a") — where the argument "a" corresponds to the HTML tag <a>.

For each link j, append its ["title"] attribute to a list called out. The reason to be interested in the ["title"] attribute is because this will match exactly the language’s name as stored in nodes.

For example, Java is stored in nodes as "Java (programming language)", so you need to use this exact name throughout the data set.

If successful, getLinks() returns a list of programming languages. The rest of the function deals with exception handling, in case something should go wrong at any stage.

#### 收集资料 (Collecting the data)

At last, you’re almost ready to sit back and let the script do its thing. It will collect the data and store it in two list objects.

edgeList = [["Source,Target"]]
meta = [["Id","Year"]]

Now write a loop that will apply the functions defined earlier to every item in nodes, and store the outputs in edgeList and meta.

for n in nodes:
try:
temp = getSoup(n)
except:
continue
try:
except:
continue
year = getYear(temp)
meta.append([n,year])

This function takes each language in nodes and attempts to retrieve the summary table from its Wikipedia page.

Then, it retrieves all the languages the table lists as having been influenced by the language in question.

For each language that also appears in the nodes list, append an element to edgeList in the form of ["source,target"]. In this way, you’ll build up an edge list to feed into Gephi.

For debugging purposes, print each element added to edgeList — just to be sure everything’s working as it should. If you were being extra thorough, you could add print statements to the except clauses, too.

Next, get the language’s name and year, and append these to the meta list.

#### 写入CSV (Writing to CSV)

Once the loop has run, the final step is to write the contents of edgeList and meta to comma separated value (CSV) files. This is easily done with the csv module imported earlier.

with open("edge_list.csv","w") as f:
wr = csv.writer(f)
for e in edgeList:
wr.writerow(e)

wr = csv.writer(f2)
for m in meta:
wr.writerow(m)

Done! Save the script, and from the terminal run:

$python3 script.py $ python3 script.py

You should see the script printing out each source-target pair as it builds up the edge list. Make sure your internet connection is steady, and sit back while the script does its magic.

### 步骤3 —使用Gephi构建图 (Step 3 — Graph building with Gephi)

Hopefully you got Gephi installed and running earlier. Now you can create a new project and use the data you gathered to build a directed graph. This will show how different programming languages have influenced one another!

Start by creating a new project in Gephi, and switch to the “Data Laboratory” view. This provides a spreadsheet-like interface for handling data in Gephi. The first thing to do is import the edge list.

点击“导入电子表格”。
• Choose the edge_list.csv file generated by the Python script. Ensure that Gephi knows to use the commas as the separator.

选择Python脚本生成的edge_list.csv文件。 确保Gephi知道使用逗号作为分隔符。

• Choose “Edge List” from the List type.

从列表类型中选择“边缘列表”。
• Click “Next” and check that you are importing both Source and Target columns as strings.

单击“下一步”，并检查您是否正在将源列和目标列都导入为字符串。

This should update the Data Lab with a list of nodes. Now, import the metadata.csv file. This time, make sure to choose “Nodes list” from the List type.

Switch over to the “Preview” tab, and see how the network looks.

Ah… It’s just a little bit… monochrome. And messy. Like a plate of spaghetti. Let’s fix this.

#### 使它漂亮 (Making it pretty)

There are all sorts of ways you can work on the presentation, and here’s where a little bit of creative freedom comes in. With network visualisations, there are essentially three things to take into consideration:

1. Positioning There are several algorithms which can generate layout patterns for a network. A popular choice is the Fruchterman-Reingold algorithm, which is available in Gephi.

定位有几种算法可以生成网络的布局模式。 流行的选择是Gephi中可用的Fruchterman-Reingold算法

2. Sizing The size of nodes in a graph can be used to represent some interesting property. Often, this is a centrality measure. There are many ways of measuring centrality, but they all reflect the “importance” of a given node, in terms of how well-connected it is to the rest of the network.

大小调整图中节点的大小可用于表示一些有趣的属性。 通常，这是一项中心性措施 。 有许多方法可以衡量中心性 ，但它们都反映了给定节点与网络其余部分的连接程度，这一点“很重要”。

3. Coloring It is also possible to use color to show some property of a node. Often, color is used to indicate community structure. This is broadly defined as a “group of nodes which are more connected with each other than with the rest of the graph”. In a social network, this can reveal friendship, family or professional groups. There are several algorithms which can detect community structure. Gephi comes with the Louvain method built-in.

着色也可以使用颜色显示节点的某些属性。 通常，颜色用于指示社区结构 。 这被广泛定义为“一组节点，彼此之间的联系比图的其余部分更多”。 在社交网络中，这可以显示友谊，家庭或专业团体。 有几种算法可以检测社区结构 。 Gephi内置了Louvain方法

To make these changes, you will need to calculate some statistics. Switch to the “Overview” window. Here you will see a panel on the right. It should contain a “Statistics” tab. Open this, and you will see a range of options.

Gephi comes with many inbuilt statistical capabilities. For each of them, clicking “Run” will generate a report that will reveal insights about the network.

Gephi具有许多内置的统计功能。 对于每个用户，单击“运行”将生成一个报告，该报告将揭示有关网络的见解。

Some useful ones to know include:

• Average degree The average language is connected to about four others. The report also shows a degree distribution graph. This reveals that most languages have very few connections, while a small proportion have many. This suggests that this is a scale-free network. Much research has been done on scale-free networks, and the processes that generate them.

平均程度平均语言与大约四种其他语言相关。 该报告还显示了学位分布图。 这表明大多数语言的联系很少，而一小部分则有很多。 这表明这是一个无规模 网络 。 关于无标度网络及其生成过程已经进行了许多研究。

• Diameter This network has a diameter of 12 — meaning this is the “widest” number of connections between any two languages. The average path length is just under four. This means that, on average, any two languages are separated by four edges. These figures give a measure of the “size” of the network.

直径此网络的直径为12，这意味着这是任何两种语言之间“最大”的连接数。 平均路径长度不到4。 这意味着，平均而言，任何两种语言都由四个边分开。 这些数字可以衡量网络的“规模”。

• Modularity This is a score that shows how “compartmentalized” the network is. Here, the modularity score is about 0.53. This is relatively high, suggesting there are distinct modules within this network. Again, this indicates something interesting about the underlying system. Languages tend to fall into distinct “influence groups”.

模块化这是一个分数，显示了网络的“分隔”程度。 在这里，模块化得分约为0.53。 这相对较高，表明该网络中存在不同的模块。 同样，这表明底层系统有一些有趣之处。 语言倾向于分为不同的“影响力群体”。

Anyhow, to modify the appearance of the network, head over to the left panel.

In the “Layout” tab, you can select which layout algorithm to use. Hit “Run” and watch the graph shift about in real-time! See which layout algorithm you think works best.

Above the Layout tab is the “Appearance” tab. Here, you can play with different settings for the node and edge colors, sizes and labels. These can be configured based upon attributes (including the stats you get Gephi to calculate).

“布局”标签上方是“外观”标签。 在这里，您可以对节点和边缘颜色，大小和标签使用不同的设置。 可以根据属性(包括让Gephi计算的统计信息)进行配置。

As a suggestion, you could:

• Color the nodes by their Modularity attribute. This colors them according to their community membership.

通过其模块属性为节点着色。 这根据他们的社区成员身份为其着色。
• Size the nodes by their Degree. Better connected nodes will appear larger than less connected ones.

根据节点的大小调整节点的大小。 连通性更好的节点将看起来比连通性较小的节点更大。

However, you should experiment and come up with a layout you like best.

Once you’re happy with the appearance of your graph, it is time to move on to the final step — exporting to Web!

### 第4步-Sigma.js (Step 4 — Sigma.js)

Already you have built a network visualisation that can be explored in Gephi. You could choose to take a screenshot, or save the graph in SVG, PDF or PNG format.

However, if you installed the Sigma.js plugin earlier, then why not export the graph to HTML? This will create an interactive visualisation that you can host online, or upload to GitHub and share with others.

To do this, select “Export > Sigma.js template…” from Gephi’s menu bar.

Fill in the details as required. Make sure to choose which directory you export the project to. You can change the title, legend, description, hover behavior and many other details. When you’re ready, click “OK”.

Now, if you navigate to the directory you exported the project to, you will see a folder containing all the files generated by Sigma.js.

Open up index.html in your favorite browser. Ta-da! There’s your network! If you know a little CSS and JavaScript, you can dive into the various generated files to tweak the output as you wish.

And that concludes this tutorial!

### 摘要 (Summary)

• Many systems can be modelled and visualised as networks. Graph theory is a branch of math that provides tools to help understand network structures and properties.

许多系统可以建模并可视化为网络。 图论是数学的一个分支，提供了有助于理解网络结构和属性的工具。
• You used Python to scrape data from Wikipedia to build a programming languages influence graph. The linkage criterion was whether a given language was listed as an influence on another’s design.

您使用Python从Wikipedia抓取数据来构建编程语言影响图。 链接标准是是否将一种给定语言列为对另一种设计的影响。
• Gephi and Sigma.js are open-source tools that allow you to analyze and visualize networks. They allow you to export the network in image, PDF or Web formats.

Gephi和Sigma.js是开放源代码工具，可让您分析和可视化网络。 它们允许您以图像，PDF或Web格式导出网络。

Thanks for reading — I look forward to any comments or questions you might have! For a fantastic resource to learn more about graph theory, see Albert-László Barabási’s interactive online book.

The full code for this tutorial can be found here.

• 0
点赞
• 0
评论
• 0
收藏
• 一键三连
• 扫一扫，分享海报

01-19 3931

04-12
08-25
12-16
09-02
12-20
08-08

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。