网络爬虫（二）——应用：北京铁路线的爬取及其绘制

最新推荐文章于 2024-07-08 14:21:22 发布

LS_learner

最新推荐文章于 2024-07-08 14:21:22 发布

阅读量1.2k

点赞数 3

分类专栏：爬虫文章标签：正则表达式 python

本文链接：https://blog.csdn.net/qq_39777550/article/details/105577405

版权

爬虫专栏收录该内容

3 篇文章 1 订阅

订阅专栏

上一篇文章学习了正则表达式，正则表达式在提取爬取的json信息时，非常有效。
先跳过爬虫爬网络的原理，直接举例一个简单的爬取数据然后进行处理的例子：
获取URL数据（北京地铁数据）：http://map.amap.com/service/subway?_1469083453978&srhdata=1100_drw_beijing.json

import requests
import re
r = requests.get('http://map.amap.com/service/subway?_1469083453978&srhdata=1100_drw_beijing.json')

request是一个http的请求库，可以方便地发送http请求，也方便处理响应结果。
re是正则表达式要使用的模块
requests.get()为发送网址请求并获取网址中信息

#获得每个地点和他的位置：
#{站点名称:(经度,纬度)}
places=re.findall('"n":"\w+"',r.text)
lat_lon=re.findall('"sl":"(\d+\.\d+,\d+\.\d+)"',r.text)
stations_info={}
for i in range(len(places)):
    place_name=re.findall('"n":"(\w+)"',places[i])[0]
    stations_info[place_name]=tuple(map(float,lat_lon[i].split(',')))

r.text把爬取的信息转化为字符串格式，也就是整个内容为一个大字符串。这样就可以使用这个则表达式进行提取想要的信息。

#获得每个地点和他的位置：
#{站点名称:(经度,纬度)}
places=re.findall('"n":"\w+"',r.text)
lat_lon=re.findall('"sl":"(\d+\.\d+,\d+\.\d+)"',r.text)
stations_info={}
for i in range(len(places)):
    place_name=re.findall('"n":"(\w+)"',places[i])[0]
    stations_info[place_name]=tuple(map(float,lat_lon[i].split(',')))

其中：

re.findall('"n":"\w+"',r.text)

在r.text中对正则表达式’“n”:"\w+"'进行匹配，匹配的是：
类似"n": "稻香湖路"的内容。

re.findall('"sl":"(\d+\.\d+,\d+\.\d+)"',r.text)

匹配的是r.text中的’sl’: '116.188145,40.068936’内容，这里的括号则指的是只返回（）内的部分，匹配的内容类似:‘sl’:‘116.188145,40.068936’，但是只提取’116.188145,40.068936’
(具体情况查看代码的运行情况结合r.text的内容就可以发现。)
下面也是采用类似的方式，这里的 | 表示的是或的关系，就是符串符合哪个正则表达式，都满足被匹配条件。

#获取{线路名称：站点名称}
kn=re.findall('"n":"(\w+)"|"kn":"(\w+)"',r.text)
kn.reverse()
lines_info={}
for i in kn:
    if i[0]=='':
        tem_key=i[1]
        lines_info[tem_key]=[]
    else:
        lines_info[tem_key]=lines_info[tem_key]+[i[0]]

下面的代码就是进行数据整合一下，用于图的绘制。

#建立邻接链表dict   
neighbor_info={}
for i in lines_info:
    neighbor_info[i]=[]
    for j in range(len(lines_info[i])-1):
        neighbor_info[i]=neighbor_info[i]+\
                        [(lines_info[i][j],lines_info[i][j+1])]

import networkx as nx
import matplotlib.pyplot as plt
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体
plt.figure(figsize=(30,30))
city_graph=nx.Graph()  #新建图类实例
city_graph.add_nodes_from(list(stations_info.keys())) #添加点

nx.draw_networkx_nodes(city_graph,stations_info,node_size=20,node_color='red')#绘制点，nodelist默认为G中的所有节点
nx.draw_networkx_labels(city_graph,stations_info,font_size=7)
col_list=['#b45b1f','#1fb4a6','#1f2db4','#b4a61f','#78b41f','#b41f78','#b41f78','#a61fb4','#b45b1f','#2db41f','#5b1fb4','#78b41f',\
'#b45b1f','#1fb4a6','#1f2db4','#b4a61f','#78b41f','#2aa930','#b41f78','#a61fb4','#b45b1f','#2db41f','#5b1fb4','#78b41f']
for i,index in enumerate(neighbor_info):
    nx.draw_networkx_edges(city_graph,stations_info,edgelist=neighbor_info[index],width=1.5,edge_color=col_list[i])
plt.show()