使用urillb获取北京公交线路信息

此文章仅用于学习交流,请勿攻击他人服务器或用于商用贩卖数据,如有违反,自己负责。

环境准备:

python3.7

Pycharm

urillb--->python自带了的

BeautifulSoup ---->需要自己下载(pip install bs4 他是集成在bs4里面的)

注: 1.源码后面都会给出。
2.这里默认大家会用pip指令下载东西,如果不会搜索“pip的安装与使用”,网上有很多详细教程。


爬取思路:


目标地址:

目标网址: https://beijing.8684.cn/

一、打开网址如下:

二、我们点开公交路线(以数字开头)---->多点几个,观察网址的变化

我们不难发现网址的变化为: https://beijing.8684.cn/list8 每次改变list后面的数值

三、之后查看当前页的数据数据展示(点击F12):

按住Ctrl + shift + C之后选中自己想要的数据即可在抓包工具中看到对应的源码

这里我们选中8路点击一下就可以看到源码

点击8路之后,发现网址和刚源码中的一样,因此我们需要爬取到上一页的公交路线信息,之后在获取每一页的公交信息


获取并解析第一个目标地址:

这一步主要是为了得到当前数字开头的所有公交信息地址(URL),以便通过改地址解析各自公交的信息

一、定制请求头,写好基础URL地址以便后面复用

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
}

url = 'https://beijing.8684.cn/'

url_list = url + '/list%d'

二、这里我们并不需要获取单独写方法来构造第一个目标地址,因为list后面的数字就是页数,因此我们只需要将其放到for循环中去就好了,爬取的页数自己定义就好,可以改成控制台输入的方式

if __name__ == '__main__':
    for k in range(1, 2):
        urls = url_list % k
        time.sleep(3)
        get_page_url(urls)
        print(f'完成{k}个......')

三、写一个方法解析第一个目标地址的数据,获取对应公交信息

从上面我们不难发现,只需要找到class ='list clearfix'的div标签就可以找到对应的网址(如果感觉这样跳跃不保险也可以用select 类选择器逐步递进筛选)

def get_page_url(urls):
    rep = urllib.request.Request(urls, headers=headers)
    html = urllib.request.urlopen(rep)
    btsoup = bs(html.read(), 'html.parser')
    lu = btsoup.find('div', class_='list clearfix')
    hrefs = lu.find_all('a')
    for i in hrefs:
        print(i)
        print(i['href'])
        urls = urljoin(url, i['href']) # 拼接网址
        get_page_info(urls) # 下面会写这一个方法

进入第二目标地址,拿到对应的数据:

一、通过第一个目标得知获取的对应公交信息网址进入并解析数据

二、这里我们解析好数据之后将其存储在一个结果列表中,以便后面保存数据

三、构造方法,实现删除两步

def get_page_info(urls):
    rep = urllib.request.Request(url=urls, headers=headers)
    html = urllib.request.urlopen(rep)
    soup = bs(html.read(), 'html.parser')
    bus_name = soup.select('div.info > h1.title > span')[0].string
    bus_type = soup.select('div.info > h1.title > a.category')[0].string

    time_select = soup.select('div.info > ul.bus-desc > li')
    bus_time = time_select[0].string
    bus_ticket = time_select[1].string
    gongsi = time_select[2].find('a').string
    gengxin = time_select[3].find('span').string

    try:
        licheng = soup.find('div', class_="change-info mb20").string
    except:
        licheng = None

    # 往的信息 ---> 0
    wang_info1 = bus_name
    wang_info2 = soup.select('div > div > div.trip')[0].string
    wang_total = soup.select('div > div > div.total')[0].string
    wang_road_ol = soup.find_all('div', class_='bus-lzlist mb15')[0].find_all('ol')
    wang_road = get_page_wangFan(wang_road_ol)

    # 返的信息 ---> 1
    try:
        fan_info1 = bus_name
        fan_info2 = soup.select('div > div > div.trip')[1].string
        fan_total = soup.select('div > div > div.total')[1].string
        fan_road_ol = soup.find_all('div', class_='bus-lzlist mb15')[1].find_all('ol')
        fan_road = get_page_wangFan(fan_road_ol)
    except IndexError:
        fan_info1 = None
        fan_info2 = None
        fan_total = None
        fan_road = None

    result_lst = [bus_name, bus_type, bus_time, bus_ticket, gongsi, gengxin, licheng, wang_info1, wang_info2,
                  wang_total, wang_road, fan_info1, fan_info2, fan_total, fan_road]

    cs = open('BeiJing_Bus_Info.txt', 'a', newline="", encoding='utf-8')
    writer = csv.writer(cs)
    # 这里可以进行空值处理(不想做,没做)
    # 写入数据
    writer.writerow(result_lst)
    print("又下载完一个^_^")
    time.sleep(5)
注:这里面获取往返数据单独写了一个方法,因为这里有隐藏数据,需要长个心眼
def get_page_wangFan(wangFan_road_ol):
    wangFan_road_tmp = wangFan_road_ol[0].find_all('li')

    wangFan_road_lst = []
    for road in wangFan_road_tmp:
        temp = road.find('a')
        if temp is None:
            continue
        else:
            wangFan_road_lst.append(temp)

    # 删除最后一个值
    wangFan_road_lst.pop()
    try:
        # 找隐藏的数据
        wangFan_road_tmp = wangFan_road_ol[1].find_all('li')
    except:
        wangFan_road_tmp = None

    if wangFan_road_tmp is not None:
        for road in wangFan_road_tmp:
            temp = road.find('a')
            if temp is None:
                continue
            else:
                wangFan_road_lst.append(temp)

    # 格式化字段
    wangFan_road = ""
    for r in wangFan_road_lst:
        wangFan_road += r.string + ', '

    return wangFan_road

源码

import csv
import time
import urllib.request
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36'
}
url = 'https://beijing.8684.cn/'

url_list = url + '/list%d'


def get_page_wangFan(wangFan_road_ol):
    wangFan_road_tmp = wangFan_road_ol[0].find_all('li')

    wangFan_road_lst = []
    for road in wangFan_road_tmp:
        temp = road.find('a')
        if temp is None:
            continue
        else:
            wangFan_road_lst.append(temp)

    # 删除最后一个值
    wangFan_road_lst.pop()
    try:
        # 找隐藏的数据
        wangFan_road_tmp = wangFan_road_ol[1].find_all('li')
    except:
        wangFan_road_tmp = None

    if wangFan_road_tmp is not None:
        for road in wangFan_road_tmp:
            temp = road.find('a')
            if temp is None:
                continue
            else:
                wangFan_road_lst.append(temp)

    # 格式化字段
    wangFan_road = ""
    for r in wangFan_road_lst:
        wangFan_road += r.string + ', '

    return wangFan_road


def get_page_info(urls):
    rep = urllib.request.Request(url=urls, headers=headers)
    html = urllib.request.urlopen(rep)
    soup = bs(html.read(), 'html.parser')
    bus_name = soup.select('div.info > h1.title > span')[0].string
    bus_type = soup.select('div.info > h1.title > a.category')[0].string

    time_select = soup.select('div.info > ul.bus-desc > li')
    bus_time = time_select[0].string
    bus_ticket = time_select[1].string
    gongsi = time_select[2].find('a').string
    gengxin = time_select[3].find('span').string

    try:
        licheng = soup.find('div', class_="change-info mb20").string
    except:
        licheng = None

    # 往的信息 ---> 0
    wang_info1 = bus_name
    wang_info2 = soup.select('div > div > div.trip')[0].string
    wang_total = soup.select('div > div > div.total')[0].string
    wang_road_ol = soup.find_all('div', class_='bus-lzlist mb15')[0].find_all('ol')
    wang_road = get_page_wangFan(wang_road_ol)

    # 返的信息 ---> 1
    try:
        fan_info1 = bus_name
        fan_info2 = soup.select('div > div > div.trip')[1].string
        fan_total = soup.select('div > div > div.total')[1].string
        fan_road_ol = soup.find_all('div', class_='bus-lzlist mb15')[1].find_all('ol')
        fan_road = get_page_wangFan(fan_road_ol)
    except IndexError:
        fan_info1 = None
        fan_info2 = None
        fan_total = None
        fan_road = None

    result_lst = [bus_name, bus_type, bus_time, bus_ticket, gongsi, gengxin, licheng, wang_info1, wang_info2,
                  wang_total, wang_road, fan_info1, fan_info2, fan_total, fan_road]

    cs = open('BeiJing_Bus_Info.txt', 'a', newline="", encoding='utf-8')
    writer = csv.writer(cs)
    # 这里可以进行空值处理(不想做,没做)
    # 写入数据
    writer.writerow(result_lst)
    print("又下载完一个^_^")
    time.sleep(5)


def get_page_url(urls):
    rep = urllib.request.Request(urls, headers=headers)
    html = urllib.request.urlopen(rep)
    btsoup = bs(html.read(), 'html.parser')
    lu = btsoup.find('div', class_='list clearfix')
    hrefs = lu.find_all('a')
    for i in hrefs:
        # 这里因为是Tag对象所以可以直接用i['href']访问对应的href属性
        print(i)
        ####################################################
        print(i['href'])
        urls = urljoin(url, i['href'])
        get_page_info(urls)


if __name__ == '__main__':
    for k in range(1, 2):
        urls = url_list % k
        time.sleep(3)
        get_page_url(urls)
        print(f'完成{k}个......')
  • 5
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 10
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值