可断点爬虫实现一-CSDN博客

本文链接：https://blog.csdn.net/qq_46240318/article/details/121272215

title: 可断点爬虫实现（一）
author: LiSoul
date: 2021-11-11

爬虫在软件开发中属于最常用的手段之一，但是在有时候在需要爬取大量的数据时程序会因为各种原因出现异常，但是我们又不想从头开始爬取，所以为了方便，在这里献上一份可断点爬虫的方案，若有考虑不到之处，欢迎各位大佬提出建议，谢谢。

1. 我所使用的库

bs4

bs4 能够快速方便的从网页中提取指定的内容，通过 bs4,我们可以提取我们需要的内容。
- 安装
```
python -m pip install bs4
```
requests

requests 是 python 实现的最简单易用的 HTTP 库
- 安装
```
python -m pip install requests
```
openpyxl

openpyxl 是用来处理 XLSX 表格文件的工具
- 安装
```
python -m pip install openpyxl
```
json

通过 json 数据处理可以实现运行时数据的存储
fake_useragent

fake_useragent 是一个随机生成浏览器请求头的工具
- 安装
```
python -m pip install fake_useragent
```

2. 各部分代码的实现

爬虫会占用大量网络资源，请在爬虫时注意控制访问速度

本次爬虫将以国家统计局统计的全国统计用区划代码和城乡划分代码为列
地址: http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html

1. json 数据储存服务实现

加载 JSON 数据

先在你运行程序的根目录下创建文件 data.json, 将以下 JSON 数据填到你刚创建的 json 文件之中。

{
  "data": [
    {
      "father_id": "0",
      "deep": 1,
      "url": "2020/index.html"
    }
  ]
}

然后创建你的主程序，列如: main.py

import json

file = open('data.js', 'r')
text = file.read()
file.close()    # 读取完文件记得及时关闭文件
data  = json.loads(text)
print(data)     # 此处的data就是你json文件中的数据

保存 JSON 数据

当每爬取完一条数据的时候，我们需要将我们程序运行时产生的数据保存到 JSON 文件之中，以防止我们的程序意外停止后数据丢失，需要我们从头去爬。

······
file = open('data.json', 'w')
file.write(data)    # 此处的data是你之前一遍爬虫爬取完之后的运行数据
file.close()

2. Excel 表格文件数据储存

使用 Excel 表格可以将我们爬取到的数据储存到文件里面，当然此处也可以借助数据库将文件储存到数据库里面，或者储存为 JSON 数据文件也可以
提示: 数据库的操作可以用pymysql实现

Excel 表格文件加载

import openpyxl

book = openpyxl.load_workbook('address.xlsx')
L = book.sheetnames     # 通过此方法可以获取到所有的sheet表名
# 罗列出我们表格文件中需要有的sheet表
address_list = ["province", "municipality", "district", "township", "village"]

# 程序在首次运行时会自动创建这些表
for item in address_list:
    for item not in L:
        book.create_sheet(item)
        # 每一个sheet表都需要有不同的字段名，所以我们需要分别创建表格
        if item == "province":
                book[item].append((("ID", "NAME", "URL")))
            elif item == "municipality":
                book[item].append((("ID", "NAME", "URL", "PROVINCEID")))
            elif item == "district":
                book[item].append((("ID", "NAME", "URL", "MUNICIPALITYID")))
            elif item == "township":
                book[item].append((("ID", "NAME", "URL", "DISTRICTID")))
            elif item == "village":
                book[item].append((("ID", "CITYID", "NAME", "TOWNSHIPID")))

3. 各阶段爬虫功能的实现

首先我们需要分析一下我们需要爬取的数据，通过在网页中我们对数据的分析发现我们需要爬取的数据共有五层，而且每一层之间的数据关系很像我们之前学过的数据结构中的树的关系，所以我们要想爬取到每一层的数据，我们就不得不使用一些算法里面的东西了。至于使用那些算法，我们暂且留个悬念。我们先来实现每一层的爬虫吧。

爬取一级城市

# 获取一级区域
import requests

global_url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/"

def fetch_province_list(url): # 此处URL是爬取省份时的URL
    # 列如: "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html"
    global global_url
    data = []
    response = requests.get(global_url + url)
    response.encoding = "gbk"
    demo = response.text
    soup = bs4.BeautifulSoup(demo, "html.parser")
    soup = soup.body.find_all(name='tr', attrs={'class': 'provincetr'})
    for item in soup:
        temp = item.find_all(name='a')
        # print(temp)
        for item2 in temp:
            data.append({'id': item2['href'].split('.')[
                        0], 'name': item2.text, 'url': item2['href']})
    [print(item) for item in data]
    return data

爬取二至四级城市

def fetch_district_list(url, deep):     # 此处URL是爬取二到四级城市时的URL， deep是当前爬取的那一级
    # 列如:
    #   市："http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/11.html"
    #   区："http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/11/1101.html"
    #   街道: "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/11/01/110101.html"
    global global_url
    response = requests.get(global_url + url)
    response.encoding = "gbk"
    demo = response.text
    soup = bs4.BeautifulSoup(demo, "html.parser")
    # 由于每一层中class名不一致，所以我们需要分类判断
    if deep == 2:
        soup = soup.body.find_all(name='tr', attrs={'class': 'citytr'})
    elif deep == 3:
        soup = soup.body.find_all(name='tr', attrs={'class': 'countytr'})
    elif deep == 4:
        soup = soup.body.find_all(name='tr', attrs={'class': 'towntr'})
    data = []
    for item in soup:
        temp = item.find_all(name='a')
        # print(temp)
        if temp:
            data.append(
                {'id': temp[0].text, 'name': temp[-1].text, 'url': temp[-1]['href']})
    [print(item) for item in data]
    return data

爬取五级城市

def fetch_village_list(url):    # 此处URL是爬取五级区域时的URL
    # 列如: "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/11/01/01/110101001.html"
    global global_url
    response = requests.get(global_url + url)
    response.encoding = "gbk"
    demo = response.text
    soup = bs4.BeautifulSoup(demo, "html.parser")
    soup = soup.body.find_all(name='tr', attrs={'class': 'villagetr'})
    data = []
    for item in soup:
        village = item.find_all('td')
        data.append(
            {'id': village[0].text, 'city_id': village[1].text, 'name': village[-1].text})
    [print(item) for item in data]
    return data