适合小白的Python爬虫入门——轻松获取疫情数据

最新推荐文章于 2024-05-12 18:12:38 发布

Ma Sizhou

最新推荐文章于 2024-05-12 18:12:38 发布

阅读量532

点赞数 1

分类专栏： python网络爬虫

本文链接：https://blog.csdn.net/weixin_45901519/article/details/110295813

版权

python网络爬虫专栏收录该内容

1 篇文章 1 订阅

订阅专栏

黑马程序员视频笔记（自用）.

目录

一、基础知识
1、爬虫简介
1.1 网络爬虫与浏览器的区别
1.2 网络爬虫的定义
1.3 网络爬虫的作用

2、requests请求库
2.1 requests介绍
2.2 requests安装
2.3 requests的基本使用

3、Beautiful Soup解析库
3.1 Beautiful Soup介绍
3.2 Beautiful Soup安装
3.3 Beautiful Soup对象的介绍与创建
3.4 Beautiful Soup对象的find方法
3.5 案例：从疫情首页提取各国最新的疫情数据

4、正则表达式
4.1正则表达式的概念与作用
（1）概念：
（2）作用：

4.2 正则表达式常见语法
4.3 re.findall()方法
（1）API:
（2)findalla()的特点

4.4 正则表达式中r原串的使用
4.5 提取最新的疫情数据的json字符串
4.6 总结

5、json模块
5.1 json模块介绍
5.2 json转换为python
5.3 python转换为json
（1） python类型数据转换为json字符串：
（2）python类型数据以json格式写入文件：

5.4 解析最新的疫情数据的json字符串
5.5 总结

二、疫情爬虫项目
1、采集最近一日世界各国疫情数据
2、采集从01月23日以来的世界各国疫情数据
3、采集最近一日全国各省疫情数据
4、采集从01月22日以来的中国各省疫情数据
5、总结

一、基础知识

1、爬虫简介

1.1 网络爬虫与浏览器的区别

如下图所示，是浏览器工作的原理：发送请求——>服务器响应——>返回响应的数据，进行渲染。
在这里插入图片描述
而网络爬虫的工作原理是：发送请求——>服务器响应——>返回响应的数据。

总之：

1.2 网络爬虫的定义

在这里插入图片描述

1.3 网络爬虫的作用

在这里插入图片描述

那怎么请求数据呢？下面接着看：

2、requests请求库

2.1 requests介绍

在这里插入图片描述

2.2 requests安装

打开终端，输入下面命令：

pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple

2.3 requests的基本使用

在这里插入图片描述
看一个请求百度首页的例子：

# 1.导入模块
import requests

# 2.发送请求，获取响应
response = requests.get('http://www.baidu.com')
print(response)  # <Response [200]>表示成功

# 3.获取响应数据
# print(response.encoding)  # 查看默认使用的是什么编码：ISO-8859-1

##方式一获取响应
response.encoding = 'utf-8'  # 改变编码方式
print(response.text)

##方式二获取响应（推荐）
print(response.content.decode())  # decode()默认使用utf-8解码
# print(response.content.decode(encoding='gbk'))  # 改为gbk编码方式

注意：

现在数据请求到了，那怎么从请求的数据中提取想要的数据呢？接着看：

3、Beautiful Soup解析库

3.1 Beautiful Soup介绍

在这里插入图片描述

3.2 Beautiful Soup安装

Beautiful Soup3停止更新了，所以这里安装Beautiful Soup4，按如下命令：

pip install bs4 -i https://pypi.tuna.tsinghua.edu.cn/simple

还需要安装xml解析库：

pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

3.3 Beautiful Soup对象的介绍与创建

（1）介绍：
在这里插入图片描述

（2）创建：

# 1.导入模块
from bs4 import BeautifulSoup

# 2.创建BeautifulSoup对象
soup = BeautifulSoup('<html>data</html>', 'lxml')  # 第一个参数：html的开始标签、数据、结束标签；第二个参数：指明要用的解析
print(soup)  # BeautifulSoup会自动的修正html

3.4 Beautiful Soup对象的find方法

在这里插入图片描述
html文档树如下图所示：

接下来看看find方法的API：

接下来看看例子：

（1）根据标签名查找：

需求：获取文档中的title标签和a标签。

在这里插入图片描述
代码：

# 1.导入模块
from bs4 import BeautifulSoup

# 2.准备文档字符串
html = """<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body>
        <p class="title">
            <b>The Dormouse's story</b>
        </p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>.
            <a hrft="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
            <a href="http://example.com/tillie" class="sister" id="link3">tillie</a>;
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>

 """

# 3.创建BeautifulSoup对象
soup = BeautifulSoup(html, 'lxml')# 第一个参数：html的开始标签、数据、结束标签；第二个参数：指明要用的解析

# 4.查找title标签
title = soup.find('title')
print(title)

# 5.查找a标签
a = soup.find('a')
print(a)  # 只是第一个

## 查找所有的a标签
a_s = soup.find_all('a')  # 把所有的a标签存入列表，再返回
print(a_s)

（2）根据属性查找：

需求：获取文档中的id为link1的标签

代码：

# 1.导入模块
from bs4 import BeautifulSoup

# 2.准备文档字符串
html = """<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body>
        <p class="title">
            <b>The Dormouse's story</b>
        </p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id= "link1">Elsie</a>.
            <a hrft="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
            <a href="http://example.com/tillie" class="sister" id="link3">tillie</a>;
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>

 """

# 3.创建BeautifulSoup对象
soup = BeautifulSoup(html, 'lxml')

# 4.获取文档中的id为link1的标签
## 方式一：通过命名参数指定
a = soup.find(id="link1")
print(a)

# 方式二：使用attrs来指定属性字典，进行查找
a = soup.find(attrs={'id': 'link1'})
print(a)

（3）根据文本查找(使用不多)：

需求：获取文档中文本为Elsie的标签文本

代码：

# 1.导入模块
from bs4 import BeautifulSoup

# 2.准备文档字符串
html = """<html>
    <head>
        <title>The Dormouse's story</title>
    </head> 
    <body>
        <p class="title">
            <b>The Dormouse's story</b>
        </p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id= "link1">Elsie</a>.
            <a hrft="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
            <a href="http://example.com/tillie" class="sister" id="link3">tillie</a>;
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>

 """

# 3.创建BeautifulSoup对象
soup = BeautifulSoup(html, 'lxml')

# 4.查找文档中文本为Elsie的标签文本
text = soup.find(text='Elsie')
print(text)

现在我们通过find方法获取到了文档中的标签，其实像上面获取的是Tag对象。接下来介绍一下Tag对象：

在这里插入图片描述

例子：

soup = BeautifulSoup(html, 'lxml')
a = soup.find(id="link1")  # 这个a就是Tag对象

# Tag对象
print(type(a))  # <class 'bs4.element.Tag'>
print('标签名', a.name)
print('标签所有属性', a.attrs)
print('标签文本内容', a.text)

3.5 案例：从疫情首页提取各国最新的疫情数据

在这里插入图片描述
代码：

# 1.导入相关模块
import requests
from bs4 import BeautifulSoup

# 2.发送请求，获取疫情首页内容
response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia')
home_page = response.content.decode()  # 默认utf-8编码
print(home_page)  # 打印看是否请求成功

# 3.使用BeautifulSoup提取疫情数据
soup = BeautifulSoup(home_page, 'html5lib')  # 创建BeautifulSoup对象
script = soup.find(id="getListByCountryTypeService2true")  # 根据属性查找：方式一：通过命名参数指定
# script = soup.find(attrs={'id':'getListByCountryTypeService2true'})  # 根据属性查找：方式二：使用attrs来指定属性字典，进行查找
print(script)
text = script.text  # 接收这个标签里的文本内容
print(text)

注意：我使用lxml解析器的时候，用text获取的内容为空，我给换成html5lib就解决了。
安装： pip install html5lib

现在想要的数据找到了，但是如何能准确的匹配呢？下面来看：

4、正则表达式

4.1正则表达式的概念与作用

（1）概念：

在这里插入图片描述

（2）作用：

在这里插入图片描述

4.2 正则表达式常见语法

在这里插入图片描述

例子：

# 导入正则模块
import re

# 字符模块
rs = re.findall('abc', 'abc')
rs = re.findall('a.c', 'abc')
rs = re.findall('a\.c', 'a.c')  # \为转义字符
rs = re.findall('a[bc]d', 'acd')  # []就是个字符集，匹配到里面的任意一个都行

# 预定义的字符集
rs = re.findall('\d', '123')
rs = re.findall('\w', 'Az123_我爱中国')  # \w匹配的是大小写字母、数字、下划线、中文

# 数量词
rs = re.findall('a*', 'adc')  # a*就表示出现0，1，2...n次a,
rs = re.findall('a+', 'abc')  # a+表示出现1,2...n次a
rs = re.findall('a?', 'abc')  # a?表示a出现0次或1次
rs = re.findall('a\d{2}', 'a123')  # \d{2}表示\d出现两次

print(rs)

总结：

在这里插入图片描述

4.3 re.findall()方法

（1）API:

在这里插入图片描述

（2)findalla()的特点

在这里插入图片描述

例子：

import re

# 1.findall方法，返回匹配的结果列表
rs = re.findall('\d+', 'chuan13zhi24')
# print(rs)

# 2.findall方法中，flag参数的作用
rs = re.findall('a.bc', 'a\nbc')  # 这个.不能匹配\n
rs = re.findall('a.bc', 'a\nbc', re.DOTALL)  # 这个就可以匹配\n了
rs = re.findall('a.bc', 'a\nbc', re.S)  # 作用同上
# print(rs)

# findall方法中，分组的使用
rs = re.findall('a.+bc', 'a\nbc', re.DOTALL)  #
print(rs)  # ['a\nbc']

rs = re.findall('a(.+)bc', 'a\nbc', re.DOTALL)  # 只返回和小括号里面匹配的内容，其他的字符负责定位
print(rs)  # ['\n']

4.4 正则表达式中r原串的使用

作用：
在这里插入图片描述
例子：

import re

# 1.不使用r原串时，与到转义字符怎么做
rs = re.findall('a\nbc', 'a\nbc')
print(rs)  # ['a\nbc']

rs = re.findall('a\\bc', 'a\\bc')  # 遇到转义字符不能匹配
print(rs)  # []

rs = re.findall('a\\\\bc', 'a\\bc')  # 这个是解决的办法，使用4个\便可解决（繁琐）
print(rs)  # ['a\\bc']

#2.r原串在正则中就可以消除转义字符带来的影响
rs = re.findall(r'a\\nbc', 'a\\nbc')
print(rs)  # ['a\\nbc']

# 扩展：可以解决写正则的时候，不符合PEP8规范的问题
rs = re.findall(r'\d', 'a123')
print(rs)  # ['1', '2', '3']

4.5 提取最新的疫情数据的json字符串

在这里插入图片描述

代码：

# 1.导入相关模块
import requests
from bs4 import BeautifulSoup
import re

# 2.发送请求，获取疫情首页内容
response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia')
home_page = response.content.decode()  # 默认utf-8编码
# print(home_page)  # 打印看是否请求成功

# 3.使用BeautifulSoup提取疫情数据
soup = BeautifulSoup(home_page, 'html5lib')  # 创建BeautifulSoup对象
script = soup.find(id="getListByCountryTypeService2true")  # 根据属性查找：方式一：通过命名参数指定
# script = soup.find(attrs={'id':'getListByCountryTypeService2true'})  # 根据属性查找：方式二：使用attrs来指定属性字典，进行查找
# print(script)

text = script.text  # 接收这个标签里的文本内容
# print(text)

# 4.使用正则表达式，提取json字符串
json_str = re.findall(r'\[.+\]', text)[0]  # []有特殊用法，所以要转义
print(json_str)

4.6 总结

在这里插入图片描述

现在json格式的字符串取出来了，那怎么转成python的类型，进而存入文件呢？下面来看：

5、json模块

5.1 json模块介绍

在这里插入图片描述
下面是json格式的例子：

5.2 json转换为python

在这里插入图片描述
代码例子：

import json

# 1.把JSON字符串转换为PYTHON数据
# 1.1 准备JSON字符串
json_str = """[{"provinceName":"美国", "currentConfirmedCount":1179041, "confirmedCount":1643499},
{"provinceName":"英国", "currentConfirmedCount":222227, "confirmedCount":259559}]"""
# 1.2 把JSON字符串转换为PYTHON数据
rs = json.loads(json_str)
print(rs)
print(type(rs))  # <class 'list'>
print(type(rs[0]))  # <class 'dict'>

# 2.把JSON格式文件，转换为PYTHON类型的数据
# 2.1 构建指向该文件的文件对象
with open('data/test.json') as fp:
    # 2.2 加载该文件对象，转换为PYTHON数据
    python_list = json.load(fp)
    print(python_list)
    print(type(python_list))  # <class 'list'>
    print(type(python_list[0]))  # <class 'dict'>

5.3 python转换为json

（1） python类型数据转换为json字符串：

在这里插入图片描述

（2）python类型数据以json格式写入文件：

在这里插入图片描述
代码：

import json

# 1.把python转换为json字符串
# 1.1 python类型的数据
json_str = """[{"provinceName":"美国", "currentConfirmedCount":1179041, "confirmedCount":1643499},
{"provinceName":"英国", "currentConfirmedCount":222227, "confirmedCount":259559}]"""
rs = json.loads(json_str)  # rs便是python类型的数据
# 1.2把python转换为json字符串
json_str = json.dumps(rs, ensure_ascii=False)
print(json_str)

# 2.把python以json格式存储到文件中
# 2.1 构建要写入的文件对象
with open('data/test1.json', 'w') as fp:
    # 2.2 把python以json格式存储到 test1.json文件中
    json.dump(rs, fp, ensure_ascii=False)

5.4 解析最新的疫情数据的json字符串

在这里插入图片描述
代码：

# 1.导入相关模块
import requests
from bs4 import BeautifulSoup
import re
import json

# 2.发送请求，获取疫情首页内容
response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia')
home_page = response.content.decode()  # 默认utf-8编码
# print(home_page)  # 打印看是否请求成功

# 3.使用BeautifulSoup提取疫情数据
soup = BeautifulSoup(home_page, 'html5lib')  # 创建BeautifulSoup对象
script = soup.find(id="getListByCountryTypeService2true")  # 根据属性查找：方式一：通过命名参数指定
# script = soup.find(attrs={'id':'getListByCountryTypeService2true'})  # 根据属性查找：方式二：使用attrs来指定属性字典，进行查找
# print(script)

text = script.text  # 接收这个标签里的文本内容
# print(text)

# 4.使用正则表达式，提取json字符串
json_str = re.findall(r'\[.+\]', text)[0]  # []有特殊用法，所以要转义
# print(json_str)

# 5.把json字符串转换为python类型的数据
last_day_corona_virus = json.loads(json_str)
print(last_day_corona_virus)

5.5 总结

在这里插入图片描述

二、疫情爬虫项目

在这里插入图片描述

1、采集最近一日世界各国疫情数据

在这里插入图片描述
代码：

import requests
from bs4 import BeautifulSoup
import re
import json

# 1.发送请求，获取疫情首页
response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia')
home_page = response.content.decode()

# 2.从疫情首页，提取最近一日各国疫情数据
soup = BeautifulSoup(home_page, 'html5lib')
script = soup.find(id='getListByCountryTypeService2true')
text = script.text
# print(text)

# 3.从疫情数据中获取json格式的字符串
json_str = re.findall('\[.+\]', text)[0]
# print(json_str)

# 4.把json格式的字符串转换为python类型
last_corona_virus = json.loads(json_str)
# print(last_corona_virus)

# 5.以json格式保存，最近一日各国疫情数据
with open('data/last_corona_virus.json', 'w', encoding="utf-8") as fp:
    json.dump(last_corona_virus, fp, ensure_ascii=False)

2、采集从01月23日以来的世界各国疫情数据

在这里插入图片描述

代码：

import requests
from bs4 import BeautifulSoup
import re
import json
from tqdm import tqdm



class CoronaVirusSpider(object):
    def __init__(self):
        self.home_url = 'https://ncov.dxy.cn/ncovh5/view/pneumonia'

    def get_content_from_url(self, url):
        """
        根据URL，获取响应内容的字符串数据
        :param url:请求的url
        :return:响应内容的字符串
        """
        response = requests.get(url)
        return response.content.decode()

    def parse_home_page(self, home_page):
        """
        解析首页内容，获取解析后的python数据
        :param home_page:首页的内容
        :return:解析后的python数据
        """
        # 2.从疫情首页，提取最近一日各国疫情数据
        soup = BeautifulSoup(home_page, 'html5lib')
        script = soup.find(id='getListByCountryTypeService2true')
        text = script.text
        # print(text)

        # 3.从疫情数据中获取json格式的字符串
        json_str = re.findall('\[.+\]', text)[0]
        # print(json_str)

        # 4.把json格式的字符串转换为python类型
        data = json.loads(json_str)
        # print(last_corona_virus)
        return data

    def save(self, data, path):
        # 5.以json格式保存，最近一日各国疫情数据
        with open(path, 'w', encoding="utf-8") as fp:
            json.dump(data, fp, ensure_ascii=False)

    def crawl_last_day_corona_virus(self):
        """
        采集最近一天的各国疫情信息
        :return:
        """
        # 1.发送请求，获取首页内容
        home_page = self.get_content_from_url(self.home_url)
        # 2.解析首页内容，获取最近一天的疫情数据
        last_day_corona_virus = self.parse_home_page(home_page)
        # 3.保存数据
        self.save(last_day_corona_virus, 'data/last_corona_virus.json')

    def crawl_corona_virus(self):
        """
        采集1月23日以来各国疫情数据
        :return:
        """
        # 1.加载各国疫情数据
        with open('data/last_corona_virus.json', 'r', encoding='gb18030', errors='ignore') as fp:
            last_day_corona_virus = json.load(fp)
        # print(last_day_corona_virus)

        # 定义列表，用于存储各国1月23日以来的疫情数据
        corona_virus = []
        # 2.遍历各国疫情数据，获取统计的URL
        for country in tqdm(last_day_corona_virus, '采集1月23日以来的各国疫情数据'):  # tqdm为进度条显示
            # 3.发送请求，获取各国1月23号至今的json数据
            statistics_data_url = country['statisticsData']
            statistics_data_json_str = self.get_content_from_url(statistics_data_url)
            # 4.把json数据转换为python类型数据，添加列表
            statistics_data = json.loads(statistics_data_json_str)['data']
            # print(statistics_data)
            for one_day in statistics_data:
                one_day['provinceName'] = country['provinceName']
                one_day['countryShortCode'] = country['countryShortCode']
            # print(statistics_data)
            corona_virus.extend(statistics_data)
        # 5.把列表以json格式保存为文件
        self.save(corona_virus, 'data/corona_virus.json')

    def run(self):
        # self.crawl_last_day_corona_virus()
        self.crawl_corona_virus()


if __name__ == "__main__":
    spider = CoronaVirusSpider()
    spider.run()

注：通过疫情首页，获取到的是最近一日的数据，在这个数据中，每一个国家有一个url，是所有时间的数据，所以通过这个url才能获取所有时间的数据。

3、采集最近一日全国各省疫情数据

在这里插入图片描述

代码：

import requests
from bs4 import BeautifulSoup
import re
import json
from tqdm import tqdm



class CoronaVirusSpider(object):
    def __init__(self):
        self.home_url = 'https://ncov.dxy.cn/ncovh5/view/pneumonia'

    def get_content_from_url(self, url):
        """
        根据URL，获取响应内容的字符串数据
        :param url:请求的url
        :return:响应内容的字符串
        """
        response = requests.get(url)
        return response.content.decode()

    def parse_home_page(self, home_page, tag_id):
        """
        解析首页内容，获取解析后的python数据
        :param home_page:首页的内容
        :return:解析后的python数据
        """
        # 2.从疫情首页，提取最近一日各国疫情数据
        soup = BeautifulSoup(home_page, 'html5lib')
        script = soup.find(id=tag_id)
        text = script.text
        # print(text)

        # 3.从疫情数据中获取json格式的字符串
        json_str = re.findall('\[.+\]', text)[0]
        # print(json_str)

        # 4.把json格式的字符串转换为python类型
        data = json.loads(json_str)
        # print(last_corona_virus)
        return data

    def save(self, data, path):
        # 5.以json格式保存，最近一日各国疫情数据
        with open(path, 'w', encoding="utf-8") as fp:
            json.dump(data, fp, ensure_ascii=False)

    def crawl_last_day_corona_virus(self):
        """
        采集最近一天的各国疫情信息
        :return:
        """
        # 1.发送请求，获取首页内容
        home_page = self.get_content_from_url(self.home_url)
        # 2.解析首页内容，获取最近一天的疫情数据
        last_day_corona_virus = self.parse_home_page(home_page, tag_id='getListByCountryTypeService2true')
        # 3.保存数据
        self.save(last_day_corona_virus, 'data/last_corona_virus.json')

    def crawl_corona_virus(self):
        """
        采集1月23日以来各国疫情数据
        :return:
        """
        # 1.加载各国疫情数据
        with open('data/last_corona_virus.json', 'r', encoding='gb18030', errors='ignore') as fp:
            last_day_corona_virus = json.load(fp)
        # print(last_day_corona_virus)

        # 定义列表，用于存储各国1月23日以来的疫情数据
        corona_virus = []
        # 2.遍历各国疫情数据，获取统计的URL
        for country in tqdm(last_day_corona_virus, '采集1月23日以来的各国疫情数据'):  # tqdm为进度条显示
            # 3.发送请求，获取各国1月23号至今的json数据
            statistics_data_url = country['statisticsData']
            statistics_data_json_str = self.get_content_from_url(statistics_data_url)
            # 4.把json数据转换为python类型数据，添加列表
            statistics_data = json.loads(statistics_data_json_str)['data']
            # print(statistics_data)
            for one_day in statistics_data:
                one_day['provinceName'] = country['provinceName']
                one_day['countryShortCode'] = country['countryShortCode']
            # print(statistics_data)
            corona_virus.extend(statistics_data)
        # 5.把列表以json格式保存为文件
        self.save(corona_virus, 'data/corona_virus.json')

    def crawl_last_day_corona_virus_of_china(self):
        """
        采集最近一日各省疫情数据
        :return:
        """
        # 1.发送请求，获取疫情首页
        home_page = self.get_content_from_url(self.home_url)
        # 2.解析疫情首页，获取最近一日各省疫情数据
        last_day_corona_virus_of_china = self.parse_home_page(home_page, tag_id='getAreaStat')

        # 3.保存疫情数据
        self.save(last_day_corona_virus_of_china, 'data/last_day_corona_virus_of_china.json')

    def run(self):
        self.crawl_last_day_corona_virus()
        # self.crawl_corona_virus()
        self.crawl_last_day_corona_virus_of_china()


if __name__ == "__main__":
    spider = CoronaVirusSpider()
    spider.run()

4、采集从01月22日以来的中国各省疫情数据

在这里插入图片描述

代码：

import requests
from bs4 import BeautifulSoup
import re
import json
from tqdm import tqdm



class CoronaVirusSpider(object):
    def __init__(self):
        self.home_url = 'https://ncov.dxy.cn/ncovh5/view/pneumonia'

    def get_content_from_url(self, url):
        """
        根据URL，获取响应内容的字符串数据
        :param url:请求的url
        :return:响应内容的字符串
        """
        response = requests.get(url)
        return response.content.decode()

    def parse_home_page(self, home_page, tag_id):
        """
        解析首页内容，获取解析后的python数据
        :param home_page:首页的内容
        :return:解析后的python数据
        """
        # 2.从疫情首页，提取最近一日各国疫情数据
        soup = BeautifulSoup(home_page, 'html5lib')
        script = soup.find(id=tag_id)
        text = script.text
        # print(text)

        # 3.从疫情数据中获取json格式的字符串
        json_str = re.findall('\[.+\]', text)[0]
        # print(json_str)

        # 4.把json格式的字符串转换为python类型
        data = json.loads(json_str)
        # print(last_corona_virus)
        return data

    def parse_corona_virus(self, last_day_corona_virus_of_china, desc):
        # 定义列表，用于存储各国1月23日以来的疫情数据
        corona_virus = []
        # 2.遍历最近一日全国疫情信息，获取各省疫情URL
        for country in tqdm(last_day_corona_virus_of_china, desc):  # tqdm为进度条显示
            # 3.发送请求，过去各省疫情json字符串
            statistics_data_url = country['statisticsData']
            statistics_data_json_str = self.get_content_from_url(statistics_data_url)
            # 4.解析各省疫情json字符串，并添加列表
            statistics_data = json.loads(statistics_data_json_str)['data']
            # print(statistics_data)
            for one_day in statistics_data:
                one_day['provinceName'] = country['provinceName']
                if country.get('countryShortCode'):
                    one_day['countryShortCode'] = country['countryShortCode']

            # print(statistics_data)
            corona_virus.extend(statistics_data)
        return corona_virus

    def load(self, path):
        """
        根据路径加载数据
        """
        with open(path, 'r', encoding='gb18030', errors='ignore') as fp:
            data = json.load(fp)
        return data

    def save(self, data, path):
        # 5.以json格式保存，最近一日各国疫情数据
        with open(path, 'w', encoding="utf-8") as fp:
            json.dump(data, fp, ensure_ascii=False)

    def crawl_last_day_corona_virus(self):
        """
        采集最近一天的各国疫情信息
        :return:
        """
        # 1.发送请求，获取首页内容
        home_page = self.get_content_from_url(self.home_url)
        # 2.解析首页内容，获取最近一天的疫情数据
        last_day_corona_virus = self.parse_home_page(home_page, tag_id='getListByCountryTypeService2true')
        # 3.保存数据
        self.save(last_day_corona_virus, 'data/last_corona_virus.json')

    def crawl_corona_virus(self):
        """
        采集1月23日以来各国疫情数据
        :return:
        """
        # 1.加载各国疫情数据
        last_day_corona_virus = self.load('data/last_corona_virus.json')
        # print(last_day_corona_virus)

        # 定义列表，用于存储各国1月23日以来的疫情数据
        corona_virus = self.parse_corona_virus(last_day_corona_virus, '采集1月23日以来的各国疫情数据')
        # 5.把列表以json格式保存为文件
        self.save(corona_virus, 'data/corona_virus.json')

    def crawl_last_day_corona_virus_of_china(self):
        """
        采集最近一日各省疫情数据
        :return:
        """
        # 1.发送请求，获取疫情首页
        home_page = self.get_content_from_url(self.home_url)
        # 2.解析疫情首页，获取最近一日各省疫情数据
        last_day_corona_virus_of_china = self.parse_home_page(home_page, tag_id='getAreaStat')

        # 3.保存疫情数据
        self.save(last_day_corona_virus_of_china, 'data/last_day_corona_virus_of_china.json')

    def crawl_corona_virus_of_china(self):
        """
        采集从1月22日以来的全国各省的疫情数据
        :return:
        """
        # 1.加载最近一日全国疫情信息
        last_day_corona_virus_of_china = self.load('data/last_day_corona_virus_of_china.json')

        corona_virus = self.parse_corona_virus(last_day_corona_virus_of_china, '采集1月23日以来的各省疫情数据')

        # 5.以json格式保存疫情信息
        self.save(corona_virus, 'data/corona_virus_of_china.json')

    def run(self):
        # self.crawl_last_day_corona_virus()
        self.crawl_corona_virus()
        # self.crawl_last_day_corona_virus_of_china()
        self.crawl_corona_virus_of_china()


if __name__ == "__main__":
    spider = CoronaVirusSpider()
    spider.run()

5、总结

在这里插入图片描述

Ma Sizhou

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
适合小白的Python爬虫入门——轻松获取疫情数据

目录一、基础知识1、爬虫简介1.1 网络爬虫与浏览器的区别1.2 网络爬虫的定义1.3 网络爬虫的作用2、requests请求库2.1 requests介绍2.2 requests安装2.3 requests的基本使用3、Beautiful Soup解析库3.1 Beautiful Soup介绍3.2 Beautiful Soup安装3.3 Beautiful Soup对象的介绍与创建3.4 Beautiful Soup对象的find方法一、基础知识1、爬虫简介1.1 网络爬虫与浏览器的区别如下.
复制链接

扫一扫

专栏目录