Python爬虫获取新冠肺炎数据

最新推荐文章于 2022-10-05 20:53:08 发布

abysswatcher_1

最新推荐文章于 2022-10-05 20:53:08 发布

阅读量3.3k

点赞数 2

分类专栏： Deep Learning 文章标签： python 数据挖掘 json

本文链接：https://blog.csdn.net/abysswatcher1/article/details/113953363

版权

Deep Learning 专栏收录该内容

13 篇文章 6 订阅

订阅专栏

本文描述了最基本的爬虫使用方法。
目标网站：https://news.qq.com/zt2020/page/feiyan.htm#/global
获取网站中外国某国家疫情感染人数。

一.网站分析

按F12进入开发者模式，查看目标网站的信息，以google浏览器为例子：
在这里插入图片描述选择network中的XHR，可以看到name中有很多项，在网页中点开目标国家（俄罗斯）的具体信息，可以看到在name栏中最下方出现了一个新的信息，查看具体信息可以看到一些相关的信息。
http请求为post，参数为country，URL也可以查看。
url最后类似于乱码形式的字符串为国家名转码以后的文本。如果需要获取其他国家的数据，只需要将url变为
https://api.inews.qq.com/newsqa/v1/query/pubished/daily/list?country=国家名称&，
在拼接时无法使用中文，可以使用quote方法进行转码。

from urllib import parse # urllib为Python中用于处理http请求，parse子模块用于处理url

URL = "https://api.inews.qq.com/newsqa/v1/query/pubished/daily/list?country=%s&"#基本格式

country = input("请输入查询国家：")

print(URL%(parse.quote(country)))# 使用unquote方法解析url

输入目标国家后即可进行转码：
在这里插入图片描述

二.数据获取

requests模块是Python中的http请求库，需要进行安装。
浏览器在发送请求时会携带一些信息，为了达到模拟浏览器的目的，爬虫也需要加入这些信息进行伪装，可以直接将浏览器中对应信息直接粘贴过来。
在这里插入图片描述 在这里插入代码片

request的使用格式：

requests.post(url,[data,] [headers,] [proxies,] *args)

-url：服务器对应的地址。
-data：请求时携带的数据。
-headers：HTTP请求中headers字段。
-proxies：设置代理IP地址

获取数据的程序如下：

from urllib import parse # urllib为Python中用于处理http请求，parse子模块用于处理url
import requests

def get_response(country):
    URL = "https://api.inews.qq.com/newsqa/v1/automation/foreign/daily/list?country=%s&"#基本格式
    headers = {
    "Referer":"https://news.qq.com/",
    "Host":"api.inews.qq.com",
    "Origin":"https://news.qq.com",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36"
    }
    url = URL%(parse.quote(country))
    data = {
    "country": country,
    }
    response = requests.post(url,data=data,headers=headers)
    print(url)
    return response

# 查看获取的内容
country = input("请输入查询国家：")
res = get_response(country)
print(res.text) # resposne.text 查看响应中的文本信息

三.处理保存数据

数据本身并不直观，需要进行处理。数据时字符串形式的字典（json）, 将其转化为字典，然后使用pandas进行处理，最后保存至Excel文件中。
最后的完整代码如下：

from urllib import parse # urllib为Python中用于处理http请求，parse子模块用于处理url
import requests

def get_response(country):
    URL = "https://api.inews.qq.com/newsqa/v1/automation/foreign/daily/list?country=%s&"#基本格式
    headers = {
    "Referer":"https://news.qq.com/",
    "Host":"api.inews.qq.com",
    "Origin":"https://news.qq.com",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36"
    }
    url = URL%(parse.quote(country))
    data = {
    "country": country,
    }
    response = requests.post(url,data=data,headers=headers)
    print(url)
    return response

# 查看获取的内容
country = input("请输入查询国家：")
res = get_response(country)
#print(res.text) # resposne.text 查看响应中的文本信息
with open("yiqing.txt","a") as f:
    f.write(res.text)
data = res.text
data = eval(data) # 转为字典
# 数据中只有data对应的键值对为数据
data = data["data"]
# 导入pandas ，需要进行安装
import pandas as pd # 使用as给包起别名
df = pd.DataFrame(data) # DataFrame为pandas中的一个数据结构，类似于二维表格
df.to_excel("yiqing.xls", index=False) # to_csv可以写入到csv文件中
df = pd.read_excel("yiqing.xls", index_col=None)
df