requests+json爬取腾讯新闻feiyan实时数据并实现持续更新

最新推荐文章于 2022-04-30 11:24:28 发布

Lizytzh

最新推荐文章于 2022-04-30 11:24:28 发布

阅读量1.5k

点赞数 1

分类专栏： python

本文链接：https://blog.csdn.net/qq_39051660/article/details/104761559

版权

python 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

博客更新

requests+bs4爬取丁香园feiyan实时数据
 selenium爬取腾讯新闻feiyan页面实时数据

2020/04/01 更新

腾讯新闻的feiyan数据接口总是更改，前几天url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_other' 中的foreignList已经没有了，查看页面代码后发现多了一个专门提供全球数据的接口：

国外数据：url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_foreign'

具体的操作类似，这里就不赘诉了。不过昨天又发现这个接口爬取到的数据又不更新了，很多国家的数据停留在3月30日。但腾讯新闻界面上的数据却是有更新的，于是又深究了一下，发现腾讯新闻实时界面是连接到app的js界面。想要解析看看内部长什么样，直接爬页面内容，但却解析了很久都没解析好，跑去看丁香园的界面，发现要比腾讯的方便一些，于是转战爬取丁香园实时数据，直接爬取页面内容。具体操作后面再更新~

前言

开启2020，简直有如人间炼狱，各种不好的事情频频发生。自1月份以来，大家估计每天都很关注疫情的发展，微博上各种令人糟心的事情就不说了，这里写下在做东西的时候用python爬取腾讯的疫情数据。本文参考了：
利用Python爬取新冠肺炎疫情实时数据，Pyecharts画2019-nCoV疫情地图

国际数据爬取

由于国内数据爬取在上述参考博客中已详尽介绍，这里就不再介绍国内数据的爬取了。

在上述博客中，其实也有涉及国际数据的爬取，但我爬取之后发现，在上述博客的url下爬取的国际数据并不是腾讯新闻显示的实时数据，所以自己去看了下腾讯新闻网页，重新爬了下。

本文爬取包含中国在内的国际数据（以国家为单位爬取数据）。在爬取的时候发现国内跟国外的数据需要分开从两个不同的url中爬取才可以得到到腾讯实时更新的数据：

国内数据：url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5'
国外数据：url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_other'

从以上两个链接爬取到的数据其实都包含了国内跟国外数据，但一个包含了国内实时更新的数据，国外的数据不是很清楚是存储什么时候的数据。而另一个链接这包含了国外实时更新的数据，但国内数据又会有一点迟滞。

加载包

import requests
import json
import time
import os
import pandas as pd

这里先介绍用pandas处理的代码，会比较直接跟容易理解。但由于服务器一些配置的原因，pandas一直报错，所以后面不用pandas重新写了一个。

国内疫情数据

首先是国内疫情数据：

path = 'xxx/xxx/xxx' # 这里该路径
os.chdir(path)
history_df = pd.read_csv('xx.csv') # 历史数据读入, 历史数据自备
data_list = [] # 用于存放更新的数据

######################
# 中国更新数据爬取
# 获取数据
url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5'
reponse = requests.get(url=url).json() # 从网页加载
data = json.loads(reponse['data']) # 返回数据字典

我们查看data的内容，参考博客用的就是这里面的areeTree，但这里我们不用这部分数据，只用到chinaTotal中的confirm，heal跟dead，lastUpdateTime是腾讯新闻上显示的最后更新时间。 data字典数据
接下来我们继续，因为要实现持续更新，每天只保持更新一条记录，所以这里需要将前面读取的历史数据清除当天的数据：

# 提取所需数据
china_dict = {} # 用于存储中国更新的数据

date = time.strptime(data['lastUpdateTime'], "%Y-%m-%d %H:%M:%S")
date = time.strftime("%Y/%m/%d", date) # 提取更新日期 

# 将历史数据中包含更新日期的数据清除，确保每天只保存一条数据
history_df['date'] = time.strftime("%Y/%m/%d", time.strptime(history_df['date'], "%Y/%m/%d"))
history_df = history_df.drop(list(history_df[history_df['date'].isin([date])].index))
list(history_df[history_df['date'].isin([str(date)])].index)

这里要注意进行时间格式的统一，因为当有时不小心或者是有需要对原始csv进行人工修改时，此时读进来的时间列格式会不一致，这样每天进行多次实时更新时，会导致保存多条当天更新的数据。需要保存具体到秒的数据也可以根据需求自行更改代码。

下面将提取中国的实时数据并加入data_list：

china_dict['date'] = date
china_dict['time'] = dateTime
china_dict['place'] = '中国'
china_dict['confirmed_num'] = data['chinaTotal']['confirm']
china_dict['die_num'] = data['chinaTotal']['dead']
china_dict['cure_num'] = data['chinaTotal']['heal']
data_list.append(china_dict)

国外数据爬取

过程与前面国内数据爬取基本一致，就是要更改url：

#####################################
# 国际数据
# 获取数据
url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_other'
reponse = requests.get(url=url).json() # 从网页加载
data = json.loads(reponse['data']) # 返回数据字典

这时候我们再来看下data长什么样： data国外数据
这里面还有包含了一些新闻之类的东西，看大家需求自行爬。我们所需要的国外数据就在foreignList中，我们来看下foreignList长什么样：
在这里插入图片描述
里面包含了海外国家的数据，我们随便点入一个查看字典结构，比如韩国：

我们可以发现韩国里也有一个chirldren的结构，这里面存放了韩国具体的地区数据。不过不是所有国家都这么具体，目前只有中、日、韩三国是有具体到对方的数据。我们这里爬以国家为单位的数据，所以只需要爬取name，confirm，dead，跟heal，其他的需要也可以自行爬取：

foreignList = data['foreignList'] # 返回国际更新数据
for i in range(len(foreignList)):
    place_dict = {}
    place_dict['date'] = date
    place_dict['place'] = foreignList[i]['name']
    place_dict['confirmed_num'] = foreignList[i]['confirm']
    place_dict['die_num'] = foreignList[i]['dead']
    place_dict['cure_num'] = foreignList[i]['heal']
    data_list.append(place_dict)

实现数据持续更新

接着我们将历史数据跟爬下来的数据进行合并，继续保存为原来的那个csv文件：

# 将更新数据存为所需的dataframe
data_list = pd.DataFrame(data_list)
data_list = data_list[["date", "place", "confirmed_num", "die_num", "cure_num"]]

# 合并历史与更新数据，保存
df_out = history_df.append(data_list)
df_out.to_csv('xxx.csv', index = False, encoding = 'utf_8_sig')

这里设置encoding = 'utf_8_sig'可以解决中文乱码的问题，到这里就完成了数据的爬取跟持续更新的实现啦。接下来将不用pandas处理怎么办。

不用pandas处理数据

使用pandas处理的时候，我们可以很简单的使用DataFrame的结构，由此可以很方便的进行晒出部分信息、合并等操作。但如果急着用，在服务器上又还没有配好环境的情况下，该怎么做呢？下面就介绍直接用list进行操作的方法。以下不赘述过程直接贴代码:

import requests
import json
import time
import os
import csv

######### 路径在这里修改
path = 'xxx'
os.chdir(path)

# 历史数据读入
history_list = []
with open('xxx.csv', 'r', encoding = 'utf_8_sig') as f:
    csvReader = csv.reader(f)
    for row in csvReader:
        history_list.append(row)
        
data_list = [] # 用于存放更新的数据
######################
# 中国更新数据爬取
# 获取数据
url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5'
reponse = requests.get(url=url).json() # 从网页加载
data = json.loads(reponse['data']) # 返回数据字典
        
# 提取所需数据
china_dict = {} # 用于存储中国更新的数据

date = time.strptime(data['lastUpdateTime'], "%Y-%m-%d %H:%M:%S")
date = time.strftime("%Y/%m/%d", date) # 提取更新日期 
# 将历史数据中包含更新日期的数据清除，确保每天只保存一条数据
data_list.append(history_list[0])
for row in history_list[1:]:
    row[0] = time.strptime(row[0], "%Y/%m/%d")
    row[0] = time.strftime("%Y/%m/%d", row[0])
    row[1] = ''
    if date not in row:
        data_list.append(row)
        
china_dict['date'] = date
china_dict['place'] = '中国'
china_dict['confirmed_num'] = data['chinaTotal']['confirm']
china_dict['die_num'] = data['chinaTotal']['dead']
china_dict['cure_num'] = data['chinaTotal']['heal']
data_list.append([china_dict['date'],  china_dict['place'],  
                  china_dict['confirmed_num'], china_dict['die_num'], china_dict['cure_num']])

#####################################
# 国际数据
# 获取数据
url = 'https://view.inews.qq.com/g2/getOnsInfo?name=disease_other'
reponse = requests.get(url=url).json() # 从网页加载
data = json.loads(reponse['data']) # 返回数据字典
foreignList = data['foreignList'] # 返回国际更新数据
for i in range(len(foreignList)):
    place_dict = {}
    place_dict['date'] = date
    place_dict['place'] = foreignList[i]['name']
    place_dict['confirmed_num'] = foreignList[i]['confirm']
    place_dict['die_num'] = foreignList[i]['dead']
    place_dict['cure_num'] = foreignList[i]['heal']
    data_list.append([place_dict['date'], place_dict['place'], 
                      place_dict['confirmed_num'], place_dict['die_num'], place_dict['cure_num']])

# 保存数据
with open('xxx.csv', 'w', newline = "", encoding = 'utf_8_sig') as f:
    csvwriter = csv.writer(f)
    csvwriter.writerows(data_list)

Lizytzh

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
requests+json爬取腾讯新闻feiyan实时数据并实现持续更新

前言开启2020，简直有如人间炼狱，各种不好的事情频频发生。自1月份以来，大家估计每天都很关注疫情的发展，微博上各种令人糟心的事情就不说了，这里写下在做东西的时候用python爬取腾讯的疫情数据。本文参考了：利用Python爬取新冠肺炎疫情实时数据，Pyecharts画2019-nCoV疫情地图国际数据爬取由于国内数据爬取在上述参考博客中已详尽介绍，这里就不再介绍国内数据的爬取了。在上述...
复制链接

扫一扫