网络爬虫从入门到实践(三)————动态网页的爬取

本文介绍了如何爬取动态加载的网页,重点讲解了AJAX技术和使用Selenium模拟浏览器抓取数据的方法。通过分析知乎评论的动态加载,展示了如何找到真实数据地址并提取JSON内容,以及在遇到复杂网站时如何利用Selenium进行爬取。
摘要由CSDN通过智能技术生成

动态网页的爬取

在动态网页爬取之前,我们要了解一种异步加载更新技术——AJAX(异步的JavaScript 和XML)

他的价值在于通过在后台与服务器进行少量的数据交换就可以使用网页的某部分进行更新

 

1.动态抓取实例

相对于传统的网页,不需要重新加载整个网页,从而使得互联网应用程序更小,更快,更友好,但是爬虫的过程就变得十分麻烦了。

我们可以通过以下两种方法爬取AJAX动态加载的网页:

(1)通过浏览器审查元素来解析地址

(2)使用selenium模拟浏览器进行抓取

 

2.解析真实地址抓取

下面我们以知乎上的一篇评论为例:

https://www.zhihu.com/question/22913650

打开Chrome浏览器的检查功能,然后找到数据的真实地址,单击页面中的“network”选项。再点击XMR按钮

看到以js, json,等格式结尾的文件,我们可以发现上图红选中的文件就是真实的评论文件。单击“Preview”即可查看数据

代码如下:

# coding: UTF-8
import requests
from bs4 import BeautifulSoup

url = "https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B*%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right&order=normal&limit=20&offset=0&status=open"
headers = {
	'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
r = requests.get(url, headers = headers)
print r.text

运行上述代码,得到如下结果:

{"featured_counts":16,"common_counts":1125,"collapsed_counts":2,"reviewing_counts":0,"paging":{"is_end":false,"is_start":true,"next":"https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B%2A%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right\u0026limit=20\u0026offset=20\u0026order=normal\u0026status=open","previous":"https://www.zhihu.com/api/v4/answers/270916221/comments?include=data%5B%2A%5D.author%2Ccollapsed%2Creply_to_author%2Cdisliked%2Ccontent%2Cvoting%2Cvote_count%2Cis_parent_author%2Cis_author%2Calgorithm_right\u0026limit=20\u0026offset=0\u0026order=normal\u0026status=open","totals":1123},"data":[{"id":367943240,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943240","content":"\u003cp\u003e越长大越觉得,生命的寄托,最终都会落脚到爱与责任\u003c/p\u003e","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512421298,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"79ed1b3de8d0df2a309cdabe895198a9","url_token":"qi-yu-50-87","name":"迟嘉澍","avatar_url":"https://pic4.zhimg.com/v2-ee7c28e69de49593eae38e7c973d8d7b_r.jpg","avatar_url_template":"https://pic4.zhimg.com/v2-ee7c28e69de49593eae38e7c973d8d7b_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/79ed1b3de8d0df2a309cdabe895198a9","user_type":"people","headline":"寒鸦赴水,渴马奔泉","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"vote_count":2388,"voting":false,"disliked":false},{"id":367943256,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943256","content":"我真喜欢你那句 奶奶的答案是你 爸妈的答案是你 你的答案是他们","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512421335,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"763c1ed453279d6e3c39124ecedec96a","url_token":"bo-chong-jing","name":"冲静","avatar_url":"https://pic3.zhimg.com/v2-7cdc882e0c764b0b22087663bad8ea8f_r.jpg","avatar_url_template":"https://pic3.zhimg.com/v2-7cdc882e0c764b0b22087663bad8ea8f_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/763c1ed453279d6e3c39124ecedec96a","user_type":"people","headline":"上下求索","badge":[],"gender":0,"is_advertiser":false}},"is_parent_author":false,"vote_count":2673,"voting":false,"disliked":false},{"id":367943603,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943603","content":"夜班真累,好想睡觉","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512421934,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"11c4f0962c4d1d183757322b5bf45da1","url_token":"xu-hui-peng-43","name":"媚俗","avatar_url":"https://pic4.zhimg.com/v2-0889a738c3cdb7086308b103e12606b8_r.jpg","avatar_url_template":"https://pic4.zhimg.com/v2-0889a738c3cdb7086308b103e12606b8_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/11c4f0962c4d1d183757322b5bf45da1","user_type":"people","headline":"","badge":[],"gender":-1,"is_advertiser":false}},"is_parent_author":false,"vote_count":236,"voting":false,"disliked":false},{"id":367943699,"type":"comment","url":"https://www.zhihu.com/api/v4/comments/367943699","content":"可以说是石总到目前为止最好的答案么?","featured":true,"collapsed":false,"is_author":false,"is_delete":false,"created_time":1512422087,"resource_type":"answer","reviewing":false,"allow_like":true,"allow_delete":false,"allow_reply":true,"allow_vote":true,"can_recommend":false,"can_collapse":false,"author":{"role":"normal","member":{"id":"e20f754c6ab428339ba081105af9114b","url_token":"bu-zhi-dao-21-38-16","name":"不知道","avatar_url":"https://pic4.zhimg.com/da8e974dc_r.jpg","avatar_url_template":"https://pic4.zhimg.com/da8e974dc_{size}.jpg","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/e20f754c6ab428339ba081105af9114b","user_type":"people","headline":"经济学博士/收入分配/地方政治","badge":[],"gender":1,"is_advertiser":false}},"is_parent_author":false,"
  • 2
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值