爬虫(1):知乎热榜
-
目标知乎热榜:https://www.zhihu.com/billboard
-
目标元素
- 排行
- 标题
- 简介
- 热点(实时性大,瞬间值一般没啥用,极值可能会有点用)
-
写代码
前置条件:安装好python 3、requests库、lxml库
import requests from lxml import etree import time headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'} url = 'https://www.zhihu.com/billboard' text = requests.get(url=url, headers=headers,verify=False).text selector = etree.HTML(text) for i in range(1,51): title = selector.xpath('//a[@class="HotList-item"][{}]/div[2]/div[{}]/text()'.format(i,1)) content = selector.xpath('//a[@class="HotList-item"][{}]/div[2]/div[{}]/text()'.format(i,2)) print("{}: ".format(i),end='') print(title[0]) print(content[0]) print('__________________________',end='\n\n') time.sleep(5)
4.存在问题
- https证书问题,还没搞懂
- 网页版的数据与爬取的排版不一致,导致按网页版编写的xpath只找到title和热度,但html的内容里也确实包含了问题简介
- 代码硬编码太多
- 代码复用性差