爬虫（1）：知乎热榜

最新推荐文章于 2023-10-30 21:19:34 发布

令人疑惑的

最新推荐文章于 2023-10-30 21:19:34 发布

阅读量503

点赞数

分类专栏：爬虫系列文章标签：爬虫 python

本文链接：https://blog.csdn.net/qq_40548097/article/details/102828511

版权

爬虫系列专栏收录该内容

2 篇文章 0 订阅

订阅专栏

爬虫（1）：知乎热榜

目标知乎热榜：https://www.zhihu.com/billboard
目标元素
- 排行
- 标题
- 简介
- 热点（实时性大，瞬间值一般没啥用，极值可能会有点用）

写代码

前置条件：安装好python 3、requests库、lxml库

import requests
from lxml import etree
import time

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'}
url = 'https://www.zhihu.com/billboard'

text = requests.get(url=url, headers=headers,verify=False).text

selector = etree.HTML(text)

for i in range(1,51):
    title = selector.xpath('//a[@class="HotList-item"][{}]/div[2]/div[{}]/text()'.format(i,1))
    content = selector.xpath('//a[@class="HotList-item"][{}]/div[2]/div[{}]/text()'.format(i,2))


    print("{}:  ".format(i),end='')
    print(title[0])
    print(content[0])
    print('__________________________',end='\n\n')

    time.sleep(5)

4.存在问题

https证书问题，还没搞懂
网页版的数据与爬取的排版不一致，导致按网页版编写的xpath只找到title和热度，但html的内容里也确实包含了问题简介
代码硬编码太多
代码复用性差

令人疑惑的

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫（1）：知乎热榜

爬虫（1）：知乎热榜目标知乎热榜：https://www.zhihu.com/billboard目标元素排行标题简介热点（实时性大，瞬间值一般没啥用，极值可能会有点用）写代码前置条件：安装好python 3、requests库、lxml库import requestsfrom lxml import etreeimport timeheaders = {'...
复制链接

扫一扫