爬虫-36kr-使用xpath爬取数据-part1-提取接口所需的6开头的数字-拼接下一个接口的路径

最新推荐文章于 2023-07-09 00:49:39 发布

鲸鱼编程pyhui

最新推荐文章于 2023-07-09 00:49:39 发布

阅读量429

点赞数

本文链接：https://blog.csdn.net/ifubing/article/details/102605869

版权

import requests
from lxml import etree

class Spider():
    def __init__(self):
        # 起始页
        self.start_url = "https://36kr.com/"
        # 请求头
        self.headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"}
        # 域名
        self.host = "https://36kr.com"
    def start(self):
        # 爬取起始页
        # 第一页HTML = 爬取起始页的方法
        index_html = self.parse_url(self.start_url)
        print(index_html)

        # 获取下一页内容
        # 下一页url = 获取下一页内容（第一页HTML）
        eobj = etree.HTML(index_html)
        res = eobj.xpath('//a[text()="查看更多资讯"]/@href')[0]
        print(res)
        next_url = self.host + res

        # 第一次点击 获取更多
        first_more_html = self.parse_url(next_url)
        # first_more_html 就是 https://36kr.com/information/web_news 页面的html代码
        # 所有的文章容器div //div[@class="information-flow-item"]
        # 最后一个文章的div  //div[@class="information-flow-item"][last()]

        # 第一次更看更多的，列表html页的对象
        first_more_obj = etree.HTML(first_more_html)
        last_div = first_more_obj.xpath('//div[@class="information-flow-item"][last()]')[0]
        print(last_div)

        anchor_div_class = last_div.xpath("./div[1]/@class")[0]
        # anchor - 69141
        print(anchor_div_class)

        # 取69141
        import re
        b_id = re.search(r"\d+", anchor_div_class).group()
        # 匹配成功的结果对象.group()
        print(b_id)

        # 接口的拼接完毕
        next_data_api = "https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id={}&per_page=30".format(b_id)
        print(next_data_api)

        # 下一个接口的数据
        # 直接从next_data_api中取最后一个新闻的id字段
        # 用id字段对应的6什么什么开头的值，可以拼出下一个接口


        # 思路：正则拿
        # 正则式的写法： re.compile(r'<a class="kr-home-flow-see-more" href="(.*?)">查看更多资讯</a>')
        # 最终可以得到， "/information/web_news?anchor=68962"
        # 下一页的url   https://36kr.com/information/web_news?anchor=68962
        # https://36kr.com/information/web_news?anchor=68957

        # 第三页的内容
        # 发一个ajax请求
        # https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=68892&per_page=30
        # https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=68848&per_page=30

        # 推理出第四页的内容
        # https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=68847&per_page=30

        # 第五页的内容
        # https://36kr.com/pp/api/aggregation-entity?type=web_latest_article&b_id=68817&per_page=30

    def parse_url(self, url):
        res = requests.get(url, headers=self.headers)
        return res.content.decode()


s = Spider()
s.start()

鲸鱼编程pyhui

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
爬虫-36kr-使用xpath爬取数据-part1-提取接口所需的6开头的数字-拼接下一个接口的路径

import requestsfrom lxml import etreeclass Spider(): def __init__(self): # 起始页 self.start_url = "https://36kr.com/" # 请求头 self.headers = {"user-agent":"Mozilla/5...
复制链接

扫一扫