【Python】爬虫：微博找人页面爬虫（三）

最新推荐文章于 2023-10-19 11:44:58 发布

杨jun坚

最新推荐文章于 2023-10-19 11:44:58 发布

阅读量597

点赞数

分类专栏： Python 文章标签： url队列爬虫url队列

本文链接：https://blog.csdn.net/yangjjuan/article/details/99817572

版权

Python 专栏收录该内容

38 篇文章 4 订阅

订阅专栏

【Python】爬虫：微博找人页面爬虫（三）

在解决完登录问题后，就来开始下载页面来进行解析，之前提到过有两种类型的页面：列表页和文章页，列表页包含文章页的url和下一页列表页的url，也就是只有先下载解析列表页后才可以对文章页进行下载解析。因此就构建了两个不同等级的url队列，通过redis的list来构建，高优先级的存储列表页url，低优先级存储文章页url。
整个存取过程如下：
1，往highlevel中插入起始的列表页url。
2，从highlevel取出url，爬取到当前列表页的下一页url，并存入highlevel，爬取当前列表页中文章页的url，并存入lowlevel中。
3，重复步骤2，直到highlevel中无列表页的url。
4，在步骤3后，就可以从lowlevel中取文章页url，下载页面，解析后存入数据库中。

一，构建url队列
1，这里使用redis中的list来构建队列，在使用时通过不同的声明可以创建不同的队列：

self.highlevel_db = RedisClent('highlevel', self.website)
self.lowlevel_db = RedisClent('lowlevel', self.website)

这里用到的RedisClent在dblink.py文件中，通过传入不同的type来创建不同的队列，通过website来实现不同网站的拓展。

class RedisClent(object):
    def __init__(self, type, website, host=REDIS_HOST, port=REDIS_POST, password=REDIS_PASSWORD):
        """
        初始化Redis连接
        :param type: Hash存储类型，account or Cookies
        :param website: 网站
        :param host:地址
        :param port:端口号
        :param password:密码
        """
        self.db = redis.StrictRedis(host=host,port=port,password=password,decode_responses=True)
        self.type = type
        self.website = website

2，这里构建的队列属于FIFO，先进先出，用到了list自带的一些方法

    def Listname(self):
        """
        获取url队列名称
        :return:
        """
        return "{type}:{website}".format(type=self.type,website=self.website)

    def allurl(self,start=0,end=-1):
        """
        获取指定list中的所有url
        :return:
        """
        return self.db.lrange(self.Listname(),start,end)

    def addUrl(self,url):
        """
        添加url进队列
        :param url: url
        :return:
        """
        # listname = self.Listname().get(urllevel)
        self.db.lrem(self.Listname(),0,url)    #移除相同的url
        self.db.lpush(self.Listname(),url)
    def popurl(self):
        """
        获取url
        :param urllevel:
        :return:
        """
        # listname = self.Listname().get(urllevel)
        print(self.Listname(),':',self.db.llen(self.Listname()))
        return self.db.rpop(self.Listname())

在添加url之前需要，考虑重复url的问题，这里偷懒，直接先将队列中相同的url删除，然后在添加进入队列中。

二，url队列操作
在取url时候，优先取出高优先级队列中的，取完之后，再取出低优先级队列中的url，这里构建了一个url仓库。

class UrlRepository(object):
    def __init__(self,website):
        self.website = website
        self.highlevel_db = RedisClent('highlevel', self.website)
        self.lowlevel_db = RedisClent('lowlevel', self.website)
#取队列的url
    def urlPop(self):
        url = self.highlevel_db.popurl()
        if not url:
            url = self.lowlevel_db.popurl()
        return  url
#将url加人队列
    def addHigh(self,url):
        self.highlevel_db.addUrl(url)
#将url加人队列
    def addlow(self,url):
        self.lowlevel_db.addUrl(url)

代码已经上传至GitHub，仅供参考
https://github.com/yangjunjians/Crawlers

杨jun坚

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【Python】爬虫：微博找人页面爬虫（三）

【Python】爬虫：微博找人页面爬虫（三）在解决完登录问题后，就来开始下载页面来进行解析，之前提到过有两种类型的页面：列表页和文章页，列表页包含文章页的url和下一页列表页的url，也就是只有先下载解析列表页后才可以对文章页进行下载解析。因此就构建了两个不同等级的url队列，通过redis的list来构建，高优先级的存储列表页url，低优先级存储文章页url。、一，构建url队列二，url...
复制链接

扫一扫

专栏目录