爬虫之使用代理爬取微信公众号文章（上）

最新推荐文章于 2024-10-09 17:23:38 发布

chengqiuming

最新推荐文章于 2024-10-09 17:23:38 发布

阅读量2.9k

点赞数

分类专栏：爬虫文章标签：爬虫

本文链接：https://blog.csdn.net/chengqiuming/article/details/86897000

版权

爬虫专栏收录该内容

62 篇文章 5 订阅

订阅专栏

一目标

利用代理爬取微信公众号的文章，提取正文、发表日期、公众号等内容，爬取的来源是搜狗微信，链接是https://weixin.sogou.com/，然后把爬取结构保存到MySQL数据库。

二准备好代理池

三爬取分析

1 搜索NAB，可以搜索到最新的文章

2 搜索的URL为：https://weixin.sogou.com/weixin?type=2&query=NBA&ie=utf8&s_from=input&_sug_=y&_sug_type_=&w=01019900&sut=1512&sst0=1549769076479&lkt=4%2C1549769074864%2C1549769076377，我们去掉无关的参数，仅搜索https://weixin.sogou.com/weixin?type=2&query=NBA

3 分页列表为

没有账号登录，只能看到10页，如果登录，能看到更多的内容：

4 搜狗的反爬能力

如果连续刷新，站点会弹出类似下面反爬虫验证页面，说明IP访问次数太高，IP被封禁。

5 实现思路

修改代理池检测链接为搜狗微信站点
构造Redis爬取队列，用队列实现请求的存取
实现异常处理，失败的请求重新加入队列
实现翻译和提取文章列表，并把请求加入队列
实现微信文章的信息提取
将提取到的信息保存到Mysql

四构造请求

继承Request，定义WeixinRequest

from weixin.config import *
from requests import Request


class WeixinRequest(Request):
    def __init__(self, url, callback, method='GET', headers=None, need_proxy=False, fail_time=0, timeout=TIMEOUT):
        Request.__init__(self, method, url, headers)
        # 回调函数
        self.callback = callback
        # 是否需要代理爬取
        self.need_proxy = need_proxy
        # 失败次数
        self.fail_time = fail_time
        # 超时时间
        self.timeout = timeout

五实现请求队列

1 代码

from redis import StrictRedis
from weixin.config import *
from pickle import dumps, loads
from weixin.request import WeixinRequest


class RedisQueue():
    def __init__(self):
        """
        初始化StrictRedis
        """
        self.db = StrictRedis(host=REDIS_HOST, port=REDIS_PORT, password=REDIS_PASSWORD)

    def add(self, request):
        """
        向队列添加序列化后的Request
        :param request: 请求对象
        :param fail_time: 失败次数
        :return: 添加结果
        """
        if isinstance(request, WeixinRequest):
            return self.db.rpush(REDIS_KEY, dumps(request))
        return False

    def pop(self):
        """
        取出下一个Request并反序列化
        :return: Request or None
        """
        if self.db.llen(REDIS_KEY):
            return loads(self.db.lpop(REDIS_KEY))
        else:
            return False

    def clear(self):
        self.db.delete(REDIS_KEY)

    def empty(self):
        return self.db.llen(REDIS_KEY) == 0


if __name__ == '__main__':
    db = RedisQueue()
    start_url = 'http://www.baidu.com'
    weixin_request = WeixinRequest(url=start_url, callback='hello', need_proxy=True)
    db.add(weixin_request)
    request = db.pop()
    print(request)
    print(request.callback, request.need_proxy)

2 测试结果

E:\Python\Weixin\venv\Scripts\python.exe E:/Python/Weixin/weixin/db.py
<Request [GET]>
hello True

chengqiuming

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录