爬虫入门（11）——cnblog博客的文章内容爬取

最新推荐文章于 2024-08-22 04:10:15 发布

shelleyHLX

最新推荐文章于 2024-08-22 04:10:15 发布

阅读量713

点赞数 2

分类专栏：爬虫文章标签：爬虫

本文链接：https://blog.csdn.net/qq_27009517/article/details/108704642

版权

本文介绍了如何使用Python爬取CNBlog博客文章内容，通过分析页面参数构造URL，解决只获取到第一页数据的问题。探讨了多线程和多进程在爬虫中的应用，并提供了相关参考资料。

摘要由CSDN通过智能技术生成

文章目录

1.前言
2.页面分析
3.结果
4.多进程和多线程
- 4.1.多线程实现
- 4.2. 多进程实现
5.reference

1.前言

根据页面链接：https://www.cnblogs.com/#p4
修改p4为p1-200，都可以获得相应的页面的链接，但是，如果写代码也是这样写，就不行了，只会获得p1页面的各篇博客的链接。
代码如下：

# coding: utf-8
# Author: shelley
# 2020/9/21,9:57

import requests
import re

# 会得到相同的urls，出错。

def get_articles():
    url = 'https://www.cnblogs.com/#p'
    headers = {
   
        'Referer': 'https://www.cnblogs.com/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
    }
    for i in range(1, 200+1,1):

        url_=url.format(i)

        text = requests.get(url_, headers=headers).text
        text = text.replace('\xbb', '')

        pattern = '<a class="post-item-title" href="https://www.cnblogs.com/.*?/p/.*?.html" target="_blank">'
        articles = re.findall(pattern, text)
        for article in articles:
            pat = 'https://www.cnblogs.com/.*?/p/.*?.html'
            article = re.findall(pat, article)
            print(article[0])
            # text = requests.get(article[0]).text


if __name__ == '__main__':
    get_articles()

2.页面分析

打开首页，进入审查页面的network，进行分析，在AggSitePostList里，可以看到参数的请求，Request Payload，

CategoryId: 808
CategoryType: “SiteHome”
ItemListActionName: “AggSitePostList”
PageIndex: 4
ParentCategoryId: 0
TotalPostCount: 4000

把请求参数提取出来，构造URL，访问第15页列表页：
http://www.cnblogs.com/?CategoryId=808&CategoryType=%22SiteHome%22&ItemListActionName=%22AggSitePostList%22&PageIndex=4&ParentCategoryId=0

url的？表示参数请求，&表示多个参数，字符串参数需要用%22包含起来。
用这个链接在浏览器的显示也是首页，不是其他的页面。写代码测试下。结果获得的还是首页的20个博客的链接。

在这里插入图片描述
暂时没想到方法，不过看到别人的headers写了很多东西，就把AggSitePostList里的headers可以加的加进入，果然有效！！！哈哈哈哈哈哈哈。

代码如下：

# coding: utf-8
# Author: shelley
# 2020/9/21,9:57

import requests
import re

# 会得到相同的urls，出错。

def get_articles():
    # url = 'https://www.cnblogs.com/#p{}'
    url = 'http://www.cnblogs.com/?CategoryId=808&CategoryType=%22SiteHome%22&ItemListActionName=%22AggSitePostList%22&PageIndex={}&ParentCategoryId=0'
    headers = {
   
        'Referer': '