爬虫博客园实例

qiukapi

于 2024-01-25 21:08:38 发布

阅读量784

点赞数 7

分类专栏： Python 文章标签：爬虫

本文链接：https://blog.csdn.net/qiukapi/article/details/135853990

版权

Python 专栏收录该内容

37 篇文章 2 订阅

订阅专栏

本文介绍如何使用Python的urllib3库下载网页并解析HTML，提取class为titlelnk的标签中的链接和标题，以爬取博客园首页的博客列表。

摘要由CSDN通过智能技术生成

# pip install i https://pypi.tuna.tsinghua.edu.cn/simple some-package
# pip install -i https://pypi.tuna.tsinghua.edu.cn/simple urllib3
from urllib3 import *
from re import *

http = PoolManager()
# 禁止显示警告信息
disable_warnings()
# 下载url对应的web页面

def download(url):
    result = http.request('GET', url)
    # 获取web页面对应的html代码
    htmlStr = result.data.decode('utf-8')
    return htmlStr

# 分析 HTML 代码
def analyse(htmlStr):
    # 通过正则表达式获取所有class属性值为titlelnk 的<a> 节点
    aList = findall('<a[^>]*titlelnk[^>]*>[^<]*</a>', htmlStr)
    result = []
    # 提取每一个<a>节点中的URL
    for a in aList:
        # 利用正则表达式提取<a>节点中的URL
        g = search('href[\s]*=[\s]*[\'"]([^>\'""]*)[\'"]',a)
        if g != None:
            url = g.group(1)
        # 通过查找的方式提取<a> 节点中博客的标题
        index1 = a.find(">")
        index2 = a.rfind("<")
        # 获取博客标题
        title = a[index1 + 1:index2]
        d = {}
        d['url'] = url
        d['title'] = title
        result.append(d)
    # 返回一个包含博客标题和URL的对象
    return result

# 抓取博客列表
def crawler(url):
    html = download(url)
    blogList = analyse(html)
    # 输出博客园首页的所有博客的标题和URL
    for blog in blogList:
        print("title:", blog["title"])
        print("url:",blog["url"])

# 开始抓取博客列表
crawler('https://www.cnblogs.com')