python3爬虫：利用正则抓取博客文章列表和地址

最新推荐文章于 2024-05-03 11:07:56 发布

采蘑菇的老姑娘

最新推荐文章于 2024-05-03 11:07:56 发布

阅读量248

点赞数

分类专栏：爬虫-pyhon 文章标签： python

本文链接：https://blog.csdn.net/u011093930/article/details/108296309

版权

爬虫-pyhon 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

题目：抓取博客https://www.cnblogs.com/ 里的文章标题和url地址。并将标题和url输出的console。

代码如下图，思路和注释也在图中：

#!usr/bin/env python3
#-*-coding=utf-8-*-
__author__='km'
import urllib.request
from re import *
import re
def download(url):
    result = urllib.request.urlopen(url=url)
    content = result.read()
    htmlStr = content.decode("utf-8")
    return htmlStr
def analyes(htmlStr):
    aList = findall('<a[^>]*post-item-title[^>]*>[^<]*</a>',htmlStr)
    result = []
    for a in aList:
        #search的这种方法，代码没有运行成功，这里的g没有提取出来，暂时没有找到原因。于是换了findall的方法
        #g = search('herf[\s]*=[\s]*[\'"]([^>\'""]*)[\'"]', a)
        g = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+',a)
        #if g != None:
        #    url = g.group(1)
        #    print('2',url)
        index1 = a.find(">")
        index2 = a.rfind("<")
        title = a[index1 + 1:index2]
        d = {}
        d['url'] = g
        d['title'] = title
        result.append(d)
    return result
def crawler(url):
    html = download(url)
    blogList = analyes(html)
    for blog in blogList:
        print("title:",blog["title"])
        print("url:",blog["url"])

if __name__=='__main__':
    crawler('https://www.cnblogs.com/')

运行结果：

疑问：

正则那块使用search，则得到的g=None,导致没有获取到url。暂没有找到原因，如果有大佬知道，求解释。

采蘑菇的老姑娘

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3爬虫：利用正则抓取博客文章列表和地址

题目：抓取博客https://www.cnblogs.com/ 里的文章标题和url地址。并将标题和url输出的console。代码如下图，思路和注释也在图中：#!usr/bin/env python3#-*-coding=utf-8-*-__author__='km'import urllib.requestfrom re import *import redef download(url): result = urllib.request.urlopen(url=url)
复制链接

扫一扫