python--正则表达式应用

最新推荐文章于 2021-10-01 20:05:53 发布

damant

最新推荐文章于 2021-10-01 20:05:53 发布

阅读量1.3k

点赞数

分类专栏：正则表达式 python 爬虫文章标签： python 正则表达式

本文链接：https://blog.csdn.net/damant/article/details/47612975

版权

python 同时被 3 个专栏收录

4 篇文章 0 订阅

订阅专栏

正则表达式

1 篇文章 0 订阅

订阅专栏

爬虫

1 篇文章 0 订阅

订阅专栏

正则对一个简单爬虫程序的改进

一个爬虫程序

这是一个提取网页源代码中以.jpg结尾的图片引用，并将其下载的python小程序

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    ## 一个小爬虫
    ## 下载网页中的所有图片
    ## getJpg.py
    import re
    import urllib

    # Get the source code of a website
    def getHtml(url):
    print 'Getting html source code...'
    page = urllib.urlopen(url)
    html = page.read()
    return html

    # Open the website and check up the address of images,
    # and find the common features to decide the re_rule
    def getImageAddrList(html):
        print 'Getting all address of images...'
        rule = r"src=\"(.+\.jpg)\" pic_ext"
        imReg = re.compile(rule)
        imList = re.findall(imReg, html)
        return imList

    def getImage(imList):
        print 'Downloading...'
        name = 1;
        for imgurl in imList:
            urllib.urlretrieve(imgurl, '%s.jpg' % name)
            name += 1
        print 'Got ', len(imList), ' images!'

    ## main
    htmlAddr = "http://tieba.baidu.com/p/2510089409"
    html = getHtml(htmlAddr)
    imList = getImageAddrList(html)
    getImage(imList)

问题：

其中用的规则是"src=\"(.+\.jpg)\" pic_ext"，可是这样会出现问题：
如：网页源代码中会有这样的串：
情况一：引用非jpg图片的标签后连着一个引用jpg图片的标签

src="http://static.tieba.baidu.com/tb/editor/images/face/i_f07.png" pic_ext="png"  width="30" height="30"><br><br><br><img pic_type="0" class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=5db4e4ccbf096b6381195e583c318733/ef59ccbf6c81800a4e2aaaedb03533fa838b4702.jpg" pic_ext="jpeg"

匹配出来的url：
http://static.tieba.baidu.com/tb/editor/images/face/i_f07.png" pic_ext="png" width="30" height="30"> <img pic_type="0" class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=5db4e4ccbf096b6381195e583c318733/ef59ccbf6c81800a4e2aaaedb03533fa838b4702.jpg
情况二：两个引用jpg图片的标签连一起

src="http://imgsrc.baidu.com/forum/w%3D580/sign=2d3c6fb835a85edffa8cfe2b795509d8/bc27cffc1e178a82280c7948f703738da977e823.jpg" pic_ext="jpeg"  height="350" width="560"><br><img pic_type="0" class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=a18181c99213b07ebdbd50003cd69113/8713c8fcc3cec3fda670a19dd788d43f86942788.jpg" pic_ext="bmp"

提取的url：
http://imgsrc.baidu.com/forum/w%3D580/sign=0955e2d1b2de9c82a665f9875c8080d2/8d1ab051f8198618ad77e96c4bed2e738bd4e623.jpg" pic_ext="jpeg" height="315" width="560"> <img pic_type="0" class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=2d3c6fb835a85edffa8cfe2b795509d8/bc27cffc1e178a82280c7948f703738da977e823.jpg" pic_ext="jpeg" height="350" width="560"> <img pic_type="0" class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=a18181c99213b07ebdbd50003cd69113/8713c8fcc3cec3fda670a19dd788d43f86942788.jpg
这两种情况都是两个图片img标签连在了一起导致我们匹配到了两段的合体，这样提取的url显然不是我们想要的。

改进：

通过查阅，发现正则表达式有两种匹配模式：贪婪模式与非贪婪模式（懒惰模式）

贪婪匹配:正则表达式一般趋向于最大长度匹配，也就是所谓的贪婪匹配。如上面使用模式p匹配字符串str，结果就是匹配到：abcaxc(ab*c)。

非贪婪匹配：就是匹配到结果就好，就少的匹配字符。如上面使用模式p匹配字符串str，结果就是匹配到：abc(ab*c)。

于是，对上面的正则改进："src=\"(.+?\.jpg)\" pic_ext"在+后加上？就可以以非贪婪模式进行匹配。这样两个引用jpg图片的情况就可以分别提取出url了，

那怎么处理第一种情况呢?
通过观察我们发现，我们想要的其实是src=后面双引号之间的内容，并且是两个相互匹配的双引号。也就是说，""之间是不应有其他的"的。所以，问题就很容易了，只要把 . 换成[^\"]匹配的是出来"之外的字符就行了。
改后，正则变成这样：rule = r"src=\"([^\"]+?\.jpg)\" pic_ext"

现在就能轻松地抓取网页上图片啦！