python re正则匹配网页中图片url地址

最近写了个python抓取必应搜索首页http://cn.bing.com/的背景图片并将此图片更换为我的电脑桌面的程序,在正则匹配图片url时遇到了匹配失败问题。

要抓取的图片地址如图所示:src

首先,使用这个pattern

reg = re.compile('.*g_img={url: "(http.*?jpg)"')

无论怎么匹配都匹配不到,后来把网页源码抓下来放在notepad++中查看,并用notepad++的正则匹配查找,很轻易就匹配到了,如图:

后来我写了个测试代码,把图片地址在的那一行保存在一个字符串中,很快就匹配到了,如下面代码所示,data是匹配不到的,然而line是可以匹配到的。

# -*-coding:utf-8-*-
import os
import re

f = open('bing.html','r')

line = r'''Bnp.Internal.Close(0,0,60056); } });;g_img={url: "https://az12410.vo.msecnd.net/homepage/app/2016hw/BingHalloween_BkgImg.jpg",id:'bgDiv',d:'200',cN'''
data = f.read().decode('utf-8','ignore').encode('gbk','ignore')

print " "

reg = re.compile('.*g_img={url: "(http.*?jpg)"')

if re.match(reg, data):
            m1 = reg.findall(data)
            print m1[0]
else:
            print("data Not match .")
            
print 20*'-'
#print line
if re.match(reg, line):
            m2 = reg.findall(line)
            print m2[0]
else:
            print("line Not match .")

由此可见line和data是有区别的,什么区别呢?那就是data是多行的,包含换行符,而line是单行的,没有换行符。我有在字符串line中加了换行符,结果line没有匹配到。

 到这了原因就清楚了。原因就在这句话 
re.compile('.*g_img={url: "(http.*?jpg)"')。
后来翻阅python文档,发现re.compile()这个函数的第二个可选参数flags。这个参数是re中定义的常量,有如下常量
re.DEBUG Display debug information about compiled expression.
re.I 
re.IGNORECASE Perform case-insensitive matching; expressions like [A-Z] will match lowercase letters, too. This is not affected by the current locale.
re.L 
re.LOCALE Make \w, \W, \b, \B, \s and \S dependent on the current locale.
re.M 
re.MULTILINE When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.
re.S 
re.DOTALL Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.re.U re.UNICODE Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.New in version 2.0.
re.X 
re.VERBOSE This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class or when preceded by an unescaped backslash. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.


这里我们需要的就是re.S 让'.'匹配所有字符,包括换行符。修改正则表达式为

reg = re.compile('.*g_img={url: "(http.*?jpg)"', re.S)

即可完美解决问题。
 
 

  • 1
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Python,re正则表达式模块用于匹配和操作字符串。引用提到了一些重要的概念和用法。其,r'\n'表示不对字符串的特殊字符进行转义,r'\\n'可以用于匹配字符串的'\\n'。而在re模块,可以使用re.sub()函数进行替换操作,并且通过re.subn()函数还可以返回替换的次数。例如,re.subn('\d', 'S', 'abc12jh45li78', 3)将数字替换成'S',最多替换3次,返回结果为('abcSSjhS5li78', 3)。另外,re.escape()函数可以对字符串的特殊字符进行转义。例如,re.escape('python.exe\n')将结果转义为'python\\.exe\\\n'。引用还提到了search()和match()方法,用于在字符串搜索匹配的内容。引用提到了使用re正则表达式匹配网页图片URL地址的方法。总的来说,re正则表达式Python是一个强大的工具,可以用于字符串的匹配和替换等操作。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* [Python之re模块实现正则表达式](https://blog.csdn.net/zong596568821xp/article/details/123802518)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] - *3* [python re正则匹配网页图片url地址的方法](https://download.csdn.net/download/weixin_38752074/12865483)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值