Python---对html文件内容进行搜索取出特定URL地址字符串，保存成列表，并使用每个url下载图片，并保存到硬盘上，使用正则re-CSDN博客

本文链接：https://blog.csdn.net/xwbk12/article/details/79018641

Python—对html文件内容进行搜索取出特定URL地址字符串，保存成列表，并使用每个url下载图片，并保存到硬盘上，正则re

参考：http://blog.csdn.net/xwbk12/article/details/72734930

1、目标地址：https://xianzhi.aliyun.com/forum/topic/1805/
如下图中的内容
这里写图片描述

对目标回包内容取出这样类似的内容：
https://xianzhi.aliyun.com/forum/media/upload/picture/20171215230019-ab0e46aa-e1a8-1.png

2、python脚本
在kali linux 上运行

root@kali:~/python# cat downloadxianzhi-re.py 

#coding=utf-8  
import urllib  
import re  
import sys  

def getHtml(url):  
    page = urllib.urlopen(url)  
    html = page.read()  
    return html  

def getImg(html):  
    reg = r'src="(.+?\.png)"></p>'  
    imgre = re.compile(reg)  
    imglist = re.findall(imgre,html)  
    x = 0  
    for imgurl in imglist:  
        urllib.urlretrieve(imgurl,'%s100.jpg' % x)  
        x+=1  
    return imglist  

html = getHtml("https://xianzhi.aliyun.com/forum/topic/1805/")  

print getImg(html)

3、运行情况
这里写图片描述

这里写图片描述

src="(.+?\.png)"></p>
解释：
src="           #匹配src="
(.+?\.jpg)
# 括号表示分组，将括号的内容捕获到分组当中
# .+表示匹配至少一个任意字符，问号?表示懒惰匹配，也就是匹配尽可能少的字符串。
#  .+?\.jpg合起来表示尽可能少匹配字符的匹配到.jpg，避免匹配范围超出src的范围
#  这个括号也就可以匹配网页中图片的url了
" "></p>         #匹配"></p>