用python批量下载网络图片_用python脚本批量下载网页中的图片+正则表达式简介...

最新推荐文章于 2023-03-02 09:32:46 发布

weixin_39750190

最新推荐文章于 2023-03-02 09:32:46 发布

阅读量135

点赞数

文章标签：用python批量下载网络图片

首先来个简单的，下载某个特定网页

import urllib2

def downURL(url,filename):

try:

#open the given url

fp=urllib2.urlopen(url)

except:

print 'cannot open

the url'

print url

return 0

op=open(filename,'wb')

while 1:

s=fp.read()

if not s:

break

op.write(s)

fp.close()

op.close()

return 1

downURL('http://www.sina.com',r'C:\1.htm')

#r 表示后面的字符串是自然字符串，不考虑转义字符\的作用

运行脚本后，C盘根目录下就有一个1.htm文件了，打开就是新浪的首页

有个问题是，有的网页有些链接是在本页面的不同个位置链接，如http://deerchao.net/tutorials/regex/regex.htm#mission后面的#mission就是它本身页面的“本文目标”部分，这样的网址urlopen函数打不开

稍加改进

import urllib2

import re

def downURL(url,filepath):

try:

fp=urllib2.urlopen(url)

except:

print 'cannot open

theURL:'

print url

return 0

#pattern=re.compile('http://hiphotos.baidu.com/bluemonster/pic/item/.*.jpg')

pattern=re.compile('http://hiphotos.baidu.com/bluemonster/pic/item/\w*.jpg')

#生成正则表达式对象，.*表示任意字符重复任意次

while 1:

s=fp.read()

if not s:

break

urls=pattern.findall(s)

#找出所有匹配正则表达式对象pattern的字符串，以列表形式返回

fp.close()

cnt=1

for item in urls:

print item

try:

fp0=urllib2.urlopen(item)

except:

print 'cannot open

the URL:'

print item

return 0

filename=filepath+'\\'+str(cnt)+'.jpg'

#'\\'用到了转义字符

op0=open(filename,'wb')

while 1:

s0=fp0.read()

if not s0:

break

op0.write(s0)

fp0.close()

op0.close()

cnt=cnt+1

return 1

downURL('http://tieba.baidu.com/f?kz=861150368',r'C:\123')

运行脚本后，http://tieba.baidu.com/f?kz=861150368页面中的所有符合正则表达式http://hiphotos.baidu.com/bluemonster/pic/item/.*.jpg的图片（百度贴吧是读取外链地址显示图片）都被下载到了C盘的123目录下，且依次命名为1.jpg,2.jpg等等（这里123文件夹必须先建立好，上面的脚本不能建立文件夹）

今天下载课件（http://grid.hust.edu.cn/zyshao/OSEngineering.htm）的时候，把上面的脚本稍微改了一下

import urllib2

import re

def downURL(url,filepath):

try:

fp=urllib2.urlopen(url)

except:

print 'cannot open the

URL:'

print url

return 0

pattern=re.compile('Teaching_Material/OSEngineering/Chapter\d.pdf')

while 1:

s=fp.read()

if not s:

break

urls=pattern.findall(s)

print urls

fp.close()

cnt=1

for item in urls:

append_item='http://grid.hust.edu.cn/zyshao/'+item

print append_item

try:

fp0=urllib2.urlopen(append_item)

except:

print 'cannot open the

URL:'

print append_item

return 0

filename=filepath+'\\'+'Chapter'+str(cnt)+'.pdf'

op0=open(filename,'wb')

while 1:

s0=fp0.read()

if not

s0:

break

op0.write(s0)

fp0.close()

op0.close()

cnt=cnt+1

return 1

downURL('http://grid.hust.edu.cn/zyshao/OSEngineering.htm',r'C:\123')

只有一个地方值得注意，那就是html文件里的链接有可能是相对路径，比如http://grid.hust.edu.cn/zyshao/OSEngineering.htm里的PDF链接的正则表达式匹配是http://grid.hust.edu.cn/zyshao/Teaching_Material/OSEngineering/Chapter\d.pdf，但是在http://grid.hust.edu.cn/zyshao/OSEngineering.htm里找不到，原来里面用的是相对路径，只有Teaching_Material/OSEngineering/Chapter\d.pdf。

常用的正则表达式介绍

表1.常用的元字符

代码

说明

匹配除换行符以外的任意字符

匹配字母或数字或下划线或汉字

匹配任意的空白符

匹配数字

匹配单词的开始或结束

匹配字符串的开始

匹配字符串的结束

表2.常用的限定符

代码/语法

说明

重复零次或更多次

重复一次或更多次

重复零次或一次

{n}

重复n次

{n,}

重复n次或更多次

{n,m}

重复n到m次

如果要匹配下面三个地址

正则表达式应该为

\d*表示任意个数字 \?表示?这一个字符（因为?是限定符，所以要匹配?必须用转义字符）

一个很好的测试正则表达式的工具Regex

Tester（已上传到ishare）。它是为.NET设计的，但是.NET的正则表达式规则和python似乎没什么区别。

htm文件已传到ishare里，免得到时候这网站挂了找不着

校内的相册图片批量下载其实和上面下载百度贴吧一样，只是之前要经过一个登录校内的过程，否则是看不到相册的。今天没搞清楚，有空再弄。

weixin_39750190

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用python批量下载网络图片_用python脚本批量下载网页中的图片+正则表达式简介...

首先来个简单的，下载某个特定网页import urllib2def downURL(url,filename):try:#open the given urlfp=urllib2.urlopen(url)except:print 'cannot openthe url'print urlreturn 0op=open(filename,'wb')while 1:s=fp.read()if not ...
复制链接

扫一扫