转载自:http://tech.it168.com/a2009/0707/601/000000601879_1.shtml
二、从HTML文档中提取图像
处理HTML文档的时候,我们常常需要从其中提取出所有的图像。使用HTMLParser模块后,这项任务将变得易如反掌。首先,我们需要定义一个新的HTMLParser类,以覆盖handle_starttag()方法,该方法的作用是查找img标签,并保存src属性值所指的文件。
import
HTMLParser
import urllib
def getImage(addr):
u = urllib.urlopen(addr)
data = u.read()
class parseImages(HTMLParser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == ' img ' :
for name,value in attrs:
if name == ' src ' :
getImage(urlString + " / " + value)
u = urllib.urlopen(urlString)
lParser.feed(u.read())
import urllib
def getImage(addr):
u = urllib.urlopen(addr)
data = u.read()
class parseImages(HTMLParser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == ' img ' :
for name,value in attrs:
if name == ' src ' :
getImage(urlString + " / " + value)
u = urllib.urlopen(urlString)
lParser.feed(u.read())
定义好新的HTMLParser类之后,需要创建一个实例来返回HTMLParser对象。然后,就可以使用urllib.urlopen(url)打开HTML文档并读取该HTML文件的内容了。
为了解析HTML文件的内容并显示包含其中的图像,可以使用feed(data)函数将数据发送至HTMLParser对象。HTMLParser对象的feed函数将接收数据,并通过定义的HTMLParser对象对数据进行相应的解析。下面是一个具体的示例:
import
HTMLParser
import urllib
import sys
urlString = " http://www.python.org "
# 把图像文件保存至硬盘
def getImage(addr):
u = urllib.urlopen(addr)
data = u.read()
splitPath = addr.split( ' / ' )
fName = splitPath.pop()
print " Saving %s " % fName
f = open(fName, ' wb ' )
f.write(data)
f.close()
# 定义HTML解析器
class parseImages(HTMLParser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == ' img ' :
for name,value in attrs:
if name == ' src ' :
getImage(urlString + " / " + value)
# 创建HTML解析器的实例
lParser = parseImages()
# 打开HTML文件
u = urllib.urlopen(urlString)
print " Opening URL\n==================== "
print u.info()
# 把HTML文件传给解析器
lParser.feed(u.read())
lParser.close()
import urllib
import sys
urlString = " http://www.python.org "
# 把图像文件保存至硬盘
def getImage(addr):
u = urllib.urlopen(addr)
data = u.read()
splitPath = addr.split( ' / ' )
fName = splitPath.pop()
print " Saving %s " % fName
f = open(fName, ' wb ' )
f.write(data)
f.close()
# 定义HTML解析器
class parseImages(HTMLParser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == ' img ' :
for name,value in attrs:
if name == ' src ' :
getImage(urlString + " / " + value)
# 创建HTML解析器的实例
lParser = parseImages()
# 打开HTML文件
u = urllib.urlopen(urlString)
print " Opening URL\n==================== "
print u.info()
# 把HTML文件传给解析器
lParser.feed(u.read())
lParser.close()
上述代码的运行结果如下所示:
Opening URL
====================
Date: Fri, 26 Jun 2009 10 : 54 : 49 GMT
Server: Apache / 2.2 . 9 (Debian) DAV / 2 SVN / 1.5 . 1 mod_ssl / 2.2 . 9 OpenSSL / 0.9 .8g mod_wsgi / 2.3 Python / 2.5 . 2
Last - Modified: Thu, 25 Jun 2009 0 9 : 44 : 54 GMT
ETag: " 105800d-46e7-46d29136f7180 "
Accept - Ranges: bytes
Content - Length: 18151
Connection: close
Content - Type: text / html
Saving python - logo.gif
Saving trans.gif
Saving trans.gif
Saving afnic.fr.png
====================
Date: Fri, 26 Jun 2009 10 : 54 : 49 GMT
Server: Apache / 2.2 . 9 (Debian) DAV / 2 SVN / 1.5 . 1 mod_ssl / 2.2 . 9 OpenSSL / 0.9 .8g mod_wsgi / 2.3 Python / 2.5 . 2
Last - Modified: Thu, 25 Jun 2009 0 9 : 44 : 54 GMT
ETag: " 105800d-46e7-46d29136f7180 "
Accept - Ranges: bytes
Content - Length: 18151
Connection: close
Content - Type: text / html
Saving python - logo.gif
Saving trans.gif
Saving trans.gif
Saving afnic.fr.png