python实践 - 抓取网页中的图片和数据

最新推荐文章于 2024-04-17 22:34:13 发布

weixin_30809333

最新推荐文章于 2024-04-17 22:34:13 发布

阅读量90

点赞数

文章标签： python

原文链接：http://www.cnblogs.com/uwebs/archive/2009/04/30/1446797.html

版权

主要是用了SGMLParser和urllib模块

# !/usr/lib/python
# getimg.py
import sys,os
from sgmllib import SGMLParser
type = sys.getfilesystemencoding()

class URLLister(SGMLParser):
         def reset(self):
            SGMLParser.reset(self)
            self.is_Contant = ""

            self.titles = []
            self.imgs = []
         def start_div(self, attrs):
            href = [v for k, v in attrs if k == ' class ' ]
             if href:
                 if href[0] == ' posttitle ' :
                    self.is_Contant = 1
         def end_div(self):
            self.is_Contant = ""
         def start_img(self,attrs):
            href = [self.imgs.append(v) for k, v in attrs if k == ' src ' ]
         def handle_data(self, text):
             if self.is_Contant:
                text = text.decode( ' UTF-8 ' ).encode(type)
                self.titles.append(text)

if __name__ == " __main__ " :
     import urllib
    u = ' http://www.cnblogs.com '
    usock = urllib.urlopen(u)
    parser = URLLister()
    parser.feed(usock.read())
    usock.close()
    parser.close()
    f = file( ' result.txt ' , ' w ' )
     for title in parser.titles:
         print title
        f.write(title + ' \r\n ' )
     for img in parser.imgs:
        urllib.urlretrieve(( '' if img.find( ' http:// ' ) == 0 else u) + img, ' d:/tmp/ ' + img.split( ' / ' )[ - 1 ])
    f.close()

上面的代码将主题保存到了当前目录的result.txt文件里

所有的图片保存到了d:/tmp/目录

转载于:https://www.cnblogs.com/uwebs/archive/2009/04/30/1446797.html

weixin_30809333

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python实践 - 抓取网页中的图片和数据

主要是用了SGMLParser和urllib模块#!/usr/lib/python#getimg.pyimportsys,osfromsgmllibimportSGMLParsertype=sys.getfilesystemencoding()classURLLister(SGMLParser):defreset(self):...
复制链接

扫一扫