python 解析html

最新推荐文章于 2024-09-07 20:08:36 发布

iteye_8719

最新推荐文章于 2024-09-07 20:08:36 发布

阅读量110

点赞数

分类专栏： python 文章标签： python 解析html

本文链接：https://blog.csdn.net/iteye_8719/article/details/82331984

版权

python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏


from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.links = []

    def handle_starttag(self, tag, attrs):
        #print "Encountered the beginning of a %s tag" % tag
        if tag == "a":
            if len(attrs) == 0: pass
            else:
                for (variable, value)  in attrs:
                    if variable == "href":
                        self.links.append(value)

if __name__ == "__main__":
    html_code = """
    <a href="www.google.com"> google.com</a>
    <A Href="www.pythonclub.org"> PythonClub </a>
    <A HREF = "www.sina.com.cn"> Sina </a>
    """
    hp = MyHTMLParser()
    hp.feed(html_code)
    hp.close()
    print(hp.links)

这里还有别人博客上的相关内容，感觉质量不错。记录一下
http://www.lovelucy.info/python-crawl-pages.html

我没有自己亲自验证下面这段代码是否正常运行。


import urllib2
from sgmllib import SGMLParser

class ListName(SGMLParser):
	def __init__(self):
		SGMLParser.__init__(self)
		self.is_h4 = ""
		self.name = []
	def start_h4(self, attrs):
		self.is_h4 = 1
	def end_h4(self):
		self.is_h4 = ""
	def handle_data(self, text):
		if self.is_h4 == 1:
			self.name.append(text)

content = urllib2.urlopen('http://list.taobao.com/browse/cat-0.htm').read()
listname = ListName()
listname.feed(content)
for item in listname.name:
	print item.decode('gbk').encode('utf8')

iteye_8719

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python 解析html

[code="python"]from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.links = [] def handle_starttag(...
复制链接

扫一扫

专栏目录