例子:解析网页中的连接地址:
from html.parser import HTMLParser
page='''<a href="http://click.union.360buy.com/JdClick/?unionId=75" class="f1" style="padding-left:13px; padding-right:14px">京东商城</a></td><td><a href="http://www.letao.com/?source=hao123" class="f1">乐淘网上鞋城</a></td><td><a href="http://www.lashou.com/cl_today/w_3001" class="f2">拉手网团购</a></td><td><a href="http://www.amazon.cn/?tag=2009hao123famousdaohang" class="f2">卓越网上购物</a></td><td><a href="http://www.vancl.com/?source=hao123mp" class="f1">凡客诚品购物</a></td><td><a href="http://reg.jiayuan.com/st/?id=3237&url=/st/main.php" class="f1">世纪佳缘交友</a>'''
class hp(HTMLParser):
def handle_starttag(self,tag,attr):
if tag=='a':
for (att,value) in attr:
if att=='href':
print(value)
yk=hp()
yk.feed(page)
返回结果是:
http://click.union.360buy.com/JdClick/?unionId=75
http://www.letao.com/?source=hao123
http://www.lashou.com/cl_today/w_3001
http://www.amazon.cn/?tag=2009hao123famousdaohang
http://www.vancl.com/?source=hao123mp
http://reg.jiayuan.com/st/?id=3237&url=/st/main.php
--------------------------------------------------------------------------------------------------------------------------------
python 官网文档的例子:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a {} start tag".format(tag))
def handle_endtag(self, tag):
print("Encountered a {} end tag".format(tag))
page = """<html><h1>Title</h1><p>I'm a paragraph!</p></html>"""
myparser = MyHTMLParser()
myparser.feed(page)
Encountered a html start tag
Encountered a h1 start tag
Encountered a h1 end tag
Encountered a p start tag
Encountered a p end tag
Encountered a html end tag
---------------------------------------------------------------------------
显示html中<a></a>之间的文字例子:
from html.parser import HTMLParser
page='''<sada>啊阿什顿啊</sada><a href="http://click.union.360buy.com/JdClick/?unionId=75" class="f1" style="padding-left:13px; padding-right:14px">京东商城</a></td><td><a href="http://www.letao.com/?source=hao123" class="f1">乐淘网上鞋城</a></td><td><a href="http://www.lashou.com/cl_today/w_3001" class="f2">拉手网团购</a></td><td><a href="http://www.amazon.cn/?tag=2009hao123famousdaohang" class="f2">卓越网上购物</a></td><td><a href="http://www.vancl.com/?source=hao123mp" class="f1">凡客诚品购物</a></td><td><a href="http://reg.jiayuan.com/st/?id=3237&url=/st/main.php" class="f1">世纪佳缘交友</a>'''
class hp(HTMLParser):
a_txt=False
def handle_starttag(self,tag,attr):
if tag=='a':
self.a_txt=True
#print (dict(attr))
def handle_endtag(self,tag):
if tag=='a':
self.a_txt=False
def handle_data(self,data):
if self.a_txt:
print (data)
yk=hp()
yk.feed(page)
yk.close()
运行后:
京东商城
乐淘网上鞋城
拉手网团购
卓越网上购物
凡客诚品购物
世纪佳缘交友