html.parser python_HTMLParser in python

You can know form the name that the HTMLParser is something used to parse

HTML files.  In python, there are two HTMLParsers. One is the HTMLParser

class defined in htmllib module—— htmllib.HTMLParser, the other one is

HTMLParser class defined in HTMLParser module. Let`s see them separately.

htmllib.HTMLParser

This is deprecated since python2.6. The htmllib is removed in python3. But

still, there is something you could know about it. This parseris not

directly concerned with I/O — it must be provided with input in string form via

a method, and makes calls to methods of a “formatter” object in order to produce

output. So you need to do it in below way for instantiation purpose.

>>> from cStringIO importStringIO>>> from formatter importDumbWriter, AbstractFormatter>>> from htmllib importHTMLParser>>> parser =HTMLParser(AbstractFormatter(DumbWriter(StringIO())))>>>

It is very annoying. All you want to do is parsing a html file, but now you

have to know a lot other things like format, I/O stream etc.

HTMLParser.HTMLParser

In python3 this module is renamed to html.parser. This module does the

samething as htmllib.HTMLParser. The good thing is you do not to import modules

like formatter and cStringIO.  For more information you can go to this URL

:

https://docs.python.org/2.7/library/htmlparser.html?highlight=htmlparser#HTMLParser

Here is some briefly introduction for this module.

See below for a sample code while using this module. You will notice that you

do not need to use formater class or I/O string class.

Another case here, in the htmllib.HTMLParser, there was two functions as

below,

HTMLParser.anchor_bgn(href, name, type)

This methodis called at the start of an anchor region. The arguments correspond to the attributes of the tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for tags) within the document. The list of hyperlinks isavailable as the data attribute anchorlist.

HTMLParser.anchor_end()

This methodis called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().

With these two funcitons, htmllib.HTMLParser can easily retrive url links

from a html file. For example:

>>> from urlparse importurlparse>>> from formatter importDumbWriter, AbstractFormatter>>> from cStringIO importStringIO>>> from htmllib importHTMLParser>>>

>>> defparseAndGetLinks():

... parser=HTMLParser(AbstractFormatter(DumbWriter(StringIO())))

... parser.feed(open(file).read())

... parser.close()

...returnparser.anchorlist

...>>> file=‘/tmp/a.ttt‘

>>>parseAndGetLinks()

[‘http://www.baidu.com/gaoji/preferences.html‘, ‘/‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘/‘, ‘http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=‘, ‘http://tieba.baidu.com/f?kw=&fr=wwwt‘, ‘http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt‘, ‘http://music.baidu.com/search?fr=ps&key=‘, ‘http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=‘, ‘http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=‘, ‘http://map.baidu.com/m?word=&fr=ps01000‘, ‘http://wenku.baidu.com/search?word=&lm=0&od=0‘, ‘http://www.baidu.com/more/‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w‘, ‘http://www.baidu.com/gaoji/preferences.html‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘http://news.baidu.com‘, ‘http://tieba.baidu.com‘, ‘http://zhidao.baidu.com‘, ‘http://music.baidu.com‘, ‘http://image.baidu.com‘, ‘http://v.baidu.com‘, ‘http://map.baidu.com‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://baike.baidu.com‘, ‘http://wenku.baidu.com‘, ‘http://www.hao123.com‘, ‘http://www.baidu.com/more/‘, ‘/‘, ‘http://www.baidu.com/cache/sethelp/index.html‘, ‘http://e.baidu.com/?refer=888‘, ‘http://top.baidu.com‘, ‘http://home.baidu.com‘, ‘http://ir.baidu.com‘, ‘/duty/‘]

But in HTMLParser.HTMLParser, we do not have these two functions. Does not

matter, we can define our own.

1 >>> from HTMLParser importHTMLParser2 >>> classmyHtmlParser(HTMLParser):3 ... def __init__(self):4 ... HTMLParser.__init__(self)5 ... self.anchorlist=[]6 ... defhandle_starttag(self, tag, attrs):7 ... if tag==‘a‘ or tag==‘A‘:8 ... for t inattrs :9 ... if t[0] == ‘href‘ or t[0]==‘HREF‘:10 ... self.anchorlist.append(t[1])11 ...12 >>> file=‘/tmp/a.ttt‘

13 >>> parser=myHtmlParser()14 >>>parser.feed(open(file).read())15 >>>parser.anchorlist16 [‘http://www.baidu.com/gaoji/preferences.html‘, ‘/‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘/‘, ‘http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=‘, ‘http://tieba.baidu.com/f?kw=&fr=wwwt‘, ‘http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt‘, ‘http://music.baidu.com/search?fr=ps&key=‘, ‘http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=‘, ‘http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=‘, ‘http://map.baidu.com/m?word=&fr=ps01000‘, ‘http://wenku.baidu.com/search?word=&lm=0&od=0‘, ‘http://www.baidu.com/more/‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w‘, ‘http://www.baidu.com/gaoji/preferences.html‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘http://news.baidu.com‘, ‘http://tieba.baidu.com‘, ‘http://zhidao.baidu.com‘, ‘http://music.baidu.com‘, ‘http://image.baidu.com‘, ‘http://v.baidu.com‘, ‘http://map.baidu.com‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://baike.baidu.com‘, ‘http://wenku.baidu.com‘, ‘http://www.hao123.com‘, ‘http://www.baidu.com/more/‘, ‘/‘, ‘http://www.baidu.com/cache/sethelp/index.html‘, ‘http://e.baidu.com/?refer=888‘, ‘http://top.baidu.com‘, ‘http://home.baidu.com‘, ‘http://ir.baidu.com‘, ‘/duty/‘]17 >>>

We look into the second code.

line 3 to line 5 overwrite the __init__ method. The key for this overwriten

is that add an new attribute - anchorlist to our instance.

line 6 to line 10 overwrite the handle_starttag method. First it use if to

check what the tag is. If it is ‘a‘ or ‘A‘,  then use for loop to check its

attribute. Retrieve the href attribute and put the value into the

anchorlist.

Then done.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值