html.parser python_HTMLParser in python

最新推荐文章于 2021-07-15 16:02:44 发布

weixin_39718460

最新推荐文章于 2021-07-15 16:02:44 发布

阅读量61

点赞数

文章标签： html.parser python

You can know form the name that the HTMLParser is something used to parse

HTML files. In python, there are two HTMLParsers. One is the HTMLParser

class defined in htmllib module—— htmllib.HTMLParser, the other one is

HTMLParser class defined in HTMLParser module. Let`s see them separately.

htmllib.HTMLParser

This is deprecated since python2.6. The htmllib is removed in python3. But

still, there is something you could know about it. This parseris not

directly concerned with I/O — it must be provided with input in string form via

a method, and makes calls to methods of a “formatter” object in order to produce

output. So you need to do it in below way for instantiation purpose.

>>> from cStringIO importStringIO>>> from formatter importDumbWriter, AbstractFormatter>>> from htmllib importHTMLParser>>> parser =HTMLParser(AbstractFormatter(DumbWriter(StringIO())))>>>

It is very annoying. All you want to do is parsing a html file, but now you

have to know a lot other things like format, I/O stream etc.

HTMLParser.HTMLParser

In python3 this module is renamed to html.parser. This module does the

samething as htmllib.HTMLParser. The good thing is you do not to import modules

like formatter and cStringIO. For more information you can go to this URL

https://docs.python.org/2.7/library/htmlparser.html?highlight=htmlparser#HTMLParser

Here is some briefly introduction for this module.

See below for a sample code while using this module. You will notice that you

do not need to use formater class or I/O string class.

Another case here, in the htmllib.HTMLParser, there was two functions as

below,

HTMLParser.anchor_bgn(href, name, type)

This methodis called at the start of an anchor region. The arguments correspond to the attributes of the tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for tags) within the document. The list of hyperlinks isavailable as the data attribute anchorlist.

HTMLParser.anchor_end()

This methodis called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().

With these two funcitons, htmllib.HTMLParser can easily retrive url links

from a html file. For example:

>>> from urlparse importurlparse>>> from formatter importDumbWriter, AbstractFormatter>>> from cStringIO importStringIO>>> from htmllib importHTMLParser>>>

>>> defparseAndGetLinks():

... parser=HTMLParser(AbstractFormatter(DumbWriter(StringIO())))

... parser.feed(open(file).read())

... parser.close()

...returnparser.anchorlist

...>>> file=‘/tmp/a.ttt‘

>>>parseAndGetLinks()

[‘http://www.baidu.com/gaoji/preferences.html‘, ‘/‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘/‘, ‘http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=‘, ‘http://tieba.baidu.com/f?kw=&fr=wwwt‘, ‘http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt‘, ‘http://music.baidu.com/search?fr=ps&key=‘, ‘http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=‘, ‘http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=‘, ‘http://map.baidu.com/m?word=&fr=ps01000‘, ‘http://wenku.baidu.com/search?word=&lm=0&od=0‘, ‘http://www.baidu.com/more/‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w‘, ‘http://www.baidu.com/gaoji/preferences.html‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘http://news.baidu.com‘, ‘http://tieba.baidu.com‘, ‘http://zhidao.baidu.com‘, ‘http://music.baidu.com‘, ‘http://image.baidu.com‘, ‘http://v.baidu.com‘, ‘http://map.baidu.com‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://baike.baidu.com‘, ‘http://wenku.baidu.com‘, ‘http://www.hao123.com‘, ‘http://www.baidu.com/more/‘, ‘/‘, ‘http://www.baidu.com/cache/sethelp/index.html‘, ‘http://e.baidu.com/?refer=888‘, ‘http://top.baidu.com‘, ‘http://home.baidu.com‘, ‘http://ir.baidu.com‘, ‘/duty/‘]

But in HTMLParser.HTMLParser, we do not have these two functions. Does not

matter, we can define our own.

1 >>> from HTMLParser importHTMLParser2 >>> classmyHtmlParser(HTMLParser):3 ... def __init__(self):4 ... HTMLParser.__init__(self)5 ... self.anchorlist=[]6 ... defhandle_starttag(self, tag, attrs):7 ... if tag==‘a‘ or tag==‘A‘:8 ... for t inattrs :9 ... if t[0] == ‘href‘ or t[0]==‘HREF‘:10 ... self.anchorlist.append(t[1])11 ...12 >>> file=‘/tmp/a.ttt‘

13 >>> parser=myHtmlParser()14 >>>parser.feed(open(file).read())15 >>>parser.anchorlist16 [‘http://www.baidu.com/gaoji/preferences.html‘, ‘/‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘/‘, ‘http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=‘, ‘http://tieba.baidu.com/f?kw=&fr=wwwt‘, ‘http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt‘, ‘http://music.baidu.com/search?fr=ps&key=‘, ‘http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=‘, ‘http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=‘, ‘http://map.baidu.com/m?word=&fr=ps01000‘, ‘http://wenku.baidu.com/search?word=&lm=0&od=0‘, ‘http://www.baidu.com/more/‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w‘, ‘http://www.baidu.com/gaoji/preferences.html‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘http://news.baidu.com‘, ‘http://tieba.baidu.com‘, ‘http://zhidao.baidu.com‘, ‘http://music.baidu.com‘, ‘http://image.baidu.com‘, ‘http://v.baidu.com‘, ‘http://map.baidu.com‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://baike.baidu.com‘, ‘http://wenku.baidu.com‘, ‘http://www.hao123.com‘, ‘http://www.baidu.com/more/‘, ‘/‘, ‘http://www.baidu.com/cache/sethelp/index.html‘, ‘http://e.baidu.com/?refer=888‘, ‘http://top.baidu.com‘, ‘http://home.baidu.com‘, ‘http://ir.baidu.com‘, ‘/duty/‘]17 >>>

We look into the second code.

line 3 to line 5 overwrite the __init__ method. The key for this overwriten

is that add an new attribute - anchorlist to our instance.

line 6 to line 10 overwrite the handle_starttag method. First it use if to

check what the tag is. If it is ‘a‘ or ‘A‘, then use for loop to check its

attribute. Retrieve the href attribute and put the value into the

anchorlist.

Then done.

weixin_39718460

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
html.parser python_HTMLParser in python

You can know form the name that the HTMLParser is something used to parseHTML files. In python, there are two HTMLParsers. One is the HTMLParserclass defined in htmllib module—— htmllib.HTMLParser, t...
复制链接

扫一扫