lxml.etree.HTML(text) 解析HTML文档

0.参考

http://lxml.de/tutorial.html#the-xml-function

There is also a corresponding function HTML() for HTML literals.

>>> root = etree.HTML("<p>data</p>") >>> etree.tostring(root) b'<html><body><p>data</p></body></html>'

1.基本用法

from lxml import etree
# Parses an HTML document from a string constant.  Returns the root nood
root = etree.HTML(r.text) #<Element html at 0x7bb8208>

 

1.1 xpath 和 cssselect 获取文字和属性

In [83]: for item in root.xpath('//button')[:1]:
    ...:     print(item)
    ...:     print(item.text)                           #获取文字
    ...:     print(item.xpath('./@id'))
    ...:
<Element button at 0x84277c8>
Requests Generator
['btn_requests']
###
In [84]: for item in root.cssselect('button')[:1]:
    ...:     print(item)
    ...:     print(item.text)
    ...:     print(item.cssselect('::attr(id)'))        #不支持伪元素写法
    ...:
    ...:
<Element button at 0x84277c8>
Requests Generator
ExpressionError: Pseudo-elements are not supported.
###
In [92]: for item in root.cssselect('button')[:1]:
    ...:     print(item.get('id', ''))                  #获取属性

btn_requests
###
In [93]: for item in root.cssselect('button')[:1]:
    ...:     print(item.xpath('./@id'))                 #嵌套
    ...:
['btn_requests']

 

1.2 美化打印

print(etree.tostring(root, pretty_print=True).decode('utf-8'))      # 美化打印
# You can also serialise to a Unicode string without declaration by
# passing the ``unicode`` function as encoding (or ``str`` in Py3),
# or the name 'unicode'.  This changes the return value from a byte
# string to an unencoded unicode string.
print(etree.tostring(root, encoding=str, pretty_print=True))        #py3 使之返回 text
print(etree.tostring(root, encoding=unicode, pretty_print=True))    #py2 使之返回 unicode

 

1.3 自动补全

In [109]: rt = etree.HTML('<html><p>123</p></html>')            #自动补全
In [110]: print(etree.tostring(rt, encoding=str, pretty_print=True))
<html>
  <body>
    <p>123</p>
  </body>
</html>

 

1.4 fromstring 不支持残缺片段,不会自动补全

In [115]: rt = etree.fromstring('<html><p>456</html>')           #fromstring 不支持残缺片段,不会自动补全
XMLSyntaxError: Opening and ending tag mismatch: p line 1 and html, line 1, column 20
In [116]: rt = etree.fromstring('<html><p>456</p></html>')
In [117]: print(etree.tostring(rt, encoding=str, pretty_print=True))
<html>
  <p>456</p>
</html>

 

.

转载于:https://www.cnblogs.com/my8100/p/parse_html_with_lxml.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值