HTML文本的各种转化:
源码:
from lxml import etree
from lxml.html import tostring
html = '''
<html>
<body>
中文
</body>
</html>
'''
print('----------------------------------原始字符--------------------------------------')
print(html)
print('--------------------------------xpath解析对象-------------------------------------')
html = etree.HTML(html)
print(html)
print('-----------------------------------二进制-------------------------------------')
print(tostring(html))
print('-----------------------------------utf-8-------------------------------------')
print(tostring(html).decode('utf-8'))
print('-----------------------------------获取文本-------------------------------------')
print(html.xpath('./body/text()'))
print('------------------------------------------------------------------------')
输出结果:
----------------------------------原始字符--------------------------------------
<html>
<body>
中文
</body>
</html>
--------------------------------xpath解析对象-------------------------------------
<Element html at 0x210abbc9d88>
-----------------------------------二进制-------------------------------------
b'<html>\n <body>\n 中文\n </body>\n</html>'
-----------------------------------utf-8-------------------------------------
<html>
<body>
中文
</body>
</html>
-----------------------------------获取文本-------------------------------------
['\n 中文\n ']
------------------------------------------------------------------------
有问题可以留言讨论,~_~