使用lxml的HTML和parse两种方式解析html代码

本文链接：https://blog.csdn.net/zdryn/article/details/105889479

本文对比了lxml和BeautifulSoup4在解析HTML代码时的性能和使用方法。lxml适用于处理规范和非规范HTML，通过自定义HTML解析器增强兼容性。而BeautifulSoup4虽使用简单，但效率较低，适合初学者。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

使用lxml解析html代码：

若解析的代码是字符串类型

使用lxml.etree.HTML进行解析，例如

from lxml import etree

text="""
<div class="login_corp" >
<div class="Third-partyi-login">
	<a title="微信" class="login-item weixin" href="http://www.renren.com/api/jump?src=wx" id="login_weixin" stats="loginPage_weixin_link"></a>
	 <a title="QQ" class="login-item qq" href="http://www.renren.com/api/jump?src=qq" id="login_qq" stats="loginPage_qq_link"></a>
	 <a title="微博" class="login-item weibo" href="http://www.renren.com/api/jump?src=wb" id="login_weibo" stats="loginPage_weibo_link"></a>
</div>
</div>
<div class="other-login clearfix">
<div class="login-word login-item">其它账号登录：</div>
<a  title="移动" class="login-item yidong" href="https://open.mmarket.com:443/omee-aus/services/oauth/authorize?responseType=code&scope=getUserInfo&clientId=300007884008&redirectUri=http%3A%2F%2Fwww.renren.com%2Fbind%2Fcnmobile%2FloginCallBack&clientState=9" id="login_cnmobile" stats="loginPage_baidu_link"></a>
<a title="天翼" class="login-item tianyi" id="login_tianyi" href="https://oauth.api.189.cn/emp/oauth2/authorize?app_id=296961050000000294&response_type=code&redirect_uri=http://www.renren.com/bind/ty/tyLoginCallBack" stats="loginPage_tianyi_link"></a>
<a title="360" class="login-item lo360" id="login_360" href="https://openapi.360.cn/oauth2/authorize?client_id=5ddda4458747126a583c5d58716bab4c&response_type=code&redirect_uri=http://www.renren.com/bind/tsz/tszLoginCallBack&scope=basic&display=default" stats="loginPage_360_link"></a>
 <a title="百度" class="login-item baidu" href="https://openapi.baidu.com/oauth/2.0/authorize?response_type=code&client_id=foRRWjPq8In3SIhmKQw1Pep3&redirect_uri=http%3A%2F%2Fwww.renren.com%2Fbind%2Fbaidu%2FbaiduLoginCallBack" id="login_baidu" stats="loginPage_baidu_link"></a>
</div>
"""#这已经是规范后的代码
html = etree.HTML(text)
print(etree.tostring(html,encoding='utf-8').decode('utf-8'))

在这里插入图片描述
看图片的第一行，会发现它多出了一些东西。☺☺

解析html文件时

使用lxml.etree.parse进行解析

from lxml import etree
html = etree.parse('renren.html')
print(etree.tostring(html,encoding='utf-8').decode('utf-8'))

若出现报错
报错形式
就改成以下代码：

from lxml import etree
parser = etree.HTMLParser(encoding='utf-8')#自己创建的html解析器
html = etree.parse('renren.html',parser=parser)
print(etree.tostring(html,encoding='utf-8').decode('utf-8'))

这是因为parse默认用的是xml解析器，如果碰到一些不规范的html代码时就会解析错误，这时就要自己创建html解析器。