假设被注释代码段如下:
html="""
<!-- <div class="forum_content clearfix">
<div class="main" id="content_wrap">
<div id="pagelet_frs-list/pagelet/content"></div> </div>
<div class="aside" id="aside">
<div id="pagelet_frs-aside/pagelet/aside"></div> </div>
</div>
-->
"""
如果直接对此代码段使用pyquery转换并提取
from pyquery import PyQuery as pq
response = pq(html)("div.forum_content")
print(response)
会报错:lxml.etree.ParserError: Document is empty
方法:利用bs4提取被注释代码段,再使用pyquery转换并提取
from pyquery import PyQuery as pq
from bs4 import BeautifulSoup,Comment
soup = BeautifulSoup(html,'html.parser')
res = ''.join(soup.findAll(text=lambda text:isinstance(text,Comment))) # 提取被注释部分
response = pq(res)("div.forum_content")
print(response)
结果:可被正常提取
<div class="forum_content clearfix">
<div class="main" id="content_wrap">
<div id="pagelet_frs-list/pagelet/content"/> </div>
<div class="aside" id="aside">
<div id="pagelet_frs-aside/pagelet/aside"/> </div>
</div>