假设被注释代码段如下:
html="""
"""
如果直接对此代码段使用pyquery转换并提取
from pyquery import PyQuery as pq
response = pq(html)("div.forum_content")
print(response)
会报错:lxml.etree.ParserError: Document is empty
方法:利用bs4提取被注释代码段,再使用pyquery转换并提取
from pyquery import PyQuery as pq
from bs4 import BeautifulSoup,Comment
soup = BeautifulSoup(html,'html.parser')
res = ''.join(soup.findAll(text=lambda text:isinstance(text,Comment))) # 提取被注释部分
response = pq(res)("div.forum_content")
print(response)
结果:可被正常提取