今天在做爬虫项目的时候出现了一个错误,通过pyquery获取不到元素。
from pyquery import PyQuery as pq
html = '''
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>TEST</title>
</head>
<body>
<div class="warp">
<ul class="goodsList">
<li>this is the test1</li>
<li>this is the test2</li>
<li>this is the test3</li>
<li>this is the test4</li>
</ul>
</div>
</body>
</html>
'''
doc = pq(html)
element = doc('.warp ul li:first-child')
print(element)
运行结果:
None
但是pyquery中的选择器并没有错误,但是运行结果一直是None。这是为什么呢?后来通过查看相关文档得知,pyquery解析的是html类型的字符串,但是上面的类型是xhtml,所以会获取不到元素。可以在pq()方法初始化字符串时加上parser="html"
告诉pyquery使用html规范解析,即可解决上述问题。
from pyquery import PyQuery as pq
html = '''
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>TEST</title>
</head>
<body>
<div class="warp">
<ul class="goodsList">
<li>this is the test1</li>
<li>this is the test2</li>
<li>this is the test3</li>
<li>this is the test4</li>
</ul>
</div>
</body>
</html>
'''
doc = pq(html,parser="html")
element = doc('.warp ul li:first-child')
if element:
print(element)
else:
print('None')
运行结果:
<li>this is the test1</li>