今天在修改bug时,发现新bug....xpath进行匹配,应该圈到最小范围去匹配,避免因为没有数据导致该标签没有生成,匹配漏掉没有数据的,而没有给我返回空字符串""
业务需求:需要采集li 属性值 parameter和对应的文本,必须一一对应
测试的数据都是有文本的,于是就这样写了,bug出现然后就有这篇文章了~~
from lxml import etree
def go_baidu():
str_html = """
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<body class="body">
<div>
<ul class="ppt">
<li id="1" parameter="BV8809">苹果</li>
<li id="2" parameter="KP6776">雪梨</li>
<li id="3" parameter="KI9980">芒果</li>
<li id="4" parameter="QW9890">西瓜</li>
<li id="5" parameter="QY7777">荔枝</li>
<li id="6" parameter="WW7877">香蕉</li>
<li id="7" parameter="WG4456">橘子</li>
<li id="8" parameter="SD1123">菠萝</li>
</ul>
</div>
<div>
<ul class="ppt">
<li id="21" parameter="WWGH345">超级大苹果</li>
<li id="22" parameter="FGHT212"></li>
<li id="23" parameter="EERT446">超级大芒果</li>
<li id="24" parameter="TTYH332"></li>
<li id="25" parameter="FVCG437"></li>
<li id="26" parameter="VCFQ353">超级大香蕉</li>
<li id="27" parameter="WERF555">超级大橘子</li>
<li id="28" parameter="RRTY676">超级大菠萝</li>
</ul>
</div>
</body>
</html>
"""
etr = etree.HTML(str_html)
lis = etr.xpath('//*[@class="ppt"]')
for i in lis:
print(i.xpath('li/@parameter'))
print(i.xpath('li/text()'))
if __name__ == '__main__':
go_baidu()
运行结果 :
D:\Python36\python.exe C:/Users/17653/Desktop/测试11.py
['BV8809', 'KP6776', 'KI9980', 'QW9890', 'QY7777', 'WW7877', 'WG4456', 'SD1123']
['苹果', '雪梨', '芒果', '西瓜', '荔枝', '香蕉', '橘子', '菠萝']
['WWGH345', 'FGHT212', 'EERT446', 'TTYH332', 'FVCG437', 'VCFQ353', 'WERF555', 'RRTY676']
['超级大苹果', '超级大芒果', '超级大香蕉', '超级大橘子', '超级大菠萝']
第一个ul下是正常获取到,没问题。
第二个ul下 文本是漏掉了几个,跟parameter对应不上,原因是匹配的范围太大了,应该是圈小范围去匹配。
解决方法:
from lxml import etree
def go_baidu():
str_html = """
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<body class="body">
<div>
<ul class="ppt">
<li id="1" parameter="BV8809">苹果</li>
<li id="2" parameter="KP6776">雪梨</li>
<li id="3" parameter="KI9980">芒果</li>
<li id="4" parameter="QW9890">西瓜</li>
<li id="5" parameter="QY7777">荔枝</li>
<li id="6" parameter="WW7877">香蕉</li>
<li id="7" parameter="WG4456">橘子</li>
<li id="8" parameter="SD1123">菠萝</li>
</ul>
</div>
<div>
<ul class="ppt">
<li id="21" parameter="WWGH345">超级大苹果</li>
<li id="22" parameter="FGHT212"></li>
<li id="23" parameter="EERT446">超级大芒果</li>
<li id="24" parameter="TTYH332"></li>
<li id="25" parameter="FVCG437"></li>
<li id="26" parameter="VCFQ353">超级大香蕉</li>
<li id="27" parameter="WERF555">超级大橘子</li>
<li id="28" parameter="RRTY676">超级大菠萝</li>
</ul>
</div>
</body>
</html>
"""
etr = etree.HTML(str_html)
lis = etr.xpath('//*[@class="ppt"]')
for i in lis:
parameter_list = []
name_list = []
for u in i:
parameter_list.append(u.xpath('@parameter'))
name_list.append(u.xpath('text()'))
print(parameter_list)
print(name_list)
if __name__ == '__main__':
go_baidu()
运行结果:
D:\Python36\python.exe C:/Users/17653/Desktop/测试11.py
[['BV8809'], ['KP6776'], ['KI9980'], ['QW9890'], ['QY7777'], ['WW7877'], ['WG4456'], ['SD1123']]
[['苹果'], ['雪梨'], ['芒果'], ['西瓜'], ['荔枝'], ['香蕉'], ['橘子'], ['菠萝']]
[['WWGH345'], ['FGHT212'], ['EERT446'], ['TTYH332'], ['FVCG437'], ['VCFQ353'], ['WERF555'], ['RRTY676']]
[['超级大苹果'], [], ['超级大芒果'], [], [], ['超级大香蕉'], ['超级大橘子'], ['超级大菠萝']]
Process finished with exit code 0
这样xpath缩小范围匹配就不会漏掉数据了