xpth通过text值获取定位并提取同级节点text
<div class="lef-bd">
<ul class="dot2">
<li>
<dl>
<dt>联系人:</dt>
<dd>李先生</dd>
</dl>
</li>
<li>
<dl>
<dt>电 话:</dt>
<dd>0371-88888888</dd>
</dl>
</li>
<li>
<dl>
<dt>手 机:</dt>
<dd>18188888888</dd>
</dl>
</li>
<li>
<dl>
<dt>传 真:</dt>
<dd>0371-88888888</dd>
</dl>
</li>
<li>
<dl>
<dt>地 址:</dt>
<dd>河南 郑州 中原区 大学科技园X座XX层</dd>
</dl>
</li>
<li>
<dl>
<dt>客 服:</dt>
<dd class="a_kf">
<a onclick="chinacn.openQQ(888888888,32045);" rel="nofollow" target="_blank" title="点击这里给我发消息" class="qqjt"></a>
</dd>
</dl>
</li>
</ul>
<div class="map_dt">
</div>
response.xpath("//div[@class='lef-bd']//dt[contains(text(),'联系人')]/following-sibling::dd").extract_first("")
获取到“联系人”同级节点的 李先生
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CMD快速测试xpath,在安装了scrapy的虚拟环境下运行命令:
scrapy shell http://xxx.xxx.com
可运行命令进行测试提取结果:
>>> tite = response.xpath('//div[@class="entry-header"]/h1/text()').extract()
>>> tite
>>> ['5 款 Linux 街机游戏']
这样就提取出数组形式的结果,可以通过访问数组来获取成员:
>>> tite.extract()[0]
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
提取h1标签的值
tite = response.xpath('//div[@class="entry-header"]/h1/text()')
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
遇到如:日期·版块·分类 在一起的这种(http://blog.jobbole.com/114636/)
<p class="entry-meta-hide-on-mobile">
2019/01/11 · <a href="http://blog.jobbole.com/category/it-tech/" rel="category tag">IT技术</a>
· <a href="http://blog.jobbole.com/tag/linux/">Linux</a>
</p>
使用命令:
response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0
].strip().replace("·","")
==命令解释==
extract()[0] = 提取数组成员
strip() = 去除空格
.replace("·","") = 将“·”替换为空
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
通过contains函数 搜索包含某个属性值的xptah
<span data-post-id="114636" class=" btn-bluet-bigger href-style vote-post-up register-user-only "><i class="fa fa-thumbs-o-up"></i> <h10 id="114636votetotal">1</h10> 赞</span>
命令:
response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()")
将值直接转换为int类型
int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract()[0])
==命令解释==
//标签中 [搜索(@class包含'vote-post-up')]/h10/的值
需要将值转换为int类型直接 int(.....)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
遇到如:156 收藏 这种数字+文字或字母的,需要用re正则进行替换(CMD命令:ipython)
In [6]: m = re.match(".*?(\d+).*","156 收藏")
In [7]: if m:
...: print(m.group(1))
...:
156
==命令解释==
m = re.match(“正则表达式”,“内容”)
if m: (判断如果有值)
print(m.group(1)) (获取第1个)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------