xpath的高级应用

最新推荐文章于 2024-07-23 10:28:12 发布

风华浪浪

最新推荐文章于 2024-07-23 10:28:12 发布

阅读量739

点赞数

分类专栏： p爬虫

本文链接：https://blog.csdn.net/a6864657/article/details/82222357

版权

p爬虫专栏收录该内容

37 篇文章 5 订阅

订阅专栏

第一种

<button class="btn-check-phone click_btn" data-phone="1341243342334">查看联系方式</button>

匹配如下 '//*[text()="查看联系方式"]/@data-phone'

第二种

    <dl>
        <dt class="field-name">位置：</dt>
        <dd class="field-detail">
			<a>北京</a>
			<a class="region" >东城</a>
			<a class="region">东直门</a>
		</dd>
    </dl>

匹配//*[text()="位置："]/following-sibling::dd[1]/a/text()
拼接 北京-东城-东直门
'-'.join([i for i in response.xpath('//*[text()="位置："]/following-sibling::dd[1]/a/text()').extract()])

第三种

<div class="keyword-wrap">
    <strong class="">关键词：</strong>
    <strong class="keyword">信用贷款</strong>
    <strong class="keyword">不用什么抵押贷款</strong>
    <strong class="keyword">身份证贷款</strong>
</div>

匹配 //div[@class="keyword-wrap"]/strong[position()>1]/text()

xpath常用模式

翻页操作(一)

    next_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
    if next_url is not None:
	    next_url = 'https://hr.tencent.com/' + next_url
	    yield scrapy.Request(url=next_url, callback=self.parse)

翻页操作(二)

   if 'javascript:;' not in next_url:
	  yield scrapy.Request(next_url, callback=self.parse)

遍历xpath列表的问题

常用清洗数据

    introduction = ''.join([i for i in response.xpath(u'//div[@class="content-wrap"]/text()').extract()]).replace('\t', '').replace('\xa0', '').replace('\n', '').replace('\u200c', '').replace(" ", '').strip()
    
    .replace('\t', '')   #\t 去横向跳到下一制表符位置
	.replace('\n', '')   #\n 去回车换行
	.replace('\u200c', '')  #\u200c 
	.replace(" ", '')     #" " 去空格
	.strip()              #去空格#strip()函数功能多样

-xpath 中的中文编码问题

u'//*[text()="位置："]/following-sibling::dd[1]/a/text()

Scrapy终端输出为ASCII

{'address': '\xe5\x8c\x97\xe47',
 'district': '\xe5\x8c\x97\'}
 
 scrapy可能在python2的环境上，Linux重新安装`pip3 install scrapy`

匹配json，字典
类型如fjdgsdfgpoi{‘k1’: ‘v1’, ‘k2’: ‘v2’, ‘k3’: ‘v3’}asfdsfjkladsfjqleiifqe;oufperhnfu

Regular = re.compile('\{.*\}', re.S|re.M)
adaption = re.findall(Regular, str(receiveData))[0]

风华浪浪

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录