爬虫Scrapy一点儿笔记

最新推荐文章于 2023-05-23 23:42:49 发布

ChigoStar

最新推荐文章于 2023-05-23 23:42:49 发布

阅读量258

点赞数

分类专栏：编程文章标签： Python Scrapy

本文链接：https://blog.csdn.net/lettmefly/article/details/79255820

版权

编程专栏收录该内容

6 篇文章 0 订阅

订阅专栏

在学习爬虫的过程中遇到的一些问题梳理一下。

Chrom复制的XPath爬取为空

在爬取某些网站时会遇到一件令人沮丧的事情。例如，爬取网易财经中贵州茅台的财务数据：

scrapy shell http://quotes.money.163.com/f10/zcfzb_600519.html#01c05

在Google Chrome浏览器中，选择贵州茅台2017-09-30资产负债表当中货币资金的值，并复制出对应的XPath。在scrapy终端中进行测试，Xpath表达式返回的是空列表。

In [1]: response.xpath('//*[@id="scrollTable"]/div[4]/table/tbody/tr[4]/td[1]/text()').extract()
Out[1]: []
In [2]: response.xpath('//*[@id="scrollTable"]/div[4]/table/tbody/tr[4]/td[1]/text()')
Out[2]: []
In [3]: response.xpath('//*[@id="scrollTable"]/div[4]/table/tbody/tr[4]/td[1]')
Out[3]: []

可能是tbody的锅，删除Xpath里的tbody即可。

In [4]: response.xpath('//*[@id="scrollTable"]/div[4]/table/tr[4]/td[1]')
Out[4]: [<Selector xpath='//*[@id="scrollTable"]/div[4]/table/tr[4]/td[1]' data='<td>8,096,468</td>'>]
In [5]: response.xpath('//*[@id="scrollTable"]/div[4]/table/tr[4]/td[1]/text()').extract()
Out[5]: ['8,096,468']
In [6]: response.xpath('//*[@id="scrollTable"]/div[4]/table/tr[4]/td[1]/text()').extract_first()
Out[6]: '8,096,468'

但是tbody不一定都会造成上述问题。例如，

scrapy shell http://www.cnindex.com.cn/zstx/quote/index.htm

同样是包含tbody，下面的Xpath()表达式返回的select列表则是正确的。

In [1]: response.xpath('/html/body/div[2]/div[1]/div[2]/div/div[2]/table[1]/tbody/tr[2]/td[1]')
Out[1]: [<Selector xpath='/html/body/div[2]/div[1]/div[2]/div/div[2]/table[1]/tbody/tr[2]/td[1]' data='<td align="left" bgcolor="#FFFFFF">深证成指<'>]

extract()和extract_first()

比较以下四条Xpath表达式及其返回的结果，可以一窥extract()和extract_first()的区别。

In [1]: response.xpath('/html/body/div[2]/div[1]/div[2]/div/div[2]/table[1]/tbody/tr[2]/td[1]')
Out[1]: [<Selector xpath='/html/body/div[2]/div[1]/div[2]/div/div[2]/table[1]/tbody/tr[2]/td[1]' data='<td align="left" bgcolor="#FFFFFF">深证成指<'>]

In [2]: response.xpath('/html/body/div[2]/div[1]/div[2]/div/div[2]/table[1]/tbody/tr[2]/td[1]').extract()
Out[2]: ['<td align="left" bgcolor="#FFFFFF">深证成指</td>']

In [3]: response.xpath('/html/body/div[2]/div[1]/div[2]/div/div[2]/table[1]/tbody/tr[2]/td[1]/text()').extract()
Out[3]: ['深证成指']

In [4]: response.xpath('/html/body/div[2]/div[1]/div[2]/div/div[2]/table[1]/tbody/tr[2]/td[1]/text()').extract_first()
Out[4]: '深证成指'

normalize-space函数（XPath）

官方解释：

White space is normalized by stripping leading and trailing white space and replacing sequences of white space characters with a single space.

通过去除前导和尾随的空格、使用单个空格替换多个空格，使空格标准化。如果省略了该参数，上下文节点的字符串值将标准化并返回。

针对含有非标准化空格的文本字符串（tabs, leading and trailing spaces, and multiple spaces between words），normalize-space函数可以实现标准化。

In [1]: response.xpath('//div[@class="title"]/text()').extract()
Out[1]: ['\n          2018-02-02\n        ']

In [2]: response.xpath('normalize-space(//div[@class="title"]/text())').extract_first()
Out[2]: '2018-02-02'

The following example normalizes a block of text string with unnormalized white spaces (tabs, leading and trailing spaces, and multiple spaces between words.

ChigoStar

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫Scrapy一点儿笔记

在学习爬虫的过程中遇到的一些问题梳理一下。Chrom复制的XPath爬取为空在爬取某些网站时会遇到一件令人沮丧的事情。例如，爬取网易财经中贵州茅台的财务数据：scrapy shell http://quotes.money.163.com/f10/zcfzb_600519.html#01c05在Google Chrome浏览器中，选择贵州茅台2017-09-30资产负债表当中货币资金的值，并复制出...
复制链接

扫一扫

专栏目录