Scrapy框架学习笔记--xpath选择器问题

最新推荐文章于 2022-09-15 14:26:42 发布

艾渃曼丶

最新推荐文章于 2022-09-15 14:26:42 发布

阅读量384

点赞数 1

分类专栏： Python

本文链接：https://blog.csdn.net/qq_39395755/article/details/96425873

版权

Python 专栏收录该内容

18 篇文章 0 订阅

订阅专栏

在今天的Scrapy框架学习过程中，发现自己的爬虫仅能抓取一个页面的信息，翻阅日志文件发现如下错误：

字面意思上很好理解，“不能将str类型的数据与非str类型的数据连接”

翻阅代码发现并没有什么错误的地方

于是继续翻阅日志文件，发现了与平时不同的地方，在抓取数据的日志信息处发现数据均为一个selector，而不再是一个数据list

在查阅了一些资料后，发现在scrapy框架中，xpath选择器在抓取属性(例如：//li[@class='next']/a/@href)和抓取标签文本(例如：

div[@class='tags']/a/text())后，其返回类型仍为一个selector

而在以往的Python爬虫的设计过程中，对于页面信息抓取一直使用的是lxml模块中的xpath选择器，该选择器在抓取属性和标签文本后，其返回值为一个包含抓取结果的list

如下例：

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import requests
from lxml import html

url  = 'http://quotes.toscrape.com/'

req = requests.get(url).content
page = html.fromstring(req)

quotes = page.xpath('//div[@class="quote"]')

element = quotes[0]
item = {}

author = element.xpath("span/small/text()")
tags = element.xpath("div[@class='tags']/a/text()")
text = element.xpath("span[@class='text']/text()")

print(author)
print(tags)
print(text)

item['author'] = element.xpath("span/small/text()")[0]
item['tags'] = element.xpath("div[@class='tags']/a/text()")[0]
item['text'] = element.xpath("span[@class='text']/text()")[0]

print(item)

控制台信息：

从例子中可以很明显的看出，lxml模块中的xpath选择器在抓取属性和标签文本后，其返回值为一个包含抓取结果的list，而不同于scrapy框架中xpath选择器返回的selector

经过分析，问题发生的原因和解决方法就很明朗了

只需要从scrapy框架中xpath选择器返回的selector中获取他的data属性就可以，经查阅资料发现该选择器的两个方法：

extract_first()
extract()

从函数就可以很容易明白，extract_first()方法提取data中的首个数据，extract()方法提取data中的所有数据

于是修改代码：

for element in quotes:
    item = ScrapyTestItem()
    item['text'] = element.xpath("span[@class='text']/text()").extract_first()
    item['author'] = element.xpath("span/small/text()").extract_first()
    item['tags'] = element.xpath("div[@class='tags']/a/text()").extract()
    yield item

在运行一次

2019-07-18 11:07:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'author': 'Mark Twain',
 'tags': ['misattributed-mark-twain', 'truth'],
 'text': '“A lie can travel half way around the world while the truth is '
         'putting on its shoes.”'}
2019-07-18 11:07:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'author': 'C.S. Lewis',
 'tags': ['christianity', 'faith', 'religion', 'sun'],
 'text': '“I believe in Christianity as I believe that the sun has risen: not '
         'only because I see it, but because by it I see everything else.”'}
2019-07-18 11:07:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/10/> (ref
erer: http://quotes.toscrape.com/page/9/)
2019-07-18 11:07:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'author': 'J.K. Rowling',
 'tags': ['truth'],
 'text': '“The truth." Dumbledore sighed. "It is a beautiful and terrible '
         'thing, and should therefore be treated with great caution.”'}
2019-07-18 11:07:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'author': 'Jimi Hendrix',
 'tags': ['death', 'life'],
 'text': "“I'm the one that's got to die when it's time for me to die, so let "
         'me live my life the way I want to.”'}

哦吼问题解决

那么接下来该想想该怎么让这个狗日的停下来(╬￣皿￣)