在今天的Scrapy框架学习过程中,发现自己的爬虫仅能抓取一个页面的信息,翻阅日志文件发现如下错误:
字面意思上很好理解,“不能将str类型的数据与非str类型的数据连接”
翻阅代码发现并没有什么错误的地方
于是继续翻阅日志文件,发现了与平时不同的地方,在抓取数据的日志信息处发现数据均为一个selector,而不再是一个数据list
在查阅了一些资料后,发现在scrapy框架中,xpath选择器在抓取属性(例如://li[@class='next']/a/@href)和抓取标签文本(例如:
div[@class='tags']/a/text())后,其返回类型仍为一个selector
而在以往的Python爬虫的设计过程中,对于页面信息抓取一直使用的是lxml模块中的xpath选择器,该选择器在抓取属性和标签文本后,其返回值为一个包含抓取结果的list
如下例:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import requests
from lxml import html
url = 'http://quotes.toscrape.com/'
req = requests.get(url).content
page = html.fromstring(req)
quotes = page.xpath('//div[@class="quote"]')
element = quotes[0]
item = {}
author = element.xpath("span/small/text()")
tags = element.xpath("div[@class='tags']/a/text()")
text = element.xpath("span[@class='text']/text()")
print(author)
print(tags)
print(text)
item['author'] = element.xpath("span/small/text()")[0]
item['tags'] = element.xpath("div[@class='tags']/a/text()")[0]
item['text'] = element.xpath("span[@class='text']/text()")[0]
print(item)
控制台信息:
从例子中可以很明显的看出,lxml模块中的xpath选择器在抓取属性和标签文本后,其返回值为一个包含抓取结果的list,而不同于scrapy框架中xpath选择器返回的selector
经过分析,问题发生的原因和解决方法就很明朗了
只需要从scrapy框架中xpath选择器返回的selector中获取他的data属性就可以,经查阅资料发现该选择器的两个方法:
extract_first()
extract()
从函数就可以很容易明白,extract_first()方法提取data中的首个数据,extract()方法提取data中的所有数据
于是修改代码:
for element in quotes:
item = ScrapyTestItem()
item['text'] = element.xpath("span[@class='text']/text()").extract_first()
item['author'] = element.xpath("span/small/text()").extract_first()
item['tags'] = element.xpath("div[@class='tags']/a/text()").extract()
yield item
在运行一次
2019-07-18 11:07:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'author': 'Mark Twain',
'tags': ['misattributed-mark-twain', 'truth'],
'text': '“A lie can travel half way around the world while the truth is '
'putting on its shoes.”'}
2019-07-18 11:07:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/9/>
{'author': 'C.S. Lewis',
'tags': ['christianity', 'faith', 'religion', 'sun'],
'text': '“I believe in Christianity as I believe that the sun has risen: not '
'only because I see it, but because by it I see everything else.”'}
2019-07-18 11:07:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/10/> (ref
erer: http://quotes.toscrape.com/page/9/)
2019-07-18 11:07:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'author': 'J.K. Rowling',
'tags': ['truth'],
'text': '“The truth." Dumbledore sighed. "It is a beautiful and terrible '
'thing, and should therefore be treated with great caution.”'}
2019-07-18 11:07:41 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'author': 'Jimi Hendrix',
'tags': ['death', 'life'],
'text': "“I'm the one that's got to die when it's time for me to die, so let "
'me live my life the way I want to.”'}
哦吼 问题解决
那么接下来该想想该怎么让这个狗日的停下来(╬ ̄皿 ̄)