项目练习
项目需求
爬取目标站点
http://books.toscrape.com
网站中的书籍信息
其中每一本书的信息包括:
- 书名
- 价格
- 评价等级
- 产品编码
- 库存量
- 评价数量
并将爬取的结果保存到csv文件中
编码实现
1. 创建项目
scrapy startproject books_to_scrape
2. 创建Spider
在spiders目录下创建 book_spider.py 文件
3. 定义书籍信息的Item类,在 items.py
```
class BookItem(scrapy.Item):
# 书名
name = scrapy.Field()
# 价格
price = scrapy.Field()
# 评价等级,1-5星
review_rating = scrapy.Field()
# 评价数量
review_num = scrapy.Field()
# 产品编码
upc = scrapy.Field()
# 库存量
stock = scrapy.Field()
```
4. 完成BooksSpider
1. 继承Spider创建 BooksSpider类
2. 为Spider取名
3. 指定起始爬取点
4. 实现书籍列表页面的解析函数
5. 实现书籍页面的解析函数
# _*_ coding:utf-8 _*_
import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import BookItem
class BooksSpider(scrapy.Spider):
# 唯一标示
name = 'books'
start_urls = ['http://books.toscrape.com/']
allowed_domains = ['books.toscrape.com']
def parse(self, response):
# xpath解析
# for book in response.xpath('//article[@class="product_pod"]'):
# book_url = book.xpath('./h3/a/@href').extract_first()
# yield scrapy.Request(response.urljoin(book_url), callback=self.parse_book)
# LinkExtractor解析
le = LinkExtractor(restrict_css='article.product_pod h3')
for link in le.extract_links(response):
yield scrapy.Request(link.url, callback=self.parse_book)
# 下一页:
le = LinkExtractor(restrict_css='ul.pager li.next')
links = le.extract_links(response)
if links:
next_url = links[0].url
yield scrapy.Request(next_url, self.parse)
def parse_book(self, response):
book = BookItem()
sel = response.css('div.product_main')
book['name'] = sel.xpath('//h1/text()').extract_first()
book['price'] = sel.xpath('//p[@class="price_color"]/text()').extract_first()
book['review_rating'] = sel.css('p.star-rating::attr(class)').re_first('star-rating ([A-Za-z]+)')
sel = response.css('table.table.table-striped')
book['upc'] = sel.xpath('//tr[1]/td/text()').extract_first()
book['stock'] = sel.xpath('//table//tr[last()-1]/td/text()').extract_first()
book['review_num'] = sel.xpath('//table//tr[last()]/td/text()').extract_first()
yield book
运行爬虫
scrapy crawl books -o books.csv
爬取到了!
有个问题:csv文件中各个次序是随机的,可以在配置文件settings.py中使用 FEED_EXPORT_FIELDS指定各个的次序:
FEED_EXPORT_FIELDS = ['upc', 'name', 'price', 'stock', 'review_rating', 'review_num']
5. 编写Pipeline
结果中的评价等级字段的值是 One、Two、Three…等单词,想把他们转成数字格式,
大材小用了, - - 就当复习下Pipeline好了
class BookPipeline(object):
review_rating_map = {
'One': 1,
'Two': 2,
'Three': 3,
'Four': 4,
'Five': 5,
}
def process_item(self, item, spider):
rating = item.get('review_rating')
if rating:
item['review_rating'] = self.review_rating_map[rating]
return item
别忘了配置!settings.py文件中启用BookPipeline
ITEM_PIPELINES = {
'books_to_scrape.pipelines.BookPipeline': 300,
}
重新运行起来~~~没毛病了~