基于python的Scrapy框架做爬虫确实简单(别人造好的轮子,我们来组装即可)
无需登录的
items.py模块
from scrapy import Item,Field
class BookItem(Item):
title = Field()
price = Field()
tags = Field()
author = Field()
copyright = Field()
score = Field()
简单爬虫三部曲:
Scrapy爬虫第一步:获取所有的链接,并用yield关键字创建一个所有链接的生成器
Scrapy爬虫第二步:根据不同的链接发送Request请求,设定不同的parse解析回调函数逐步处理!
Scrapy爬虫第三步:所以的item数据都会经过pipeline类数据处理
book.py爬虫模块
# -*- coding: utf-8 -*-
import re
from scrapy import Spider, Request
from scrapy_first.items import BookItem
from scrapy.linkextractors import LinkExtractor
class BookSpider(Spider):
# name属性就是spider_name,【scrapy crawl spider_name】执行的文件名(必须是唯一)
name = 'book'
allowed_domains = ['yuedu.baidu.com']
start_urls = ['https://yuedu.baidu.com/rank/hotsale']
def __init__(self, name=None, **kwargs):
super().__init__(name=name, **kwargs)
self.title = ""
self.price = ""
self.tags = ""
self.author = ""
self.copyright = ""
self.score = ""
Scrapy爬虫第一步:获取所有的链接,并用yield关键字创建一个所有链接的生成器
def parse(self, response):
url = response.urljoin(response.xpath("//div[@class='pager-inner']/a/@href").get())
if url:
yield Request(url=url, callback=self.parse)
extractor = LinkExtractor(restrict_css=".book .al.title-link")
for link in extractor.extract_links(response):
yield Request(url=link.url, callback=self.parse_book)
Scrapy爬虫第二步:根据不同的链接发送Request请求,设定不同的parse解析回调函数逐步处理!
def parse_book(self, response):
for book in response.css(".doc-info-bd.clearfix .content-block"):
item = BookItem()
item["title"] = book.xpath(".//h1[@class='book-title']/text()").get()
item["price"] = book.xpath(".//span[@class='numeric']/text()").get()
item["author"] = book.xpath(".//a[@class='doc-info-field-val doc-info-author-link']/text()").get()
item["copyright"] = book.xpath(".//a[@class='doc-info-field-val']/text()").get()
item["tags"] = [value.get() for value in book.xpath(".//a[@class='tag-item doc-info-field-val mb5']/text()")]
item["score"] = book.xpath(".//span[@class='doc-info-read-count']/text()").get()
yield item
pipeline数据处理模块
Scrapy爬虫第三步:所以的item数据都会经过pipeline类数据处理
class BookPipeline:
# from_crawler返回一个实例对象,初始化配置文件读取的数据
def __init__(self,count):
self.count = count
# 所有的数据都会经过这个方法,这个方法会自动重复调用(items列表组数)
# 如果要过滤某些数据,在这个函数中过滤条件抛出DropItem异常即可!
def process_item(self, item, spider):
# 数据全部去空格和\n
item["title"] = item.get("title").replace(r"\n","").strip()
return item
@classmethod
def from_crawler(cls, crawler):
return cls(
# 从./settings.py文件中读取字段的方法
count = crawler.settings["BOOK_FILTER_COUNT"]
)
def open_spider(self, spider):
self.file = open('items.jl', 'w')
def close_spider(self, spider):
self.file.close()
./settings.py配置文件
pipeline模块执行配置
ITEM_PIPELINES = {
'scrapy_first.pipelines.BookPipeline': 1,
}
自定义变量:
BOOK_FILTER_COUNT = 3
数据输出配置:
# 文件导出的路径当前目录下的export_data文件下
FEED_URI = "export_data/%(name)s_%(time)s.json"
# 输出文件的格式类型(JSON、CSV、XML)
FEED_FORMAT = "json"
# 输出文本编码格式(json默认格式不是utf-8)
FEED_EXPORT_ENCODING = "UTF-8"