Scrapy爬虫框架使用教程

1.Scrapy运行原理

在这里插入图片描述

2.Scrapy安装

  第一种:在命令行模式下使用pip命令即可安装:

pip install scrapy

  第二种:首先下载,然后再安装:

pip download scrapy -d ./

  进入下载目录后执行下面的命令进行安装:

pip install Scrapy-2.5.1-py2.py3-none-any.whl

3.scrapy命令

  1.全局命令:在哪里都能使用。

scrapy version     # 查看版本
scrapy version -v  # 查看依赖库版本
scrapy -h          # 查看帮助信息 or scrapy
scrapy -help       # 查看帮助信息

在这里插入图片描述
  2.项目命令:必须在爬虫项目里面使用。

scrapy startproject demo  # 创建爬虫项目
scrapy -h                 # 查看项目命令
scrapy bench              # 测试
scrapy list               # 查看项目的爬虫          
scrapy crawl spider_name  # 运行爬虫
scrapy view https://www.baidu.com # view使用浏览器打开网页

在这里插入图片描述

4.shell交互终端

  请求百度,返回一个交互终端,返回一个response对象。

scrapy shell https://www.baidu.com

在这里插入图片描述
  查看response。

response.url
response.status
response.body[:50]
response.css("title::text")
response.css("title::text").extract_first()

在这里插入图片描述

5.创建工程

  (1)新建项目

scrapy startproject demo
文件说明
items.pyItems的定义,定义抓取的数据结构
middlewares.py定义Spider和DownLoader的Middlewares中间件实现
pipelines.py定义Item Pipeline的实现,即数据通道
settings.py定义项目的全局配置
spiders.py其中包含一个个Spider的实现,每个Spider都有一个文件
scrapy.cfgScrapy部署时的配置文件,定义了配置文件路径、部署相关信息等内容

  (2)构建爬虫genspider,一个项目可以存在多个spider,但名字必须唯一。

scrapy genspider name domain
# scrapy genspider baidu baidu.com
scrapy list # 查看当前项目的爬虫

  (3)runspider命令运行爬虫,直接运行创建的爬虫,但不会运行整个项目。

scrapy runspider 爬虫名称

  (4)crawl命令运行爬虫。

scrapy crawl 爬虫名称

6.Selector选择器

  (1)三种选择方法。

from scrapy import Selector

context = "<html><head><title>Test html</title><body><h3 class='center'>He1llo Word!</h3></body></head></html>"
selector = Selector(text=context)
selector.xpath("//title/text()").extract_first()
selector.xpath("//title/text()").get()
selector.xpath("//h3[@class='center']/text()").get()
selector.css("title::text").get()
selector.css("h3.center::text").get()
selector.re_first(r"<title>(.*?)</title>")

  (2)交互模式测试。

scrapy shell https://www.baidu.com
response.url
response.selector.xpath("//title/text()").get()
response.selector.css("title::text").get()
response.selector.re_first("<title>(.*?)</title>")
alist = response.xpath("//a")
for vo in alist:
    print(vo.css("::attr(href)").get(),":",vo.css("::text()").get())
    # print(vo.xpath("./@href").get(),":",vo.xpath("./text()").get())

  注:正则选择selector不能省略,xpath和css可以省略selector。

7.Spider的使用

  (1)在Scrapy中,要抓取网站的链接配置、抓取逻辑、解析逻辑里其实都是在Spider中配置的。
  (2)Spider要做的事就是有两件:定义抓取网站的动作和解析爬取下来的网页
  方法一:

import scrapy
from tqdm import tqdm

class VillaSpider(scrapy.Spider):
    name = 'villa'
    allowed_domains = ['3d.qingmo.com', 'img.qingmo.com']
    start_urls = ['https://3d.qingmo.com/so/%E5%88%AB%E5%A2%85-1-6-0-0-0.html']
    page = 1
    img_count = 0

    def parse(self, response):
        imgs = response.css("div.box")
        print(f"===============正在爬取第{self.page}页===============")
        if self.page < 5:
            for img in tqdm(imgs):
                yield scrapy.Request(url=img.css("img::attr(data-src)").get(), callback=self.downloadImg)
            self.page += 1
            next_url = f'so/%E5%88%AB%E5%A2%85-{self.page}-6-0-0-0.html'
            url = response.urljoin(next_url)
            yield scrapy.Request(url=url, callback=self.parse)

    def downloadImg(self, response):
        self.img_count += 1
        with open(f'./images/{self.img_count}-{self.page}.jpg', 'wb') as f:
            f.write(response.body)

  方法二:

import scrapy
from tqdm import tqdm

class Villa2Spider(scrapy.Spider):
    name = 'villa2'
    allowed_domains = ['3d.qingmo.com', 'img.qingmo.com']
    page = 1
    img_count = 0

    def start_requests(self):
        next_url = f'https://3d.qingmo.com/so/%E5%88%AB%E5%A2%85-{self.page}-6-0-0-0.html'
        yield scrapy.Request(url=next_url, callback=self.myParse)

    def myParse(self, response):
        imgs = response.css("div.box")
        for img in tqdm(imgs):
            yield scrapy.Request(url=img.css("img::attr(data-src)").get(), callback=self.downloadImg)

    def downloadImg(self, response):
        self.img_count += 1
        with open(f'./images/{self.img_count}.jpg', 'wb') as f:
            f.write(response.body)

  方法三(POST提交):

import scrapy
import json

class YoudaoSpider(scrapy.Spider):
    name = 'youdao'
    allowed_domains = ['fanyi.youdao.com']
    # start_urls = ['http://fanyi.youdao.com/']

    def start_requests(self):
        url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
        kw = input("请输入要翻译的单词: ")
        data = {'i':kw, 'doctype':'json'}
        # FormRequest是Scrapy发送POST请求的方法
        yield scrapy.FormRequest(
            url = url,
            formdata = data,
            callback = self.parse
        )

    def parse(self, response):
        res = json.loads(response.body)
        self.logger.info(f'result: {res["translateResult"][0][0]["tgt"]}')
        input("按任意键继续")

8.Downloader中间件代理

  在settings配置文件里面设置代理服务器,并开启downloader中间件。

IPPOOL=[
    {"ipaddr": 'http://0.0.0.0:00'},
    {"ipaddr": 'http://0.0.0.0:00'},
    {"ipaddr": 'http://0.0.0.0:00'},
]
DOWNLOADER_MIDDLEWARES = {
   'middletest.middlewares.MiddletestDownloaderMiddleware': 543,
}

  在middlewares.py文件编写代理类。

from scrapy import signals
import random
from .settings import IPPOOL

class MiddletestDownloaderMiddleware:
 
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        ip = random.choice(IPPOOL)['ipaddr']
        request.meta['proxy'] = ip
        return None

    def process_response(self, request, response, spider):
        pass

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

9.Pipelines使用

  (1)过滤数据。

ITEM_PIPELINES = {
   'dangdang.pipelines.DangdangPipeline': 300,
}
import scrapy
from scrapy.exceptions import DropItem

class DangdangPipeline:
    def process_item(self, item, spider):
        '''过滤数据'''
        if item.get('price') == '¥45.40':
            return item
        else:
            raise DropItem("Missing price")

  (2)自定义图片存储,设置存储路径和编写类。

IMAGES_STORE = './images'
ITEM_PIPELINES = {
   'dangdang.pipelines.MyImagesPipeline': 301,
#    'scrapy.pipelines.images.ImagesPipeline': 303, #scrapy内部图片存储类
}
import scrapy
from scrapy.pipelines.images import ImagesPipeline

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        yield scrapy.Request(item['pic'])

    def file_path(self, request, response=None, info=None):
        url = request.url
        file_name = url.split("/")[-1]
        return file_name

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['pic'] = image_paths
        return item

  (3)数据保存到数据库,设置数据库配置信息和类。

DATABASES = {
    'default': {
        'ENGINE': 'mysql',
        'NAME': 'scrapy',
        'USER': 'root',
        'PASSWORD': '123456+',
        'HOST': 'localhost',
        'PORT': 3306,
    }
}
ITEM_PIPELINES = {
    'dangdang.pipelines.MysqlPipeline': 302,
}
import scrapy
import pymysql

class MysqlPipeline:
    
    def __init__(self, host, user, password, database, port):
        self.host = host
        self.user = user
        self.password = password
        self.database = database
        self.port = port

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            host = crawler.settings.get('DATABASES')['default']['HOST'],
            user = crawler.settings.get('DATABASES')['default']['USER'],
            password = crawler.settings.get('DATABASES')['default']['PASSWORD'],
            database = crawler.settings.get('DATABASES')['default']['NAME'],
            port = crawler.settings.get('DATABASES')['default']['PORT'],
        )

    def open_spider(self, spider):
        self.db = pymysql.connect(
            host = self.host,
            port = self.port,
            user = self.user,
            passwd = self.password,
            database = self.database,
            charset = 'utf8',
        )
        self.cursor = self.db.cursor()

    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        sql = f'insert into dangdang(title,pic,author,publish,price,comment) values ("{item["title"]}","{item["pic"]}","{item["author"]}","{item["publish"]}","{item["price"]}","{item["comment"]}");'
        self.cursor.execute(sql)
        self.db.commit()
        return item

  (4)items字段类。

import scrapy

class DangdangItem(scrapy.Item):
    title = scrapy.Field()
    pic = scrapy.Field()
    author = scrapy.Field()
    publish = scrapy.Field()
    price = scrapy.Field()
    comment = scrapy.Field()
    # image_urls = scrapy.Field()

  (5)spider爬虫类。

import scrapy
from dangdang.items import DangdangItem

class DangSpider(scrapy.Spider):
    name = 'dang'
    allowed_domains = ['search.dangdang.com']
    start_urls = ['http://search.dangdang.com/?key=python&act=input&page_index=1']
    page = 1
    all_pages = 5

    def parse(self, response):
        dlist = response.selector.css("ul.bigimg li")
        for d in dlist:
            item = DangdangItem()
            # item['title'] = d.css("a::attr(title)").get()
            item['title'] = d.xpath(".//a/@title").get()
            item['pic'] = d.css("a img::attr(data-original)").get()
            item['pic'] = response.urljoin(item['pic'])
            item['author'] = d.xpath(".//p[@class='search_book_author']//span[1]//a[1]/text()").get()
            item['publish'] = d.xpath(".//p[@class='search_book_author']//span[3]/a/text()").get()
            item['price'] = d.css("p.price span::text").get()
            item['comment'] = d.css("p.search_star_line a::text").get()
            # item['image_urls'] = [item['pic']]
            yield item

        next_url = response.selector.css("li.next a::attr(href)").get()
        print("="*50)
        print(next_url)
        print("="*50)
        if next_url and (self.page < self.all_pages):
            self.page += 1
            url = response.urljoin(next_url)
            yield scrapy.Request(url=url, callback=self.parse)

10.日志处理

  (1)使用下面命令可以将日志信息存储到access.log文件中。

scrapy crawl spider_name -s LOG_FILE=access.log

  (2)logging设置,Scrapy提供了log功能,可以通过logging模块使用,可以修改配置文件settings.py。通过在settings.py中进行以下设置可以被用来配置logging

参数说明
LOG_ENABLED默认:True,启用logging
LOG_ENCODING默认:‘utf-8’,logging使用的编码
LOG_FILE默认:None,在当前目录里创建logging输出的文件名
LOG_LEVEL默认:‘DEBUG’,log的最低级别
LOG_STDOUT默认:False,如果为True,进程所有的标准输出(及错误)将会被重定向到log中。
LOG_FILE = "access.log"
LOG_LEVEL = "INFO"

  Scrapy提供5层logging级别:

CRITICAL  # 严重错误(critical)
ERROR     # 一般错误(regular errors)
WARNING   # 警告信息(warning messages)
INFO      # 一般信息(informational messages)
DEBUG     # 调试信息(debugging messages)

  (3)记录信息,使用WARNING级别记录信息。

from scrapy import log
log.msg("This is a warning", evel=log.WRANING)

  (4)spider使用日志。

def my_parse(self, response):
    lilist = response.selector.css("li.gl-item")
    self.logger.info(len(lilist))
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值