初识scrapy，以及豆瓣爬虫实例

最新推荐文章于 2023-10-23 00:00:00 发布

小赵呢

最新推荐文章于 2023-10-23 00:00:00 发布

阅读量244

点赞数

文章标签： python list django flask

本文链接：https://blog.csdn.net/weixin_53227758/article/details/119080482

版权

流行的爬虫框架,scrapy

框架,少量代码,迅速抓取

请求模块,spider爬虫模块(需要手写),调度器模块,管道数据保存模块（需要手写）,引擎

每个模块只和引擎做交互

创建爬虫框架

 """scrapy startproject myspider"""

创建爬虫

"""scrapy genspider itcast itcast.cn"""

运行

 """scrapy crawl itcast --nolog"""

翻页构造一个Request对象,确定url地址

直接携带cookies

模拟登录

post请求

中间件（距离那个近就叫什么中间件）

预处理request和response对象

代理ip

对请求进行定制化操作

中间件写到middlewares.py文件中，平时写的是下载器中间件

下载器中间件

爬虫中间件

中间件的使用方法，在middlewares.py中定义中间类

process_response,process_request两个方法

crawlspider的作用：crawlspider可以按照规则自动获取连接

crawlspider爬虫的创建：scrapy genspider -t crawl tencent hr.tencent.com

crawlspider中rules的使用：

rules是一个元组或者是列表，包含的是Rule对象

Rule表示规则，其中包含LinkExtractor,callback和follow等参数

LinkExtractor:连接提取器，可以通过正则或者是xpath来进行url地址的匹配

callback :表示经过连接提取器提取出来的url地址响应的回调函数，可以没有，没有表示响应不会进行回调函数的处理

follow：连接提取器提取的url地址对应的响应是否还会继续被rules中的规则进行提取，True表示会，Flase表示不会

经常应用于数据在一个页面上采集

1.编写普通爬虫

(1).创建项目

(2).明确目标

(3).创建爬虫

(4).保存内容

2.改造分布式爬虫

(1).改爬虫

1.导入scrapy_redis中的分布式爬虫类

2.继承类

3.注销 start_urls & allowed--domains

4.设置redis_key获取start_urls

5.配置__ints__获取允许的域

(2).改造配置文件

cop配置参数

言归正住，下面解释一下利用scrapy制作的豆瓣250小爬虫

items.py

import scrapy


class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    pass

middlewares,py

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
import random
from Douban.settings import USER_AGENT_LIST,PROXY_LIST
from scrapy import signals
import base64

# 定义一个中间件类
class RandomUserAgent(object):

    def process_request(self, request, spider):

        # print(request.headers['User-Agent'])
        ua = random.choice(USER_AGENT_LIST)
        request.headers['User-Agent'] = ua


class RandomProxy(object):

    def process_request(self, request, spider):
        proxy = random.choice(PROXY_LIST)
        print(proxy)
        if 'user_passwd' in proxy:
            # 对账号密码进行编码，python3中base64编码的数据必须是bytes类型，所以需要encode
            b64_up = base64.b64encode(proxy['user_passwd'].encode())
            # 设置认证
            request.headers['Proxy-Authorization'] = 'Basic ' + b64_up.decode()
            # 设置代理
            request.meta['proxy'] = proxy['ip_port']
        else:
            # 设置代理
            request.meta['proxy'] = proxy['ip_port']

pipelines.py


# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html


class DoubanPipeline(object):
    def process_item(self, item, spider):
        return item

# -*- coding: utf-8 -*-

# Scrapy settings for Douban project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Douban'

SPIDER_MODULES = ['Douban.spiders']
NEWSPIDER_MODULE = 'Douban.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'

USER_AGENT_LIST = [
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5 ",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14 ",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7 ",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 ",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre ",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.8 (KHTML, like Gecko) Beamrise/17.2.0.9 Chrome/17.0.939.0 Safari/535.8 ",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/18.6.872.0 Safari/535.2 UNTRUSTED/1.0 3gpp-gba"
]

PROXY_LIST =[
   {"ip_port": "123.207.53.84:16816", "user_passwd": "morganna_mode_g:ggc22qxp"},
   # {"ip_port": "122.234.206.43:9000"},
]

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Douban.middlewares.DoubanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   # 'Douban.middlewares.MyCustomDownloaderMiddleware': 543,
   # 'Douban.middlewares.RandomUserAgent': 543,
   'Douban.middlewares.RandomProxy': 543,
}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'Douban.pipelines.DoubanPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

主程序

movie.py

# -*- coding: utf-8 -*-
import scrapy
from Douban.items import DoubanItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['douban.com']
    start_urls = ['https://movie.douban.com/top250']

    def parse(self, response):
        print(response.request.headers['User-Agent'])

        movie_list = response.xpath('//*[@id="content"]/div/div[1]/ol/li/div/div[2]')

        # print(len(movie_list))

        for movie in movie_list:
            item = DoubanItem()

            item['name'] = movie.xpath('./div[1]/a/span[1]/text()').extract_first()

            yield item

        next_url = response.xpath('//*[@id="content"]/div/div[1]/div[2]/span[3]/a/@href').extract_first()
        if next_url != None:
            next_url = response.urljoin(next_url)
            yield scrapy.Request(url=next_url)

最后cd到程序目录

输入

 """scrapy crawl movie --nolog"""

运行

小赵呢

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
初识scrapy，以及豆瓣爬虫实例

流行的爬虫框架,scrapy框架,少量代码,迅速抓取请求模块,spider爬虫模块(需要手写),调度器模块,管道数据保存模块（需要手写）,引擎每个模块只和引擎做交互创建爬虫框架"""scrapystartprojectmyspider"""创建爬虫"""scrapygenspideritcastitcast.cn"""运行"""scrapycrawlitcast--nolog"""翻页构造一个Request对象,确定url地址直接携...
复制链接

扫一扫