（28）爬虫scrapy的实践使用

最新推荐文章于 2022-05-12 23:59:39 发布

小蜗笔记

最新推荐文章于 2022-05-12 23:59:39 发布

阅读量108

点赞数

分类专栏：爬虫实战模块

本文链接：https://blog.csdn.net/qq_42830971/article/details/107891919

版权

爬虫实战模块专栏收录该内容

50 篇文章 11 订阅

订阅专栏

maoyan.py
import scrapy


class MaoyanSpider(scrapy.Spider):
    name = 'maoyan'
    allowed_domains = ['maoyan.com']
    start_urls = ['https://maoyan.com/films?showType=3']

    def parse(self, response):
        names = response.xpath('''//div[@class='channel-detail movie-item-title']/a/text()''').extract()
        scores = [score.xpath('string(.)').extract_first() for score in response.xpath('''//div[@class='channel-detail channel-detail-orange']''')]
        for name,score in zip(names,scores):
            #print(name,':',score)
            yield {'name':name,'score':score}

pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json

class MyfristPipeline:
    def open_spider(self,spider):
        self.file = open('movie.text','w',encoding='utf-8')
    def process_item(self, item, spider):
        self.file.write(json.dumps(item,ensure_ascii=False) + '\n')
        return item
    def close_spider(self,spider):
        self.file.close()


setting.py
# Scrapy settings for myfrist project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'myfrist'

SPIDER_MODULES = ['myfrist.spiders']
NEWSPIDER_MODULE = 'myfrist.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'myfrist.middlewares.MyfristSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'myfrist.middlewares.MyfristDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'myfrist.pipelines.MyfristPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

main.py
from  scrapy.cmdline import execute

execute('scrapy crawl maoyan'.split())

文件放置规则
设置格式

创建scrapy的代码

	scrapy startproject 项目名

建立一个爬虫文件代码

	scrapy genspider qidian(文件名字)  qidian.com(爬取网页域名)

执行爬虫文件代码

	终端输入 scrapy crawl maoyan(项目名)
	或者创建并执行main.py文件：
	from  scrapy.cmdline import execute
	execute('scrapy crawl maoyan'.split())

小蜗笔记

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
（28）爬虫scrapy的实践使用

maoyan.pyimport scrapyclass MaoyanSpider(scrapy.Spider): name = 'maoyan' allowed_domains = ['maoyan.com'] start_urls = ['https://maoyan.com/films?showType=3'] def parse(self, response): names = response.xpath('''//div[@class='c
复制链接

扫一扫