python爬虫-scrapy

哈都婆

已于 2023-03-07 17:19:35 修改

阅读量644

点赞数

分类专栏：爬虫学习文章标签： python 爬虫 scrapy Powered by 金山文档

于 2023-03-04 16:41:49 首次发布

本文链接：https://blog.csdn.net/m0_69379600/article/details/129323951

版权

爬虫学习专栏收录该内容

9 篇文章 0 订阅

订阅专栏

scrapy这个框架

到底有多牛b

首先当然你得需要有啊

安装

你可以使用 conda 安装

conda install -c conda-forge scrapy

你也可以使用 PyPI 安装

pip install Scrapy

scrapy依赖一些相关的库

lxml
parsel
w3lib
twisted
cryptography and pyOpenSSL

如果你在使用 scrapy 的时候发现相关库缺失

把裤子穿上就是了

开始耍

以免玩坏了，可以先创建一个虚拟环境

Python创建虚拟环境的三种方式_镰刀韭菜的博客-CSDN博客

我们要创建一个 handoupo 的爬虫项目

就可以这样

基础语法

 scrapy startproject qiushibaike
cd quishibiake
scrapy crawl sina_news

基础文件

在这里面有一些配置文件和预定义的设置文件

而我们的爬虫代码

可以在 spiders 目录下创建

在这里我们就创建一个 qiushibaike_spider 吧

1

我们就需要继承 scrapy.Spider 这个类

为了创建一个Spider，您必须继承 scrapy.Spider 类，且定义以下三个属性:
name: 用于区别Spider。该名字必须是唯一的，您不可以为不同的Spider设定相同的名字。
start_urls: 包含了Spider在启动时进行爬取的url列表。因此，第一个被获取到的页面将是其中之一。后续的URL则从初始的URL获取到的数据中提取。
parse() 是spider的一个方法。被调用时，每个初始URL完成下载后生成的 Response 对象将会作为唯一的参数传递给该函数。该方法负责解析返回的数据(response data)，提取数据(生成item)以及生成需要进一步处理的URL的 Request 对象。

defstart_requests(self):
        urls = [
            'https://www.qiushibaike.com/text/page/1/',
'https://www.qiushibaike.com/text/page/2/',
        ]
for url inurls:
yield scrapy.Request(url=url, callback=self.parse)

我们在这个方法 start_requests 需要返回一个 yield 生成的迭代

而其中的参数 callback=self.parse 就是要让它去回调我们的数据解析方法

回调方法是这样的

defparse(self, response):
        page = response.url.split("/")[-2]
        filename = 'qiushi-%s.html' % page
with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

储存

主要是把我们抓取下来的 html 存入到本地文件，我们来使用 scrapy 运行一下这个爬虫，使用如下命令即可

cd qiushibaike/
scrapy crawl qiushibaike

猫眼电影案例

Python Scrapy爬虫框架实战应用 (biancheng.net)

最后，我们使用 Scrapy 框架完成一个完整的案例：抓取猫眼电影 Top100 榜。

1) 创建项目

scrapy startproject Maoyan100
#进入项目目录
cd Maoyan100
# 创建爬虫文件,注意url 一定要是网站域名
scrapy genspider maoyan www.maoyan.com

2) 定义数据结构

首先在 items.py 中定义要抓取的数据结构，如下所示：

name = scrapy.Field()
star = scrapy.Field()
time = scrapy.Field()

3) 编写爬虫文件

接下来编写爬虫文件 maoyan.py 代码如下所示：

import scrapy

from Maoyan100.items import Maoyan100Item

class Maoyan100Spider(scrapy.Spider):

# name 指定爬虫文件名字

name = 'maoyan'

allowed_domains = ['maoyan.com'] # 网站域名

start_urls = ['https://maoyan.com/board/4?offset=0'] # 第一个要抓取的url

offset = 0 #查询字符串参数

# response 为 start_urls中影响对象

def parse(self,response):

# 基准xpath，匹配电影信息的dd节点对象列表

dd_list = response.xpath('//dl[@class="board-wrapper"]/dd')

# 给items.py 中的类：Maoyan100Item（）实例化

item = Maoyan100Item()

        for dd in dd_list:
            item['name'] = dd.xpath('./a/@title').get().strip()  # 1.6以后版本使用   原来用 extract_first()
            item['star'] = dd.xpath('.//p[@class="star"]/text()').get().strip()
            item['time'] = dd.xpath('.//p[@class="releasetime"]/text()').get().strip()
            yield item

if self.offset < 90: # 判断条件

self.offset += 10

url = 'https://maoyan.com/board/4?offset=' + str(self.offset)

# 把url交给secheduer入队列

# response会自动传给 callback 回调的 parse()函数

#Scrapy.request()向url发起请求，并将响应结果交给回调的解析函数

yield scrapy.Request(url=url, callback=self.parse)

4) 实现数据存储

(5条消息) pymysql连接mysql指定端口_用Python操作MySQL（pymysql）_吴国娣的博客-CSDN博客

通过编写管道文件 pipelinse.py 文件实现数据的存储，将抓取的数据存放在 MySQL 数据库，这里需要提前建库、建表，因为前面章节已经创建过，此处不再赘述。代码编写如下：

import pymysql
from Maoyan100.settings import *

class Maoyan100Pipeline(object):
    def process_item(self, item, spider):
        print(item['name'], item['star'], item['time'])
        return item  # 多个管道有体现

# 存入mysql数据库的管道
class Maoyan100MysqlPipeline(object):
    #开始
    def open_spider(self, spider):
        # 爬虫项目启动，执行连接数据操作
        # 以下常量需要定义在settings配置文件中
        self.db = pymysql.connect(
            host=MYSQL_HOST,
            user=MYSQL_USER,
            password=MYSQL_PWD,
            database=MYSQL_DB,
            charset=MYSQL_CHARSET
        )
        self.cursor = self.db.cursor()

    # 向表中插入数据
    def process_item(self, item, spider):
        ins = 'insert into movieinfo values(%s,%s,%s)'
        L = [
            item['name'], item['star'], item['time']
        ]
        self.cursor.execute(ins, L)
        self.db.commit()
        return item

   # 结束存放数据，在项目最后一步执行
    def close_spider(self, spider):
        # close_spider()函数只在所有数据抓取完毕后执行一次，
        self.cursor.close()
        self.db.close()
        print('执行了close_spider方法,项目已经关闭')

报错



错误回溯（Traceback）信息如下：
sqlCopy codeTraceback (most recent calllast):
  File "C:\Users\韩东平\PycharmProjects\Maoyan100\Maoyan100\pipelines.py", line 10, in<module>from Maoyan100.settings import *
ModuleNotFoundError: Nomodule named 'Maoyan100.settings'

这个错误信息表明，在pipelines.py文件中的导入语句存在问题。具体来说，Maoyan100.settings模块无法找到。

为了解决这个问题，可以尝试以下步骤：

确保Maoyan100包已安装并位于Python路径中。您可以在终端中运行pip list来检查它是否已安装。如果没有安装，可以使用pip install Maoyan100进行安装。

检查settings.py文件是否位于Maoyan100包内的正确目录中。它应该位于名为Maoyan100的目录中。

仔细检查pipelines.py文件中的导入语句，确保它正确。应该是from Maoyan100.settings import *。

如果以上步骤都不起作用，可以尝试将Maoyan100包的路径添加到PYTHONPATH环境变量中。您可以在终端中运行以下命令来执行此操作：export PYTHONPATH="/path/to/Maoyan100:$PYTHONPATH"。将/path/to/Maoyan100替换为系统上实际的Maoyan100包路径。

export PYTHONPATH="/path/to/Maoyan100:$PYTHONPATH

export PYTHONPATH="/path/to/maoyan:$PYTHONPATH

5) 定义启动文件

下面定义项目启动文件 run.py，代码如下：

from scrapy import cmdline

#执行爬虫文件 -o 指定输出文件的格式

cmdline.execute('scrapy crawl maoyan -o maoyan.csv'.split())#执行项目，并且将数据存csv文件格式

修改配置文件

自带的

# Scrapy settings for Maoyan100 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "Maoyan100"

SPIDER_MODULES = ["Maoyan100.spiders"]
NEWSPIDER_MODULE = "Maoyan100.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "Maoyan100 (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
#    "Accept-Language": "en",
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    "Maoyan100.middlewares.Maoyan100SpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    "Maoyan100.middlewares.Maoyan100DownloaderMiddleware": 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    "scrapy.extensions.telnet.TelnetConsole": None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    "Maoyan100.pipelines.Maoyan100Pipeline": 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = "httpcache"
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage"

# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

可以在 Scrapy 的配置文件 settings.py 中进行修改，具体的修改内容如下：