scrapy初体验之settings、spider(结合BeautifulSoup)、items、pipelines

南星叨叨

已于 2022-01-29 15:43:35 修改

阅读量925

点赞数

分类专栏： # 爬虫文章标签： python 爬虫

于 2022-01-29 15:00:29 首次发布

本文链接：https://blog.csdn.net/hans99812345/article/details/122743578

版权

爬虫专栏收录该内容

19 篇文章 2 订阅

订阅专栏

Scrapy到目前为止依然是这个星球上最流行的爬虫框架. 摘一下官方给出对scrapy的介绍

An open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

scrapy的特点: 速度快, 简单, 可扩展性强.

创建

scrapy startproject game_spider


  D:\Code\new_scrapy_demo\game_spider

You can start your first spider with:
    cd game_spider
    scrapy genspider example example.com

D:\Code\new_scrapy_demo\game_spider>scrapy genspider learn_spider 4399.com

目录结构

目录结构如图所示
在这里插入图片描述

settings.py
# 手动写入, 警告以及警告以上的信息会被显示
# 当某一天程序被发布到服务器. 请把日志级别调整到ERROR
LOG_LEVEL = "WARNING"  # CRITICAL, ERROR, WARNING, INFO, DEBUG

ROBOTSTXT_OBEY = False  # 干掉robots

ITEM_PIPELINES = {
    'game_spider.pipelines.GameSpiderPipeline': 300,
}
# 把流水线打开，后边的数字是优先级，越小优先级越高

learn_spider.py


import scrapy
from bs4 import BeautifulSoup

from game_spider.items import GameSpiderItem #  这个错误不用管. 是pycham的误报, 不标记也不影响程序运行


class LearnSpiderSpider(scrapy.Spider):
    name = 'learn_spider'
    allowed_domains = ['4399.com']
    start_urls = ['http://www.4399.com/flash/gamehw.htm'] # 从这里开始

    def parse(self, response, **kwargs):
        soup = BeautifulSoup(response.text, 'lxml')
        li_list = soup.find('div', attrs={'class': 'w_box cf'}).find('ul', attrs={'class': 'tm_list'}).find_all('li')
        for li in li_list:
            # game_dict = {}
            game_name = li.find('a').text
            em_list = li.find_all('em')
            game_type = em_list[0].text
            game_time = em_list[1].text

            # 方法一 字典
            # game_dict['game_name'] = game_name
            # game_dict['game_type'] = game_type
            # game_dict['game_time'] = game_time
            # yield game_dict

            # 方法二 自定义items,这就要这个items.py
            game_dict = GameSpiderItem()
            game_dict['name'] = game_name
            game_dict['category'] = game_type
            game_dict['date'] = game_time
            yield game_dict

## 将提取到的数据提交到管道内.
# 注意, 这里只能返回 request对象, 字典, item数据, or None

items.py


# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class GameSpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    category = scrapy.Field()
    date = scrapy.Field()

pipelines.py # 流水线


# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class GameSpiderPipeline:
    def process_item(self, item, spider):
        print(item)
        return item

启动爬虫

这个玩意就不能右键直接run
scrary crawl 蜘蛛名字

scrary crawl learn_spider

南星叨叨

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
scrapy初体验之settings、spider(结合BeautifulSoup)、items、pipelines

Scrapy到目前为止依然是这个星球上最流行的爬虫框架. 摘一下官方给出对scrapy的介绍An open source and collaborative framework for extracting the data you need from websites.In a fast, simple, yet extensible way.scrapy的特点: 速度快, 简单, 可扩展性强.创建scrapy startproject game_spider D:\Code\new_s
复制链接

扫一扫