scrapy初体验之settings、spider(结合BeautifulSoup)、items、pipelines

Scrapy到目前为止依然是这个星球上最流行的爬虫框架. 摘一下官方给出对scrapy的介绍

An open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

scrapy的特点: 速度快, 简单, 可扩展性强.

创建

scrapy startproject game_spider


  D:\Code\new_scrapy_demo\game_spider

You can start your first spider with:
    cd game_spider
    scrapy genspider example example.com

D:\Code\new_scrapy_demo\game_spider>scrapy genspider learn_spider 4399.com

目录结构

目录结构如图所示
在这里插入图片描述

settings.py
# 手动写入, 警告以及警告以上的信息会被显示
# 当某一天程序被发布到服务器. 请把日志级别调整到ERROR
LOG_LEVEL = "WARNING"  # CRITICAL, ERROR, WARNING, INFO, DEBUG

ROBOTSTXT_OBEY = False  # 干掉robots

ITEM_PIPELINES = {
    'game_spider.pipelines.GameSpiderPipeline': 300,
}
# 把流水线打开,后边的数字是优先级,越小优先级越高
learn_spider.py


import scrapy
from bs4 import BeautifulSoup

from game_spider.items import GameSpiderItem #  这个错误不用管. 是pycham的误报, 不标记也不影响程序运行


class LearnSpiderSpider(scrapy.Spider):
    name = 'learn_spider'
    allowed_domains = ['4399.com']
    start_urls = ['http://www.4399.com/flash/gamehw.htm'] # 从这里开始

    def parse(self, response, **kwargs):
        soup = BeautifulSoup(response.text, 'lxml')
        li_list = soup.find('div', attrs={'class': 'w_box cf'}).find('ul', attrs={'class': 'tm_list'}).find_all('li')
        for li in li_list:
            # game_dict = {}
            game_name = li.find('a').text
            em_list = li.find_all('em')
            game_type = em_list[0].text
            game_time = em_list[1].text

            # 方法一 字典
            # game_dict['game_name'] = game_name
            # game_dict['game_type'] = game_type
            # game_dict['game_time'] = game_time
            # yield game_dict

            # 方法二 自定义items,这就要这个items.py
            game_dict = GameSpiderItem()
            game_dict['name'] = game_name
            game_dict['category'] = game_type
            game_dict['date'] = game_time
            yield game_dict

## 将提取到的数据提交到管道内.
# 注意, 这里只能返回 request对象, 字典, item数据, or None

items.py


# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class GameSpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    category = scrapy.Field()
    date = scrapy.Field()
pipelines.py # 流水线


# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class GameSpiderPipeline:
    def process_item(self, item, spider):
        print(item)
        return item

启动爬虫

这个玩意就不能右键直接run
scrary crawl 蜘蛛名字

scrary crawl learn_spider
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值