Scrapy到目前为止依然是这个星球上最流行的爬虫框架. 摘一下官方给出对scrapy的介绍
An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.
scrapy的特点: 速度快, 简单, 可扩展性强.
创建
scrapy startproject game_spider
D:\Code\new_scrapy_demo\game_spider
You can start your first spider with:
cd game_spider
scrapy genspider example example.com
D:\Code\new_scrapy_demo\game_spider>scrapy genspider learn_spider 4399.com
目录结构
目录结构如图所示
settings.py
# 手动写入, 警告以及警告以上的信息会被显示
# 当某一天程序被发布到服务器. 请把日志级别调整到ERROR
LOG_LEVEL = "WARNING" # CRITICAL, ERROR, WARNING, INFO, DEBUG
ROBOTSTXT_OBEY = False # 干掉robots
ITEM_PIPELINES = {
'game_spider.pipelines.GameSpiderPipeline': 300,
}
# 把流水线打开,后边的数字是优先级,越小优先级越高
learn_spider.py
import scrapy
from bs4 import BeautifulSoup
from game_spider.items import GameSpiderItem # 这个错误不用管. 是pycham的误报, 不标记也不影响程序运行
class LearnSpiderSpider(scrapy.Spider):
name = 'learn_spider'
allowed_domains = ['4399.com']
start_urls = ['http://www.4399.com/flash/gamehw.htm'] # 从这里开始
def parse(self, response, **kwargs):
soup = BeautifulSoup(response.text, 'lxml')
li_list = soup.find('div', attrs={'class': 'w_box cf'}).find('ul', attrs={'class': 'tm_list'}).find_all('li')
for li in li_list:
# game_dict = {}
game_name = li.find('a').text
em_list = li.find_all('em')
game_type = em_list[0].text
game_time = em_list[1].text
# 方法一 字典
# game_dict['game_name'] = game_name
# game_dict['game_type'] = game_type
# game_dict['game_time'] = game_time
# yield game_dict
# 方法二 自定义items,这就要这个items.py
game_dict = GameSpiderItem()
game_dict['name'] = game_name
game_dict['category'] = game_type
game_dict['date'] = game_time
yield game_dict
## 将提取到的数据提交到管道内.
# 注意, 这里只能返回 request对象, 字典, item数据, or None
items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class GameSpiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
category = scrapy.Field()
date = scrapy.Field()
pipelines.py # 流水线
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class GameSpiderPipeline:
def process_item(self, item, spider):
print(item)
return item
启动爬虫
这个玩意就不能右键直接run
scrary crawl 蜘蛛名字
scrary crawl learn_spider