PS:以下内容参照于《Python 3 爬虫、数据清洗与可视化实战》的第五章内容P78-P87
目录
一、创建scrapy项目
- 任意地方新建文件夹,此文件夹只是为方便管理项目,并非真正的项目文件夹
- 打开cmd切到项目文件夹,输入以下代码正式创建scrapy项目:
scrapy startproject stockstar
- 在settings文件中,设置不遵守Robot协议
ROBOTSTXT_OBEY = False
- Mark directory as "sources root"
二、定义一个item容器
- item是存储爬取数据的容器
- 先对所要爬取的网页数据进行分析,定义所爬取的数据结构
- items.py是自动创建的,补充代码如下:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst
class StockstarItemLoader(ItemLoader):
"""
自定义itemloader,用于存储爬虫所爬取的字段内容
"""
default_output_processor = TakeFirst()
class StockstarItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
code = scrapy.Field() #股票代码
abbr = scrapy.Field() #股票简称
last_trade = scrapy.Field() #最新价
chg_ratio = scrapy.Field() #涨跌幅
chg_amt = scrapy.Field() #涨跌额
chg_ratio_5min = scrapy.Field()#5分钟涨幅
volumn = scrapy.Field() #成交量
turn_over = scrapy.Field() #成交额
三、定义settings文件进行爬虫基本设置
#settings.py
from scrapy.exporters import JsonItemExporter
#默认显示的中文是阅读性较差的Unicode字符
#需定义子类显示出原来的字符集(将父类的ensure——ascii属性设置为False即可)
class CustomJsonLinesItemExporter(JsonItemExporter):
def __init__(self,file,**kwargs):
super(CustomJsonLinesItemExporter,self).__init__(file,ensure_ascii=False,**kwargs)
#启用新定义的Exporter类
FEED_EXPORTERS = {
'json':'stockstar.settings.CustomJsonLinesItemExporter',
}
DOWNLOAD_DELAY = 0.25
四、编写爬虫逻辑
# -*- coding: utf-8 -*-
import scrapy
from items import StockstarItem,StockstarItemLoader
class StockSpider(scrapy.Spider):
name = 'stock'
allowed_domains = ['quote.stockstar.com']
start_urls = ['http://quote.stockstar.com/stock/ranklist_a_3_1_1.html']
def parse(self, response):
page = int(response.url.split("_")[-1].split(".")[0]) #抓取页码
item_nodes = response.css('#datalist tr')
for item_node in item_nodes:
# 根据item文件所定义的字段内容,进行字段内容的抓取
item_loader = StockstarItemLoader(item=StockstarItem(), selector=item_node)
item_loader.add_css("code", "td:nth-child(1) a::text")
item_loader.add_css("abbr", "td:nth-child(2) a::text")
item_loader.add_css("last_trade", "td:nth-child(3) span::text")
item_loader.add_css("chg_ratio", "td:nth-child(4) span::text")
item_loader.add_css("chg_amt", "td:nth-child(5) span::text")
item_loader.add_css("chg_ratio_5min", "td:nth-child(6) span::text")
item_loader.add_css("volumn", "td:nth-child(7)::text")
item_loader.add_css("turn_over", "td:nth-child(8)::text")
stock_item = item_loader.load_item()
yield stock_item
if item_nodes:
next_page = page + 1
next_url = response.url.replace("{0}.html".format(page), "{0}.html".format(next_page))
yield scrapy.Request(url=next_url, callback=self.parse)
下面重点讲解以上代码怎么来的。
- 先打开所要爬取数据的页面,chrome按F12查看具体元素的css名
但是我这样分析是不够的,推荐看看这两篇文章:
五、代码调试
>>>新建main.py,代码如下:
from scrapy.cmdline import execute
execute(["scrapy","crawl","stock","-o","items.json"])
也可以在cmd切到项目目录执行以下命令:
scrapy crawl stock -o items.json
以上的操作效果都是讲爬取到的数据导出在items.json中。
运行截图如下:
到此,项目就基本做好了。css记得学好!!!!!!!!!!!!!!!