Crawl Spider 模板的使用

1.Spider模板

  • scrapy默认创建的spider模板就是basic模板,创建spider文件的命令是:scrapy genspider dribbble dribbble.com,查看spider模板的命令是:scrapy genspider --list

  • 在项目中明确指明使用crawl生成模板生成spider的命令是:scrapy genspider -t crawl csdn www.csdn.net ;

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CsdnSpider(CrawlSpider):
    name = 'csdn'
    allowed_domains = ['www.csdn.net']
    start_urls = ['https://www.csdn.net/']
    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )
    def parse_item(self, response):
        return item

2.CrawlSpider类介绍

  • CrawlSpider是Spider的派生类,目的是对全站信息爬取更加简单,爬取那些具有一定规则网站的常用的爬虫, 它基于Spider并有一些独特属性;

3.rules规则列表

  • 语法:Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None),rules是Rule对象的集合,用于匹配目标网站并排除干扰;

  • link_extractor:是一个LinkExtractor对象,其定义了如何从爬取到的页面提取链接;

  • callback:从link_extractor中每获取到链接得到Responses时,会调用参数所指定的值作为回调函数,该回调 函数接收一个response作为其一个参数;

  • cb_kwargs:用于作为**kwargs参数,传递给callback;

  • follow:是一个布尔值,指爬取了之后,是否还继续从该页面提取链接,然后继续爬下去, 默认是False;

  • process_links:指定spider中哪个的函数将会被调用,从link_extractor中获取到链接列表时将会调用该函数 。该方法主要用来过滤;

  • process_request:指定处理函数,根据该Rule提取到的每个Request时,该函数将会被调用,可以对Request进 行处理,该函数必须返回Request或者None;

4.LinkExtractors

  • LinkExtractors 的目的是提取链接,每个LinkExtractor有唯一的公共方法是extract_links(),它接收一个 Response对象,并返回一个scrapy.link.Link对象;

  • Link Extractors要实例化一次,并且 extract_links 方法会根据不同的 response 调用多次提取链接;

主要参数:

  • allow:满足括号中”正则表达式”的值会被提取,如果为空,则全部匹配;

  • deny:与这个正则表达式(或正则表达式列表)不匹配的url一定不提取;

  • allow_domains:会被提取的连接的;

  • deny_domains:一定不会被提取链接的domains;

  • restrict_xpaths:使用xpath表达式,和allow共同作用过滤链接;

5.爬取CSDN的文章, 且提取URL和文章标题

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class DoubanSpider(CrawlSpider):
    name = 'csdn'
    allowed_domains = ['blog.csdn.net']
    start_urls = ['https://blog.csdn.net']
    # 指定链接提取的规律
    rules = (
        # follow:是指爬取了之后,是否还继续从该页面提取链接,然后继续爬下去
        Rule(LinkExtractor(allow=r'.*/article/.*'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print('-'*100)
        print(response.url)
        title = response.css('h1::text').extract()[0]
        print(title)
        print('-' * 100)
        return None

参考:https://www.9xkd.com/user/plan-view.html?id=3716132715

转载于:https://my.oschina.net/u/4072026/blog/3069917

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
很好,这是一个非常具体的问题,我将逐步为您解答。 首先,您需要安装Scrapy和Selenium模块,以及对应的浏览器驱动程序,如Chrome驱动程序。 然后,您需要创建一个Scrapy项目,命名为Stock,并在PyCharm中打开。可以使用以下命令: ``` scrapy startproject Stock ``` 接下来,您需要在items.py文件中创建一个StockItem类添加Item数据。根据要求,该类应包括以下字段: ``` code = scrapy.Field() name = scrapy.Field() last_price = scrapy.Field() buy = scrapy.Field() sell = scrapy.Field() yesterday_close = scrapy.Field() today_open = scrapy.Field() highest = scrapy.Field() lowest = scrapy.Field() volume = scrapy.Field() turnover = scrapy.Field() amplitude = scrapy.Field() turnover_rate = scrapy.Field() pb_ratio = scrapy.Field() pe_ratio = scrapy.Field() total_equity = scrapy.Field() circulating_equity = scrapy.Field() ``` 接下来,使用模板创建一个爬虫stock.py文件。使用以下命令: ``` scrapy genspider stock vip.stock.finance.sina.com.cn ``` 在stock.py文件中,修改start_urls和parse方法来爬取数据。代码如下: ```python import scrapy from Stock.items import StockItem class StockSpider(scrapy.Spider): name = 'stock' allowed_domains = ['vip.stock.finance.sina.com.cn'] start_urls = ['https://vip.stock.finance.sina.com.cn/mkt/#sh_a'] def start_requests(self): for url in self.start_urls: yield scrapy.Request(url=url, callback=self.parse, meta={'driver': self.driver}) def __init__(self, *args, **kwargs): from selenium import webdriver from selenium.webdriver.chrome.options import Options options = Options() options.add_argument('--headless') self.driver = webdriver.Chrome(options=options) super().__init__(*args, **kwargs) def parse(self, response): self.driver.get(response.url) stocks = response.xpath('//table[@id="dataTable"]/tbody/tr') for stock in stocks: item = StockItem() item['code'] = stock.xpath('./td[1]/a/text()').get() item['name'] = stock.xpath('./td[2]/a/text()').get() item['last_price'] = stock.xpath('./td[3]/text()').get() item['buy'] = stock.xpath('./td[4]/text()').get() item['sell'] = stock.xpath('./td[5]/text()').get() item['yesterday_close'] = stock.xpath('./td[6]/text()').get() item['today_open'] = stock.xpath('./td[7]/text()').get() item['highest'] = stock.xpath('./td[8]/text()').get() item['lowest'] = stock.xpath('./td[9]/text()').get() item['volume'] = stock.xpath('./td[10]/text()').get() item['turnover'] = stock.xpath('./td[11]/text()').get() item['amplitude'] = stock.xpath('./td[12]/text()').get() item['turnover_rate'] = stock.xpath('./td[13]/text()').get() item['pb_ratio'] = stock.xpath('./td[14]/text()').get() item['pe_ratio'] = stock.xpath('./td[15]/text()').get() item['total_equity'] = stock.xpath('./td[16]/text()').get() item['circulating_equity'] = stock.xpath('./td[17]/text()').get() yield item ``` 接下来,在middleware.py中添加process_request方法使用Selenium爬取网站信息。代码如下: ```python from scrapy import signals from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException class StockDownloaderMiddleware(object): def __init__(self, timeout=None): self.timeout = timeout super().__init__() @classmethod def from_crawler(cls, crawler): return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT')) def process_request(self, request, spider): if request.meta.get('driver'): spider.driver.get(request.url) try: WebDriverWait(spider.driver, self.timeout).until(EC.presence_of_element_located((By.XPATH, '//table[@id="dataTable"]/tbody/tr'))) return scrapy.HtmlResponse(url=spider.driver.current_url, body=spider.driver.page_source, encoding='utf-8', request=request) except TimeoutException: return scrapy.HtmlResponse(url=spider.driver.current_url, status=500, request=request) ``` 最后,在settings.py文件中,添加管道信息、下载器信息,以及默认的请求头信息。代码如下: ```python BOT_NAME = 'Stock' SPIDER_MODULES = ['Stock.spiders'] NEWSPIDER_MODULE = 'Stock.spiders' ROBOTSTXT_OBEY = False DOWNLOADER_MIDDLEWARES = { 'scrapy_selenium.SeleniumMiddleware': 800, 'Stock.middlewares.StockDownloaderMiddleware': 543, } SELENIUM_DRIVER_NAME = 'chrome' SELENIUM_DRIVER_EXECUTABLE_PATH = 'path/to/chromedriver' SELENIUM_DRIVER_ARGUMENTS = ['--headless'] ITEM_PIPELINES = { 'Stock.pipelines.StockPipeline': 300, } FEED_FORMAT = 'csv' FEED_URI = 'stock.csv' DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } ``` 现在,您可以运行以下命令来爬取数据并将其保存到csv文件中: ``` scrapy crawl stock -o stock.csv ``` 希望这对您有所帮助。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值