Scrapy爬取在 linux系统上执行

1、安装scrapy

2、创建一个scrapy项目

cd /opt/app

python3 -m scrapy startproject A_stock

3、创建爬虫文件,这里爬取A股历史交易数据

cd /opt/app/A_stock/A_stock/spiders

python3 -m scrapy genspider money163 quotes.money.163.com

股票代码这里为了省事直接写在一个文件里,从文件里读取

import scrapy
import time
from ..items import AStockItem


class Money163Spider(scrapy.Spider):
    name = 'money163'
    allowed_domains = ['quotes.money.163.com']
    start_urls = ['https://quotes.money.163.com']
    headers = {
        'Referer': 'http://quotes.money.163.com',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }

    def start_requests(self):
        stock_list = open('../stock_list')
        for line in stock_list:
            lsjysj_url = 'http://quotes.money.163.com/trade/lsjysj_' + line + '.html'
            yield scrapy.Request(url=lsjysj_url, headers=self.headers, callback=self.parse, meta={'current_stock_code':line.replace('\n', '')})

    def parse(self, response):
        item = AStockItem()
        current_stock_code = response.meta['current_stock_code']
        start_date = ''.join(response.xpath('//input[@name="date_start_type"]/@value').get().split('-'))
        end_date = ''.join(response.xpath('//input[@name="date_end_type"]/@value').get().split('-'))
        time.sleep(5)
        # 随便拼接下载地址
        if current_stock_code == 'zhishu_000001':
            urls = 'https://quotes.money.163.com/service/chddata.html?code=0000001&start=' + start_date + '&end=' + end_date + '&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP'
        elif (current_stock_code == 'zhishu_399001') or (current_stock_code == 'zhishu_399006'):
            urls = 'https://quotes.money.163.com/service/chddata.html?code=1' + current_stock_code.replace('\n', '').replace('zhishu_', '') + '&start=' + start_date + '&end=' + end_date + '&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP'
        else:
            urls = 'https://quotes.money.163.com/service/chddata.html?code=1' + current_stock_code.replace('\n', '') + '&start=' + start_date + '&end=' + end_date + '&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP'
        item['file_urls'] = urls
        item['files'] = '/opt/app/A_stock/A_stock/date/' + start_date + '_' + end_date + '_' + current_stock_code.replace('\n', '')
        yield item

4、setting文件设置管道,如果是在window上测试,这里要把地址改成linux下的下载地址并事先创建好,不然会报错:FileNotFoundError: [Errno 2] No such file or directory:

ITEM_PIPELINES = {
    'A_stock.pipelines.AStockPipeline': 300,
}

FILES_STORE='/opt/app/A_stock/A_stock/date'

5、pipelines.py文件

这里遇到一个错误urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1051)

原因是linux系统太老证书过期了,最简单的方法是在导入urllib时同时导入ssl并设置参数

ssl._create_default_https_context = ssl._create_unverified_context
import urllib.request
import ssl
ssl._create_default_https_context = ssl._create_unverified_context


class AStockPipeline:

    def process_item(self, item, spider):
        url = item.get('file_urls')
        filename = item.get('files') + '.csv'
        urllib.request.urlretrieve(url=url,filename=filename)
        return item

6、最后在spiders目录下执行

python3 -m scrapy crawl money163
就可以爬取相应的股票数据啦

版本

Python 3.7.1

Pip 22.3

Scrapy  2.7.0

Centos 6

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值