1、安装scrapy
2、创建一个scrapy项目
cd /opt/app
python3 -m scrapy startproject A_stock
3、创建爬虫文件,这里爬取A股历史交易数据
cd /opt/app/A_stock/A_stock/spiders
python3 -m scrapy genspider money163 quotes.money.163.com
股票代码这里为了省事直接写在一个文件里,从文件里读取
import scrapy
import time
from ..items import AStockItem
class Money163Spider(scrapy.Spider):
name = 'money163'
allowed_domains = ['quotes.money.163.com']
start_urls = ['https://quotes.money.163.com']
headers = {
'Referer': 'http://quotes.money.163.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
def start_requests(self):
stock_list = open('../stock_list')
for line in stock_list:
lsjysj_url = 'http://quotes.money.163.com/trade/lsjysj_' + line + '.html'
yield scrapy.Request(url=lsjysj_url, headers=self.headers, callback=self.parse, meta={'current_stock_code':line.replace('\n', '')})
def parse(self, response):
item = AStockItem()
current_stock_code = response.meta['current_stock_code']
start_date = ''.join(response.xpath('//input[@name="date_start_type"]/@value').get().split('-'))
end_date = ''.join(response.xpath('//input[@name="date_end_type"]/@value').get().split('-'))
time.sleep(5)
# 随便拼接下载地址
if current_stock_code == 'zhishu_000001':
urls = 'https://quotes.money.163.com/service/chddata.html?code=0000001&start=' + start_date + '&end=' + end_date + '&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP'
elif (current_stock_code == 'zhishu_399001') or (current_stock_code == 'zhishu_399006'):
urls = 'https://quotes.money.163.com/service/chddata.html?code=1' + current_stock_code.replace('\n', '').replace('zhishu_', '') + '&start=' + start_date + '&end=' + end_date + '&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP'
else:
urls = 'https://quotes.money.163.com/service/chddata.html?code=1' + current_stock_code.replace('\n', '') + '&start=' + start_date + '&end=' + end_date + '&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP'
item['file_urls'] = urls
item['files'] = '/opt/app/A_stock/A_stock/date/' + start_date + '_' + end_date + '_' + current_stock_code.replace('\n', '')
yield item
4、setting文件设置管道,如果是在window上测试,这里要把地址改成linux下的下载地址并事先创建好,不然会报错:FileNotFoundError: [Errno 2] No such file or directory:
ITEM_PIPELINES = {
'A_stock.pipelines.AStockPipeline': 300,
}
FILES_STORE='/opt/app/A_stock/A_stock/date'
5、pipelines.py文件
这里遇到一个错误urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1051)
原因是linux系统太老证书过期了,最简单的方法是在导入urllib时同时导入ssl并设置参数
ssl._create_default_https_context = ssl._create_unverified_context
import urllib.request
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
class AStockPipeline:
def process_item(self, item, spider):
url = item.get('file_urls')
filename = item.get('files') + '.csv'
urllib.request.urlretrieve(url=url,filename=filename)
return item
6、最后在spiders目录下执行
python3 -m scrapy crawl money163
就可以爬取相应的股票数据啦
版本
Python 3.7.1
Pip 22.3
Scrapy 2.7.0
Centos 6