python爬取股票图_python 股票数据爬取（两种方法）

最新推荐文章于 2024-08-02 09:56:06 发布

weixin_39531378

最新推荐文章于 2024-08-02 09:56:06 发布

阅读量2.3k

点赞数 1

文章标签： python爬取股票图

本文链接：https://blog.csdn.net/weixin_39531378/article/details/111454774

版权

本文介绍了两种使用Python爬取股票数据的方法。第一种利用requests、BeautifulSoup和正则表达式，从东方财富网获取股票代码并从百度股票获取详细信息。第二种方法采用Scrapy框架，结合正则表达式抓取股票数据。两种方法均涉及HTML解析和数据存储。

摘要由CSDN通过智能技术生成

股票HTML网页分析：

东方财富网可以看到股票信息：

http://quote.eastmoney.com/stocklist.html

查看源代码：

R001(201008) R004(201010) RC001(202001)

可以在href中提取股票的代码，

想了解股票的具体信息，需要去百度股票查找，方法为：

'https://gupiao.baidu.com/stock/股票代码.html

查看具体股票的源代码：

国祯环保 (300388)

已休市 2017-09-29 15:00:03

今开

19.92

成交量

8917手

最高

20.15

涨停

21.96

内盘

4974手

成交额

1786.10万

委比

-50.69%

流通市值

59.98亿

市盈率 ^MRQ

50.59

每股收益

0.20

总股本

3.06亿

昨收

19.96

换手率

0.30%

最低

19.92

跌停

17.96

外盘

3943手

振幅

1.15%

量比

0.11

总市值

61.35亿

市净率

3.91

每股净资产

5.14

流通股本

2.99亿

发现股票名称在class="bets-name"的a标签中，其他的数据都在dt和dd标签中

方法一：采用bs4库和正则表达式

import requests

from bs4 import BeautifulSoup

import re

#优化，可以减少程序判断编码所花费的时间

def getHTMLText(url, code='UTF-8'):

try:

r = requests.get(url)

r.raise_for_status()

r.encoding = code

return r.text

except:

return ""

def getStockList(url, stockList):

html = getHTMLText(url, 'GB2312')

soup = BeautifulSoup(html, 'html.parser')

aInformaton = soup.find_all('a')

for ainfo in aInformaton:

try:

stockList.append(re.findall(r'[s][hz]\d{6}', ainfo.attrs['href'])[0])

except:

continue

def getStockInformation(detailUrl, outputFile, stockList):

count = 0

for name in stockList:

count = count + 1

stockUrl = detailUrl + name + '.html'

html = getHTMLText(stockUrl)

try:

if html == "":

continue

stockDict = {}

soup = BeautifulSoup(html, 'html.parser')

stockinfo = soup.find('div', attrs={'class': 'stock-bets'})

stockname = stockinfo.find('a', attrs={'class': 'bets-name'})

# 当标签内部还有标签时，利用text可以得到正确的文字，利用string可能会产生None

stockDict["股票名称"] = stockname.text.split()[0]

stockKey = stockinfo.find_all('dt')

stockValue = stockinfo.find_all('dd')

for i in range(len(stockKey)):

stockDict[stockKey[i].string] = stockValue[i].string

#\r移动到行首，end=""不进行换行

print("\r{:5.2f}%".format((count / len(stockList) * 100)), end='')

#追加写模式'a'

f = open(outputFile, 'a')

f.write(str(stockDict) + '\n')

f.close()

except:

print("{:5.2f}%".format((count / len(stockList) * 100)), end='')

continue

def main():

listUrl = 'http://quote.eastmoney.com/stocklist.html'

detailUrl = 'https://gupiao.baidu.com/stock/'

outputFile = 'C:/Users/Administrator/Desktop/out.txt'

stockList = []

getStockList(listUrl, stockList)

getStockInformation(detailUrl, outputFile, stockList)

main()

方法2.采用Scrapy框架和正则表达式库

(1)建立工程和Spider模板(保存为stocks.py文件)

在命令行中进入：E:\PythonProject\BaiduStocks

输入：scrapy startproject BaiduStocks 建立了scrapy工程

输入：scrapy genspider stocks baidu.com 建立spider模板，baidu.com是指爬虫限定的爬取域名，在stocks.py文件删去即可

(2)编写spider爬虫(即stocks.py文件)

采用css选择器，可以返回选择的标签元素，通过方法extract()可以提取标签元素为字符串从而实现匹配正则表达式的处理

正则表达式详解：

国祯环保 (300388)

re.findall('.*\(', stockname)[0].split()[0] + '('+re.findall('\>.*\

匹配结果：国祯环保(300388)

因为'('为正则表达式语法里的基本符号，所以需要转义

正则表达式从每行开始匹配，匹配之后返回[' 国祯环保 (']，采用split将空白字符分割，返回['国祯环保'，‘(’]

# -*- coding: utf-8 -*-

import scrapy

import re

class StocksSpider(scrapy.Spider):

name = 'stocks'

start_urls = ['http://quote.eastmoney.com/stocklist.html']

def parse(self, response):

fo=open(r'E:\PythonProject\BaiduStocks\oo.txt','a')

#fo.write(str(response.css('a').extract()))

count=0

for href in response.css('a').extract():

try:

if count == 300:

break

count=count+1

stockname=re.findall(r'[s][hz]\d{6}',href)[0]

stockurl='https://gupiao.baidu.com/stock/' + stockname + '.html'

#fo.write(stockurl)

yield scrapy.Request(url= stockurl,headers={"User-Agent":"Chrome/10"} ,callback=self.stock_parse)

except:

continue

pass

def stock_parse(self,response):

ffo=open(r'E:\PythonProject\BaiduStocks\stockparse.txt','a')

stockDict={}

#提取标签中class="stock-bets"的标签元素

stockinfo=response.css('.stock-bets')

#将提取出来的标签转化为字符串列表，然后取第一个

stockname=stockinfo.css('.bets-name').extract()[0]

#ffo.write(stockname)

keyList=stockinfo.css('dt').extract()

#ffo.write(str(keyList))

valueList=stockinfo.css('dd').extract()

stockDict['股票名称'] = re.findall('.*\(', stockname)[0].split()[0] + '('+re.findall('\>.*\

for i in range(len(keyList)):

stockkey=re.findall(r'>.*',keyList[i])[0][1:-5]

stockvalue=re.findall(r'>.*',valueList[i])[0][1:-5]

stockDict[stockkey]=stockvalue

yield stockDict

(3)编写PipeLine(即pipelines.py文件)

系统自动生成了Item处理类BaiduStocksPipeline，我们不采用系统生成，新建一个BaiduStocksinfoPipeline类，并书写Item处理函数

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class BaidustocksPipeline(object):

def process_item(self, item, spider):

return item

class BaidustocksinfoPipeline(object):

#爬虫打开时执行

def open_spider(self,spider):

self.f=open(r'E:\PythonProject\BaiduStocks\BaiduStocks\asdqwe.txt','a')

# 爬虫关闭时执行

def close_spider(self,spider):

self.f.close()

#处理Item项

def process_item(self,item,spider):

try:

self.f.write(str(item)+'\n')

except:

pass

return item

此时要修改配置文件setting.py文件

ITEM_PIPELINES = {

'BaiduStocks.pipelines.BaidustocksinfoPipeline': 300,

}

(4)运行爬虫：scrapy crawl stocks

weixin_39531378

关注

1
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫