爬虫项目3 - 股票数据爬取

爬虫项目3 - 股票数据爬取

步骤

  1. 爬取股票名和股票列表,使用gucheng网进行爬取,网址:
    https://hq.gucheng.com/gpdmylb.html
import requests
import re
import csv
from bs4 import BeautifulSoup
import pandas as pd

def parse_html(url,headers):
    try:
        res = requests.get(url=url,headers=headers)
        return res.content.decode('utf-8')
    except:
        return None

def get_stock_list(url,headers):
    html = parse_html(url,headers)
    soup = BeautifulSoup(html,'lxml')
    ll = soup.find_all('a')
    #print(ll)
    res = []
    for item in ll:
        try:
            pattern = re.compile('<a href="https://hq.gucheng.com/(.*?)/">(.*?)\(.*?</a>',re.S) 
            res.append(re.findall(pattern,str(item))[0])
        except:
            continue
    return res

def main():
    url = 'https://hq.gucheng.com/gpdmylb.html'
    headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"}
    stocks = get_stock_list(url,headers)
    print(stocks)
    for item in stocks: # 将股票名,股票代码保存到本地
        with open('./stocklist.csv','a',encoding='utf-8') as file:
            writer = csv.writer(file)
            writer.writerow(item)

if __name__ == '__main__':
    main()
  1. 加载已经保存的股票列表,进行历史数据爬取,本来想进行网页爬取,后来发现使用selenium进行模拟操作会更简单一点,因此使用如下方法:
    • 首先进入http://quotes.money.163.com/trade/lsjysj_000002.html?year=2010&season=1。加粗部分为股票代码
    • 接着尝试使用selenium模块进行模拟的下载操作:点击“下载数据” -> 输入“起始日期” -> 输入“截止日期” -> 点击“下载”,如下:
      在这里插入图片描述
    • 最后将数据存入本地,因为总股数在3600股左右,所以选择手动分批次下载,这部分代码可以调整。
def load_csv(start_ind,end_ind):
	# start_ind: start stock index
	# end_ind: end stock index
    stocklist = pd.read_csv('./stocklist.csv')
    stocklist['Code']=stocklist['Code'].apply(lambda x: x[2:])
    return stocklist['Code'].tolist()[start_ind:end_ind]

def get_stock_history_data(stocklist, start_date='2010-01-01',end_date='2020-12-01'):
	# take all data of stocks from 2010-01-01 to 2020-12-01 can adjust according to demand
    if not stocklist:
        return

    bro = webdriver.Chrome()
    bro.maximize_window()
    time.sleep(1)

    for stock in stocklist:
        try:
            bro.get("http://quotes.money.163.com/trade/lsjysj_{}.html?".format(stock))
            time.sleep(random.choice([1, 2]))
            bro.find_element_by_id("downloadData").click()
            time.sleep(random.choice([1, 2]))
            bro.find_element_by_name('date_start_value').clear()
            bro.find_element_by_name('date_start_value').send_keys(start_date)
            time.sleep(random.choice([1, 2]))
            bro.find_element_by_name('date_end_value').clear()
            bro.find_element_by_name('date_end_value').send_keys(end_date)
            time.sleep(random.choice([1, 2]))
            bro.find_element_by_css_selector('a.blue_btn.submit').click()
            # bro.find_element_by_xpath("//*[@action name='tradeData']/div[3]/a").click()
            time.sleep(10)
        except:
            print("Stock {} don't exist".format(stock))

def main():
    stocks = load_csv(200,500)
    #print(stocks)
    get_stock_history_data(stocks)

if __name__ == '__main__':
    main()
  1. 至此全部股票数据2010-2020的数据就已下载完毕,现在开始将所有下载的csv存入数据库进行整合使用。
  • 将数据存成不同的部分:
    在这里插入图片描述

  • 将每部分数据分别存入mysql:

import pymysql
import pandas as pd
import os

def save_mysql(file_path,part):
    try:
        df = pd.read_csv(file_path, sep=',', encoding='utf-8')
    except:
        df = pd.read_csv(file_path, sep=',', encoding='GBK')

    #处理一些特殊值,这部分视情况而定
    df['股票代码'] = df['股票代码'].apply(lambda x:x.strip('\''))
    df['涨跌额'] = df['涨跌额'].apply(lambda x: float(x) if x!='None' else 0)
    df['涨跌幅'] = df['涨跌幅'].apply(lambda x: float(x) if x!='None' else 0)

    conn = pymysql.connect(
    host='localhost',
    user='root',
    passwd= 'yoursqlpassword',
    db='dbname',
    port=3306,
    charset='utf8'
    )

    # 获得游标
    cur = conn.cursor()
    # 创建插入sql语句,需先在mysql中建立相应的表(stocks_part1),定义列名以及列的类型。
    query = 'insert into stocks_part%i'%(part)+'(Date,Stock_cd,CompanyName,Close,High,Low,Open,LastOpen,ChgAmt,ChgPct,Turnover,Volume,TotalTransAmt,TotalValue,LiquidValue)values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)'

    for row,item in df.iterrows():
        try:
            date = item[0]
            cd = item[1]
            name = item[2]
            close = item[3]
            high = item[4]
            low = item[5]
            open = item[6]
            lastopen = item[7]
            chgamt = item[8]
            chgpct = item[9]
            turnover = item[10]
            volume = item[11]
            totaltransamt = item[12]
            totalvalue = item[13]
            liquidvalue = item[14]
            values = (date, cd, name, close, high, low, open, lastopen, chgamt, chgpct, turnover, volume, totaltransamt,totalvalue, liquidvalue)
            # 执行sql语句
            cur.execute(query, values)
        except:
            continue

    # close关闭文档
    cur.close()
    # commit 提交
    conn.commit()
    # 关闭MySQL链接
    conn.close()

    # 显示导入多少行
    rows = len(df)
    try:
        print('导入%s %i 行数据到MySQL数据库!'%(cd,rows))
    except:
        print('无股票代码 出错')

def main():
    #分步读取后写入MYSQL,手动选取part
    path = './stocks_data/part'
    part = 1
    folder_path = os.path.join(path + str(part))
    for file in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file)
        if not os.path.isdir(file_path) and file.endswith('.csv'):
            #print(file_path)
            save_mysql(file_path, part)

if __name__ == '__main__':
    main()
  • 进入mysql workbench中将不同的stocks_part表union起来可以得到我们的总表。
    举例查看2010-2020开盘价最高的股票:
    在这里插入图片描述
    看的出来基本符合事实。
  • 5
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值