简单的爬虫入门--爬取百度股票信息--来自mooc嵩老师视频

本文链接：https://blog.csdn.net/weixin_42066185/article/details/81304433

这个仅仅作为自己做这个项目的一些过程的记录和理解吧~~~

1、import 部分，将使用到的一些库引入进来

import requests
import re
from bs4 import BeautifulSou

2、先做一个简单的例子，说明原理

https://gupiao.baidu.com/stock/sz300059.html 这个链接，是百度股票查询的网页，我们可以通过右键查看网页源代码对其进行分析。我们就定义这个页面，是我们想要的页面信息

url='https://gupiao.baidu.com/stock/sz300059.html'
r=requests.get(url)

得到了返回的页面，我们定义html变量，为网页的整个内容

r.encoding = r.apparent_encoding
html=r.text

3、使用BeautifulSoup 做一锅汤

soup=BeautifulSoup(r,'html,parser')

我们对比一下不做汤，和做汤之后的区别吧

（1）不做汤

（2）做汤

可以看出更加结构化的不同喔

4、解析源网页的代码内容（静态网页）

可见上面的信息是分块的，所以我们的解析的方法如下：

同时看到这个的网址内容

刚好，我们可以通过获取这个class中的内容进行进一步的拓展

5、编写相关的代码

获取对应的股票的名称：

stockName=soup.find(attrs={'class':'bets-name'})

上面i就是如果我这样写的运行的结果。

所以真正的name 应该如下：

name=stockName.text.split()[0]

其中[0] 表示取前面的那个。

获取股票的今日信息：

那么这样信息肯定最好就直接存成列表的格式啊

那么，我们同样使用find功能

同样的对于对应的实际的数值

所以系列代码如下：

infoDict={}
stockName=soup.find(attrs={'class':'bets-name'})

infoDict.update({'股票名称':stockName.text.split()[0]})

stockInfo=soup.find('div',attrs={'class':'bets-content'})
keyList=stockInfo.find_all('dt')
keyVal=stockInfo.find_all('dd')
#to make it a list
 for i in range(len(keyList)):
     key=keyList[i].text
     val=valueList[i].text
     infoDict[key]=val   #建立字一个字典的幅值的方式，我们在原来的基础上继续增加新的

综上，是一个，静态原网页的提取的方式。

提取一系列的内容如下的代码，是我按照老师教的写的，并且调试好的代码：作为参考吧

import requests
from bs4  import  BeautifulSoup
import traceback
import  re



def getHTMLText(url,code='utf-8'):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding # 修改编码
        return r.text
    except:
        return ''

def getStockList(lst,stockURL):
    print('List information')
    html=getHTMLText(stock_list_url)
    #print(html)
    soup=BeautifulSoup(html,'html.parser')
    a=soup.find_all('a')

    '''
     <li><a target="_blank" href="http://quote.eastmoney.com/sh201000.html">R003(201000)</a></li>
    '''

    for i in a:
        try:
            # 个股的股票编号保存在lst中
            href = i.attrs['href']
            # print(type(href))
            lst.append(re.findall(r'[s][hz]\d{6}', href)[0])
            # findall返回列表，比如a=[sh012345],取0，就取出了里面的数字
            # 在append到lst里，这样lst就不会是[[sh201000], [sh201002]]这样
        except:
            continue
   # print(lst)
    print('ending list')




def getStockInfo(lst,stockURL,fpath):
    count=0   #进度条用的
    print('start information')
    for stock  in lst:
        url=stockURL+stock+'.html'
        print(url)
        html=getHTMLText(url)

        try:
            if html=='':
                continue
            infoDict={}
            soup=BeautifulSoup(html,'html.parser')
            #print(soup)
            stockInfo=soup.find('div',attrs={'class':'stock-bets'})
            print(stockInfo)
          #  print(stockInfo)
            #name=soup.find('div',attrs={'class':'bets-name'})
            name = soup.find('a', attrs={'class': 'bets-name'})
            print(name)
           # print(name)
            infoDict.update({'股票名称':name.text.split()[0]})
          #  print(infoDict)

            #print(stockInfo)
            keyList=stockInfo.find_all('dt')
            #print(keyList)
            valueList=stockInfo.find_all('dd')
            #print(valueList)
            for i in range(len(keyList)):
                key=keyList[i].text
                val=valueList[i].text
                infoDict[key]=val
            with open(fpath,'a',encoding='utf-8') as f:
                f.write(str(infoDict)+'\n')
                count = count+1
                print('\r当前速度：{:.2f}%'.format(count*100/len(lst)),end='')
        except:
            count = count + 1
            print('\r当前速度：{:.2f}%'.format(count * 100 / len(lst)), end='')
            traceback.print_exc()
            continue





if __name__=='__main__':
    stock_list_url='http://quote.eastmoney.com/stocklist.html'
    stock_info_url='https://gupiao.baidu.com/stock/'
    output_file='E:\BaiduStockInfo.txt'
    slist=[]
    getStockList(slist,stock_list_url)
    getStockInfo(slist,stock_info_url,output_file)