python爬虫——selenium+bs4爬取选股宝‘利好‘or’利空'股票信息

本文链接：https://blog.csdn.net/weixin_42533987/article/details/80794510

本文介绍了一个使用Python进行的爬虫实战案例，通过Selenium和BeautifulSoup4从选股票宝网站抓取并解析股票利好和利空消息。文章分享了具体的实现步骤和技术要点。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一.前言。

（1）我个人比较喜欢先看结果，再看内容，so,结果如图:

（2）信息抓取自选股宝https://xuangubao.cn/（我这里设定抓取加载20页，下面只列举几个）：

（3）本次主要应用到了Python：

        1.正则表达式；

            Python3 正则表达式：http://www.runoob.com/python3/python3-reg-expressions.html

     2.Selenium模拟浏览器行为；

      3.BeautifulSoup4进行剖析:

          官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

（4）运行环境or编译软件：

1》python 3.6

       2》Selenium 3.12.0

          3》BeautifulSoup4.6

5》pip 10.0

6》JetBrains PyCharm Community Edition 2017.3.4 x64

二.实战

（1）导入库，这些库安装和配置网上都有教程（其实我已经不知道自己是怎样装好的了，反正各种百度）

from bs4 import BeautifulSoup
import re
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

（2）网页源码抓取；

def gethtml(str):
    options = Options()
    options.add_argument('-headless')
    driver = Firefox()#火狐浏览器
    driver.get(str)
    for a in range(0,20):#动态加载20页的网页数据
        loadmore = driver.find_element_by_xpath("//span[@class='home-news-footer-loadmore']").click()#模拟鼠标点击“点击加载更多”

a=driver.page_source# 获取到页面源码。

driver.quit()#关闭浏览器 return a

模拟点击“点击加载更多”，设置点击20次。O(∩_∩)O哈哈~

（3）信息提取；

def getinfor(lst,html_str,str_type,str_char):
    soup = BeautifulSoup(html_str,'html.parser')
    bu = soup.find_all(class_=str_type)#搜索‘利好’or‘利空’所在直接标签:
    for date in bu:
        bu_name = date.parent.parent.find_all(class_="stock-group-item")#利好’or‘利空’所在信息块有股票信息才继续
        if not bu_name == []:
            print()
            date_=date.parent.parent.parent.parent.parent#
            date_month=date_.find(class_="news-item-timeline-date-month")#月
            print(date_month.string,end='/')
            date_day=date_.find(class_="news-item-timeline-date-day")#日
            print(date_day.string,end='日/')
            date_time=date_.find(class_=re.compile("news-item-timeline-time .*")).get_text()#时间
            date_time_=re.compile(r'[0-9]{1,2}:[0-9]{1,2}').search(''.join(date_time))
            print(date_time_.group(),end='/')
            print(str_char, end=' ')
            for a in bu_name:
                stock_name=a.find(class_="stock-group-item-name")#股票名字
                print(stock_name.string, end='[')
                stock_name = a.find(class_="stock-group-item-rate")#指数
                print(stock_name.string, end='] ')
    print()

解析：

1》先定位‘利好’（‘利空’），通过所在<span>标签的属性class="bullish-and-bear bullish"（利空为class="bullish-and-bear bear"）

2》搜索有股票才继续（如'焦作万方'）,因为有些没有。

3》通过date_=date.parent.parent.parent.parent.parent定位到总<li>,在里面可以用find()方法定位所要信息所在标签。

（4）主方法调用；

def main1():
    stock_list_url = 'https://xuangubao.cn'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    stock_url = gethtml(stock_list_url)
    getinfor(stock_url,'bullish-and-bear bullish', ' 利好：')
    getinfor(stock_url,'bullish-and-bear bear',' 利空：')

三.总结。

（1）对python爬虫有了一定了解。

（2）对相关库有一定认识，尤其是在安装库的时候，真的不是pip install ***就完是的了。

（3）接触pychar,知道了pychar的一些基本使用。

（4）这次是第一次爬虫，主要是应老师要求【黑脸】，要学的还有很多，简单爬取一些信息，没有明确的目的。欢迎各位朋友一起交流啊。有问题的，欢迎指出。