0315 财经爬虫实战

最新推荐文章于 2024-08-02 09:56:06 发布

或许快要下雪了吧

最新推荐文章于 2024-08-02 09:56:06 发布

阅读量748

点赞数

分类专栏：金融大数据

本文链接：https://blog.csdn.net/qq_40647378/article/details/104891786

版权

本文介绍了多个财经网站的数据爬取实战，包括新浪财经、东方财富网、上交所、裁判文书网、第一财经、华尔街见闻和巨潮资讯网。讲解了如何通过观察源码、使用selenium以及处理网页结构来获取实时数据、财经资讯和上市公司信息，强调了在爬取过程中需要注意的细节和技巧。

摘要由CSDN通过智能技术生成

1.新浪财经股票实时数据爬取实战

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get("http://finance.sina.com.cn/realstock/company/sh000001/nc.shtml")
data = browser.page_source
print(data)


#数据已经获得要退出程序
browser.quit()

1.看源码

import re
p_price = '<div id="price" class=".*?">(.*?)</div>'
price = re.findall(p_price, data)

前两个结合起来就可以获得了

[‘2493.90’]

获得股票实时数据：

购买接口收费及时不全有，比如期货

自己爬虫免费滞后可以实现所有

2.东方财富网爬取

是一家专业的互联网财经媒体,提供7*24小时财经资讯及全球金融市场报价,汇聚全方位的综合财经新闻和金融市场资讯。

对于爬取日期

(.*?)可能包括换行，所以写findall的时候记得写上re.S来自动忽略换行的影响，如果不写这个re.S，下一节数据清洗和打印输出的时候，会发现弹出一个“list index out of range”（日期列表序号不够的意思）的错误，这个就是因为没有写re.S而导致爬取到日期少了一些

rom selenium import webdriver
import re

def dongfang(company):
    #1.获取源码
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    browser = webdriver.Chrome(chrome_options=chrome_options)
    url = 'http://so.eastmoney.com/news/s?keyword=' + company
    browser.get(url)
    data = browser.page_source
    browser.quit()
    # print(data)
    
    #提取正则表达
    p_title = '<div class="news-item"><h3><a href=".*?">(.*?)</a>'
    p_href = '<div class="news-item"><h3><a href="(.*?)">.*?</a>'
    p_date = '<p class="news-desc">(.*?)</p>'
    title = re.findall(p_title,data)
    href = re.findall(p_href,data)
    date = re.findall(p_date,data,re.S)
    
    #下面是检查数量是否一致
    #print(title)
    #print(len(title))
    #print(href)
    #print(len(href))
    #print(date)
    #print(len(date))
    
#正则提取完之后，我们来把提取到的信息清洗一下，我们发现提取到的标题里有一些<em>

最低0.47元/天解锁文章

或许快要下雪了吧

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
0315 财经爬虫实战

1.新浪财经股票实时数据爬取实战from selenium import webdriverchrome_options = webdriver.ChromeOptions()chrome_options.add_argument('--headless')browser = webdriver.Chrome(chrome_options=chrome_options)browser.g...
复制链接

扫一扫

专栏目录