爬虫：爬取新闻内容及图片，存入数据库

最新推荐文章于 2025-03-03 19:01:08 发布

小楼一夜听春雨258

最新推荐文章于 2025-03-03 19:01:08 发布

阅读量2.4k

点赞数 13

分类专栏： python 文章标签：爬虫数据库 python

本文链接：https://blog.csdn.net/weixin_44458771/article/details/136624211

版权

python 专栏收录该内容

14 篇文章

订阅专栏

本文介绍了一个使用Python爬取新华网新闻，包括标题、内容、新闻类型和符合条件的图片，并将数据存储到数据库中的示例。通过BeautifulSoup解析HTML，利用requests获取网页内容，对特定链接进行遍历和处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、需求

二、代码

一、需求

1、对新闻主页上的新闻进行爬取，要求解析出标题、内容、新闻类型、图片并存入数据库。

2、只爬取带有图片的新闻，一张即可。

二、代码

以下是对新华网爬取的代码示例。

import requests as rq
from bs4 import BeautifulSoup
import re,os
import datetime
from datetime import timedelta
from difflib import SequenceMatcher
from gbase import GBASE_DB 
from conf import IMGPATH,LOCALPATH,PICSIZE

xinhua_dict = {'politics':1, #时政
               'culture':2, #文化
               'health':3,  #健康
               'fortune':4,  #财经
               'world':5,  #国际
               }
    
def classify_news(s, news_list):
    for li in news_list:
        if li in s:
            return li
    return None
    
def get_xinhua_news(url):
    '''
    爬取新华网标题、内容、分类、图片
    '''
    newsWeb = rq.get(url)
    newsWeb.encoding = 'utf-8'
    soup = BeautifulSoup(newsWeb.text,'html.parser')
    #获取标题
    title_element = soup.find('span', class_='title')
    title = title_element.get_text(strip=True)
    #获取分类
    news_type = xinhua_dict[classify_news(url,xinhua_dict.keys())]
    #获取内容
    content_element = soup.find('div', id='detail')
    paragraphs = content_element.find_all('p')
    content = '\n'.join(paragraph.get_text(strip=True) for paragraph in paragraphs)
    content = re.sub('\n+', '\n', content).replace('"', '\\"').replace('\n', '\\n')
    #获取图片
    jpg_element = soup.find_all('img')
    jpg_pattern = re.compile(r'src="([^"]*1n\.(jpg|jpeg))"')
    j_list = jpg_pattern.findall(str(jpg_element))
    for j in j_list:
        jpg_path = os.path.basename(j[0])
        jpg_url = url[:url.find('c_')] + jpg_path
        picture = rq.get(jpg_url)
        if picture.status_code==200:
            if len(picture.content)>PICSIZE:
                with open(LOCALPATH+jpg_path,"wb") as f:
                    f.write(picture.content)
                return_path = IMGPATH+jpg_path
                break
        else:
            pass
    return title,news_type,content,return_path

def main():
    db = GBASE_DB()
    newsUrl = 'http://www.xinhuanet.com/'
    newsWeb = rq.get(newsUrl)
    newsWeb.encoding = 'utf-8'
    soup = BeautifulSoup(newsWeb.text,'lxml')
    #获取新闻网址列表
    link_list = []
    li_elements = soup.find_all('li')
    for li_element in li_elements:
        a_element = li_element.find('a')
        if a_element:
            url = a_element.get('href')
            if url.startswith("http://www.news.cn/") and url.endswith(".htm") and 'c_' in url and classify_news(url,xinhua_dict.keys())!=None:
                link_list.append(url)

    #逐个解析新闻网址
    for link in link_list:
        try:
            title,news_type,content,jpg_path = get_xinhua_news(link)

            sql = '''insert into table_name(title,type,content,image) 
                     values ('{}', {},'{}','{}')
                  '''.format(title,news_type,content,jpg_path)
            db.execute_sql(sql)
            print('（成功）爬取新华网：',title)
        except Exception as e:
            print('爬取失败：',link,' :',e)
            continue
            
if __name__ == '__main__':
    main()

首先，对新华网主页进行爬取，获取页面上所有的新闻链接，存放进入link_list列表中。

然后，依次访问每一个新闻链接，并解析标题、内容，需要对空格、特殊字符等做一下清洗。根据子频道路径进行分类，并爬取像素值大于阈值的图片（避免爬取到页面上的二维码等小图），图片保存在服务器本地某个文件夹下，如果没有符合条件的图片，则会报错，在main函数中抛出异常，跳过此新闻链接的爬取。

最后，存入数据库。