python 爬虫爬取中国新闻网

最新推荐文章于 2024-08-22 16:27:01 发布

Mooney安

最新推荐文章于 2024-08-22 16:27:01 发布

阅读量5.1k

点赞数 4

分类专栏：爬取新闻内容文章标签： python 数据挖掘

本文链接：https://blog.csdn.net/Iv_zzy/article/details/107537295

版权

爬取新闻内容专栏收录该内容

3 篇文章 0 订阅

订阅专栏

中国新闻网的新闻种类较多、而且新闻比较丰富，如果需要获取大量新闻的话，中国新闻网是个不错的选择。

界面是这样的：
在这里插入图片描述

从url不难发现，改变日期就能获取不同日期的新闻
在这里插入图片描述
那么，正文开始。。。

1、获取某一个链接的新闻详情页信息

import requests
from bs4 import BeautifulSoup
url = 'http://www.chinanews.com/auto/2019/01-30/8743035.shtml'
res = requests.get(url)
res.encoding='GBK'  # html: ISO-8859-1 (2012)
# res.encoding = 'utf-8' # (2019)
soup = BeautifulSoup(res.text, 'html.parser')

title = soup.find('h1')
print(title.text.strip())
news_contents = ''
contents = soup.find('div', 'left_zw').find_all('p')
for content in contents:
    if 'function' in content.text:
        continue
    news_contents = news_contents + content.text.strip()
print(news_contents)

值得说明的是，这个网页的编码方式总是变换，如果出来的内容是乱码的，可以换一种方式，总之可以自己测试一下。

res.encoding = ‘utf-8’

2、获取滚动页面的url

def get_url(date):
    url = 'http://www.chinanews.com/scroll-news/' + date +'/news.shtml'
    res = requests.get(url)
    res.encoding='GBK'  # html: ISO-8859-1 (2012)
    #res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')

    li_tag = soup.find('div','content_list').find_all('li')
    category_list = []
    title_list = []
    url_list = []
    for li in li_tag:
        try:
            info = li.find_all('a')
            category = info[0].text
            if category in ['军事','娱乐','台湾','汽车','教育','健康']:
                category_list.append(category)
                news_title = info[1].text
                title_list.append(news_title)
                news_url = 'http://www.chinanews.com'+str(info[1].get('href'))
                url_list.append(news_url)
                print("have done!"+ news_title+":"+news_url)
        except:
            continue
    c = {'类别':category_list,
        '标题':title_list,
        'url':url_list
    }
    data=DataFrame(c)
    print(data)