python 爬虫，东方网上海新闻，简单数据分析

waterHBO

于 2024-08-26 11:25:01 发布

阅读量256

点赞数 4

文章标签： python 爬虫数据分析

本文链接：https://blog.csdn.net/waterHBO/article/details/141558995

版权

起因:

本来想去市区玩玩，结果搜到一些相关的新闻，所以就~~想爬取新闻网站…~~

1. 爬虫部分

import os
import csv
import time
import requests


"""
# home: https://sh.eastday.com/
# 1. 标题, url， 来源，时间
"""

headers = {
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36'
}


def get_data():
    file_name = 'shanghai_news.csv'
    has_file =  os.path.exists(file_name)

    # 打开文件，写入模式
    with open(file_name, 'a', newline='', encoding='utf-8') as file:
        # 创建一个csv.DictWriter对象，用于写入字典数据
        columns = ['title', 'url', 'time','source']
        writer = csv.DictWriter(file, fieldnames=columns)

        # 写入表头
        if not has_file:
            writer.writeheader()

        # 爬取数据. 20页. 每页20条。 每天大概有400个新闻。
        for i in range(20):
            time.sleep(0.5)
            url = f"https://apin.eastday.com/apiplus/special/specialnewslistbyurl?specialUrl=1632798465040016&skipCount={i * 20}&limitCount=20"

            resp = requests.get(url, headers=headers)
            ret = resp.json()

            junk = ret['data']['list']

            for x in junk:
                item = dict()
                item['title'] = x["title"]
                item["url"] = x["url"]
                item["time"] = x["time"]
                item["source"] = x["infoSource"]

                # 写入数据
                writer.writerow(item)
                print(item)

get_data()