如何爬取中国近十年的GDP，并写入csv文件？

最新推荐文章于 2023-04-03 22:30:35 发布

远远在北方

最新推荐文章于 2023-04-03 22:30:35 发布

阅读量3.3k

点赞数 3

分类专栏： python 文章标签： python csv 数据分析大数据列表

本文链接：https://blog.csdn.net/m0_50628114/article/details/112561146

版权

python 专栏收录该内容

2 篇文章

订阅专栏

如何爬取中国近十年的GDP，并写入csv文件？

怎样爬取中国近十年的gdp，看看中国经济的变化，进而做出可视化图呢？这里我们先教大家第一步，获取数据，数据是基础。

首先我们要选好网址，这里选的是快易理财网：“https://www.kylc.com/stats/global/yearly_per_country/g_gdp/chn.html”。

导入相应的库

import urllib.request
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
import ssl
import csv

处理异常，返回html

def exception_handling(url):
    try:
        # 信任所有Https的安全证书
        ssl._create_default_https_context = ssl._create_unverified_context
        req = urllib.request.Request(url)
        req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0")
        response = urllib.request.urlopen(req)
        html = response.read().decode('utf-8')
        return html
    except HTTPError as e1:
        print(e1.code)
    except URLError as e2:
        print("The server can't connect!")
        print('Reason:{}'.format(e2.reason))
    else:
        print('程序出错')

爬取数据

这里要对应网站源码去查看，获取对应的div，table下的数据

def get_data(html):

    # 定义一个空列表，存放数据的属性（头部标题）
    data_header = []
    # 定义一个空列表，存放数据
    data_detail = []
    # 是指定Beautiful的解析器为html.parser
    soup = BeautifulSoup(html, 'html.parser')
    # 得到最小的那个类标签以内的html
    table_divs = soup.find('div', {'class': 'table-responsive'})
    # 得到最小类下的table
    table_divs1 = table_divs.find('table', {'class': 'table'})

    # 得到table下的thead
    thead = table_divs1.find('thead')
    # 得到thead下的th
    ths = thead.find_all('th')
    for th in ths:
        header = th.text.strip()
        data_header.append(header)

    # 得到table下的tbody
    tbody = table_divs1.find('tbody')
    # 得到tbody下的td
    tds = tbody.find_all('td')
    # 获取2010-2019年的数据
    for td in tds[0:30]:
        if td !=None:
            detail = td.text.strip('')
            data_detail.append(detail)
        else:
            pass

    # 返回值
    return (data_header, data_detail)

当然如果大家想要更多年的数据，那去掉tds[0:30]的区间即可。

将数据写入文件

def data_write(data_h, data_d):
    with open('china_data1_0.csv', mode='w', encoding='utf-8',newline='') as f:
        # 基于文件对象构建csv写入对象
        csv_writer = csv.writer(f)
        # 先将头部写入列表
        csv_writer.writerow(data_h)
        # 为保证数据格式，将列表分割好
        for i in range(0, len(data_d), 3):
            # 去除空行
            if i != '':
                csv_writer.writerow(data_d[i:i+3])
        # print('爬取完成，请到china_data.csv查看')

主函数部分

url = "https://www.kylc.com/stats/global/yearly_per_country/g_gdp/chn.html"
    html = exception_handling(url)
    if html != None:
        get_data(html)
        data_h, data_d = get_data(html)
        data_write(data_h, data_d)

总结

以上便是全部内容，数据出来会发现，数据是原始的数据，如下：
年份,中国,GDP(美元),占世界%
2019,“14.34万亿 (14,342,902,842,915)”,16.3550%
2018,“13.89万亿 (13,894,817,110,036)”,16.0900%
2017,“12.31万亿 (12,310,408,652,423)”,15.1552%
2016,“11.23万亿 (11,233,277,146,512)”,14.7156%
2015,“11.06万亿 (11,061,552,790,044)”,14.7098%
2014,“10.48万亿 (10,475,682,846,632)”,13.1851%
2013,“9.57万亿 (9,570,405,758,739)”,12.3805%
2012,“8.53万亿 (8,532,230,724,141)”,11.3542%
2011,“7.55万亿 (7,551,500,425,597)”,10.2814%
2010,“6.09万亿 (6,087,164,527,421)”,9.2072%
下一链接我们将会讲述，如何将数据规范化，变成我们可用的数据。感谢大家阅读菜鸟远远的博客。喜欢的小伙伴收藏加点赞叭。