Python编程从入门到实践_第十六章_下载数据

最新推荐文章于 2024-06-28 04:47:22 发布

无穷范数

最新推荐文章于 2024-06-28 04:47:22 发布

阅读量1k

点赞数 2

分类专栏： Python学习文章标签： python 开发语言

本文链接：https://blog.csdn.net/zj15001/article/details/125922399

版权

Python学习专栏收录该内容

12 篇文章 7 订阅

订阅专栏

第十六章：下载数据

文章目录

第十六章：下载数据
- 16.1 CSV文件格式
16.2 制作全球地震散点图：JSON格式
- 总结

常见的数据存储格式：CSV;JSON

16.1 CSV文件格式

CSV文件：将数据以作为一系列以逗号分隔的值写入文件
CSV和excel的区别：
1. CSV是纯文本文件，excel不是纯文本，excel包含很多格式信息在里面。
2. CSV文件的体积会更小，创建分发读取更加方便，适合存放结构化信息，比如记录的导出，流量统计等等。
3. CSV文件在windows平台默认的打开方式是excel，但是它的本质是一个文本文件。

16.1.1 分析CSV文件头

文件头：数据文件第一行，指出后续各行包含什么信息

import csv #导入csv模块

filename = 'data/sitka_weather_07-2018_simple.csv'  #将要使用的文件名赋值给filename

with open(filename) as f:  #打开文件，并将返回的文件对象赋值给f
    reader = csv.reader(f)  #调用csv.read()创建一个与文件对象f相关联的阅读器对象，并赋值给reader
    header_row = next(reader) #next()返回文件的下一行，将该行赋值给header_row
    print(header_row)

16.1.2 打印文件头及位置

任务： 打印列表中的每个文件头及其索引位置

import csv #导入csv模块

filename = 'data/sitka_weather_07-2018_simple.csv'  #将要使用的文件名赋值给filename

with open(filename) as f:  #打开文件，并将返回的文件对象赋值给f
    reader = csv.reader(f)  #调用csv.read()创建一个与文件对象f相关联的阅读器对象，并赋值给reader
    header_row = next(reader) #next()返回文件的下一行，将该行赋值给header_row
    
    for index, column_header in enumerate(header_row):  # enumerate()获取每个元素的索引及其值
        print(index, column_header)

16.1.3 提取并读取数据

任务： 读取每天的最高温度

任务： 改善图表的可读性

import csv #导入csv模块

filename = 'data/sitka_weather_07-2018_simple.csv'  #将要使用的文件名赋值给filename

with open(filename) as f:  #打开文件，并将返回的文件对象赋值给f
    reader = csv.reader(f)  #调用csv.read()创建一个与文件对象f相关联的阅读器对象，并赋值给reader
    header_row = next(reader) #next()返回文件的下一行，将该行赋值给header_row
    
    # 从文件中获取最高温度
    highs = [] # 创建空列表
    for row in reader:  #遍历文件中余下各行
        high = int(row[5])
        highs.append(high)
        
print(highs)

16.1.4 绘制温度列表

任务： 可视化最高温度

import csv #导入csv模块
import matplotlib.pyplot as plt

filename = 'data/sitka_weather_07-2018_simple.csv'  #将要使用的文件名赋值给filename

with open(filename) as f:  #打开文件，并将返回的文件对象赋值给f
    reader = csv.reader(f)  #调用csv.read()创建一个与文件对象f相关联的阅读器对象，并赋值给reader
    header_row = next(reader) #next()返回文件的下一行，将该行赋值给header_row
    
    # 从文件中获取最高温度
    highs = [] # 创建空列表
    for row in reader:  #遍历文件中余下各行
        high = int(row[5])
        highs.append(high)
        
# 绘制最高温度
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(highs, c='red')

# 设置图形格式
ax.set_title("Daily high temperatures - 2018", fontsize=24)
ax.set_xlabel('', fontsize=16)
fig.autofmt_xdate()
ax.set_ylabel("Temperature (F)", fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=16)

plt.show()

16.1.5 模块`datatime`

任务： 在图表中添加日期

“USW00025333”,“SITKA AIRPORT, AK US”,“2018-07-01”,“0.25”,“62”,“50”

要将字符串"2018-07-01"转化为一个表示相应日期的对象
datatime.strptime()

from datetime import datetime
first_data = datetime.strptime("2018-07-01", "%Y-%m-%d")
print(first_data)

import csv #导入csv模块
import matplotlib.pyplot as plt

filename = 'data/sitka_weather_07-2018_simple.csv'  #将要使用的文件名赋值给filename

with open(filename) as f:  #打开文件，并将返回的文件对象赋值给f
    reader = csv.reader(f)  #调用csv.read()创建一个与文件对象f相关联的阅读器对象，并赋值给reader
    header_row = next(reader) #next()返回文件的下一行，将该行赋值给header_row
    
    # 从文件中获取最高温度
    dates, highs = [], [] # 创建空列表
    for row in reader:  #遍历文件中余下各行
        current_date = datetime.strptime(row[2], '%Y-%m-%d') #将包含日期信息的数据row[2]转化为datetime对象
        dates.append(current_date) #附加到列表dates末尾
        high = int(row[5])
        highs.append(high)
        
# 绘制最高温度
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates, highs, c='red')

# 设置图形格式
ax.set_title("Daily high temperatures - 2018", fontsize=24)
ax.set_xlabel('', fontsize=16)
fig.autofmt_xdate() #绘制倾斜的日期，以免重叠
ax.set_ylabel("Temperature (F)", fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=16)


plt.show()

16.1.7 涵盖更长时间

任务： 覆盖整年的数据

import csv #导入csv模块
import matplotlib.pyplot as plt

filename = 'data/sitka_weather_2018_simple.csv'  #将要使用的文件名赋值给filename

with open(filename) as f:  #打开文件，并将返回的文件对象赋值给f
    reader = csv.reader(f)  #调用csv.read()创建一个与文件对象f相关联的阅读器对象，并赋值给reader
    header_row = next(reader) #next()返回文件的下一行，将该行赋值给header_row
    
    # 从文件中获取最高温度
    dates, highs = [], [] # 创建空列表
    for row in reader:  #遍历文件中余下各行
        current_date = datetime.strptime(row[2], '%Y-%m-%d') #将包含日期信息的数据row[2]转化为datetime对象
        dates.append(current_date) #附加到列表dates末尾
        high = int(row[5])
        highs.append(high)
        
# 绘制最高温度
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates, highs, c='red')

# 设置图形格式
ax.set_title("Daily high temperatures - 2018", fontsize=24)
ax.set_xlabel('', fontsize=16)
fig.autofmt_xdate() #绘制倾斜的日期，以免重叠
ax.set_ylabel("Temperature (F)", fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=16)


plt.show()

16.1.8 再绘制一个数据系列

任务： 添加最低温度数据

import csv
from datetime import datetime

from matplotlib import pyplot as plt

filename = 'data/sitka_weather_2018_simple.csv'
with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    # Get dates, and high and low temperatures from this file.
    dates, highs, lows = [], [], []
    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        high = int(row[5])
        low = int(row[6]) # 最低温度
        dates.append(current_date)
        highs.append(high)
        lows.append(low)

# Plot the high and low temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates, highs, c='red', alpha=0.5)
ax.plot(dates, lows, c='blue', alpha=0.5)


# Format plot.
ax.set_title("Daily high and low temperatures - 2018", fontsize=24)
ax.set_xlabel('', fontsize=16)
fig.autofmt_xdate()
ax.set_ylabel("Temperature (F)", fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=16)

plt.show()

16.1.9 给图表区域着色

任务： 通过着色呈现每天的温度范围

fill_between()

import csv
from datetime import datetime

from matplotlib import pyplot as plt

filename = 'data/sitka_weather_2018_simple.csv'
with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    # Get dates, and high and low temperatures from this file.
    dates, highs, lows = [], [], []
    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        high = int(row[5])
        low = int(row[6])
        dates.append(current_date)
        highs.append(high)
        lows.append(low)

# Plot the high and low temperatures.
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates, highs, c='red', alpha=0.5)
ax.plot(dates, lows, c='blue', alpha=0.5)
ax.fill_between(dates, highs, lows, facecolor='blue', alpha=0.1) # 填充两个值系列之间的空间

# Format plot.
ax.set_title("Daily high and low temperatures - 2018", fontsize=24)
ax.set_xlabel('', fontsize=16)
fig.autofmt_xdate()
ax.set_ylabel("Temperature (F)", fontsize=16)
ax.tick_params(axis='both', which='major', labelsize=16)

plt.show()

16.2 制作全球地震散点图：JSON格式

JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式。易于人阅读和编写。同时也易于机器解析和生成。
JSON采用完全独立于语言的文本格式，但是也使用了类似于C语言家族的习惯（包括C, C++, C#, Java, JavaScript, Perl, Python等）。这些特性使JSON成为理想的数据交换语言。
JSON建构于两种结构：
1. “名称/值”对的集合（A collection of name/value pairs）。不同的语言中，它被理解为对象（object），纪录（record），结构（struct），字典（dictionary），哈希表（hash table），有键列表（keyed list），或者关联数组（associative array）。
2. 值的有序列表（An ordered list of values）。在大部分语言中，它被理解为数组（array）。

任务： 加载数据并以易于阅读的方式显示

import json

filename = 'data/eq_data_30_day_m1.json'
with open(filename) as f:
    all_eq_data = json.load(f) #json.load()将数据转化为python能够处理的格式

readable_file = 'data/readable_eq_data.json' #创建新文件，以便将数据以易于阅读的方式写入
with open(filename,'w') as f:
    json.dump(all_eq_data,f,indent=4)   #json.dump：接受JSON数据对象all_eq_data，文件对象f，indent=4使用与数据结构匹配的缩进量设置数据格式

16.2.3 创建地震列表

任务： 创建地震列表，其中包含所有的地震信息

import json

filename = 'data/eq_data_30_day_m1.json'
with open(filename) as f:
    all_eq_data = json.load(f)
    
all_eq_dicts = all_eq_data['features'] #提取与键"Feature"相关联的数据，并存储在all_eq_dicts中
print(len(all_eq_data))

16.2.4 提取震级

任务： 提取每次地震的震级

import json

filename = 'data/eq_data_30_day_m1.json'
with open(filename) as f:
    all_eq_data = json.load(f)
all_eq_dicts = all_eq_data['features']
mags = []
for eq_dict in all_eq_dicts:
    mag = eq_dict['properties']['mag']
    mags.append(mag)
    

print(mags[:10])

16.2.4 提取位置数据

任务： 位置数据存储在键"geometry"–""coordinates"中

import json

filename = 'data/eq_data_30_day_m1.json'
with open(filename) as f:
    all_eq_data = json.load(f)
all_eq_dicts = all_eq_data['features']
mags, titles, lons, lats = [], [], [], []
for eq_dict in all_eq_dicts:
    mag = eq_dict['properties']['mag']
    title = eq_dict['properties']['title']
    lon = eq_dict['geometry']['coordinates'][0]
    lat = eq_dict['geometry']['coordinates'][1]
    mags.append(mag)
    titles.append(title)
    lons.append(lon)
    lats.append(lat)


if __name__ == '__main__':
    print(mags[:5])
    print(titles[:5])
    print(lons[:5])
    print(lats[:5])

16.2.5 绘制地震级散点图

任务： 位置数据存储在"geometry"下

import plotly.express as px



fig = px.scatter(
    x=lons,
    y=lats,
    labels={'x': '经度','y':'纬度'},
    range_x=[-200, 200],
    range_y=[-90, 90],
    width=800,
    height=800,
    title='全球地震散点图',
)
fig.write_html('global_earthquakes.html')
fig.show()

16.2.7 利用`Pandas`数据分析工具封装数据

利pandas库将数据进行封装，所有有关数据的信息都以键值对的形式放在一个字典中

import json
import plotly.express as px
import pandas as pd


filename = 'data/eq_data_30_day_m1.json'
with open(filename) as f:
    all_eq_data = json.load(f)
all_eq_dicts = all_eq_data['features']
mags, titles, lons, lats = [], [], [], []
for eq_dict in all_eq_dicts:
    mag = eq_dict['properties']['mag']
    title = eq_dict['properties']['title']
    lon = eq_dict['geometry']['coordinates'][0]
    lat = eq_dict['geometry']['coordinates'][1]
    mags.append(mag)
    titles.append(title)
    lons.append(lon)
    lats.append(lat)

data = pd.DataFrame(
    data=zip(lons, lats, titles, mags), columns=['经度', '纬度', '位置', '震级']
)

fig = px.scatter(
    data,
    x='经度',
    y='纬度',
    range_x=[-200, 200],
    range_y=[-90, 90],
    width=800,
    height=800,
    title='全球地震散点图',
    size='震级',
    size_max=10,
    color='震级',#震级按照不同的颜色显示
    hover_name='位置',#鼠标指向时显示的文本
)
fig.write_html('global_earthquakes.html')
fig.show()