用Python进行新型冠状病毒(COVID-19/2019-nCoV)疫情分析

新型冠状病毒(COVID-19/2019-nCoV)疫情分析

祈LHL

重要说明

分析文档:完成度:代码质量 3:5:2

其中分析文档是指你数据分析的过程中,对各问题分析的思路、对结果的解释、说明(要求言简意赅,不要为写而写)

ps:你自己写的代码胜过一切的代笔,无关美丑,只问今日比昨日更长进!加油!

由于数据过多,查看数据尽量使用head()或tail(),以免程序长时间无响应

=======================

本项目数据来源于丁香园。本项目主要目的是通过对疫情历史数据的分析研究,以更好的了解疫情与疫情的发展态势,为抗击疫情之决策提供数据支持。

关于本章使用的数据集,欢迎点击——>我的B站视频 在评论区获取。

一. 提出问题

从全国范围,你所在省市,国外疫情等三个方面主要研究以下几个问题:

(一)全国累计确诊/疑似/治愈/死亡情况随时间变化趋势如何?

(二)全国新增确诊/疑似/治愈/死亡情况随时间变化趋势如何?

(三)全国新增境外输入随时间变化趋势如何?

(四)你所在的省市情况如何?

(五)国外疫情态势如何?

(六)结合你的分析结果,对个人和社会在抗击疫情方面有何建议?

二. 理解数据

原始数据集:AreaInfo.csv,导入相关包及读取数据:

r_hex = '#dc2624'     # red,       RGB = 220,38,36
dt_hex = '#2b4750'    # dark teal, RGB = 43,71,80
tl_hex = '#45a0a2'    # teal,      RGB = 69,160,162
r1_hex = '#e87a59'    # red,       RGB = 232,122,89
tl1_hex = '#7dcaa9'   # teal,      RGB = 125,202,169
g_hex = '#649E7D'     # green,     RGB = 100,158,125
o_hex = '#dc8018'     # orange,    RGB = 220,128,24
tn_hex = '#C89F91'    # tan,       RGB = 200,159,145
g50_hex = '#6c6d6c'   # grey-50,   RGB = 108,109,108
bg_hex = '#4f6268'    # blue grey, RGB = 79,98,104
g25_hex = '#c7cccf'   # grey-25,   RGB = 199,204,207
import numpy as np
import pandas as pd
import matplotlib,re
import matplotlib.pyplot as plt
from matplotlib.pyplot import MultipleLocator


data = pd.read_csv(r'data/AreaInfo.csv')

查看与统计数据,以对数据有一个大致了解

data.head()
continentNamecontinentEnglishNamecountryNamecountryEnglishNameprovinceNameprovinceEnglishNameprovince_zipCodeprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountupdateTimecityNamecityEnglishNamecity_zipCodecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
0北美洲North America美国United States of America美国United States of America97100223062470.06401981203512020-06-23 10:01:45NaNNaNNaNNaNNaNNaNNaN
1南美洲South America巴西Brazil巴西Brazil97300311064700.0549386512712020-06-23 10:01:45NaNNaNNaNNaNNaNNaNNaN
2欧洲Europe英国United Kingdom英国United Kingdom9610073052890.0539426472020-06-23 10:01:45NaNNaNNaNNaNNaNNaNNaN
3欧洲Europe俄罗斯Russia俄罗斯Russia9640065922800.034441682062020-06-23 10:01:45NaNNaNNaNNaNNaNNaNNaN
4南美洲South America智利Chile智利Chile9730042469630.04494645022020-06-23 10:01:45NaNNaNNaNNaNNaNNaNNaN

三. 数据清洗

(一)基本数据处理

数据清洗主要包括:选取子集,缺失数据处理、数据格式转换、异常值数据处理等。

国内疫情数据选取(最终选取的数据命名为china)
  1. 选取国内疫情数据

  2. 对于更新时间(updateTime)列,需将其转换为日期类型并提取出年-月-日,并查看处理结果。(提示:dt.date)

  3. 因数据每天按小时更新,一天之内有很多重复数据,请去重并只保留一天之内最新的数据。

提示:df.drop_duplicates(subset=[‘provinceName’, ‘updateTime’], keep=‘first’, inplace=False)

其中df是你选择的国内疫情数据的DataFrame

分析:选取countryName一列中值为中国的行组成CHINA。

CHINA = data.loc[data['countryName'] == '中国']
CHINA.dropna(subset=['cityName'], how='any', inplace=True)
#CHINA
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

分析:取出含所有中国城市的列表

cities = list(set(CHINA['cityName']))

分析:遍历取出每一个城市的子dataframe,然后用sort对updateTime进行时间排序

for city in cities:
    CHINA.loc[data['cityName'] == city].sort_values(by = 'updateTime')

分析:去除空值所在行

CHINA.dropna(subset=['cityName'],inplace=True)
#CHINA.loc[CHINA['cityName'] == '秦皇岛'].tail(20)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

分析:将CHINA中的updateTime列进行格式化处理

CHINA.updateTime = pd.to_datetime(CHINA.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
#CHINA.loc[data['cityName'] == '秦皇岛'].tail(15)
D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
CHINA.head()
continentNamecontinentEnglishNamecountryNamecountryEnglishNameprovinceNameprovinceEnglishNameprovince_zipCodeprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountupdateTimecityNamecityEnglishNamecity_zipCodecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
136亚洲Asia中国China陕西省Shaanxi6100003171.030732020-06-23境外输入NaN0.072.00.065.00.0
137亚洲Asia中国China陕西省Shaanxi6100003171.030732020-06-23西安Xi'an610100.0120.00.0117.03.0
138亚洲Asia中国China陕西省Shaanxi6100003171.030732020-06-23安康Ankang610900.026.00.026.00.0
139亚洲Asia中国China陕西省Shaanxi6100003171.030732020-06-23汉中Hanzhong610700.026.00.026.00.0
140亚洲Asia中国China陕西省Shaanxi6100003171.030732020-06-23咸阳Xianyang610400.017.00.017.00.0

分析:每日数据的去重只保留第一个数据,因为前面已经对时间进行排序,第一个数据即为当天最新数据
分析:考虑到合并dataframe需要用到concat,需要创建一个初始china

real = CHINA.loc[data['cityName'] == cities[1]]
real.drop_duplicates(subset='updateTime', keep='first', inplace=True)
china = real
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

分析:遍历每个城市dataframe进行每日数据的去重,否则会出现相同日期只保留一个城市的数据的情况

for city in cities[2:]:
    real_data = CHINA.loc[data['cityName'] == city]
    real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True)
    china = pd.concat([real_data, china],sort=False)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

查看数据信息,是否有缺失数据/数据类型是否正确。

提示:若不会处理缺失值,可以将其舍弃

分析:有的城市不是每日都上报的,如果某日只统计上报的那些城市,那些存在患者却不上报的城市就会被忽略,数据就失真了,需要补全所有城市每日的数据,即便不上报的城市也要每日记录数据统计,所以要进行插值处理补全部分数据,处理方法详见数据透视与分析

china.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32812 entries, 96106 to 208267
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   continentName            32812 non-null  object 
 1   continentEnglishName     32812 non-null  object 
 2   countryName              32812 non-null  object 
 3   countryEnglishName       32812 non-null  object 
 4   provinceName             32812 non-null  object 
 5   provinceEnglishName      32812 non-null  object 
 6   province_zipCode         32812 non-null  int64  
 7   province_confirmedCount  32812 non-null  int64  
 8   province_suspectedCount  32812 non-null  float64
 9   province_curedCount      32812 non-null  int64  
 10  province_deadCount       32812 non-null  int64  
 11  updateTime               32812 non-null  object 
 12  cityName                 32812 non-null  object 
 13  cityEnglishName          31968 non-null  object 
 14  city_zipCode             32502 non-null  float64
 15  city_confirmedCount      32812 non-null  float64
 16  city_suspectedCount      32812 non-null  float64
 17  city_curedCount          32812 non-null  float64
 18  city_deadCount           32812 non-null  float64
dtypes: float64(6), int64(4), object(9)
memory usage: 5.0+ MB
china.head()
continentNamecontinentEnglishNamecountryNamecountryEnglishNameprovinceNameprovinceEnglishNameprovince_zipCodeprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountupdateTimecityNamecityEnglishNamecity_zipCodecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
96106亚洲Asia中国China广西壮族自治区Guangxi4500002540.025222020-04-02贵港Guigang450800.08.00.08.00.0
125120亚洲Asia中国China广西壮族自治区Guangxi4500002540.025022020-03-20贵港Guigang450800.08.00.08.00.0
128762亚洲Asia中国China广西壮族自治区Guangxi4500002530.025022020-03-18贵港Guigang450800.08.00.08.00.0
130607亚洲Asia中国China广西壮族自治区Guangxi4500002530.024822020-03-17贵港Guigang450800.08.00.08.00.0
131428亚洲Asia中国China广西壮族自治区Guangxi4500002520.024822020-03-16贵港Guigang450800.08.00.08.00.0
你所在省市疫情数据选取(最终选取的数据命名为myhome)

此步也可在后面用到的再做

myhome = china.loc[data['provinceName'] == '广东省']
myhome.head()
continentNamecontinentEnglishNamecountryNamecountryEnglishNameprovinceNameprovinceEnglishNameprovince_zipCodeprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountupdateTimecityNamecityEnglishNamecity_zipCodecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
205259亚洲Asia中国China广东省Guangdong4400002770.0502020-01-29外地来粤人员NaNNaN5.00.00.00.0
206335亚洲Asia中国China广东省Guangdong4400002070.0402020-01-28河源市NaNNaN1.00.00.00.0
205239亚洲Asia中国China广东省Guangdong4400002770.0502020-01-29外地来穗人员NaNNaN5.00.00.00.0
252亚洲Asia中国China广东省Guangdong440000163411.0161982020-06-23潮州Chaozhou445100.06.00.06.00.0
2655亚洲Asia中国China广东省Guangdong440000163411.0161482020-06-21潮州Chaozhou445100.06.00.06.00.0
国外疫情数据选取(最终选取的数据命名为world)

此步也可在后面用到的再做

world = data.loc[data['countryName'] != '中国']
world.head()
continentNamecontinentEnglishNamecountryNamecountryEnglishNameprovinceNameprovinceEnglishNameprovince_zipCodeprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountupdateTimecityNamecityEnglishNamecity_zipCodecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
0北美洲North America美国United States of America美国United States of America97100223062470.06401981203512020-06-23 10:01:45NaNNaNNaNNaNNaNNaNNaN
1南美洲South America巴西Brazil巴西Brazil97300311064700.0549386512712020-06-23 10:01:45NaNNaNNaNNaNNaNNaNNaN
2欧洲Europe英国United Kingdom英国United Kingdom9610073052890.0539426472020-06-23 10:01:45NaNNaNNaNNaNNaNNaNNaN
3欧洲Europe俄罗斯Russia俄罗斯Russia9640065922800.034441682062020-06-23 10:01:45NaNNaNNaNNaNNaNNaNNaN
4南美洲South America智利Chile智利Chile9730042469630.04494645022020-06-23 10:01:45NaNNaNNaNNaNNaNNaNNaN

数据透视与分析

分析:对china进行插值处理补全部分数据

china.head()
continentNamecontinentEnglishNamecountryNamecountryEnglishNameprovinceNameprovinceEnglishNameprovince_zipCodeprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountupdateTimecityNamecityEnglishNamecity_zipCodecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
96106亚洲Asia中国China广西壮族自治区Guangxi4500002540.025222020-04-02贵港Guigang450800.08.00.08.00.0
125120亚洲Asia中国China广西壮族自治区Guangxi4500002540.025022020-03-20贵港Guigang450800.08.00.08.00.0
128762亚洲Asia中国China广西壮族自治区Guangxi4500002530.025022020-03-18贵港Guigang450800.08.00.08.00.0
130607亚洲Asia中国China广西壮族自治区Guangxi4500002530.024822020-03-17贵港Guigang450800.08.00.08.00.0
131428亚洲Asia中国China广西壮族自治区Guangxi4500002520.024822020-03-16贵港Guigang450800.08.00.08.00.0

分析:先创建省份列表和日期列表,并初始化一个draft

province = list(set(china['provinceName']))#每个省份
#p_city = list(set(china[china['provinceName'] == province[0]]['cityName']))#每个省份的城市
date_0 = []
for dt in china.loc[china['provinceName'] ==  province[0]]['updateTime']:
    date_0.append(str(dt))
date_0 = list(set(date_0))
date_0.sort()
start = china.loc[china['provinceName'] ==  province[0]]['updateTime'].min()
end = china.loc[china['provinceName'] ==  province[0]]['updateTime'].max()
dates = pd.date_range(start=str(start), end=str(end))
aid_frame = pd.DataFrame({'updateTime': dates,'provinceName':[province[0]]*len(dates)})
aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
#draft = pd.merge(china.loc[china['provinceName'] ==  province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime')
draft = pd.concat([china.loc[china['provinceName'] ==  province[0]], aid_frame], join='outer').sort_values('updateTime')
draft.province_confirmedCount.fillna(method="ffill",inplace=True)
draft.province_suspectedCount.fillna(method="ffill", inplace=True)
draft.province_curedCount.fillna(method="ffill", inplace=True)
draft.province_deadCount.fillna(method="ffill", inplace=True)

分析:补全部分时间,取前日的数据进行插值,因为有的省份从4月末开始陆续就不再有新增病患,不再上报,所以这些省份的数据只能补全到4月末,往后的数据逐渐失去真实性

分析:同时进行日期格式化

for p in range(1,len(province)):
    date_d = []
    for dt in china.loc[china['provinceName'] ==  province[p]]['updateTime']:
        date_d.append(dt)
    date_d = list(set(date_d))
    date_d.sort()
    start = china.loc[china['provinceName'] ==  province[p]]['updateTime'].min()
    end = china.loc[china['provinceName'] ==  province[p]]['updateTime'].max()
    dates = pd.date_range(start=start, end=end)
    aid_frame = pd.DataFrame({'updateTime': dates,'provinceName':[province[p]]*len(dates)})
    aid_frame.updateTime = pd.to_datetime(aid_frame.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
    X = china.loc[china['provinceName'] ==  province[p]]
    X.reset_index(drop= True)
    Y = aid_frame
    Y.reset_index(drop= True)
    draft_d = pd.concat([X,Y], join='outer').sort_values('updateTime')
    draft = pd.concat([draft,draft_d])
    draft.province_confirmedCount.fillna(method="ffill",inplace=True)
    draft.province_suspectedCount.fillna(method="ffill", inplace=True)
    draft.province_curedCount.fillna(method="ffill", inplace=True)
    draft.province_deadCount.fillna(method="ffill", inplace=True)
    #draft['updateTime'] = draft['updateTime'].strftime('%Y-%m-%d')
    #draft['updateTime'] = pd.to_datetime(draft['updateTime'],format="%Y-%m-%d",errors='coerce').dt.date
china = draft
china.head()
continentNamecontinentEnglishNamecountryNamecountryEnglishNameprovinceNameprovinceEnglishNameprovince_zipCodeprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountupdateTimecityNamecityEnglishNamecity_zipCodecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
208226亚洲Asia中国China天津市Tianjin120000.014.00.00.00.02020-01-26外地来津NaNNaN2.00.00.00.0
208224亚洲Asia中国China天津市Tianjin120000.014.00.00.00.02020-01-26河北区Hebei District120105.05.00.00.00.0
208228亚洲Asia中国China天津市Tianjin120000.014.00.00.00.02020-01-26和平区Heping District120101.01.00.00.00.0
208227亚洲Asia中国China天津市Tianjin120000.014.00.00.00.02020-01-26滨海新区Binhai New Area120116.01.00.00.00.0
208230亚洲Asia中国China天津市Tianjin120000.014.00.00.00.02020-01-26西青区Xiqing District120111.01.00.00.00.0

四. 数据分析及可视化

在进行数据分析及可视化时,依据每个问题选取所需变量并新建DataFrame再进行分析和可视化展示,这样数据不易乱且条理更清晰。

基础分析

基础分析,只允许使用numpy、pandas和matplotlib库

可以在一张图上多个坐标系展示也可以在多张图上展示

请根据分析目的选择图形的类型(折线图、饼图、直方图和散点图等等),实在没有主意可以到百度疫情地图或其他疫情分析的站点激发激发灵感。

(一)全国累计确诊/疑似/治愈/死亡情况随时间变化趋势如何?

分析:要获得全国累计情况随时间变化趋势,首先需要整合每日全国累计确诊情况做成date_confirmed

分析:要整合每日全国累计确诊情况,首先得提取每个省份每日当天最新累计确诊人数,省份数据求和后形成dataframe,
for循环拼接到date_confirmed中

date = list(set(china['updateTime']))
date.sort()
date
[datetime.date(2020, 1, 24),
 datetime.date(2020, 1, 25),
 datetime.date(2020, 1, 26),
 datetime.date(2020, 1, 27),
 datetime.date(2020, 1, 28),
 datetime.date(2020, 1, 29),
 datetime.date(2020, 1, 30),
 datetime.date(2020, 1, 31),
 datetime.date(2020, 2, 1),
 datetime.date(2020, 2, 2),
 datetime.date(2020, 2, 3),
 datetime.date(2020, 2, 4),
 datetime.date(2020, 2, 5),
 datetime.date(2020, 2, 6),
 datetime.date(2020, 2, 7),
 datetime.date(2020, 2, 8),
 datetime.date(2020, 2, 9),
 datetime.date(2020, 2, 10),
 datetime.date(2020, 2, 11),
 datetime.date(2020, 2, 12),
 datetime.date(2020, 2, 13),
 datetime.date(2020, 2, 14),
 datetime.date(2020, 2, 15),
 datetime.date(2020, 2, 16),
 datetime.date(2020, 2, 17),
 datetime.date(2020, 2, 18),
 datetime.date(2020, 2, 19),
 datetime.date(2020, 2, 20),
 datetime.date(2020, 2, 21),
 datetime.date(2020, 2, 22),
 datetime.date(2020, 2, 23),
 datetime.date(2020, 2, 24),
 datetime.date(2020, 2, 25),
 datetime.date(2020, 2, 26),
 datetime.date(2020, 2, 27),
 datetime.date(2020, 2, 28),
 datetime.date(2020, 2, 29),
 datetime.date(2020, 3, 1),
 datetime.date(2020, 3, 2),
 datetime.date(2020, 3, 3),
 datetime.date(2020, 3, 4),
 datetime.date(2020, 3, 5),
 datetime.date(2020, 3, 6),
 datetime.date(2020, 3, 7),
 datetime.date(2020, 3, 8),
 datetime.date(2020, 3, 9),
 datetime.date(2020, 3, 10),
 datetime.date(2020, 3, 11),
 datetime.date(2020, 3, 12),
 datetime.date(2020, 3, 13),
 datetime.date(2020, 3, 14),
 datetime.date(2020, 3, 15),
 datetime.date(2020, 3, 16),
 datetime.date(2020, 3, 17),
 datetime.date(2020, 3, 18),
 datetime.date(2020, 3, 19),
 datetime.date(2020, 3, 20),
 datetime.date(2020, 3, 21),
 datetime.date(2020, 3, 22),
 datetime.date(2020, 3, 23),
 datetime.date(2020, 3, 24),
 datetime.date(2020, 3, 25),
 datetime.date(2020, 3, 26),
 datetime.date(2020, 3, 27),
 datetime.date(2020, 3, 28),
 datetime.date(2020, 3, 29),
 datetime.date(2020, 3, 30),
 datetime.date(2020, 3, 31),
 datetime.date(2020, 4, 1),
 datetime.date(2020, 4, 2),
 datetime.date(2020, 4, 3),
 datetime.date(2020, 4, 4),
 datetime.date(2020, 4, 5),
 datetime.date(2020, 4, 6),
 datetime.date(2020, 4, 7),
 datetime.date(2020, 4, 8),
 datetime.date(2020, 4, 9),
 datetime.date(2020, 4, 10),
 datetime.date(2020, 4, 11),
 datetime.date(2020, 4, 12),
 datetime.date(2020, 4, 13),
 datetime.date(2020, 4, 14),
 datetime.date(2020, 4, 15),
 datetime.date(2020, 4, 16),
 datetime.date(2020, 4, 17),
 datetime.date(2020, 4, 18),
 datetime.date(2020, 4, 19),
 datetime.date(2020, 4, 20),
 datetime.date(2020, 4, 21),
 datetime.date(2020, 4, 22),
 datetime.date(2020, 4, 23),
 datetime.date(2020, 4, 24),
 datetime.date(2020, 4, 25),
 datetime.date(2020, 4, 26),
 datetime.date(2020, 4, 27),
 datetime.date(2020, 4, 28),
 datetime.date(2020, 4, 29),
 datetime.date(2020, 4, 30),
 datetime.date(2020, 5, 1),
 datetime.date(2020, 5, 2),
 datetime.date(2020, 5, 3),
 datetime.date(2020, 5, 4),
 datetime.date(2020, 5, 5),
 datetime.date(2020, 5, 6),
 datetime.date(2020, 5, 7),
 datetime.date(2020, 5, 8),
 datetime.date(2020, 5, 9),
 datetime.date(2020, 5, 10),
 datetime.date(2020, 5, 11),
 datetime.date(2020, 5, 12),
 datetime.date(2020, 5, 13),
 datetime.date(2020, 5, 14),
 datetime.date(2020, 5, 15),
 datetime.date(2020, 5, 16),
 datetime.date(2020, 5, 17),
 datetime.date(2020, 5, 18),
 datetime.date(2020, 5, 19),
 datetime.date(2020, 5, 20),
 datetime.date(2020, 5, 21),
 datetime.date(2020, 5, 22),
 datetime.date(2020, 5, 23),
 datetime.date(2020, 5, 24),
 datetime.date(2020, 5, 25),
 datetime.date(2020, 5, 26),
 datetime.date(2020, 5, 27),
 datetime.date(2020, 5, 28),
 datetime.date(2020, 5, 29),
 datetime.date(2020, 5, 30),
 datetime.date(2020, 5, 31),
 datetime.date(2020, 6, 1),
 datetime.date(2020, 6, 2),
 datetime.date(2020, 6, 3),
 datetime.date(2020, 6, 4),
 datetime.date(2020, 6, 5),
 datetime.date(2020, 6, 6),
 datetime.date(2020, 6, 7),
 datetime.date(2020, 6, 8),
 datetime.date(2020, 6, 9),
 datetime.date(2020, 6, 10),
 datetime.date(2020, 6, 11),
 datetime.date(2020, 6, 12),
 datetime.date(2020, 6, 13),
 datetime.date(2020, 6, 14),
 datetime.date(2020, 6, 15),
 datetime.date(2020, 6, 16),
 datetime.date(2020, 6, 17),
 datetime.date(2020, 6, 18),
 datetime.date(2020, 6, 19),
 datetime.date(2020, 6, 20),
 datetime.date(2020, 6, 21),
 datetime.date(2020, 6, 22),
 datetime.date(2020, 6, 23)]
china = china.set_index('provinceName')
china = china.reset_index()

分析:循环遍历省份和日期获得每个省份每日累计确诊,因为需要拼接,先初始化一个date_confirmed

list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_confirmedCount'])#该日每省的累计确诊人数
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_confirmed = pd.DataFrame(list_d,index=list_e)
date_confirmed.index.name="date"
date_confirmed.columns=["China_confirmedCount"]
date_confirmed
China_confirmedCount
date
2020-01-241956.0

分析:遍历每个省份拼接每日的总确诊人数的dataframe

l = 0
for i in date[3:]:
    list_p = []
    list_d = []
    list_e = []
    l +=1
    for p in range(0,32):
        try:
            con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
            list_p.append(con_0['province_confirmedCount'])#该日每省的累计确诊人数
        except:
            continue
    #con_0 = china.loc[china['updateTime'] == date[0]].loc[china['provinceName'] == '河北省'].loc[[0]].iloc[0]
    #list_p.append(con_0['province_confirmedCount'])#该日每省的累计确诊人数
    list_d.append(sum(list_p))
    list_e.append(str(date[l]))
    confirmed = pd.DataFrame(list_d, index=list_e)
    confirmed.index.name="date"
    confirmed.columns=["China_confirmedCount"]
    date_confirmed = pd.concat([date_confirmed,confirmed],sort=False)
date_confirmed
China_confirmedCount
date
2020-01-241956.0
2020-01-252253.0
2020-01-261956.0
2020-01-272825.0
2020-01-284589.0
......
2020-06-178106.0
2020-06-186862.0
2020-06-196894.0
2020-06-206921.0
2020-06-216157.0

150 rows × 1 columns

分析:去除空值和不全的值

date_confirmed.dropna(subset=['China_confirmedCount'],inplace=True)
date_confirmed.tail(20)
China_confirmedCount
date
2020-06-0278782.0
2020-06-0378780.0
2020-06-0476903.0
2020-06-0576908.0
2020-06-068777.0
2020-06-078782.0
2020-06-088628.0
2020-06-098634.0
2020-06-108638.0
2020-06-118649.0
2020-06-128658.0
2020-06-138665.0
2020-06-148733.0
2020-06-158772.0
2020-06-168055.0
2020-06-178106.0
2020-06-186862.0
2020-06-196894.0
2020-06-206921.0
2020-06-216157.0

分析:数据从4月末开始到5月末就因为缺失过多省份的数据(部分省份从4月末至今再也没有新增病患)而失真,自2020-06-06起完全失去真实性,所以我删除了2020-06-06往后的数据

date_confirmed = date_confirmed.drop(['2020-06-06','2020-06-07','2020-06-08','2020-06-09','2020-06-10','2020-06-11','2020-06-12','2020-06-13','2020-06-14',
                     '2020-06-15','2020-06-16','2020-06-19','2020-06-18','2020-06-20','2020-06-17','2020-06-21'])

分析:构造拼接函数

def data_frame(self,china,element):
    l = 0
    for i in date[3:]:
        list_p = []
        list_d = []
        list_e = []
        l +=1
        for p in range(0,32):
            try:
                con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
                list_p.append(con_0[element])
            except:
                continue
        #con_0 = china.loc[china['updateTime'] == date[0]].loc[china['provinceName'] == '河北省'].loc[[0]].iloc[0]
        #list_p.append(con_0['province_confirmedCount'])
        list_d.append(sum(list_p))
        list_e.append(str(date[l]))
        link = pd.DataFrame(list_d, index=list_e)
        link.index.name="date"
        link.columns=["China"]
        self = pd.concat([self,link],sort=False)
    self.dropna(subset=['China'],inplace=True)
    self = self.drop(['2020-06-06','2020-06-07','2020-06-08','2020-06-09','2020-06-10','2020-06-11','2020-06-12','2020-06-13','2020-06-14',
                  '2020-06-15','2020-06-16','2020-06-19','2020-06-18','2020-06-20','2020-06-17','2020-06-21'])
    return self

分析:初始化各个变量

#累计治愈人数  date_curedCount
list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_curedCount'])
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_cured = pd.DataFrame(list_d, index=list_e)
date_cured.index.name="date"
date_cured.columns=["China"]



#累计死亡人数  date_dead
list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_deadCount'])
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_dead = pd.DataFrame(list_d, index=list_e)
date_dead.index.name="date"
date_dead.columns=["China"]
#累计确诊患者  date_confirmed
plt.rcParams['font.sans-serif'] = ['SimHei'] #更改字体,否则无法显示汉字
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = date_confirmed.index
y = date_confirmed.values
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
ax.set_title('累计确诊患者',fontdict={
      'color':'black',
      'size':24
})
ax.set_xticks( range(0,len(x),30))
[<matplotlib.axis.XTick at 0x255520e4908>,
 <matplotlib.axis.XTick at 0x255520e49e8>,
 <matplotlib.axis.XTick at 0x255520af048>,
 <matplotlib.axis.XTick at 0x2555216b0b8>,
 <matplotlib.axis.XTick at 0x2555216b4e0>]

在这里插入图片描述

#累计治愈患者 date_curedCount
date_cured = data_frame(date_cured,china,'province_curedCount')
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = date_cured.index
y = date_cured.values
ax.set_title('累计治愈患者',fontdict={
      'color':'black',
      'size':24
})
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
ax.set_xticks( range(0,len(x),30))
[<matplotlib.axis.XTick at 0x25550ef60f0>,
 <matplotlib.axis.XTick at 0x255521cd0b8>,
 <matplotlib.axis.XTick at 0x255521b7780>,
 <matplotlib.axis.XTick at 0x2555208ffd0>,
 <matplotlib.axis.XTick at 0x2555208f0f0>]

在这里插入图片描述

分析:累计疑似无法通过补全数据得到

#累计死亡患者 date_dead
date_dead = data_frame(date_dead,china,'province_deadCount')
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = date_dead.index
y = date_dead.values
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
x_major_locator=MultipleLocator(12)
ax=plt.gca()
ax.set_title('累计死亡患者',fontdict={
      'color':'black',
      'size':24
})
ax.xaxis.set_major_locator(x_major_locator)
ax.set_xticks( range(0,len(x),30))
[<matplotlib.axis.XTick at 0x255521fda90>,
 <matplotlib.axis.XTick at 0x255521fda58>,
 <matplotlib.axis.XTick at 0x25552a51550>,
 <matplotlib.axis.XTick at 0x25552a75470>,
 <matplotlib.axis.XTick at 0x25552a75908>]

在这里插入图片描述

分析:疫情自1月初开始爆发,到2月末开始减缓增速,到4月末趋于平缓。治愈人数自2月初开始大幅增加,到3月末趋于平缓,死亡人数自1月末开始增加,到2月末趋于平缓,到4月末因为统计因素死亡人数飙升后趋于平缓。
分析总结:确诊人数数据和治愈数据从4月末开始到5月末就因为缺失过多省份的数据(部分省份至今再也没有新增病患)导致失真,其他数据尽量通过补全,越靠近尾部数据越失真。死亡数据补全较为成功,几乎没有错漏。

(二)全国新增确诊/疑似/治愈/死亡情况随时间变化趋势如何?

分析:新增确诊/治愈/死亡的数据需要对china进行运算,每省每日进行diff差值运算

分析:首先初始化各个数据,然后仿照上面的拼接函数,作适用于该题的拼接函数

#新增确诊人数  date_new_confirmed
list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_confirmedCount'])#该日每省的累计确诊人数
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_new_confirmed = pd.DataFrame(list_d,index=list_e)
date_new_confirmed.index.name="date"
date_new_confirmed.columns=["China"]
date_new_confirmed


#新增治愈人数  date_new_curedCount
list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_curedCount'])
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_new_cured = pd.DataFrame(list_d, index=list_e)
date_new_cured.index.name="date"
date_new_cured.columns=["China"]


#新增死亡人数  date_new_dead
list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = china.loc[china['updateTime'] == date[2]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['province_deadCount'])
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_new_dead = pd.DataFrame(list_d, index=list_e)
date_new_dead.index.name="date"
date_new_dead.columns=["China"]

分析:构造拼接函数

def data_new_frame(self,china,element):
    l = 0
    for i in date[3:]:
        list_p = []
        list_d = []
        list_e = []
        l +=1
        for p in range(0,32):
            try:
                con_0 = china.loc[china['updateTime'] == date[l]].loc[china['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
                list_p.append(con_0[element])
            except:
                continue
        #con_0 = china.loc[china['updateTime'] == date[0]].loc[china['provinceName'] == '河北省'].loc[[0]].iloc[0]
        #list_p.append(con_0['province_confirmedCount'])
        list_d.append(sum(list_p))
        list_e.append(str(date[l]))
        link = pd.DataFrame(list_d, index=list_e)
        link.index.name="date"
        link.columns=["China"]
        self = pd.concat([self,link],sort=False)
    self.dropna(subset=['China'],inplace=True)
    return self

分析:数据补全以及去除含缺失省份的数据

d = data_new_frame(date_new_confirmed,china,'province_confirmedCount')
for i in range(len(d)):
    dr = []
    for a,b in zip(range(0,len(d)-1),range(1,len(d)-2)):
        if d.iloc[b].iloc[0] < d.iloc[a].iloc[0]:
            dr.append(d.iloc[b].iloc[0])
    d = d[~d['China'].isin(dr)]

分析:做差值运算

d['China'] = d['China'].diff()

分析:去除两个含缺失省份的日期

d.drop(['2020-06-20','2020-06-21'],inplace=True)

分析:作折线图表现时间趋势

#新增确诊患者  date_confirmed
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = d.index
y = d.values
ax.set_title('新增确诊患者',fontdict={
      'color':'black',
      'size':24
})
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
ax.set_xticks( range(0,len(x),10))
[<matplotlib.axis.XTick at 0x25552a9c898>,
 <matplotlib.axis.XTick at 0x25552a9c860>,
 <matplotlib.axis.XTick at 0x25552ab7550>,
 <matplotlib.axis.XTick at 0x25552ad50f0>,
 <matplotlib.axis.XTick at 0x25552ad5518>,
 <matplotlib.axis.XTick at 0x25552ad59b0>,
 <matplotlib.axis.XTick at 0x25552ad5e48>,
 <matplotlib.axis.XTick at 0x25552adc320>]

在这里插入图片描述

分析:使用初始化数据构造date_new_cured的dataframe,然后作折线图表现时间趋势

cu = data_new_frame(date_new_cured,china,'province_curedCount')
for i in range(len(cu)):
    dr = []
    for a,b in zip(range(0,len(cu)-1),range(1,len(cu)-2)):
        if cu.iloc[b].iloc[0] < cu.iloc[a].iloc[0]:
            dr.append(cu.iloc[b].iloc[0])
    cu = cu[~cu['China'].isin(dr)]
cu['China'] = cu['China'].diff()
cu.drop(['2020-06-20','2020-06-21'],inplace=True)
#新增治愈患者  date_new_cured
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = cu.index
y = cu.values
ax.set_title('新增治愈患者',fontdict={
      'color':'black',
      'size':24
})
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
ax.set_xticks( range(0,len(x),10))
[<matplotlib.axis.XTick at 0x25552b13b00>,
 <matplotlib.axis.XTick at 0x25552b13ac8>,
 <matplotlib.axis.XTick at 0x25552b137b8>,
 <matplotlib.axis.XTick at 0x25552b3f470>,
 <matplotlib.axis.XTick at 0x25552b3f908>,
 <matplotlib.axis.XTick at 0x25552b3fda0>,
 <matplotlib.axis.XTick at 0x25552b47278>]

在这里插入图片描述

分析:使用初始化数据构造date_new_dead的dataframe,然后作折线图表现时间趋势

de = data_new_frame( date_new_dead,china,'province_deadCount')
for i in range(len(de)):
    dr = []
    for a,b in zip(range(0,len(de)-1),range(1,len(de)-2)):
        if de.iloc[b].iloc[0] < de.iloc[a].iloc[0]:
            dr.append(de.iloc[b].iloc[0])
    de = de[~de['China'].isin(dr)]
de['China'] = de['China'].diff()
de.drop(['2020-06-21'],inplace=True)
#新增死亡患者   date_new_dead
fig = plt.figure( figsize=(16,6), dpi=100)
ax = fig.add_subplot(1,1,1)
x = de.index
y = de.values
ax.set_title('新增死亡患者',fontdict={
      'color':'black',
      'size':24
})
ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-' )
ax.set_xticks( range(0,len(x),10))
[<matplotlib.axis.XTick at 0x25553bdfd30>,
 <matplotlib.axis.XTick at 0x25553bdfcf8>,
 <matplotlib.axis.XTick at 0x25553c01f60>,
 <matplotlib.axis.XTick at 0x25553c146a0>,
 <matplotlib.axis.XTick at 0x25553c14b38>,
 <matplotlib.axis.XTick at 0x25553c14d68>,
 <matplotlib.axis.XTick at 0x25553c1b4a8>,
 <matplotlib.axis.XTick at 0x25553c1b940>,
 <matplotlib.axis.XTick at 0x25553c1bdd8>]

在这里插入图片描述

分析:新增患者自1月末开始增加,到2月14日前后到达顶点,后增数下降,趋于平缓。
分析:新增治愈患者自1月末开始增加,到3月02日前后达到顶峰,后增数下降,从4月初开始趋于平缓。
分析:新增死亡患者自1月末开始增加,到2月达到高峰,自3月初开始增数平缓,到4月17日前后因为统计因素飙升后回落。

(三)全国新增境外输入随时间变化趋势如何?

分析:新增境外输入数据需要对CHINA进行运算,逐日相减。

分析:先从CHINA取出境外输入的数据,然后补全时间序列并作差。

imported = CHINA.loc[CHINA['cityName'] == '境外输入']
imported.updateTime = pd.to_datetime(imported.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
imported
D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
continentNamecontinentEnglishNamecountryNamecountryEnglishNameprovinceNameprovinceEnglishNameprovince_zipCodeprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountupdateTimecityNamecityEnglishNamecity_zipCodecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
136亚洲Asia中国China陕西省Shaanxi6100003171.030732020-06-23境外输入NaN0.072.00.065.00.0
150亚洲Asia中国China江苏省Jiangsu3200006543.065302020-06-23境外输入NaN0.023.00.022.00.0
201亚洲Asia中国China北京市Beijing110000843164.058492020-06-23境外输入NaN0.0174.03.0173.00.0
214亚洲Asia中国China河北省Hebei1300003460.032362020-06-23境外输入NaN0.010.00.010.00.0
218亚洲Asia中国China天津市Tianjin12000019848.019232020-06-23境外输入NaN0.061.00.059.00.0
............................................................
115420亚洲Asia中国China陕西省Shaanxi6100002501.024032020-03-25境外输入NaN0.05.01.00.00.0
115956亚洲Asia中国China天津市Tianjin1200001450.013332020-03-24境外输入NaN0.09.00.00.00.0
116164亚洲Asia中国China甘肃省Gansu6200001360.011922020-03-24境外输入NaN0.045.00.030.00.0
117171亚洲Asia中国China上海市Shanghai3100004140.033042020-03-24境外输入NaN0.075.00.03.00.0
117597亚洲Asia中国China天津市Tianjin1200001420.013332020-03-24境外输入NaN0.06.00.00.00.0

607 rows × 19 columns

分析:补全省份缺失时间的数据

for i in range(0,len(province)):
    list_j_d = []
    date_b = []
    for dt in imported.loc[imported['provinceName'] ==  province[i]]['updateTime']:
        date_b.append(str(dt))
    list_j_d = list(set(date_b))
    list_j_d.sort()
    #imported.loc[imported['provinceName'] == province[3]]
    try:
        start = imported.loc[imported['provinceName'] ==  province[i]]['updateTime'].min()
        end = imported.loc[imported['provinceName'] ==  province[i]]['updateTime'].max()
        dates_b = pd.date_range(start=str(start), end=str(end))
        aid_frame_b = pd.DataFrame({'updateTime': dates_b,'provinceName':[province[i]]*len(dates_b)})
        aid_frame_b.updateTime = pd.to_datetime(aid_frame_b.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
        #draft = pd.merge(china.loc[china['provinceName'] ==  province[1]], aid_frame, on='updateTime', how='outer').sort_values('updateTime')
        draft_b = pd.concat([imported.loc[imported['provinceName'] ==  province[i]], aid_frame_b], join='outer').sort_values('updateTime')
        draft_b.city_confirmedCount.fillna(method="ffill",inplace=True)
        draft_b.city_suspectedCount.fillna(method="ffill", inplace=True)
        draft_b.city_curedCount.fillna(method="ffill", inplace=True)
        draft_b.city_deadCount.fillna(method="ffill", inplace=True)
        draft_b.loc[draft_b['provinceName'] ==  province[i]].fillna(0,inplace=True,limit = 1)
        draft_b.loc[draft_b['provinceName'] ==  province[i]].loc[:,'city_confirmedCount':'city_deadCount'] = draft_b.loc[draft_b['provinceName'] ==  province[i]].loc[:,'city_confirmedCount':'city_deadCount'].diff()
        draft_b.dropna(subset=['city_confirmedCount','city_suspectedCount','city_curedCount','city_deadCount'],inplace=True)
        imported = pd.concat([imported,draft_b], join='outer').sort_values('updateTime')
    except:
        continue
imported
continentNamecontinentEnglishNamecountryNamecountryEnglishNameprovinceNameprovinceEnglishNameprovince_zipCodeprovince_confirmedCountprovince_suspectedCountprovince_curedCountprovince_deadCountupdateTimecityNamecityEnglishNamecity_zipCodecity_confirmedCountcity_suspectedCountcity_curedCountcity_deadCount
115956亚洲Asia中国China天津市Tianjin120000.0145.00.0133.03.02020-03-24境外输入NaN0.09.00.00.00.0
0NaNNaNNaNNaN甘肃省NaNNaNNaNNaNNaNNaN2020-03-24NaNNaNNaN45.00.030.00.0
117597亚洲Asia中国China天津市Tianjin120000.0142.00.0133.03.02020-03-24境外输入NaN0.06.00.00.00.0
117597亚洲Asia中国China天津市Tianjin120000.0142.00.0133.03.02020-03-24境外输入NaN0.06.00.00.00.0
116164亚洲Asia中国China甘肃省Gansu620000.0136.00.0119.02.02020-03-24境外输入NaN0.045.00.030.00.0
............................................................
150亚洲Asia中国China江苏省Jiangsu320000.0654.03.0653.00.02020-06-23境外输入NaN0.023.00.022.00.0
136亚洲Asia中国China陕西省Shaanxi610000.0317.01.0307.03.02020-06-23境外输入NaN0.072.00.065.00.0
91NaNNaNNaNNaN天津市NaNNaNNaNNaNNaNNaN2020-06-23NaNNaNNaN61.00.059.00.0
136亚洲Asia中国China陕西省Shaanxi610000.0317.01.0307.03.02020-06-23境外输入NaN0.072.00.065.00.0
201亚洲Asia中国China北京市Beijing110000.0843.0164.0584.09.02020-06-23境外输入NaN0.0174.03.0173.00.0

2524 rows × 19 columns

分析:作copy()防止数据处理失误使得原数据丢失

draft_i = imported.copy()

分析:初始化一个省份数据,保证这个方法可行

real_s = imported.loc[imported['provinceName'] == province[0]]
real_s.drop_duplicates(subset='updateTime', keep='first', inplace=True)
draft_i =  real_s
for p in province:
    real_data = imported.loc[imported['provinceName'] == p]
    real_data.drop_duplicates(subset='updateTime', keep='first', inplace=True)
    #imported = pd.concat([real_data, china],sort=False)
    draft_i = pd.concat([real_data,draft_i],sort=False)
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

分析:确认方法无误,对余下省份进行相同的处理

imported = draft_i
imported = imported.set_index('provinceName')
imported = imported.reset_index()

分析:进行各个省份的数据合并。

list_p = []
list_d = []
list_e = []
for p in range(0,32):
    try:
        con_0 = imported.loc[imported['updateTime'] == date[2]].loc[imported['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
        list_p.append(con_0['city_confirmedCount'])#该日每省的累计确诊人数
    except:
        continue
list_d.append(sum(list_p))
list_e.append(str(date[0]))
date_new_foreign_confirmed = pd.DataFrame(list_d,index=list_e)
date_new_foreign_confirmed.index.name="date"
date_new_foreign_confirmed.columns=["imported_confirmedCount"]
date_new_foreign_confirmed
imported_confirmedCount
date
2020-01-240
l = 0
for i in date[3:]:
    list_p = []
    list_d = []
    list_e = []
    l +=1
    for p in range(0,32):
        try:
            con_0 = imported.loc[imported['updateTime'] == date[l]].loc[imported['provinceName'] ==  province[p]].iloc[[0]].iloc[0] 
            list_p.append(con_0['city_confirmedCount'])#该日每省的累计确诊人数
        except:
            continue
    #con_0 = imported.loc[imported['updateTime'] == date[0]].loc[imported['provinceName'] == '河北省'].loc[[0]].iloc[0]
    #list_p.append(con_0['city_confirmedCount'])#该日每省的累计确诊人数
    list_d.append(sum(list_p))
    list_e.append(str(date[l]))
    confirmed = pd.DataFrame(list_d, index=list_e)
    confirmed.index.name="date"
    confirmed.columns=["imported_confirmedCount"]
    date_new_foreign_confirmed = pd.concat([date_new_foreign_confirmed,confirmed],sort=False)
date_new_foreign_confirmed
imported_confirmedCount
date
2020-01-240.0
2020-01-250.0
2020-01-260.0
2020-01-270.0
2020-01-280.0
......
2020-06-17848.0
2020-06-18800.0
2020-06-19800.0
2020-06-20802.0
2020-06-21775.0

150 rows × 1 columns

#新增境外输入
fig = plt.figure( figsize=(16,4), dpi=100)
ax = fig.add_subplot(1,1,1)
x = date_new_foreign_confirmed.index
y = date_new_foreign_confirmed.values
plot = ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-',label='date_new_foreign_confirmed' )
ax.set_xticks( range(0,len(x),10))
plt.xlabel('日期',fontsize=20)
plt.ylabel('人数',fontsize=20)
plt.title('COVID-19——新增境外输入',fontsize=30)
ax.legend( loc=0, frameon=True )
<matplotlib.legend.Legend at 0x25553ca5f28>

在这里插入图片描述

分析总结:境外输入病例自3月末开始激增,到5月初增速趋于平缓,到6月初开始增速减缓。

(四)你所在的省市情况如何?

分析:首先取出广东省的所有时间序列,转换成string类型,然后进行排序

m_dates = list(set(myhome['updateTime']))
aid_d = m_dates.copy()
for d in aid_d:
    a = str(d)
    m_dates.remove(d)
    m_dates.append(a)
m_dates.sort()
myhome = myhome.set_index('provinceName')
myhome = myhome.reset_index()

分析:遍历我的城市对应的省份的时间构建对应的dataframe

#广东省累计确诊人数
list_g = []
for i in range(0,len(m_dates)):
    try:
        con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == '茂名'].iloc[[0]].iloc[0] 
        list_g.append(con_m['province_confirmedCount'])
    except:
        list_g.append(0)
        continue
g_date_confirmed = pd.DataFrame(list_g, index=m_dates)
g_date_confirmed.index.name="date"
g_date_confirmed.columns=["g_confirmed"]
g_date_confirmed=g_date_confirmed[~g_date_confirmed['g_confirmed'].isin([0])]


#广东省累计治愈人数
list_g = []
for i in range(0,len(m_dates)):
    try:
        con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == '茂名'].iloc[[0]].iloc[0] 
        list_g.append(con_m['province_curedCount'])
    except:
        list_g.append(0)
        continue
g_date_cured = pd.DataFrame(list_g, index=m_dates)
g_date_cured.index.name="date"
g_date_cured.columns=["g_cured"]
g_date_cured=g_date_cured[~g_date_cured['g_cured'].isin([0])]


#广东省累计死亡人数
list_g = []
for i in range(0,len(m_dates)):
    try:
        con_m = myhome.loc[myhome['updateTime'] == date[i]].loc[myhome['cityName'] == '茂名'].iloc[[0]].iloc[0] 
        list_g.append(con_m['province_deadCount'])
    except:
        list_g.append(0)
        continue
g_date_dead = pd.DataFrame(list_g, index=m_dates)
g_date_dead.index.name="date"
g_date_dead.columns=["g_dead"]
g_date_dead=g_date_dead[~g_date_dead['g_dead'].isin([0])]

分析:作折线图表现疫情随时间变化趋势

##广东省累计确诊人数  广东省累计治愈人数
plt.rcParams['font.sans-serif'] = ['SimHei'] 
x= g_date_confirmed.index
y1 = g_date_confirmed.values
y2 = g_date_cured.values
y3 = g_date_dead
#font_manager = font_manager.FontProperties(fname = 'C:/Windows/Fonts/simsun.ttc',size = 18)
plt.figure(figsize=(20,10),dpi = 80)
plt.plot(x,y1,color = r_hex,label = 'confirmed')
plt.plot(x,y2,color = g_hex,label = 'cured')
x_major_locator=MultipleLocator(12)
ax=plt.gca()
ax.xaxis.set_major_locator(x_major_locator)
plt.title('COVID-19 —— 广东省',fontsize=30)
plt.xlabel('日期',fontsize=20)
plt.ylabel('人数',fontsize=20)
plt.legend(loc=1, bbox_to_anchor=(1.00,0.90), bbox_transform=ax.transAxes)
<matplotlib.legend.Legend at 0x25553d02a90>

在这里插入图片描述

#广东省累计死亡人数
plt.rcParams['font.sans-serif'] = ['SimHei'] 
fig = plt.figure( figsize=(16,4), dpi=100)
ax = fig.add_subplot(1,1,1)
x = g_date_dead.index
y = g_date_dead.values
plot = ax.plot( x, y, color=dt_hex, linewidth=2, linestyle='-',label='dead' )
ax.set_xticks( range(0,len(x),10))
plt.xlabel('日期',fontsize=20)
plt.ylabel('人数',fontsize=20)
plt.title('COVID-19——广东省',fontsize=30)
ax.legend( loc=0, frameon=True )
<matplotlib.legend.Legend at 0x25553d94940>

在这里插入图片描述

分析:广东省的数据补全很成功,真实性高。
分析:从折线图来看,广东省自1月末起感染人数激增,直到2月中旬趋于平缓,3月初开始由于检测普及以及统计因素,短期确诊患者人数小幅度增加。广东省自2月初开始治愈人数激增,直到6月初开始因为新增感染人数趋于平缓,所以治愈人数趋于平缓。广东省自3月初开始不再有新增死亡患者。

(五)国外疫情态势如何?

分析:数据去除空值

world.dropna(axis=1, how='any', inplace=True)
#world.set_index('updateTime')
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

分析:创建国家列表country,创建日期列表date_y

country = list(set(world['provinceName']))
date_y = []
for dt in world.loc[world['provinceName'] ==  country[0]]['updateTime']:
    date_y.append(str(dt))
date_y = list(set(date_0))
date_y.sort()

分析:遍历国家列表对world中的updateTime进行处理并去重。

for c in country:
    world.loc[world['provinceName'] == c].sort_values(by = 'updateTime')
world.dropna(subset=['provinceName'],inplace=True)
world.updateTime = pd.to_datetime(world.updateTime,format="%Y-%m-%d",errors='coerce').dt.date
D:\Anaconda\envs\python32\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
D:\Anaconda\envs\python32\lib\site-packages\pandas\core\generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value

分析:取前15个国家的province_confirmedCount透视构成world_confirmed,并进行数据补全处理

world_confirmed = world.loc[world['provinceName'] == world.head(15)['provinceName'][0]].pivot_table(index='updateTime', columns='provinceName', values='province_confirmedCount',aggfunc=np.mean)
for i in world.head(15)['provinceName'][1:]:
    draft_c = world.loc[world['provinceName'] == i].pivot_table(index='updateTime', columns='provinceName', values='province_confirmedCount',aggfunc=np.mean)
    world_confirmed = pd.merge(world_confirmed,draft_c,on='updateTime', how='outer',sort=True)
world_confirmed.fillna(0,inplace=True,limit = 1)
world_confirmed.fillna(method="ffill",inplace=True)
world_confirmed
provinceName美国巴西英国俄罗斯智利印度巴基斯坦秘鲁西班牙孟加拉国法国沙特阿拉伯瑞典南非厄瓜多尔
updateTime
2020-01-275.000000e+000.000.0000000.00.00.0000000.0000000.0000000.000.003.0000000.0000000.0000000.00.000000
2020-01-290.000000e+000.000.0000000.00.00.0000000.0000000.0000000.000.004.0000000.0000000.0000000.00.000000
2020-01-300.000000e+000.000.0000000.00.01.0000000.0000000.0000000.000.005.0000000.0000000.0000000.00.000000
2020-01-316.000000e+000.002.0000002.00.01.0000000.0000000.0000000.000.000.0000000.0000000.0000000.00.000000
2020-02-016.000000e+000.002.0000002.00.01.0000000.0000000.0000004.000.005.5000000.0000001.0000000.00.000000
................................................
2020-06-192.184912e+06976906.50300469.000000563084.0225103.0371474.666667162935.600000243518.000000245268.00102292.00158641.000000145991.00000055672.75000083020.548256.400000
2020-06-202.221982e+061038568.00302138.750000573007.5231393.0390209.333333169464.666667247925.000000245665.75105535.00159452.000000151277.25000056201.50000087715.049519.666667
2020-06-212.253118e+061068977.25303284.428571579160.0236748.0399451.714286174346.222222251338.000000245938.00109657.75160093.000000154715.71428656360.00000092681.049731.000000
2020-06-222.279603e+061084312.25304331.000000587720.0243276.6416389.400000179148.750000254336.333333246272.00112306.00160336.428571158177.50000057346.00000096377.850092.600000
2020-06-232.299650e+061106470.00305289.000000592280.0246963.0425282.000000182562.666667257447.000000246504.00115786.00160750.000000161005.00000059060.666667101590.050487.666667

144 rows × 15 columns

分析:作前15个国家的疫情随时间变动表

#plt.rcParams['font.sans-serif'] = ['SimHei']  
fig = plt.figure(figsize=(16,10))
plt.plot(world_confirmed)
plt.legend(world_confirmed.columns)
plt.title('前15个国家累计确诊人数',fontsize=20)
plt.xlabel('日期',fontsize=20)
plt.ylabel('人数/百万',fontsize=20);

在这里插入图片描述

分析:国外数据的补全较为成功,有一定的真实性。
分析:国外新冠确诊人数自3月末开始激增,排名前四的国家的疫情没有受到控制的趋势,国外疫情的趋势为确诊人数继续激增。

(六)结合你的分析结果,对个人和社会在抗击疫情方面有何建议?

从国内疫情折线图来看,从4月末开始疫情趋于平缓,相反,国外疫情从4月初开始爆发,至今没有看到平缓的趋势。
从境外输入案例来看,我们需要谨防境外输入病例,遏制国内新冠再次传播,一切都不能放松警惕。
对于个人,我们要避免到人员密集的区域,外出一定要戴好口罩,回家要做全面的消毒。
对于社会,在交通发达区域和人员密集区域,需要普及病毒检测和场所消毒措施,切断病毒的传播途径,维护我国疫情防控的成果。

附加分析(选做)

附加分析,所使用的库不限,比如可以使用seaborn、pyecharts等库。

限于个人能力,没有做。

  • 14
    点赞
  • 80
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
本文将介绍如何使用Python爬虫对新型冠状病毒相关数据进行整理和可视化分析。主要包括以下步骤: 1. 获取疫情数据 2. 整理数据 3. 可视化数据 ## 1. 获取疫情数据 我们可以从丁香园、百度、腾讯等网站获取新型冠状病毒疫情数据。这里以丁香园为例,使用requests库进行网页爬取,代码如下: ```python import requests url = 'https://ncov.dxy.cn/ncovh5/view/pneumonia' r = requests.get(url) r.encoding = 'utf-8' print(r.text) ``` 这里获取到的是网页的HTML代码,需要使用BeautifulSoup库进行解析。代码如下: ```python from bs4 import BeautifulSoup soup = BeautifulSoup(r.text, 'html.parser') script = soup.find('script', attrs={'id': 'getListByCountryTypeService2true'}) text = script.string print(text) ``` 这里我们找到了HTML中id为`getListByCountryTypeService2true`的script标签,通过`script.string`获取到其中的字符串。这个字符串就是包含疫情数据的JSON格式数据。我们可以使用json库将其转换为Python字典格式。代码如下: ```python import json data_str = text.replace('try{window.getListByCountryTypeService2true = ','')[:-1] data_dict = json.loads(data_str) print(data_dict) ``` 这里先将字符串中的`try{window.getListByCountryTypeService2true = `和最后的分号去掉,然后使用json.loads()将其转换为Python字典格式。这个字典中包含了全国和各省市的疫情数据。 ## 2. 整理数据 我们可以将获取到的字典数据整理成DataFrame格式,方便后续的分析可视化。代码如下: ```python import pandas as pd df = pd.DataFrame(data_dict['getAreaStat']) df = df[['provinceName', 'confirmed', 'suspected', 'cured', 'dead']] df['confirmed'] = df['confirmed'].astype(int) df['suspected'] = df['suspected'].astype(int) df['cured'] = df['cured'].astype(int) df['dead'] = df['dead'].astype(int) print(df.head()) ``` 这里我们只保留了省份、确诊人数、疑似病例、治愈人数和死亡人数,然后将这些数据转换为整数类型。 ## 3. 可视化数据 接下来我们可以使用matplotlib和seaborn库对数据进行可视化分析。这里我们分别绘制各省份的确诊人数和治愈人数的柱状图。代码如下: ```python import matplotlib.pyplot as plt import seaborn as sns plt.figure(figsize=(16, 8)) sns.barplot(x='provinceName', y='confirmed', data=df) plt.xticks(rotation=90) plt.title('Confirmed Cases in China') plt.show() plt.figure(figsize=(16, 8)) sns.barplot(x='provinceName', y='cured', data=df) plt.xticks(rotation=90) plt.title('Cured Cases in China') plt.show() ``` 这里使用了sns.barplot()函数绘制柱状图,并使用plt.xticks(rotation=90)将x轴标签旋转90度,避免重叠。结果如下图所示。 ![Confirmed Cases in China](https://i.loli.net/2021/07/14/og7VnJS32i5xHwz.png) ![Cured Cases in China](https://i.loli.net/2021/07/14/U8Kx4Hj7GwVIvag.png) 可以看到,湖北省的确诊人数和治愈人数都远高于其他省份,这是因为新型冠状病毒最初在湖北省爆发。此外,各省份的确诊人数和治愈人数都有所上升,但是随着时间的推移,治愈人数逐渐增加,确诊人数逐渐减少,这是一个积极的信号。 至此,我们使用Python爬虫对新型冠状病毒疫情数据进行了整理和可视化分析。通过这些分析,我们可以更好地了解疫情的发展趋势和各省份的疫情状况,这对疫情的防控和治疗具有重要意义。
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值