在本次项目实训中需要用到的数据是包括了各各站点过去24小时的各种空气污染物,其中包括了so2,no2,pm2.5,pm10,温度,气压,风向,风速等等,数据的来源是知心天气,在访问知心天气的时候需要首先需要注册出自己的秘钥,在注册秘钥之后可以通过url访问该网址,知心天气采用精准数据源授权预报信息,可查看全国各地的实时天气预告、24小时预报、15天预报、空气质量指数等预报信息。其中在城市空气质量站点的url中 city:包括了整个城市的天气状况,温度,风向等天气状况,在stations中包含了每个空气质量站点的各种数据,在爬取的时候注意key值的选取。此部分工作,是为《PM2.5预测》生成训练数据集,因为心知天气网站的气象数据是在实时更新的,所以为使数据不再变动,单独编写代码读取数据并存入硬盘,提供给预测代码数据来源 采用心知天气网站的气象和大气污染物数据,通过Restful风格url获取数据。气象数据获取济南历史24小时平均数据,大气污染物数据获取北京市历史24小时各个监测站观测数据
url_beijing_all = 'https://api.seniverse.com/v3/air/now.json?key=SnfGPnPVRV6VU3w9G&location=beijing&language=zh-Hans&scope=all' url_beijing_weather = 'https://api.seniverse.com/v3/weather/now.json?key=SnfGPnPVRV6VU3w9G&location=beijing&language=zh-Hans&unit=c' # In[92]: s_p = requests.get(url_beijing_all).read().decode('utf8')#从网上把数据读成json字符串 s_w = request.urlopen(url_beijing_weather).read().decode('utf8') data_dict_p = json.loads(s_p)#把json字符串转换成dict data_dict_w = json.loads(s_w) # In[93]: #将数据转换成pandas的DataFrame def gen_dataframe(data_list): hour_list = [] for dict_1 in data_list: dict_station = {} for station in dict_1['stations']: #字典的列表 dict_station[station['station']] = station #{站名:信息} 组成字典 dict_hour = pd.DataFrame(dict_station) hour_list.append(dict_hour.T) data = pd.concat(hour_list) return data # **先整理大气污染物数据** # In[94]: data_list_p = data_dict_p['results'][0]['hourly_history'] data_p = gen_dataframe(data_list_p) # In[95]: data_p.shape # **处理时间格式** # In[96]: def adjust_time(data): time = data['last_update'].astype(str) time = time.str[:19] time = time.str.replace('T', ' ') time = time.map(lambda x : parse(x)) time = time.dt.strftime('%H-%m/%d') data['last_update'] = time return data # **生成大气污染物的DataFrame** # In[97]: data_p = adjust_time(data_p) # In[98]: data_p # In[99]: def gen_table_w(list1): data_dict = {} for i, value in enumerate(list1): data_dict[i] = value return data_dict # **处理气象数据,生成DataFrame表格** # In[100]: data_list_w = data_dict_w['results'][0]['hourly_history'] data_dict_w = gen_table_w(data_list_w) data_w = pd.DataFrame(data_dict_w).T data_w = adjust_time(data_w) # In[101]: data_w # In[102]: data_all = pd.merge(data_p, data_w, on = 'last_update') # In[103]: data_all.shape # In[104]: pd.set_option('max_columns', 27) data_all = data_all.drop(['dew_point', 'wind_direction', 'wind_direction_degree', 'text', 'code', 'wind_scale'], axis = 1) # In[126]: data_all['wind_speed'] = data_all['wind_speed'].astype(float) data_all['pm25'] = data_all['pm25'].astype(int) data_all['no2'] = data_all['no2'].astype(int) data_all['co'] = data_all['co'].astype(int) data_all['o3'] = data_all['o3'].astype(int) data_all['pm10'] = data_all['pm10'].astype(int) data_all['so2'] = data_all['so2'].astype(int) data_all['clouds'] = data_all['clouds'].astype(int) data_all['feels_like'] = data_all['feels_like'].astype(int) data_all['humidity'] = data_all['humidity'].astype(int) data_all['pressure'] = data_all['pressure'].astype(int) data_all['temperature'] = data_all['temperature'].astype(int) data_all['visibility'] = data_all['visibility'].astype(float) # In[158]: data_all['co'] = data_all['co'].astype(float) # **由于网站数据是在实时更新的,所以如果再次运行程序,训练数据集就被更新了,所以先把已经获得的数据作为历史数据存储起来** # In[159]: data_all.to_excel('D:/python/practise/sample/weather/data_all.xlsx')