目录
1. Grouped Data
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')
1.1 Group一次
- 纵坐标为每个group的大小
borough_group = df.groupby('Borough')
borough_group.size().plot(kind='bar')
1.2 Group两次
- Unstack,获得每个borough的不同agency的size
agency_borough = df.groupby(['Agency','Borough'])
agency_borough.size().unstack().plot(kind='bar',title="Incidents in each Agency by Borough",figsize=(12,12))
#调节图表大小和标题
- 时间序列+group两次
具体strftime语法,参考: Python time strftime() 方法 | 菜鸟教程
#处理时间序列
import datetime
#保留年和月
df['yyyymm'] = df['Created Date'].apply(lambda x:datetime.datetime.strftime(x,'%Y%m'))
#group+unstack操作
date_agency = df.groupby(['yyyymm','Agency'])
date_agency.size().unstack().plot(kind='bar',figsize=(12,12))
- %y 两位数的年份表示(00-99)
- %Y 四位数的年份表示(000-9999)
- %m 月份(01-12)
- %d 月内中的一天(0-31)
- %H 24小时制小时数(0-23)
- %I 12小时制小时数(01-12)
- %M 分钟数(00=59)
- %S 秒(00-59)
- %a 本地简化星期名称
- %A 本地完整星期名称
- %b 本地简化的月份名称
1.3 选择TOP5制图
两次group,哪个量是全部显示(各占一个图),哪个量在后面;选择top5的量在前面
agency_borough = df.groupby(['Agency', 'Borough']).size().unstack()
形成的agency_borough长这样:key是Bronk...
Agency | BRONX | BROOKLYN | MANHATTAN | QUEENS | STATEN ISLAND |
---|---|---|---|---|---|
3-1-1 | 17.0 | 28.0 | 23.0 | 28.0 | 6.0 |
DCA | 958.0 | 1532.0 | 1529.0 | 1547.0 | 194. |
%matplotlib inline
#设计几行几列
COL_NUM = 2
ROW_NUM = 3
import matplotlib.pyplot as plt
fig, axes = plt.subplots(ROW_NUM, COL_NUM, figsize=(12,12))
colors=['r','g','b','y','c']
for i, (borough, agency_count) in enumerate(agency_borough.items()):
#图表位置
ax = axes[int(i/COL_NUM), i%COL_NUM] #3*2matrix i=0,(0,0)
agency_count = agency_count.sort_values(ascending=False)[:5] #top5取前五个
agency_count.plot(kind='barh', ax=ax,color=colors) #horizontal;subplot
ax.set_title(borough)
plt.tight_layout()
1.4 时间统计处理
import numpy as np
#时间转化成天
df['float_time'] =df['processing_time'].apply(lambda x:x/np.timedelta64(1, 'D'))
grouped = df[['float_time','Agency']].groupby('Agency')
grouped.mean().sort_values('float_time',ascending=False)
df['float_time'].hist(bins=50)
2. Distribution
2.1 画statistical function
#提取hour
df['hour of day'] = df['Created Date'].apply(lambda x:x.hour)
#画本身图
import seaborn as sns
sns.distplot(df['hour of day'])
2.2 画拟合distribution
from scipy import stats
# move the data by 5
sns.distplot(df['hour of day'].apply(lambda x: x if x>3 else x+24),kde=True,fit=stats.gamma/norm)
3. Mapping
3.1 创建Geojson内容
- 创建文本内容
map_dict = dict()
map_dict["type"] ="FeatureCollection"
features = list()
lats = df['Latitude']
longs = df['Longitude']
agencies = df['Agency']
for index in range(100):
lat,lon,agency = lats.iloc[index],longs.iloc[index],agencies.iloc[index]
data_point = { "type": "Feature",
"geometry": {"type": "Point", "coordinates": [lon, lat]},
"properties": {"Agency": agency}
}
features.append(data_point)
map_dict['features'] = features
应形如下:
example = { "type" : "FeatureCollection",
"features": [
{"type": "Feature",
"geometry": {"type":"Point", "coordinates": [-73.9626, 40.8075]},
"properties": {"name":"Columbia University"}
}]}
- 画图
import json
import geojsonio
geojsonio.display(json.dumps(map_dict))
3.2 Folium热力图
-
处理输入数据sizes,长得如下
***zip一定要是str格式!
import pandas as pd
zip_groups = df.groupby("Incident Zip")
sizes = pd.DataFrame(zip_groups.size())
import pandas as pd
sizes.rename(columns={0:"size"},inplace=True)
sizes.reset_index(level=0, inplace=True)
sizes['Zip'] = sizes['Incident Zip']
Incident Zip | size | Zip | |
---|---|---|---|
0 | 10000 | 38 | 10000 |
1 | 10001 | 5223 | 10001 |
- 制图
import folium
#Center the map at Times Square
m = folium.Map(location = [40.7589,-73.9851],zoom_start=12)
#key_on是geo_data中与data串联的key的位子
m.choropleth(geo_data='zipcode.geojson', data=sizes,
columns=[ 'Zip','size'],
key_on='feature.properties.postalCode',
fill_color='RdYlGn', fill_opacity=0.7, line_opacity=0.8,
legend_name='Distribution of Incidents')
folium.LayerControl().add_to(m)
m #展示
4. 其他
-
scatter
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.scatter(df['duration'],df['trip_distance'])