数据分析-人口分析为例
import numpy as np
import pandas as pd
读文件:
abb=pd.read_csv('./data/state-abbrevs.csv')
area=pd.read_csv('./data/state-areas.csv')
population=pd.read_csv('./data/state-population.csv')
合并abb和population:
abb_pop=pd.merge(abb,population,left_on='abbreviation',right_on='state/region',how='outer')
删除重复数据:
abb_pop.drop(labels='abbreviation',axis=1,inplace=True)
查看哪些列有NAN
abb_pop.isnull().any(axis=0)
stat列为空的行对应的state/region的数值:
abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()
补全:
indexs=abb_pop.loc[abb_pop['state/region']=='USA'].index
abb_pop.loc[indexs,'state']='United state'
indexs=abb_pop.loc[abb_pop['state/region']=='PR'].index
abb_pop.loc[indexs,'state']='PPR'
合并地区:
abb_pop_area=pd.merge(abb_pop,area,how='outer')
删除地区area (sq. mi)中有NAN的值
indexs=abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
abb_pop_area.drop(labels=indexs,axis=0,inplace=True)
查找ages为total和year为2010的行
abb_pop_area.query('ages=="total"&year=="2010"')
求解人口密度:
abb_pop_area['midu']=abb_pop_area['population']/abb_pop_area['area (sq. mi)']
人口密度最大的州:
abb_pop_area.sort_values(by='midu',axis=0,ascending=False).iloc[0]['state']