利用python进行数据分析_第二章_案例3_全美婴儿名字分析

  
In [1]:
import pandas as pd
data_year = {}
path = 'C:\\Users\\yi&lei\\Documents\\电子书\\pydata-book-1st-edition\\pydata-book-1st-edition\\ch02\\names'
for i in range(1880,2011):
    dir = path + '\\yob%d.txt' %i
    data_year[i] = pd.read_csv(dir,engine='python',header=None,names=['name','gender','birth'])
In [2]:
data_year[1880].head() ##birth为出生人数
Out[2]:
  name gender birth
0 Mary F 7065
1 Anna F 2604
2 Emma F 2003
3 Elizabeth F 1939
4 Minnie F 1746
In [3]:
data = data_year[1880]
for i in range(1881,2011):
    data = pd.concat([data,data_year[i]],ignore_index=True)
data.head()  
Out[3]:
  name gender birth
0 Mary F 7065
1 Anna F 2604
2 Emma F 2003
3 Elizabeth F 1939
4 Minnie F 1746
In [4]:
data.shape
Out[4]:
(1690784, 3)
*采用concat连接多个Data Frame更高效的做法
In [5]:
data_list = []
for i in range(1880,2011):
    dir = path + '\\yob%d.txt' %i
    data_year = pd.read_csv(dir,engine='python',header=None,names=['name','gender','birth'])
    data_year['year'] = i
    data_list.append(data_year)
data = pd.concat(data_list,ignore_index=True)
In [6]:
data.shape
Out[6]:
(1690784, 4)

统计每年出生婴儿的性别

In [7]:
data_gp = data.groupby(['year','gender']).sum()
data_gp.unstack().tail()
Out[7]:
  birth
gender F M
year    
2006 1896468 2050234
2007 1916888 2069242
2008 1883645 2032310
2009 1827643 1973359
2010 1759010 1898382

*插入prop列,用于存放每个名字占总人数的比例

In [10]:
data_g = data.groupby(['year','gender'])
data_g.sum().head()
Out[10]:
    birth
year gender  
1880 F 90993
M 110493
1881 F 91955
M 100748
1882 F 107851
In [11]:
def add_prop(group):
    births = group.birth.astype(float)
    group['prop'] = births/births.sum()
    return group
name = data.groupby(['year','gender']).apply(add_prop)
In [12]:
name.head()
Out[12]:
  name gender birth year prop
0 Mary F 7065 1880 0.077643
1 Anna F 2604 1880 0.028618
2 Emma F 2003 1880 0.022013
3 Elizabeth F 1939 1880 0.021309
4 Minnie F 1746 1880 0.019188

检查分组后的总值是不是为1

In [13]:
import numpy as np
np.allclose(name.groupby(['year','gender']).prop.sum(axis=0),1)
Out[13]:
True

*取每对gender/year的前1000名,注意apply的用法

In [14]:
name.groupby(['year','gender']).sort_index(by=['prop'])[:1000]
##不能直接对groupby对象做排序,看下面的报错提示要用APPLY
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-a61dc3c107ca> in <module>()
----> 1 name.groupby(['year','gender']).sort_index(by=['prop'])[:1000]
      2 ##不能直接对groupby对象做排序,看下面的报错提示要用APPLY

C:\PorgramFiles\Anaconda3\lib\site-packages\pandas\core\groupby.py in __getattr__(self, attr)
    546             return self[attr]
    547         if hasattr(self.obj, attr):
--> 548             return self._make_wrapper(attr)
    549 
    550         raise AttributeError("%r object has no attribute %r" %

C:\PorgramFiles\Anaconda3\lib\site-packages\pandas\core\groupby.py in _make_wrapper(self, name)
    560                    "using the 'apply' method".format(kind, name,
    561                                                      type(self).__name__))
--> 562             raise AttributeError(msg)
    563 
    564         # need to setup the selection

AttributeError: Cannot access callable attribute 'sort_index' of 'DataFrameGroupBy' objects, try using the 'apply' method
In [15]:
def top1000(group):
    return group.sort_values(by=['prop'],ascending=False)[:1000]
In [16]:
top_1000 = name.groupby(['year','gender']).apply(top1000)
  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值