第二章 numpy库中的全美婴儿案例

最新推荐文章于 2024-02-29 09:30:00 发布
豆乳_艾米
最新推荐文章于 2024-02-29 09:30:00 发布
阅读量1.2k
点赞数
文章标签： python
本文链接：https://blog.csdn.net/yunini2/article/details/73601848
版权
#全美婴儿案例
    数据集可以做很多事儿：
    1、计算指定名字（可以是自己的，也可以是别人的）的年度比例
    2、计算某个名字的相对排名
    3、计算各年度最流行的名字，以及增长或减少最快的名字
    4、分析名字趋势：元音、辅音、长度、总体多样性、拼写变化、首尾字母等
    5、分析外源性趋势：圣经中的名字、名人、人口结构变化等
#为什么用read_table增加标题就会多增加两列？
names1880=pd.read_csv('E:/yob1880.txt',names=['name','sex','births'])
names1880[:10]
#用births列的sex组小计表示该年度的births总计
names1880.groupby('sex').births.sum()
#将各年度的数据整合在一个数据框中，用pandas.concat函数即可
#2010是目前最后一个有效的统计年度
years=range(1880,2011)
pieces=[]
columns=['name','sex','births']
for year in years:
    path='E:/names/yob%d.txt' % year
    frame=pd.read_csv(path,names=columns)
    frame['year']=year
    pieces.append(frame)
#将所有数据整合到单个DataFrame中
names_new=pd.concat(pieces,ignore_index=True)
#利用groupby和pivot_table在year和sex级进行聚合
total_births=names_new.pivot_table('births',index='year',columns='sex',aggfunc=sum)
total_births
#插入prob列，存放指定的婴儿数相对于总出生数的比例
def add_prop(group):
    births=group.births.astype(float)
    group['prop']=births/births.sum()
    return group
names_new=names_new.groupby(['year','sex']).apply(add_prop)
names_new[:10]
#检查：验证所有的prob总和是否为1
import numpy as np
np.allclose(names_new.groupby(['year','sex']).prop.sum(),1)
#取数据子集，每对sex/year组合的前1000个名字
def get_top1000(group):
    return group.sort_values(by='births',ascending=False)[:1000]
grouped=names_new.groupby(['year','sex'])
top1000=grouped.apply(get_top1000)    
#将前1000个名字分为男女两个部分
boys=top1000[top1000.sex=='M']
girls=top1000[top1000.sex=='F']
#生成一个按year和name统计的总出生数据透视表
total_births=top1000.pivot_table('births',index='year',columns='name',aggfunc=sum)
subset=total_births[['John','Harry','Mary','Marilyn']]
%matplotlib inline
subset.plot(subplots=True,figsize=(12,10),grid=False,title="Number of births per year")
#评估命名多样性的增长，计算最流行的1000个名字所占的比例，按year和sex聚合并画图
table=top1000.pivot_table('prop',index='year',columns='sex',aggfunc=sum)
table.plot(title="Sum of table1000.prop by year and sex",yticks=np.linspace(0,1.2,13),
           xticks=range(1880,2020,10))
#计算总出生人数前50%的不同名字的数量，只考虑2010年男孩的名字
df=boys[boys.year==2010]
#对prop降序，先计算prob的累计和cumsum，通过searchsorted找出0.5应该被插在哪里
prop_cumsum=df.sort_values(by='prop',ascending=False).prop.cumsum()
prop_cumsum[:10]
prop_cumsum.searchsorted(0.5)
#索引从0开始，应该+1，共117个。拿1900年比较，这个值要小的多，仅为25
df=boys[boys.year==1900]
prop_cumsum=df.sort_values(by='prop',ascending=False).prop.cumsum()
prop_cumsum.searchsorted(0.5)+1
def get_quantile_count(group,q=0.5):
    group=group.sort_values(by='prop',ascending=False)
    return group.prop.cumsum().searchsorted(q)[0]+1
#注意searchsorted(q)[0]需要先取[0]操作，否则画图会报错
diversity=top1000.groupby(['year','sex']).apply(get_quantile_count)
diversity=diversity.unstack('sex')
#依靠sex入栈操作，将Series转为DataFrame
diversity.head#diversity这个dataframe拥有两个时间序列，每个性别各一个，按年度
Out[112]: 
<bound method DataFrame.head of sex       F      M
year              
1880   [38]   [14]
1881   [38]   [14]
1882   [38]   [15]
1883   [39]   [15]
1884   [39]   [16]
1885   [40]   [16]
1886   [41]   [16]
1887   [41]   [17]
1888   [42]   [17]
1889   [43]   [18]
1890   [44]   [19]
1891   [44]   [20]
1892   [44]   [20]
1893   [44]   [21]
1894   [45]   [22]
1895   [46]   [22]
1896   [46]   [23]
1897   [46]   [23]
1898   [47]   [24]
1899   [47]   [25]
1900   [49]   [25]
1901   [49]   [25]
1902   [49]   [26]
1903   [49]   [27]
1904   [50]   [28]
1905   [50]   [28]
1906   [49]   [28]
1907   [50]   [30]
1908   [49]   [30]
1909   [49]   [30]
    ...    ...
1981   [78]   [35]
1982   [75]   [35]
1983   [71]   [34]
1984   [71]   [35]
1985   [72]   [36]
1986   [74]   [37]
1987   [75]   [39]
1988   [78]   [40]
1989   [83]   [43]
1990   [90]   [45]
1991   [95]   [48]
1992  [102]   [51]
1993  [107]   [54]
1994  [111]   [57]
1995  [115]   [60]
1996  [122]   [64]
1997  [129]   [67]
1998  [138]   [70]
1999  [146]   [73]
2000  [155]   [77]
2001  [164]   [81]
2002  [170]   [83]
2003  [178]   [87]
2004  [191]   [92]
2005  [199]   [96]
2006  [209]   [99]
2007  [223]  [103]
2008  [234]  [109]
2009  [241]  [114]
2010  [246]  [117]

[131 rows x 2 columns]>


diversity.plot(title="Number of popular names in top50%")

#最后一个字母的变革
#从name列取出最后一个字母
get_last_letter=lambda x:x[-1]
last_letters=names_new.name.map(get_last_letter)
last_letters.name='last_letter'
table=names_new.pivot_table('births',index=last_letters,columns=['sex','year'],
                            aggfunc=sum)
subtable=table.reindex(columns=[1900,1950,2000],level='year')
subtable.head()
Out[124]: 
sex                 F                            M                    
year             1900      1950      2000     1900      1950      2000
last_letter                                                           
a             89934.0  576481.0  675485.0    870.0    4037.0   40837.0
b                 NaN      17.0     372.0    372.0    1632.0   50892.0
c                 NaN      16.0     525.0    299.0    6500.0   26998.0
d              3670.0    4413.0    4380.0  15499.0  263643.0   64251.0
e            107080.0  376863.0  318199.0  22731.0  168659.0  148821.0
#按总出生数对该表进行规范化处理,以便计算各性别各末字母占出生人数比例
subtable.sum()
Out[125]: 
sex  year
F    1900     299873.0
     1950    1713001.0
     2000    1813960.0
M    1900     150554.0
     1950    1789936.0
     2000    1961702.0
dtype: float64
letter_prop=subtable/subtable.sum().astype(float)
letter_prop[:10]
sex                 F                             M                    
year             1900      1950      2000      1900      1950      2000
last_letter                                                            
a            0.299907  0.336533  0.372381  0.005779  0.002255  0.020817
b                 NaN  0.000010  0.000205  0.002471  0.000912  0.025943
c                 NaN  0.000009  0.000289  0.001986  0.003631  0.013763
d            0.012239  0.002576  0.002415  0.102946  0.147292  0.032753
e            0.357084  0.220002  0.175417  0.150982  0.094226  0.075863
f                 NaN       NaN  0.000015  0.000770  0.000475  0.000874
g            0.000110  0.000064  0.000322  0.001680  0.004155  0.001223
h            0.051032  0.045475  0.064617  0.041407  0.037949  0.043346
i            0.001201  0.010573  0.023460  0.001030  0.000347  0.009339
j                 NaN       NaN  0.000054       NaN  0.000003  0.000468
import matplotlib.pyplot as plt
%matplotlib inline
#把所有的图画在同一个画布上
plt.figure(1)
fig,axes=plt.subplots(2,1,figsize=(12,20))
letter_prop['M'].plot(kind='bar',rot=0,ax=axes[0],title='Male')
letter_prop['F'].plot(kind='bar',rot=0,ax=axes[1],title='Female',legend=False)
plt.show()

#将年度和性别对其进行规范化处理，在男孩名字中选取几个字母，进行转置以便将各个列做成一个时间序列
letter_prop=table/table.sum().astype(float)
dny_ts=letter_prop.ix[['d','n','q','y'],'M'].T
dny_ts.head()
Out[158]: last_letter         d         n   q         y
year                                         
1880         0.083055  0.153213 NaN  0.075760
1881         0.083247  0.153214 NaN  0.077451
1882         0.085340  0.149560 NaN  0.077537
1883         0.084066  0.151646 NaN  0.079144
1884         0.086120  0.149915 NaN  0.080405dny_ts.plot()

#编程女孩名字的男孩名字
all_names=top1000.name.unique()
mask=np.array(['lesl' in x.lower() for x in all_names])
lesley_like=all_names[mask]
lesley_like
Out[163]: array(['Leslie', 'Lesley', 'Leslee', 'Lesli', 'Lesly'], dtype=object)
#然后利用这个结果过滤其他名字，并按名字分组计算出生数以查看相对频率
filtered=top1000[top1000.name.isin(lesley_like)]
filtered.groupby('name').births.sum()
name
Leslee      1082
Lesley     35022
Lesli        929
Leslie    370429
Lesly      10067
Name: births, dtype: int64
#按性别和年度进行聚合，并按年度进行规范化处理
table=filtered.pivot_table('births',index='year',columns='sex',aggfunc='sum')
table=table.div(table.sum(1),axis=0)
table.head()
Out[170]: sex     F   M
year         
2006  1.0 NaN
2007  1.0 NaN
2008  1.0 NaN
2009  1.0 NaN
2010  1.0 NaN
table.plot(style={'M':'k-','F':'k--'})