《利用Python进行数据分析》第二个案例（节选）-CSDN博客

本文链接：https://blog.csdn.net/qq_44182694/article/details/115580246

先上数据：
链接：https://pan.baidu.com/s/1aPPKntZ0dDrLXdwpLKnNgw
提取码：momz
这是一个从1880年–2010年美国新生儿的姓名数据集
（里面包含年份、姓名、性别、同姓儿童出生数量）

还是用jupter notebook开发的

首先导入一年的数据：

import pandas as pd
names1880=pd.read_csv('E:\\计算机知识\\python\\Python数据分析案例\\Desktop\\pydata-notebook-master\\datasets\\babynames\\yob1880.txt',names=['name','sex','births'],engine='python')
names1880[:10]

在这里插入图片描述
举个例子：在1880年出生了7065个女孩叫Mary，后同
这只是一年的数据

最后全部合起来，一共169万条数据，确实是个庞大的数据集

#按照sex进行分组，然后对birth求总和
data1=names1880.groupby(['sex']).births.sum()
#数据可视化，男女柱状图
%matplotlib inline
import matplotlib.pyplot as plt
data1.plot.bar()

在这里插入图片描述

#把很多年的name合并起来
import pandas as pd
import numpy as np
years=range(1880,2011)
pieces=[]
columns=['name','sex','births']
for year in years:
    path='E:\\计算机知识\\python\\Python数据分析案例\\Desktop\\pydata-notebook-master\\datasets\\babynames\\yob%d.txt'%year
    frame=pd.read_csv(path,names=columns,engine='python')
    frame['year']=year
    pieces.append(frame)
names=pd.concat(pieces,ignore_index=True)

names

在这里插入图片描述

#数据透视表
total_births=names.pivot_table('births',index='year',columns='sex',aggfunc=sum)
total_births.tail()

在这里插入图片描述

#画图
%matplotlib inline
import matplotlib.pyplot as plt
total_births.plot(title='Total birth by sex and year')

在这里插入图片描述

#计算每个名字出现的比例
def add_prop(group):
    group['prop']=group.births/group.births.sum()
    return group
names=names.groupby(['year','sex']).apply(add_prop)

#画1880-2010年间的男女新出生趋势图
df_F=names[names['sex']=='F']
df_M=names[names['sex']=='M']
df_F=pd.DataFrame(df_F.groupby('year').sum()['births'])
df_M=pd.DataFrame(df_M.groupby('year').sum()['births'])
df_F.reset_index(inplace=True)
df_M.reset_index(inplace=True)
from matplotlib import pyplot as plt
plt.figure(figsize=(20,8))
plt.plot(df_F['year'],df_F['births'],label='F')
plt.plot(df_M['year'],df_M['births'],label='M')
plt.legend()
plt.show()

在这里插入图片描述

或者换用第二种方法（数据透视表，比较简单的方法）

df12=names.pivot_table('births',index='year',columns='sex',aggfunc='sum')
df12.plot(figsize=(20,8))

在这里插入图片描述

两种方法画出来的图是一致的，我们从图中发现，近几年男孩的出生数始终比女孩多（美国父母是不是也重男轻女呢？开个小玩笑。。）
接下来分析姓名占比前1000名的名字（也就是大众名）
首先定一个函数

#定义
def get_top1000(group):
    return group.sort_values(by='prop',ascending=False)[:1000]

top_1000=names.groupby(['year','sex']).apply(get_top1000)
top_1000.reset_index(inplace=True,drop=True)

#把男取出来，把女取出来
boys=top_1000[top_1000.sex=='M']
girls=top_1000[top_1000.sex=="F"]

total_births=top_1000.pivot_table('births',index='year',columns='name',aggfunc='sum')
total_births.info()

在这里插入图片描述
画’John’,‘Harry’,‘Mary’,'Marilyn’四个名字的趋势图

subset=total_births[['John','Harry','Mary','Marilyn']]
subset.plot(subplots=True,figsize=(12,10),grid=False,title='Number of births per year')

在这里插入图片描述

在这里插入图片描述
我们从这四张图发现，这前四个大众名字，在近期越来越不受欢迎了

最后分析这前一千名的大众名字在1880-2010年的趋势

table=top_1000.pivot_table('prop',index='year',columns='sex',aggfunc='sum')
table1.plot(title='M or F',figsize=(12,4),xticks=range(1880,2020,10),yticks=np.linspace(0,1,20))