python数据分析chapter2-3

最新推荐文章于 2022-05-20 17:28:40 发布

墨岚❤️

最新推荐文章于 2022-05-20 17:28:40 发布

阅读量1.4k

点赞数 1

分类专栏： python 数据处理与分析文章标签： python 数据分析

本文链接：https://blog.csdn.net/LY_ysys629/article/details/73604664

版权

python 同时被 2 个专栏收录

29 篇文章

订阅专栏

数据处理与分析

18 篇文章

订阅专栏

通过分析1880年至2011年美国婴儿名字数据，探究名字流行度、多样性及性别混用趋势的变化。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1 统计婴儿姓名

婴儿的姓名，能反映什么呢？很多，比如某个名字的使用人数，流行程度，人口结构变化等等。下边就让我们来探索名字中隐藏的秘密吧~

1.1 下载数据

仍然是最常规的下载数据，然后显示原数据，看看原始数据文件里边都是什么样的，然后再想怎么处理，要得到什么的结果。

import pandas as pd

#查看原始数据
print open(r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\names\yob1880.txt').readline()

names1880 = pd.read_csv(r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\names\yob1880.txt',names=['name','sex','births'])
print type(names1880)
names1880[:5]

  Mary,F,7065

    <class 'pandas.core.frame.DataFrame'>

	name	sex	births
0	Mary	F	7065
1	Anna	F	2604
2	Emma	F	2003
3	Elizabeth	F	1939
4	Minnie	F	1746

1.2 查看1880年男女婴儿的出生数

print names1880.shape
print len(names1880)
print names1880.size#很显然，size是三列的乘积
names1880.groupby(['sex']).sum()

(2000, 3)
    2000
    6000

	births
sex
F	90993
M	110493

1.3 实现多个txt文本文件的融合（多个DataFrame的联结）

#由于统计的是1880-2011年的婴儿名字
years = range(1880,2011)
columns = ['name','sex','births']
pieces = []
for year in years:
    path = r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\names\yob%d.txt'%year
    frame = pd.read_csv(path, names = columns)
    frame['year'] = year
    pieces.append(frame)

names = pd.concat(pieces,ignore_index=True)
print len(names)
print names.shape

    1690784
    (1690784, 4)

1.4 统计并可视化每年不同性别婴儿的出生数量

一般原来数据的三列会被挑选出来，做成透视表，其中两列做成行列表，第三列填充表中内容，并实现可视化

total_births_by_sex = pd.pivot_table(names,values = 'births', index ='year',columns='sex',aggfunc = sum)
total_births_by_sex.tail()#默认显示最后5行

sex	F	M
year
2006	1896468	2050234
2007	1916888	2069242
2008	1883645	2032310
2009	1827643	1973359
2010	1759010	1898382

total_births_by_sex.plot(title='total births by sex and year')
import matplotlib.pyplot as plt
plt.show()

1.5 找出最受欢迎的名字

在达到要求之前，需要在原来的names数据集增加一列prop表示每年每个名字在当年相同性别中的使用比例

def add_prop(group):
    briths = group.births.astype(float)
    group['prop'] = briths/briths.sum()
    return group
names = names.groupby(['year','sex']).apply(add_prop)

names[:5]

	name	sex	births	year	prop
0	Mary	F	7065	1880	0.077643
1	Anna	F	2604	1880	0.028618
2	Emma	F	2003	1880	0.022013
3	Elizabeth	F	1939	1880	0.021309
4	Minnie	F	1746	1880	0.019188

import numpy as np
#检验分组总计值是否接近于1
np.allclose(names.groupby(['year','sex']).prop.sum(),1)

   True

def get_top1000(group):
    return group.sort_index(by = 'births',ascending = False)
top1000 = names.groupby(['year','sex']).apply(get_top1000)

    C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
      from ipykernel import kernelapp as app

top1000.info()#查看top1000的相关信息

    <class 'pandas.core.frame.DataFrame'>
    MultiIndex: 1690784 entries, (1880, F, 0) to (2010, M, 1690783)
    Data columns (total 5 columns):
    name      1690784 non-null object
    sex       1690784 non-null object
    births    1690784 non-null int64
    year      1690784 non-null int64
    prop      1690784 non-null float64
    dtypes: float64(1), int64(2), object(2)
    memory usage: 77.4+ MB

展示最受欢迎的名字

top1000[:5]

			name	sex	births	year	prop
year	sex
1880	F	0	Mary	F	7065	1880	0.077643
		1	Anna	F	2604	1880	0.028618
		2	Emma	F	2003	1880	0.022013
		3	Elizabeth	F	1939	1880	0.021309
		4	Minnie	F	1746	1880	0.019188

可以发现在1880年名字为mary的女baby最多

1.6 分析并可视化某个名字的随时间变化趋势

#将top1000中男女分开
boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']

boys[:5]

			name	sex	births	year	prop
year	sex
1880	M	942	John	M	9655	1880	0.087381
		943	William	M	9533	1880	0.086277
		944	James	M	5927	1880	0.053641
		945	Charles	M	5348	1880	0.048401
		946	George	M	5126	1880	0.046392

girls[:5]

			name	sex	births	year	prop
year	sex
1880	F	0	Mary	F	7065	1880	0.077643
		1	Anna	F	2604	1880	0.028618
		2	Emma	F	2003	1880	0.022013
		3	Elizabeth	F	1939	1880	0.021309
		4	Minnie	F	1746	1880	0.019188

#生成year、name和births的透视表
total_births = top1000.pivot_table(values = 'births',index = 'year',columns = 'name',aggfunc = sum)

total_births[:5]

name	Aaban	Aabid	Aabriella	Aadam	Aadan	Aadarsh	Aaden	Aadesh	Aadhav	Aadhavan	…	Zyrus	Zysean	Zyshaun	Zyshawn	Zyshon	Zyshonne	Zytavious	Zyvion	Zyyanna	Zzyzx
year
1880	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1881	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1882	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1883	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1884	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 88496 columns

total_births.info()

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 131 entries, 1880 to 2010
    Columns: 88496 entries, Aaban to Zzyzx
    dtypes: float64(88496)
    memory usage: 88.4 MB

subset = total_births['John']
subset[:5]

    year
    1880    9701.0
    1881    8795.0
    1882    9597.0
    1883    8934.0
    1884    9427.0
    Name: John, dtype: float64

subset = total_births[['John','Harry','Mary','Marilyn']]
subset[:5]

name	John	Harry	Mary	Marilyn
year
1880	9701.0	2158.0	7092.0	NaN
1881	8795.0	2002.0	6948.0	NaN
1882	9597.0	2246.0	8179.0	NaN
1883	8934.0	2116.0	8044.0	NaN
1884	9427.0	2338.0	9253.0	NaN

#可视化subset,注意该图是在运行三次出的，前两次有点小问题
subset.plot(subplots = True, figsize = (12,10),grid = False,title='Number of births per year')
plt.show()

为什么常见的名字会越来越少被使用？可能的原因是大家想让孩子的名字与众不同~

1.7 命名多样性增加

怎么来验证命名多样性的增加呢？在上述中top1000是按前prop选出的，prop表示某个名字每性别每年在所有婴儿出生数中的比例，当这个值降低时，说明其他人数少的名字的比例会增加，即可证明名字越来越不同。

tabal = top1000.pivot_table(values = 'prop',index = 'year',columns = 'sex',aggfunc = sum)
tabal[:5]

sex	F	M
year
1880	1.0	1.0
1881	1.0	1.0
1882	1.0	1.0
1883	1.0	1.0
1884	1.0	1.0

可以看出1880-1884年婴儿的名字几乎都是前1000名的名字集合中，命名多样性差

#tabal.plot(title ='sum of table1000.prop by year and sex',yticks = np.linspace(0,1.2,20),xticks=range(1880,2020,10))
tabal.plot(title ='sum of table1000.prop by year and sex')
plt.show()

未解之谜？？？

另外一种展示名字多样性的方式

占出生总人口数50%的且具有较大prop的名字的个数随时间越来越多，那么也可以说明命名多样性在增加。那么分别统计每年不同性别出生数占50%的名字个数的变化趋势，从该趋势上就能看出命名多样性。

def get_quantile_count(group,q=0.5):
    group = group.sort_index(by = 'prop',ascending = False)
    return (group.prop.cumsum().searchsorted(q)+1)[0]
diversity = top1000.groupby(['year','sex']).apply(get_quantile_count)
print type(diversity)
diversity[:5]

    C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
      from ipykernel import kernelapp as app


    <class 'pandas.core.series.Series'>

year  sex
1880  F      38
      M      14
1881  F      38
      M      14
1882  F      38
dtype: int64

#具有多个索引的Serise可以展开
diversity = diversity.unstack('sex')
diversity[:5]

sex	F	M
year
1880	38	14
1881	38	14
1882	38	15
1883	39	15
1884	39	16

diversity.plot(title='number of popular names in top 50%')
plt.show()

可以看出该曲线呈现上升趋势，说明命名多样性。

1.8 男孩名与女孩名字的混用情况

在以前，通常男孩使用的名字，在现在，被越来越多的使用在女孩上。在这里，只考察以‘lesl’开头的一组名字，男生女生使用随时间的变化趋势。

找出top1000表中以lesl开通的名字

all_names = top1000.name.unique()#找出top1000中所有名字的集合（没有重复的名字）
all_names[:5]

    array(['Mary', 'Anna', 'Emma', 'Elizabeth', 'Minnie'], dtype=object)

mask = np.array(['lesl' in x.lower() for x in all_names])
mask

  array([False, False, False, ..., False, False, False], dtype=bool)

lesley_like = all_names[mask]
print lesley_like.shape
lesl_like = np.array([ x for x in all_names if x.lower().startswith('lesl')])
print lesl_like.shape

    (24L,)
    (21L,)

lesl_top1000 = top1000[top1000.name.isin(lesl_like)]
lesl_top1000.groupby(['name']).births.sum()

    name
    Lesle            187
    Leslea           349
    Leslee          4863
    Leslei            52
    Lesleigh         436
    Lesley         37945
    Lesleyann         86
    Lesleyanne        80
    Lesli           5473
    Leslian           27
    Lesliann           6
    Leslianne         10
    Leslie        371686
    Leslieann        465
    Leslieanne        93
    Lesliee            8
    Leslly             5
    Lesly          12407
    Leslyann          16
    Leslye          2295
    Leslyn           166
    Name: births, dtype: int64

lesl_top1000[:5]

			name	sex	births	year	prop
year	sex
1880	F	654	Leslie	F	8	1880	0.000088
1880	M	1108	Leslie	M	79	1880	0.000715
1881	F	2523	Leslie	F	11	1881	0.000120
1881	M	3072	Leslie	M	92	1881	0.000913
1882	F	4593	Leslie	F	9	1882	0.000083

构造year、sex、以及births的透视图，并画出每年不同性别，名字在lesl_like集合中的出生数

lesl_pivot = lesl_top1000.pivot_table(values = 'births',index = 'year',columns = 'sex',aggfunc = sum)
lesl_pivot[:5]

sex	F	M
year
1880	8	79
1881	11	92
1882	9	128
1883	7	125
1884	15	125

lesl_pivot.plot(style = {'M':'k-','F':'k--'})
plt.show()

上图显示命名在lesl_like集合中的婴儿出生数量的变化，但是我们没有考虑，每年男女baby出生总量的变化因素，使得结果并不是很清晰的显示男名女用的变化趋势，接下里，将男女baby出生总量的变化因素考虑进去，看看结果如何吧~

lesl_pivot_div = lesl_pivot.div(lesl_pivot.sum(1),axis = 0)
lesl_pivot_div.tail()

sex	F	M
year
2006	0.979139	0.020861
2007	0.978508	0.021492
2008	0.977437	0.022563
2009	0.971627	0.028373
2010	0.978482	0.021518

lesl_pivot_div.plot(style = {'M':'k-','F':'k--'})
plt.show()

瞧！结果很明显，1880-1940年之间lesl_like集合中的名字大部分是男孩，但是之后，女孩使用的比例发生了很大变化！

2 总结

本部分主要使用panda中的groupby、pivot_table、.plot灯等函数分析了婴儿名字的使用变化情况。