几个数据分析的小实例（《使用python进行数据分析》）

最新推荐文章于 2024-06-22 10:30:53 发布

pnd237

最新推荐文章于 2024-06-22 10:30:53 发布

阅读量8.5k

点赞数 5

文章标签： python 数据分析大数据可视化

本文链接：https://blog.csdn.net/pnd237/article/details/105642397

版权

数据分析小实例

MovieLens 1M数据集
- 测量评价分歧
美国1880~2010年婴儿名字
- 分析名字趋势
2012年美国联邦选举委员会数据库

小实例内容来自《利用python进行数据分析》。
本文中可能使用的数据集来自：《利用python进行数据分析》数据集。

MovieLens 1M数据集

由GroupLens实验室从MoviesLens收集，内容是20世纪90年代末带21世纪初的电影评分数据。这些数据包含电影的评分、电影的元数据（流派和年份）以及观众数据（年龄、邮编、性别、职业）。
首先我们可以使用pandas.read_table函数来将数据以DataFrame形式读入（文件读取教程）：

import pandas as pd

unames = ["user_id","gender","age","occupation","zip"]
users = pd.read_csv("../pydata-book-2nd-edition/datasets/movielens/users.dat",sep="::",header=None,names=unames)
rnames = ["user_id","movie_id","rating","timestamp"]
ratings = pd.read_csv("../pydata-book-2nd-edition/datasets/movielens/ratings.dat",sep="::",header=None,names=rnames)
mnames = ["movie_id","title","genres"]
movies = pd.read_csv("../pydata-book-2nd-edition/datasets/movielens/movies.dat",sep="::",header=None,names=mnames)
print(users.head())
#    user_id gender  age  occupation    zip
# 0        1      F    1          10  48067
# 1        2      M   56          16  70072
# 2        3      M   25          15  55117
# 3        4      M   45           7  02460
# 4        5      M   25          20  55455
print(ratings.head())
#    user_id  movie_id  rating  timestamp
# 0        1      1193       5  978300760
# 1        1       661       3  978302109
# 2        1       914       3  978301968
# 3        1      3408       4  978300275
# 4        1      2355       5  978824291
print(movies.head())
#    movie_id                               title                        genres
# 0         1                    Toy Story (1995)   Animation|Children's|Comedy
# 1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
# 2         3             Grumpier Old Men (1995)                Comedy|Romance
# 3         4            Waiting to Exhale (1995)                  Comedy|Drama
# 4         5  Father of the Bride Part II (1995)                        Comedy

为了方便之后的数据处理，我们需要将3个DataFrame合并成为1个DataFrame（DataFrame的合并教程）。观察到users和ratings都有user_id这一项，我们可以将其作为合并键，来进行合并，之后合并的数据和movies都有movies_id这一项，之后再以此作为键进行合并：

data = pd.merge(pd.merge(ratings,users),movies)
print(data.head())
#    user_id  movie_id  ...                                   title  genres
# 0        1      1193  ...  One Flew Over the Cuckoo's Nest (1975)   Drama
# 1        2      1193  ...  One Flew Over the Cuckoo's Nest (1975)   Drama
# 2       12      1193  ...  One Flew Over the Cuckoo's Nest (1975)   Drama
# 3       15      1193  ...  One Flew Over the Cuckoo's Nest (1975)   Drama
# 4       17      1193  ...  One Flew Over the Cuckoo's Nest (1975)   Drama
print(data.iloc[0])
# user_id                                            1
# movie_id                                        1193
# rating                                             5
# timestamp                                  978300760
# gender                                             F
# age                                                1
# occupation                                        10
# zip                                            48067
# title         One Flew Over the Cuckoo's Nest (1975)
# genres                                         Drama
# Name: 0, dtype: object

之后我们可以通过数据聚合的方式，来计算出男性和女性对每部电影的平均评分（数据聚合教程）：

mean_data = data.groupby(["title","gender"])["rating"].mean()
print(mean_data.unstack().head())
# gender                                F         M
# title                                            
# $1,000,000 Duck (1971)         3.375000  2.761905
# 'Night Mother (1986)           3.388889  3.352941
# 'Til There Was You (1997)      2.675676  2.733333
# 'burbs, The (1989)             2.793478  2.962085
# ...And Justice for All (1979)  3.828571  3.689024

另外我们在数据分析的时候，往往会去掉一些样本不足的项，现在我们可以使用数据聚合的方式将评分数少于250的电影去掉：

ratings_by_title = data.groupby("title").size()
#获取每个分组的大小
print(ratings_by_title.head())
# title
# $1,000,000 Duck (1971)            37
# 'Night Mother (1986)              70
# 'Til There Was You (1997)         52
# 'burbs, The (1989)               303
# ...And Justice for All (1979)    199
# dtype: int64
active_titles = ratings_by_title.index[ratings_by_title >= 250]
#删选出评分数大于250的索引
mean_data = mean_data.loc[active_titles]
#通过索引选择行，得到筛选后的结果
print(mean_data.head())
# gender                                    F         M
# title                                                
# 'burbs, The (1989)                 2.793478  2.962085
# 10 Things I Hate About You (1999)  3.646552  3.311966
# 101 Dalmatians (1961)              3.791444  3.500000
# 101 Dalmatians (1996)              3.240000  2.911215
# 12 Angry Men (1957)                4.184397  4.328421

然后我们可以使用sort_values方法来得到女性最受欢迎的电影top10：

top_female_ratings = mean_data.sort_values(by="F",ascending=False)
print(top_female_ratings.head(10))
# gender                                                     F         M
# title                                                                 
# Close Shave, A (1995)                               4.644444  4.473795
# Wrong Trousers, The (1993)                          4.588235  4.478261
# Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.572650  4.464589
# Wallace & Gromit: The Best of Aardman Animation...  4.563107  4.385075
# Schindler's List (1993)                             4.562602  4.491415
# Shawshank Redemption, The (1994)                    4.539075  4.560625
# Grand Day Out, A (1992)                             4.537879  4.293255
# To Kill a Mockingbird (1962)                        4.536667  4.372611
# Creature Comforts (1990)                            4.513889  4.272277
# Usual Suspects, The (1995)                          4.513317  4.518248

测量评价分歧

我们可以新添加一列，其值为男性评分和女性评分的差值，以此来表示评分差异：

mean_data['diff'] = mean_data["M"] - mean_data["F"]
print(mean_data.sort_values(by="diff").head())
# gender                            F         M      diff
# # title                                                  
# # Dirty Dancing (1987)       3.790378  2.959596 -0.830782
# # Jumpin' Jack Flash (1986)  3.254717  2.578358 -0.676359
# # Grease (1978)              3.975265  3.367041 -0.608224
# # Little Women (1994)        3.870588  3.321739 -0.548849
# # Steel Magnolias (1989)     3.901734  3.365957 -0.535777

假设我们想要得到不依赖性别的最受争议的电影（评分差异大），则可以通过评分的标准差或者方差来计算：

rating_std_by_title = data.groupby("title")["rating"].std()
rating_std_by_title = rating_std_by_title.loc[active_titles]
print(rating_std_by_title.sort_values(ascending=False).head(10))
# title
# Dumb & Dumber (1994)                     1.321333
# Blair Witch Project, The (1999)          1.316368
# Natural Born Killers (1994)              1.307198
# Tank Girl (1995)                         1.277695
# Rocky Horror Picture Show, The (1975)    1.260177
# Eyes Wide Shut (1999)                    1.259624
# Evita (1996)                             1.253631
# Billy Madison (1995)                     1.249970
# Fear and Loathing in Las Vegas (1998)    1.246408
# Bicentennial Man (1999)                  1.245533
# Name: rating, dtype: float64

美国1880~2010年婴儿名字

在附件babynames文件夹中，包含着1880年~2010你年出生的婴儿的名字、性别和该名字的婴儿的数量，并且只包含出现过超过5次的名字（每一年以一个txt文件的形式出现）。
首先我们以1880年的数据为例，对其进行读取：

names1880 = pd.read_csv("../pydata-book-2nd-edition/datasets/babynames/yob1880.txt",
                        names=["name","sex","number"])
print(names1880.head())
#         name sex  number
# 0       Mary   F    7065
# 1       Anna   F    2604
# 2       Emma   F    2003
# 3  Elizabeth   F    1939
# 4     Minnie   F    1746

我们可以查看不同性别婴儿的出生数量：

print(names1880.groupby("sex").number.sum())
# F     90993
# M    110493
# Name: number, dtype: int64

由于数据分布在多个文件中（从yob1880.txt~yob2010.txt），首先需要做的事情就是讲数据集中在一个DataFrame中，并为每一行数据添加一个年份的标签：

years = range(1880,2010)
pieces = []
colunms = ["mame","sex","number"]

for year in years:
    path = "../pydata-book-2nd-edition/datasets/babynames/yob%d.txt" % year
    frame = pd.read_csv(path,names=colunms)
    frame["year"] = year
    pieces.append(frame)

names = pd.concat(pieces,ignore_index=True)
#将读取的各个frame按行连接（默认），并且重新建立行索引
print(names.head())
#         mame sex  number  year
# 0       Mary   F    7065  1880
# 1       Anna   F    2604  1880
# 2       Emma   F    2003  1880
# 3  Elizabeth   F    1939  1880
# 4     Minnie   F    1746  1880

之后我们就可以进行数据的聚合了，比如每年的男婴和女婴的数量的统计，之后进行数据的可视化：

total_births = names.groupby(["year","sex"])["number"].sum().unstack()
print(total_births.head(10))
# sex        F       M
# year                
# 1880   90993  110493
# 1881   91955  100748
# 1882  107851  113687
# 1883  112322  104632
# 1884  129021  114445
# 1885  133056  107802
# 1886  144538  110785
# 1887  145983  101412
# 1888  178631  120857
# 1889  178369  110590
total_births.plot(title = "Total birth by sex and year")

plt.show()

数据可视化结果如下：
出生人数与年份以及性别的关系
现在我们插入一个prop列，表示该名字的婴儿该年同性别婴儿总数的百分比：

def add_prop(group):
    group["prop"] = group.number / group.number.sum()
    return group
names = names.groupby(["year","sex"]).apply(add_prop)
print(names.head())
#         mame sex  number  year      prop
# 0       Mary   F    7065  1880  0.077643
# 1       Anna   F    2604  1880  0.028618
# 2       Emma   F    2003  1880  0.022013
# 3  Elizabeth   F    1939  1880  0.021309
# 4     Minnie   F    1746  1880  0.019188

为了之后数据分析方便，可以选取数量上每年/性别top1000的名字来进行之后的分析：

def get_top1000(group):
    return group.sort_values(by="number",ascending=False)[:1000]
grouped = names.groupby(["year","sex"])
top1000 = grouped.apply(get_top1000)
top1000.reset_index(inplace=True,drop=True)
#去掉原始的分级索引，更换为默认索引
print(top1000.head())
#         mame sex  number  year      prop
# 0       Mary   F    7065  1880  0.077643
# 1       Anna   F    2604  1880  0.028618
# 2       Emma   F    2003  1880  0.022013
# 3  Elizabeth   F    1939  1880  0.021309
# 4     Minnie   F    1746  1880  0.019188

分析名字趋势

如果我们想要将数据分离为男性和女性，可以进行以下操作：

boys = top1000[top1000["sex"] == "M"]
girls = top1000[top1000["sex"] == "F"]

我们也可以查看每个名字数量的变化趋势：

total_births = top1000.pivot_table("number",index="year",columns="name",aggfunc=sum)
#相当于top1000.groupby(["year","name"])["number"].sum().unstack()
print(total_births.head())
# name  Aaden  Aaliyah  Aarav  Aaron  Ab  ...  Zola  Zollie  Zona  Zora  Zula
# year                                    ...                                
# 1880    NaN      NaN    NaN  102.0 NaN  ...   7.0     NaN   8.0  28.0  27.0
# 1881    NaN      NaN    NaN   94.0 NaN  ...  10.0     NaN   9.0  21.0  27.0
# 1882    NaN      NaN    NaN   85.0 NaN  ...   9.0     NaN  17.0  32.0  21.0
# 1883    NaN      NaN    NaN  105.0 NaN  ...  10.0     NaN  11.0  35.0  25.0
# 1884    NaN      NaN    NaN   97.0 NaN  ...  14.0     6.0   8.0  58.0  27.0

之后可以选取一些名字来查看其变化趋势：

subset = total_births[["John","Harry","Mary","Marilyn"]]
subset.plot(subplots = True,figsize = (12,10),grid = False,
            title = "Number of births per year")

plt.show()

变化趋势如下：
部分名字的变化趋势
可以看到，这四个名字随着时间的增加越来越罕见，原因我们之后会进一步讨论。

计量命名多样性的增加

当我们对top1000的名字所占总数量的比例进行聚合的时候可以发现一个有趣的现象：

table = top1000.pivot_table("prop",index="year",columns="sex",aggfunc=sum)
#按性别和年份将prop分组求和
table.plot(title="Sum of table1000.prop by year and drc",
           yticks = np.linspace(0,1.2,13),xticks=range(1880,2020,10))

随着时间的变化，两种性别top1000姓名所占百分比
可以看出，随着时间的变化，top1000的名字所占百分比逐渐下降，这说明了名字越来越多样性。
还有就是，我们如何能计算出前50%数量中包含了多少个不同的名字呢？我们可以先使用sort_values对prop进行从大到小的排列，之后使用cumsum（累加）方法对其进行累加，得到一个新的列，最后使用searchsorted方法来找出最后一个低于0.5的项的位次，该位次就是包含的名字的个数（事实上应该加1，因为索引是从0开始的），以2010年出生的男孩为例：

boys = top1000[top1000.sex == "M"]
df = boys[boys.year == 2010]
print(df.head())
#            name sex  number  year      prop
# 260877    Jacob   M   21875  2010  0.011523
# 260878    Ethan   M   17866  2010  0.009411
# 260879  Michael   M   17133  2010  0.009025
# 260880   Jayden   M   17030  2010  0.008971
# 260881  William   M   16870  2010  0.008887
prop_cumsum = df.sort_values(by="prop",ascending=False).prop.cumsum()
print(prop_cumsum.head())
# 260877    0.011523
# 260878    0.020934
# 260879    0.029959
# 260880    0.038930
# 260881    0.047817
# Name: prop, dtype: float64
print(prop_cumsum.values.searchsorted(0.5))
# 116

当然我们可以通过groupby对每年的数据进行一次聚合，然后将其变化趋势进行可视化：

def get_quantile_count(group,q=0.5):
    group = group.sort_values(by="prop",ascending=False)
    return group.prop.cumsum().values.searchsorted(q) + 1

diversity = top1000.groupby(["year","sex"]).apply(get_quantile_count)
diversity = diversity.unstack("sex")
print(diversity.head())
# sex    F   M
# year        
# 1880  38  14
# 1881  38  14
# 1882  38  15
# 1883  39  15
# 1884  39  16
diversity.plot(title="Number of popular names in top 50%")
plt.show()

前50%包含的名字数
可以看出来，受欢迎的名字的种类越来越多，并且女孩的名字种类更加的多样化。

“最后一个字母”革命

有研究指出，在过去的一百年中，男孩名字的最后一个字母的分布在过去一百年中发生了重大的变化。为了确认这一点，我们需要将数据按照年龄、性别和最后一个字母进行重新汇总：

get_last_letter = lambda x:x[-1]
#匿名函数，用于取得最后一个字母
last_letters = names.name.map(get_last_letter)
last_letters.name = "last_letter"

table = names.pivot_table("number",index=last_letters,
                          columns=["sex","year"],aggfunc=sum)
subtable = table.reindex(columns=[1910,1960,2010],level="year")
#选出具有代表性的三个年份
print(subtable.head())
# sex                 F                            M                    
# year             1910      1960      2010     1910      1960      2010
# last_letter                                                           
# a            108376.0  691247.0  670605.0    977.0    5204.0   28438.0
# b                 NaN     694.0     450.0    411.0    3912.0   38859.0
# c                 5.0      49.0     946.0    482.0   15476.0   23125.0
# d              6750.0    3729.0    2607.0  22111.0  262112.0   44398.0
# e            133569.0  435013.0  313833.0  28655.0  178823.0  129012.0

接下来进行归一化处理，即一种名字数量在特定年份、特定性别下换成百分比，之后使用条形图对其进行数据的可视化：

import matplotlib.pyplot as plt

letter_prop = subtable/subtable.sum()
print(letter_prop.head())
# sex                 F                             M
# year             1910      1960      2010      1910      1960      2010
# last_letter
# a            0.273390  0.341853  0.381240  0.005031  0.002440  0.014980
# b                 NaN  0.000343  0.000256  0.002116  0.001834  0.020470
# c            0.000013  0.000024  0.000538  0.002482  0.007257  0.012181
# d            0.017028  0.001844  0.001482  0.113858  0.122908  0.023387
# e            0.336941  0.215133  0.178415  0.147556  0.083853  0.067959
fig,axes = plt.subplots(2,1,figsize=(10,8))
letter_prop["M"].plot(kind="bar",rot=0,ax=axes[0],title="Male")
letter_prop["F"].plot(kind="bar",rot=0,ax=axes[1],title="Female")

plt.show()

男性和女性在不同年代名字最后一个字母的分布
可以看出名字以n结尾的男生的数量自上个世纪60年代以来发生了一个显著的增长，我们可以使用之前的数据来查看n等字母在期间的具体增长趋势：

letter_prop = table/table.sum()
dny_ts = letter_prop.loc[["d","n","y"],"M"].T
#选择d、n、y三个行，M列,z之后进行转置
print(dny_ts.head())
# last_letter         d         n         y
# year
# 1880         0.083055  0.153213  0.075760
# 1881         0.083247  0.153214  0.077451
# 1882         0.085340  0.149560  0.077537
# 1883         0.084066  0.151646  0.079144
# 1884         0.086120  0.149915  0.080405
dny_ts.plot()

末尾字母所占比例

男孩的名字变成女孩的名字（以及反向）

另外一个有趣的趋势是，一些在早期在男性中流行的名字，随着时间的推移，逐渐变成了流行于女性的名字，如Lesley或者Leslie。为了直观的看到其变化，我们可以先筛选出以“lesl”开头的名字列表：

all_names = pd.Series(top1000.name.unique())
lesley_like = all_names[all_names.str.lower().str.contains("lesl")]
print(lesley_like)
# 632     Leslie
# 2294    Lesley
# 4262    Leslee
# 4728     Lesli
# 6103     Lesly
# dtype: object

之后我们可以将其他名字过滤掉，查看每年Lesley式名字在性别中的分布：

table = filtered.pivot_table("number",index="year",
                             columns="sex",aggfunc=sum)
print(table.head())
# sex      F      M
# year
# 1880   8.0   79.0
# 1881  11.0   92.0
# 1882   9.0  128.0
# 1883   7.0  125.0
# 1884  15.0  125.0
table = table.div(table.sum(1),axis=0)
#用每一个元素除以改行元素的和
print(table.head())
# sex          F         M
# year                    
# 1880  0.091954  0.908046
# 1881  0.106796  0.893204
# 1882  0.065693  0.934307
# 1883  0.053030  0.946970
# 1884  0.107143  0.892857
table.plot(style={"M":"k-","F":"k--"})
#绘制变化
plt.show()

结果如下：
随着时间的推移男性/女性中Lesley式名字的比例

2012年美国联邦选举委员会数据库

附件中给出的P00000001-ALL.csv是一个有关于2012年美国总统大选的文件，其包含着捐赠者姓名、职业、雇主、地址和缴费金额，我们可以使用pandas中的read_csv函数来进行加载：

fec = pd.read_csv("../pydata-book-2nd-edition/datasets/fec/P00000001-ALL.csv")
print(fec.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 1001731 entries, 0 to 1001730
# Data columns (total 16 columns):
# cmte_id              1001731 non-null object
# cand_id              1001731 non-null object
# cand_nm              1001731 non-null object
# contbr_nm            1001731 non-null object
# contbr_city          1001712 non-null object
# contbr_st            1001727 non-null object
# contbr_zip           1001620 non-null object
# contbr_employer      988002 non-null object
# contbr_occupation    993301 non-null object
# contb_receipt_amt    1001731 non-null float64
# contb_receipt_dt     1001731 non-null object
# receipt_desc         14166 non-null object
# memo_cd              92482 non-null object
# memo_text            97770 non-null object
# form_tp              1001731 non-null object
# file_num             1001731 non-null int64
# dtypes: float64(1), int64(1), object(14)
# memory usage: 122.3+ MB

我们可以使用unique快速获得所有不同的政治候选人名单（cand_nm项）：

unique_cands = fec.cand_nm.unique()
print(unique_cands)
# ['Bachmann, Michelle' 'Romney, Mitt' 'Obama, Barack'
#  "Roemer, Charles E. 'Buddy' III" 'Pawlenty, Timothy' 'Johnson, Gary Earl'
#  'Paul, Ron' 'Santorum, Rick' 'Cain, Herman' 'Gingrich, Newt'
#  'McCotter, Thaddeus G' 'Huntsman, Jon' 'Perry, Rick']

另外可以看到，数据中并没有各个参选人物的政党背景，我们可以对其进行手动的添加（以字典的形式）：

parties = {
    'Bachmann, Michelle':"Republican",
    'Cain, Herman':"Republican",
    'Gingrich, Newt':"Republican",
    'Huntsman, Jon':"Republican",
    'Johnson, Gary Earl':"Republican",
    'McCotter, Thaddeus G':"Republican",
    'Obama, Barack':'Democrat',
    'Paul, Ron':"Republican",
    'Pawlenty, Timothy':"Republican",
    'Perry, Rick':"Republican",
    "Roemer, Charles E. 'Buddy' III":"Republican",
    'Romney, Mitt':"Republican",
    'Santorum, Rick':"Republican"

}

当给map方法传入一个字典的时候，则会自动根据键值得到对应的数值，比如：

print(fec.cand_nm[123456:123461])
# 123456    Obama, Barack
# 123457    Obama, Barack
# 123458    Obama, Barack
# 123459    Obama, Barack
# 123460    Obama, Barack
# Name: cand_nm, dtype: object
print(fec.cand_nm[123456:123461].map(parties))
# 123456    Democrat
# 123457    Democrat
# 123458    Democrat
# 123459    Democrat
# 123460    Democrat
# Name: cand_nm, dtype: object

因此我们就可以根据上面的原理来新加入一列：

fec["party"] = fec.cand_nm.map(parties)
print(fec["party"].value_counts())
# Democrat      593746
# Republican    407985
# Name: party, dtype: int64

另外，在进行数据分析之前，需要的主意的是，数据中既包含捐款，也包含退款（负数），为了简化分析，我们将分析范围限制在捐款中：

print((fec.contb_receipt_amt > 0).value_counts())
# True     991475
# False     10256
# Name: contb_receipt_amt, dtype: int64
fec = fec[fec.contb_receipt_amt > 0]

另外，由于Barack Obama和Mitt Romney是主要竞选人，还特别准备了仅包含他们的数据集：

fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack','Romney, Mitt'])]
#isin代表若数据包含在输入的列表中，则返回Ture，否则返回False

按职业和雇主的捐献统计

首先我们可以获得各职业的总捐款次数：

print(fec.contbr_occupation.value_counts().head())
# RETIRED                  234829
# INFORMATION REQUESTED     35176
# ATTORNEY                  34409
# HOMEMAKER                 30199
# PHYSICIAN                 23530
# Name: contbr_occupation, dtype: int64

这里会出现一个问题，那就是一种职业可以用多种方法表示，比如对于“CEO”这个职业，可以表示为“CEO”，也可以表示为“C.E.O”，在这里我们要对这种情况进行处理，将其归为一类：

occ_mapping = {
    "INFORMATION REQUESTED PER BEST EFFORT":"NOT PROVIEDE",
    "INFORMATION REQUESTED":"NOT PROVIEDE",
    "INFORMATION REQUESTED (BEST EFFORT)":"NOT PROVIEDE",
    "C.E.O":"CEO"

}

f = lambda x:occ_mapping.get(x,x)
#get方法中，第一个参数是查找的键，若键存在则返回对应的数值，第二个参数是若不存在则返回的值
fec.contbr_occupation = fec.contbr_occupation.map(f)

当然，对于雇主这一列，其也会出现类似的情况，我们做相同的处理：

emp_mapping = {
    "INFORMATION REQUESTED PER BEST EFFORT": "NOT PROVIEDE",
    "INFORMATION REQUESTED": "NOT PROVIEDE",
    "SELF":"SELF-EMPLOYED",
    "SELF EMPLOYED":"SELF-EMPLOYED"
}

f = lambda x:emp_mapping.get(x,x)
fec.contbr_employer = fec.contbr_employer.map(f)

现在我们就可以通过pivot_table来按党派和职业对数据进行聚合，并且筛选出捐赠总支超过200w美元的职业并可视化：

by_occupation = fec.pivot_table("contb_receipt_amt",
                                index="contbr_occupation",
                                columns="party",aggfunc=sum)
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
print(over_2mm.head())
# party                 Democrat  Republican
# contbr_occupation
# ATTORNEY           11141982.97  7477194.43
# C.E.O.                 1690.00  2592983.11
# CEO                 2074284.79  1640758.41
# CONSULTANT          2459912.71  2544725.45
# ENGINEER             951525.55  1818373.70
over_2mm.plot(kind="barh")
plt.show()

按党派划分各职业捐赠总量
我们还可以使用之前使用过的方法来查看对Obama和Rommey捐赠最多的职业或者公司（如TOP5）：

def get_top_amounts(group,key,n=5):
        totals = group.groupby(key)["contb_receipt_amt"].sum()
        return totals.nlargest(n)
        #nlargest(n)表示返回前n个最大的数据

grouped = fec_mrbo.groupby("cand_nm")
#将原始数据分为两组
print(grouped.apply(get_top_amounts,"contbr_occupation",n=5))
# cand_nm        contbr_occupation
# Obama, Barack  RETIRED              25305116.38
#                ATTORNEY             11141982.97
#                NOT PROVIEDE          4866973.96
#                HOMEMAKER             4248875.80
#                PHYSICIAN             3735124.94
# Romney, Mitt   RETIRED              11508473.59
#                NOT PROVIEDE         11396894.84
#                HOMEMAKER             8147446.22
#                ATTORNEY              5364718.82
#                PRESIDENT             2491244.89
# Name: contb_receipt_amt, dtype: float64

之后便可以方便的进行数据可视化：

Obama = grouped.apply(get_top_amounts,"contbr_occupation",n=5)["Obama, Barack"]
print(Obama)
Obama.plot.bar()
plt.show()

奥巴马贡献者TOP5职业

捐赠金额分桶

可以使用cut函数将贡献者的数量按照贡献值大小离散化分桶：

bins = np.array([0,1,10,100,1000,10000,100000,1000000,10000000])
labels = pd.cut(fec_mrbo.contb_receipt_amt,bins)

grouped = fec_mrbo.groupby(["cand_nm",labels])
res = grouped.size().unstack(0)
print(res)
# cand_nm              Obama, Barack  Romney, Mitt
# contb_receipt_amt
# (0, 1]                       493.0          77.0
# (1, 10]                    40070.0        3681.0
# (10, 100]                 372280.0       31853.0
# (100, 1000]               153991.0       43357.0
# (1000, 10000]              22284.0       26186.0
# (10000, 100000]                2.0           1.0
# (100000, 1000000]              3.0           NaN
# (1000000, 10000000]            4.0           NaN

可以看出来，奥巴马获得的捐款数量比另外一个人多得多。我们还可以查看一下每个人在每种数额的捐款中占所得的比例，之后对其进行可视化：

bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
normed_sums = bucket_sums.div(bucket_sums.sum(axis=1),axis=0)
print(normed_sums)
# cand_nm              Obama, Barack  Romney, Mitt
# contb_receipt_amt                               
# (0, 1]                    0.805182      0.194818
# (1, 10]                   0.918767      0.081233
# (10, 100]                 0.910769      0.089231
# (100, 1000]               0.710176      0.289824
# (1000, 10000]             0.447326      0.552674
# (10000, 100000]           0.823120      0.176880
# (100000, 1000000]         1.000000           NaN
# (1000000, 10000000]       1.000000           NaN
normed_sums[:-2].plot(kind="barh")
plt.show()

不同捐款规模的候选人收到的捐款总额的百分比

按州进行捐赠统计

我们可以统计按照州以及候选人来对数据进行聚合，来查看每个州对每个候选人的狷狂数量：

grouped = fec_mrbo.groupby(["cand_nm","contbr_st"])
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
#将NA用0来替换
totals = totals[totals.sum(1) > 100000]
#筛选狷狂总额大于100000美元的州
print(totals.head())
# cand_nm    Obama, Barack  Romney, Mitt
# contbr_st                             
# AK             281840.15      86204.24
# AL             543123.48     527303.51
# AR             359247.28     105556.00
# AZ            1506476.98    1888436.23
# CA           23824984.24   11237636.60

同样的可以获得两个候选人在每个州捐款中所占百分比：

percent = totals.div(totals.sum(1),axis=0)
print(percent.head())
# cand_nm    Obama, Barack  Romney, Mitt
# contbr_st                             
# AK              0.765778      0.234222
# AL              0.507390      0.492610
# AR              0.772902      0.227098
# AZ              0.443745      0.556255
# CA              0.679498      0.320502

pnd237

关注

5
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
几个数据分析的小实例（《使用python进行数据分析》）

数据分析小实例MovieLens 1M数据集测量评价分歧美国1880~2010年婴儿名字分析名字趋势小实例内容来自《利用python进行数据分析》。本文中可能使用的数据集来自：《利用python进行数据分析》数据集。MovieLens 1M数据集由GroupLens实验室从MoviesLens收集，内容是20世纪90年代末带21世纪初的电影评分数据。这些数据包含电影的评分、电影的元数据（流...
复制链接

扫一扫