数据分析入门之2012美国大选政治献金项目

操作环境: window10,Python3.7,Jupyter notebook
数据资料: https://www.lanzous.com/i98lfra

1、数据载入与总览

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
%matplotlib inline

1.1、数据加载

contb1 = pd.read_csv('./contb_01.csv')
contb2 = pd.read_csv('./contb_02.csv')
contb3 = pd.read_csv('./contb_03.csv')

1.2、数据合并

contb = pd.concat([contb1, contb2, contb3], axis=0)
contb.head() #查看前五行
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dt
0Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED250.020-JUN-11
1Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED50.023-JUN-11
2Bachmann, MichelleSMITH, LANIERALINFORMATION REQUESTEDINFORMATION REQUESTED250.005-JUL-11
3Bachmann, MichelleBLEVINS, DARONDAARNONERETIRED250.001-AUG-11
4Bachmann, MichelleWARDENBURG, HAROLDARNONERETIRED300.020-JUN-11

字段解释:

cmte_id :候选人ID
cand_nm :候选人姓名
contbr_nm : 捐赠人姓名
contbr_st :捐赠人所在州
contbr_employer : 捐赠人所在公司
contbr_occupation : 捐赠人职业
contb_receipt_amt :捐赠数额(美元)
contb_receipt_dt : 捐款的日期

1.3、数据预览和基本统计分析

1.3.1、查看数据形状

contb.shape
(1001733, 7)
  • 一共有1001733行,7列数据

1.3.2、查看是否有空值

contb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001733 entries, 0 to 1730
Data columns (total 7 columns):
cand_nm              1001733 non-null object
contbr_nm            1001733 non-null object
contbr_st            1001729 non-null object
contbr_employer      988004 non-null object
contbr_occupation    993303 non-null object
contb_receipt_amt    1001733 non-null float64
contb_receipt_dt     1001733 non-null object
dtypes: float64(1), object(6)
memory usage: 61.1+ MB
  • 从上面的结果可以看出:
    • 一共有1001733行数据,其中contbr_st,contbr_employe,contbr_occupation 有空值。

1.3.3、查看描述性的信息

contb.describe()
contb_receipt_amt
count1.001733e+06
mean2.982358e+02
std3.749663e+03
min-3.080000e+04
25%3.500000e+01
50%1.000000e+02
75%2.500000e+02
max2.014491e+06

2、数据清洗

2.1、缺失值处理

2.1.1、查看带有缺失值的列

cond = contb['contbr_employer'].isnull() #判断是否有缺失值,返回True和False
contb[cond] #筛选满足的数据
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dt
41Bachmann, MichelleMINNIS, RITACANaNNaN-1500.020-JUN-11
264Bachmann, MichelleBISHOP, GERARDNYNaNNaN-1700.028-JUN-11
752Romney, MittKNIGHT, RENAALNaNNaN-60.007-MAR-12
897Romney, MittTHE STEWART FIRM L.L.C.ALNaNNaN250.023-MAR-12
1033Romney, MittSELLERS, LEEALNaNNaN-120.020-JAN-12
........................
1715Perry, RickTEXAS ENERGY L.L.C.WYNaNNaN250.030-SEP-11
1717Perry, RickHARDER, ROBERTWYNaNRETIRED100.012-NOV-11
1720Perry, RickHARDER, ROBERTWYNaNRETIRED100.001-OCT-11
1722Perry, RickHARDER, ROBERTWYNaNRETIRED100.029-OCT-11
1723Perry, RickHARDER, ROBERTWYNaNRETIRED100.030-NOV-11

2.1.2、空值指定填充

  • 用某个值去填充上面的空值
  • 上面contbr_st,contbr_employe,contbr_occupation 的一些用户没有提供信息,这是使用NOT PROVIDE 进行填充。
# 填充空值,NOT PROVIDE没有提供
contb['contbr_employer'].fillna('NOT PROVIDE', inplace=True) #用NOT PROVIDE填充空值
#填充contbr_occupation为空的值为NOT PROVIDE
contb['contbr_occupation'].fillna('NOT PROVIDE', inplace=True)
contb['contbr_st'].fillna('NOT PROVIDE', inplace=True)
contb.info() # 查看填充后的结果
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001733 entries, 0 to 1730
Data columns (total 7 columns):
cand_nm              1001733 non-null object
contbr_nm            1001733 non-null object
contbr_st            1001733 non-null object
contbr_employer      1001733 non-null object
contbr_occupation    1001733 non-null object
contb_receipt_amt    1001733 non-null float64
contb_receipt_dt     1001733 non-null object
dtypes: float64(1), object(6)
memory usage: 61.1+ MB
  • 现在所有带有空值的数据被 NOT PROVIDE 填充后,不再有空值。

2.2、数据转换

利用字典映射进行转换:党派分析

2.2.1、候选人去重

contb['cand_nm'].unique()
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
       'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
       'Huntsman, Jon', 'Perry, Rick'], dtype=object)
  • 去重后发现一共有13名候选人参与竞选

2.2.2、添加党派

  • 通过搜索引擎等途径,获取每个总统候选人的所属党派,建立字典parties,候选人名字作为键,所属党派作为对应的值

建立字典:

parties = {
  'Bachmann, Michelle': 'Republican',
  'Romney, Mitt': 'Republican',
  'Obama, Barack': 'Democrat',
  "Roemer, Charles E. 'Buddy' III": 'Republican',
  'Pawlenty, Timothy': 'Republican',
  'Johnson, Gary Earl': 'Republican',
  'Paul, Ron': 'Republican',
  'Santorum, Rick': 'Republican',
  'Cain, Herman': 'Republican',
  'Gingrich, Newt': 'Republican',
  'McCotter, Thaddeus G': 'Republican',
  'Huntsman, Jon': 'Republican',
  'Perry, Rick': 'Republican'           
 }

数据映射:

%%time
contb['party'] = contb['cand_nm'].map(parties)
Wall time: 129 ms
  • 新加党派,使用map字典映射,100万数据,增加一列耗时129ms

查看前5行:

contb.head()
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dtparty
0Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED250.020-JUN-11Republican
1Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED50.023-JUN-11Republican
2Bachmann, MichelleSMITH, LANIERALINFORMATION REQUESTEDINFORMATION REQUESTED250.005-JUL-11Republican
3Bachmann, MichelleBLEVINS, DARONDAARNONERETIRED250.001-AUG-11Republican
4Bachmann, MichelleWARDENBURG, HAROLDARNONERETIRED300.020-JUN-11Republican

2.2.3、统计两个党派支持的次数

contb['party'].value_counts()
Democrat      593747
Republican    407986
Name: party, dtype: int64

2.2.4、查询每个政党的捐献额

contb.groupby('party')['contb_receipt_amt'].sum()
party
Democrat      1.335028e+08
Republican    1.652498e+08
Name: contb_receipt_amt, dtype: float64

2.2.5、按照职业汇总对赞助总金额进行排序

grouped_occupation = contb.groupby(['contbr_occupation'])['contb_receipt_amt'].sum()
grouped_occupation.sort_values(ascending=False) #ascending=False降序
contbr_occupation
RETIRED                                   48176647.00
ATTORNEY                                  18470473.30
HOMEMAKER                                 17484807.65
INFORMATION REQUESTED PER BEST EFFORTS    15859514.55
INFORMATION REQUESTED                      8742357.59
                                             ...     
PRES OF GAS & ELECTRIC                       -2500.00
AVIATION ATTORNEY                            -2500.00
DREDGING                                     -2500.00
METAL SMITH                                  -4225.00
VENTURE PHILANTHROPIST                       -5000.00
Name: contb_receipt_amt, Length: 45074, dtype: float64

2.2.6、职业类型去重

  • 整理同种类型职业
  • 按照职位进行汇总,计算赞助总金额,发现不少职业都是相同的,只不过是表达形式不同而已,如C.E.0与CEO,都是一个职业
  • 利用函数进行转换:职业与雇主信息分析
  • 建立一个职业对应的字典,把相同职业的不同表达映射为对应的职业,比如C.E.O映射为CEO
#整理一部分相同的职业,如果全部整理需要花费很长时间
occupation = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDE',
             'INFORMATION REQUESTED':'NOT PROVIDE',
              'C.E.O.':'CEO',
              'LAWYER':'ATTORNEY',
              'SELF':'SELF-EMPLOYED',
              'SELF EMPLOYED ':'SELF-EMPLOYED'}
f = lambda x : occupation.get(x, x)
contb['contbr_occupation'] = contb['contbr_occupation'].map(f)
# 统计同种职业捐献的总额,查看前5个
contb.groupby(['contbr_employer'])['contb_receipt_amt'].sum().sort_values(ascending = False)[:10
contbr_employer
RETIRED           41374333.67
SELF-EMPLOYED     37483895.22
NOT PROVIDE       31281997.76
HOMEMAKER         14738524.86
NOT EMPLOYED       8636809.43
NONE               3809582.99
STUDENT             957971.85
REQUESTED           894009.54
MORGAN STANLEY      386129.40
UNEMPLOYED          377088.31
Name: contb_receipt_amt, dtype: float64

2.3、数据筛选

2.3.1、赞助金额筛选

  • 去掉金额小于 “0” 的异常数据
# 捐赠金额大于0
contb_over = contb[contb['contb_receipt_amt'] > 0]
contb_over.head()
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dtparty
0Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED250.020-JUN-11Republican
1Bachmann, MichelleHARVEY, WILLIAMALRETIREDRETIRED50.023-JUN-11Republican
2Bachmann, MichelleSMITH, LANIERALNOT PROVIDENOT PROVIDE250.005-JUL-11Republican
3Bachmann, MichelleBLEVINS, DARONDAARNONERETIRED250.001-AUG-11Republican
4Bachmann, MichelleWARDENBURG, HAROLDARNONERETIRED300.020-JUN-11Republican
#去重前
contb.shape  #(1001733, 8)

#去重后
contb_over.shape  #(991477, 8)

2.3.2、查看各候选人获得的赞助总金额

cand_nm_amt = contb_over.groupby(['cand_nm'])['contb_receipt_amt'].sum().sort_values(ascending = False)
cand_nm_amt
cand_nm
Obama, Barack                     1.358776e+08
Romney, Mitt                      8.833591e+07
Paul, Ron                         2.100962e+07
Perry, Rick                       2.030675e+07
Gingrich, Newt                    1.283277e+07
Santorum, Rick                    1.104316e+07
Cain, Herman                      7.101082e+06
Pawlenty, Timothy                 6.004819e+06
Huntsman, Jon                     3.330373e+06
Bachmann, Michelle                2.711439e+06
Johnson, Gary Earl                5.669616e+05
Roemer, Charles E. 'Buddy' III    3.730099e+05
McCotter, Thaddeus G              3.903000e+04
Name: contb_receipt_amt, dtype: float64

2.3.3、数据可视化

plt.figure(figsize=(8, 8))
cand_nm_amt.plot(kind='pie') #

在这里插入图片描述

  • 从上面的数据可以看出,支持Obama, Barack 和 Romney, Mitt 的人是最多的。

2.3.4、选取选举人为Obama、Romney的子集数据

方法一:

cond1 = contb_over['cand_nm'] == 'Obama, Barack'
cond2 = contb_over['cand_nm'] == 'Romney, Mitt'
# 这是一个与运算
cond = cond1|cond2
cond.sum() #694283次
contb_vs = contb_over[cond]
contb_vs

方法二:

contb_over.query("cand_nm == 'Obama, Barack' or cand_nm == 'Romney, Mitt'")

方法三:

cond = contb_over['cand_nm'].isin(['Obama, Barack', 'Romney, Mitt'])
contb_over[cond]

结果:

cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dtparty
411Romney, MittELDERBAUM, WILLIAMAAUS GOVERNMENTFOREIGN SERVICE OFFICER25.001-FEB-12Republican
412Romney, MittELDERBAUM, WILLIAMAAUS GOVERNMENTFOREIGN SERVICE OFFICER110.001-FEB-12Republican
413Romney, MittCARLSEN, RICHARDAEDEFENSE INTELLIGENCE AGENCYINTELLIGENCE ANALYST250.013-APR-12Republican
414Romney, MittDELUCA, PIERREAECISCOENGINEER30.021-AUG-11Republican
415Romney, MittSARGENT, MICHAELAERAYTHEON TECHNICAL SERVICES CORPCOMPUTER SYSTEMS ENGINEER100.007-MAR-12Republican
...........................
201381Obama, BarackMOUNTS, ROBERTZZHQ USFK (FKDC-SA)GS-15 INTERNATIONAL RELATIONS OFFICER25.026-FEB-12Democrat
201382Obama, BarackTAITANO, TYRONEZZNOT EMPLOYEDRETIRED250.020-JAN-12Democrat
201383Obama, BarackTUCKER, PAMELAZZDODEAEDUCATOR3.020-JAN-12Democrat
201384Obama, BarackMOUNTS, ROBERTZZHQ USFK (FKDC-SA)GS-15 INTERNATIONAL RELATIONS OFFICER25.026-APR-12Democrat
201385Obama, BarackNEAL, AMBERZZTHE DEPARTMENT OF DEFENSE EDUCATION ACTEACHER135.004-SEP-11Democrat

2.4、面元化数据

接下来我们对该数据做另一种非常实用的分析,利用cut函数根据出资额大小将数据离散化到多个面元中

contb_over['contb_receipt_amt'].sort_values()
323688          0.01
327537          0.01
327468          0.01
326458          0.01
317091          0.01
             ...    
319478     526246.17
344419    1511192.17
344539    1679114.65
326651    1944042.43
325136    2014490.51
Name: contb_receipt_amt, Length: 991477, dtype: float64
  • 他们的金额出现在 (0.01,2014490.51)之间,现在将它们进行范围分组。
bins = [0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]
labels = pd.cut(contb_over['contb_receipt_amt'], bins)
labels
0         (100, 1000]
1           (10, 100]
2         (100, 1000]
3         (100, 1000]
4         (100, 1000]
            ...      
1726    (1000, 10000]
1727    (1000, 10000]
1728      (100, 1000]
1729      (100, 1000]
1730    (1000, 10000]
Name: contb_receipt_amt, Length: 991477, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]


3、数据聚合与分组运算

contb_over.columns
Index(['cand_nm', 'contbr_nm', 'contbr_st', 'contbr_employer',
       'contbr_occupation', 'contb_receipt_amt', 'contb_receipt_dt', 'party'],
      dtype='object')

3.1、透视表(pivot_table)分析党派和职业

  • 按照党派,职业对赞助金额进行汇总,类似Excel表中的透视表功能,聚合函数为sum
ret = contb_over.pivot_table('contb_receipt_amt', index='contbr_occupation', columns='party', aggfunc='sum', fill_value=0)
ret
partyDemocratRepublican
contbr_occupation
MIXED-MEDIA ARTIST / STORYTELLER100.00.0
AREA VICE PRESIDENT250.00.0
RESEARCH ASSOCIATE100.00.0
TEACHER500.00.0
THERAPIST3900.00.0
.........
ZOOKEEPER35.00.0
ZOOLOGIST400.00.0
ZOOLOGY EDUCATION25.00.0
\NONE\0.0250.0
~0.075.0

3.1.1、求每个职业为两个党派献金之和

ret['total'] = ret['Democrat'] + ret['Republican']
ret.head()
partyDemocratRepublicantotal
contbr_occupation
MIXED-MEDIA ARTIST / STORYTELLER100.00.0100.0
AREA VICE PRESIDENT250.00.0250.0
RESEARCH ASSOCIATE100.00.0100.0
TEACHER500.00.0500.0
THERAPIST3900.00.03900.0

3.1.2、过滤掉总和小于200W的数据

cond = ret['total'] < 2000000 #条件
index = ret[cond].index # 索引
ret_big = ret.drop(labels=index) # 移除
ret_big
partyDemocratRepublicantotal
contbr_occupation
ATTORNEY14302461.847.868419e+062.217088e+07
CEO2074974.794.211041e+066.286015e+06
CONSULTANT2459912.712.544725e+065.004638e+06
ENGINEER951525.551.818374e+062.769899e+06
EXECUTIVE1355161.054.138850e+065.494011e+06
HOMEMAKER4248875.801.363428e+071.788315e+07
INVESTOR884133.002.431769e+063.315902e+06
MANAGER762883.221.444532e+062.207416e+06
NOT PROVIDE13725187.322.097161e+073.469680e+07
OWNER1001567.362.408287e+063.409854e+06
PHYSICIAN3735124.943.594320e+067.329445e+06
PRESIDENT1878509.954.720924e+066.599434e+06
PROFESSOR2165071.082.967027e+052.461774e+06
REAL ESTATE528902.091.625902e+062.154804e+06
RETIRED25305316.382.356124e+074.886656e+07
SELF-EMPLOYED721108.401.961786e+062.682894e+06

3.1.3、数据可视化

# 绘制图形
ret_big.plot(kind='bar', figsize=(12, 6))

在这里插入图片描述

3.2、分组及运算和转换

  • 根据职业和雇主信息分组运算
grouped = contb_over.groupby('cand_nm')
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001D1D57CBE08>
  • 可以用 for 方法遍历出分组的结果

3.2.1、不同职业对他两人的支持度

  • 由于职业和雇主的处理非常相似,我们定义函数get_top_amounts()对两个字段进行分析处理
  • 首先统计各区间的赞助笔数,这里用到unstack(),stack()函数是堆叠,unstack()函数是不要堆叠,即把多层索引变为表格数据
def get_top_amounts(grouped, key, n):
    #先分组,grouped,然后继续再分
    return grouped.groupby(key)['contb_receipt_amt'].sum().sort_values(ascending=False)[:n]

grouped = contb_over.groupby('cand_nm')
grouped.apply(get_top_amounts, 'contbr_occupation', 7).unstack(level=0)
cand_nmObama, BarackRomney, Mitt
contbr_occupation
ATTORNEY14302461.845372424.02
CEONaN2324297.03
CONSULTANT2459912.71NaN
EXECUTIVENaN2300947.03
HOMEMAKER4248875.808147446.22
NOT PROVIDE13725187.3211638509.84
PHYSICIAN3735124.94NaN
PRESIDENTNaN2491244.89
PROFESSOR2165071.08NaN
RETIRED25305316.3811508473.59

结论:从数据可以看出,Obama更受精英群体(律师、医生、咨询顾问)的欢迎,Romney则得到更多企业家或企业高管的支持


3.2.2、不同公司对他两人的支持度

  • 同样,使用get_top_amounts()对雇主进行分析处理
def get_top_amounts(grouped, key, n):
    #先分组,grouped,然后继续再分
    return grouped.groupby(key)['contb_receipt_amt'].sum().sort_values(ascending=False)[:n]

grouped = contb_over.groupby('cand_nm')
grouped.apply(get_top_amounts, 'contbr_employer', 7)
cand_nm        contbr_employer
Obama, Barack  RETIRED            22694558.85
               SELF-EMPLOYED      18626807.16
               NOT PROVIDE        13883494.03
               NOT EMPLOYED        8586308.70
               HOMEMAKER           2605408.54
               STUDENT              318831.45
               VOLUNTEER            257104.00
Romney, Mitt   NOT PROVIDE        12321731.24
               RETIRED            11506225.71
               HOMEMAKER           8147196.22
               SELF-EMPLOYED       7414115.22
               STUDENT              496490.94
               CREDIT SUISSE        281150.00
               MORGAN STANLEY       267266.00
Name: contb_receipt_amt, dtype: float64

结论: Obama:微软,盛德国际律师事务所;Romney:瑞士瑞信银行,摩根斯坦利,高盛公司,巴克莱资本,H.I.G资本

3.3、统计各区间的赞助金额

3.3.1、查看区间金额

labels = pd.cut(contb_vs['contb_receipt_amt'], bins)
contb_vs.groupby(['cand_nm', labels]).size().unstack(level=0, fill_value=0)
cand_nmObama, BarackRomney, Mitt
contb_receipt_amt
(0, 1]49377
(1, 10]400703681
(10, 100]37228031853
(100, 1000]15399243357
(1000, 10000]2228426186
(10000, 100000]21
(100000, 1000000]30
(1000000, 10000000]40

3.3.2、绘制Obama和Romney各区间赞助的总金额

amt_vs = contb_vs.groupby(['cand_nm', labels]).sum().unstack(level=0, fill_value=0)
amt_vs.fillna(0, inplace=True)
amt_vs.plot(kind='bar', figsize=(12,6))

在这里插入图片描述

3.3.3、过滤掉大金额

# 过滤掉大金额
amt_vs[:-2].plot(kind='bar', figsize=(12,6))

在这里插入图片描述

3.3.4、百分比堆积图

算出每个区间两个候选人收到赞助总金额比:

amt_vs.div(amt_vs.sum(axis = 1), axis=0)[:-2]
contb_receipt_amt
cand_nmObama, BarackRomney, Mitt
contb_receipt_amt
(0, 1]0.8051820.194818
(1, 10]0.9187670.081233
(10, 100]0.9107690.089231
(100, 1000]0.7101770.289823
(1000, 10000]0.4473260.552674
(10000, 100000]0.8231200.176880

绘制百分比堆积图:

  • 参数 stacked=True
amt_vs.div(amt_vs.sum(axis = 1), axis=0)[:-2].plot(kind='bar', stacked=True, figsize=(12, 6))

在这里插入图片描述



4、时间处理

contb_vs.head()
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dtparty
411Romney, MittELDERBAUM, WILLIAMAAUS GOVERNMENTFOREIGN SERVICE OFFICER25.001-FEB-12Republican
412Romney, MittELDERBAUM, WILLIAMAAUS GOVERNMENTFOREIGN SERVICE OFFICER110.001-FEB-12Republican
413Romney, MittCARLSEN, RICHARDAEDEFENSE INTELLIGENCE AGENCYINTELLIGENCE ANALYST250.013-APR-12Republican
414Romney, MittDELUCA, PIERREAECISCOENGINEER30.021-AUG-11Republican
415Romney, MittSARGENT, MICHAELAERAYTHEON TECHNICAL SERVICES CORPCOMPUTER SYSTEMS ENGINEER100.007-MAR-12Republican
  • 在上面的结果中,我们可以看出时间格式为 01-FEB-12 ,拥有英文符号,说明格式是字符串,不能直接用于运算。

4.1、查看数据类型

contb_vs.dtypes
cand_nm               object
contbr_nm             object
contbr_st             object
contbr_employer       object
contbr_occupation     object
contb_receipt_amt    float64
contb_receipt_dt      object
party                 object
dtype: object

4.2、str转datetime

contb_vs['contb_receipt_dt'] = pd.to_datetime(contb_vs['contb_receipt_dt'])
contb_vs.head()
cand_nmcontbr_nmcontbr_stcontbr_employercontbr_occupationcontb_receipt_amtcontb_receipt_dtparty
411Romney, MittELDERBAUM, WILLIAMAAUS GOVERNMENTFOREIGN SERVICE OFFICER25.02012-02-01Republican
412Romney, MittELDERBAUM, WILLIAMAAUS GOVERNMENTFOREIGN SERVICE OFFICER110.02012-02-01Republican
413Romney, MittCARLSEN, RICHARDAEDEFENSE INTELLIGENCE AGENCYINTELLIGENCE ANALYST250.02012-04-13Republican
414Romney, MittDELUCA, PIERREAECISCOENGINEER30.02011-08-21Republican
415Romney, MittSARGENT, MICHAELAERAYTHEON TECHNICAL SERVICES CORPCOMPUTER SYSTEMS ENGINEER100.02012-03-07Republican
  • 现在时间是 2012-02-01 的形式,格式为 datetime64

5、各州支持率

5.1、数据分组

根据州和候选人进行分组:

state_vs = contb_vs.groupby(['cand_nm','contbr_st'])['contb_receipt_amt'].sum().unstack(level=0)
state_vs
cand_nmObama, BarackRomney, Mitt
contbr_st
AA56405.00135.00
AB2048.00NaN
AE42973.755680.00
AK281840.1586204.24
AL543123.48527303.51
.........
WI1130155.46270316.32
WV169154.47126725.12
WY194046.74252595.84
XXNaN400250.00
ZZ5963.00NaN

数据可视化:

state_vs.fillna(0, inplace=True)
state_vs.plot(kind='bar', figsize=(16, 6))

在这里插入图片描述

5.2、候选人各州金额占比

# 所占的比例
state_vs_rate = state_vs.div(state_vs.sum(axis=1), axis=0)
state_vs_rate
cand_nmObama, BarackRomney, Mitt
contbr_st
AA0.9976120.002388
AB1.0000000.000000
AE0.8832570.116743
AK0.7657780.234222
AL0.5073900.492610
.........
WI0.8069820.193018
WV0.5717000.428300
WY0.4344560.565544
XX0.0000001.000000
ZZ1.0000000.000000

5.3、删除不存在的州

state_vs_rate.drop(labels=['AA', 'AB', 'AE', 'NOT PROVIDE'], inplace=True)

5.4、绘制地图

5.4.1、导入相关库

from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Polygon #导入多边形包
from matplotlib.colors import rgb2hex #rgb2hex表示16进制的颜色

5.4.2、同一绘制不同颜色方法

# 着色
cmap = plt.cm.Reds #blues,0cean....
for i in range(10):
    print ((i+1) / 10)
    plt.plot(np.arange(10) + i, c=cmap((i+1) / 10))

输出结果:

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0

在这里插入图片描述

5.4.3、查看Obama各州均值

obama = state_vs_rate['Obama, Barack']
obama
contbr_st
AK    0.765778
AL    0.507390
AP    0.957329
AR    0.772902
AS    1.000000
        ...   
WI    0.806982
WV    0.571700
WY    0.434456
XX    0.000000
ZZ    1.000000
Name: Obama, Barack, Length: 64, dtype: float64

5.4.3、绘制美国地图

'''
关键词	描述
llcrnrlon	所需地图域的左下角经度(度)。
llcrnrlat	所需地图域的左下角纬度(度)。
urcrnrlon	所需地图域右上角的经度(度)。
urcrnrlat	所需地图域右上角的纬度(度)。
'''
plt.figure(figsize=(12, 9))
m = Basemap(llcrnrlon = -122,
           llcrnrlat = 23.41,
           urcrnrlon = -64,
           urcrnrlat = 45,
           projection = 'lcc',
           lat_1 = 30,
           lon_0 = -100
            )

m.drawcoastlines(linewidth=1.5) #海岸线
m.drawcountries(linewidth=1.5) # 国家
# m.drawstates() #直接画出州

# 读取美国地图的现状,m中就有了各州的形状,数据
m.readshapefile('./USA/gadm36_USA_1', name='states')

colors = []
states = []

cmap = plt.cm.Reds
#州全称对应缩写缩写
abbr = {'Commonwealth of Kentucky':'KY','Commonwealth of Massachusetts':'MA','Commonwealth of Pennsylvania':'PA',
        'State of Rhode Island and Providence Plantations':'RI'}

for shapeinfo in m.states_info:
    a = shapeinfo['VARNAME_1'] #结果结构AL|Ala,AK|Alaska
#     州的缩写
    s = a.split('|')[0] # 结果AL,AK表示州的缩写
    try:
        rate = obama[s] # 取出obama的州对应的值
        colors.append(cmap(rate)) #转化颜色
        states.append(s) #州的简称
    except:
        colors.append(cmap(obama[abbr[s]]))#有些州没有检查,需要进行替换
        states.append(s)
# 州填充颜色
# seg州中的一部分区域,多边形
ax = plt.gca()
for n,seg in enumerate(m.states):
    c = rgb2hex(colors[n])
    poly = Polygon(seg,color = c )
    ax.add_patch(poly)
plt.show()

在这里插入图片描述

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值