本次赛事由开源学习组织Datawhale主办,主要带领学习者利用Python进行数据分析以及数据可视化,包含数据集的处理、数据探索与清晰、数据分析、数据可视化四部分,利用pandas、matplotlib、wordcloud等第三方库带大家玩转数据分析
1.3 需要提前安装的包
# 安装词云处理包wordcloud
!pip install wordcloud --user
2、数据处理
分析候选人与捐赠人之间的关系,所以要将一张数据表中有捐赠人与候选人一一对应的关系,所以需要将目前的三张数据表进行一一关联,汇总到需要的数据。
2.1 将委员会和候选人一一对应,通过CAND_ID
关联两个表
由于候选人和委员会的联系表中无候选人姓名,只有候选人ID(CAND_ID
),所以需要通过CAND_ID
从候选人表中获取到候选人姓名,最终得到候选人与委员会联系表ccl
。
# 导入相关处理包 import pandas as pd # 读取候选人信息,由于原始数据没有表头,需要添加表头
candidates = pd.read_csv("weball20.txt", sep = '|',names=['CAND_ID','CAND_NAME','CAND_ICI','PTY_CD','CAND_PTY_AFFILIATION','TTL_RECEIPTS', 'TRANS_FROM_AUTH','TTL_DISB','TRANS_TO_AUTH','COH_BOP','COH_COP','CAND_CONTRIB', 'CAND_LOANS','OTHER_LOANS','CAND_LOAN_REPAY','OTHER_LOAN_REPAY','DEBTS_OWED_BY', 'TTL_INDIV_CONTRIB','CAND_OFFICE_ST','CAND_OFFICE_DISTRICT','SPEC_ELECTION','PRIM_ELECTION','RUN_ELECTION' ,'GEN_ELECTION','GEN_ELECTION_PRECENT','OTHER_POL_CMTE_CONTRIB','POL_PTY_CONTRIB', 'CVG_END_DT','INDIV_REFUNDS','CMTE_REFUNDS'])
# 读取候选人和委员会的联系信息 ccl = pd.read_csv("ccl.txt", sep = '|',names=['CAND_ID','CAND_ELECTION_YR','FEC_ELECTION_YR','CMTE_ID','CMTE_TP','CMTE_DSGN','LINKAGE_ID']) # 关联两个表数据 ccl = pd.merge(ccl,candidates) # 提取出所需要的列 ccl = pd.DataFrame(ccl, columns=[ 'CMTE_ID','CAND_ID', 'CAND_NAME','CAND_PTY_AFFILIATION'])
数据字段说明:
- CMTE_ID:委员会ID
- CAND_ID:候选人ID
- CAND_NAME:候选人姓名
- CAND_PTY_AFFILIATION:候选人党派
# 查看目前ccl数据前10行
ccl.head(10)
CMTE_ID | CAND_ID | CAND_NAME | CAND_PTY_AFFILIATION | |
---|---|---|---|---|
0 | C00697789 | H0AL01055 | CARL, JERRY LEE, JR | REP |
1 | C00701557 | H0AL01063 | LAMBERT, DOUGLAS WESTLEY III | REP |
2 | C00701409 | H0AL01071 | PRINGLE, CHRISTOPHER PAUL | REP |
3 | C00703066 | H0AL01089 | HIGHTOWER, BILL | REP |
4 | C00708867 | H0AL01097 | AVERHART, JAMES | DEM |
5 | C00710947 | H0AL01105 | GARDNER, KIANI A | DEM |
6 | C00722512 | H0AL01121 | CASTORANI, JOHN | REP |
7 | C00725069 | H0AL01139 | COLLINS, FREDERICK G. RICK' | DEM |
8 | C00462143 | H0AL02087 | ROBY, MARTHA | REP |
9 | C00493783 | H0AL02087 | ROBY, MARTHA | REP |
将候选人和捐赠人一一对应&#x