import numpy as np
import pandas as pd
import math
任务1:企业收入的多样性
一个企业的产业收入多样性可以仿照信息熵的概念来定义收入熵指标:
其中 p(x)是企业该年某产业收入额占该年所有产业总收入的比重。在company.csv中存有需要计算的企业和年份,在company_data.csv中存有企业、各类收入额和收入年份的信息。现请利用后一张表中的数据,在前一张表中增加一列表示该公司该年份的收入熵指标I
思路:
1.将company_data表转换位包含证券代码(需转换)、日期(年)、熵值的表
2.证券编码需要去除"#"后转换为INT型
3.日期取年
4.按照年分组汇总数据,求出总计,再将汇总数据与分类数据进行求值
一定要注意将小于0的数据排除干净
df_com = pd.read_csv('练习题数据/Company.csv')
print(df_com.dtypes)
df_com.head(2)
证券代码 object
日期 int64
dtype: object
证券代码 | 日期 | |
---|---|---|
0 | #000007 | 2014 |
1 | #000403 | 2015 |
df_com_data = pd.read_csv('练习题数据/Company_data.csv')
df_com_data.head(2)
证券代码 | 日期 | 收入类型 | 收入额 | |
---|---|---|---|---|
0 | 1 | 2008/12/31 | 1 | 1.084218e+10 |
1 | 1 | 2008/12/31 | 2 | 1.259789e+10 |
1.首先将数据的日期转换为年,之前查看的时候以为日期列存储的是日期类型,后来发现存储的是字符串,只需要截取前四位即可。
A = list(df_com_data['日期'])
A = [i[:4] for i in A]
df_com_data['日期'] = A
df_com_data.head(2)
证券代码 | 日期 | 收入类型 | 收入额 | |
---|---|---|---|---|
0 | 1 | 2008 | 1 | 1.084218e+10 |
1 | 1 | 2008 | 2 | 1.259789e+10 |
2.将证券代码向前补0并加上"#",由于我们存储的证券代码是INT型,需要转为str之后再使用zfill填充0
B = df_com_data['证券代码']
B = ["#"+str(i).zfill(6) for i in B]
df_com_data['证券代码'] = B
df_com_data = df_com_data[df_com_data['收入额']>0]
df_com_data.head(2)
证券代码 | 日期 | 收入类型 | 收入额 | |
---|---|---|---|---|
0 | #000001 | 2008 | 1 | 1.084218e+10 |
1 | #000001 | 2008 | 2 | 1.259789e+10 |
3.求出依据证券代码,日期分组后的收入额的总和,并且需要查看是否有负值的收入额,作为异常值清除掉。此处因为在groupby的时候会将列名作为索引,因此使用了ax_index=False,方便之后排除异常值
gb_data = df_com_data.groupby(['证券代码','日期'],as_index=False)['收入额'].sum()
gb_data = gb_data.rename(columns={"收入额":"收入总额"})
gb_data.head(2)
证券代码 | 日期 | 收入总额 | |
---|---|---|---|
0 | #000001 | 2008 | 7.402176e+10 |
1 | #000001 | 2009 | 4.771900e+10 |
4.现在我们将得到的最终聚合数据与company_data进行拼接,之后求得p(i)的值
def Pi(x):
Income = x['收入额']
Income_ALL = x['收入总额']
Pi_value = Income/Income_ALL
return Pi_value.sum()
merge_data = df_com_data.merge(gb_data, left_on=['证券代码','日期'], right_on=['证券代码','日期'], how='left')
com_gb = merge_data.groupby(['证券代码','日期','收入类型'])[['收入额','收入总额']]
com_gb.apply(Pi)
com_gb = pd.DataFrame(com_gb.apply(Pi))
com_gb1 = com_gb.reset_index()
com_gb1 = com_gb.rename(columns = {0:'P'})
com_gb1['logP'] = com_gb1.apply(lambda x : math.log(x['P'],2) ,axis=1)
com_gb1['P*logP'] = com_gb1.apply(lambda x : x['P']*x['logP'] ,axis=1)
5.将com_gb中去掉类型,按照证券编号、日期,求得我们想要的I值,最后用merge将值拼入df_com中
com_gb2 = com_gb1.groupby(['证券代码','日期'])['P*logP'].sum()
com_gb3 = pd.DataFrame(com_gb2)
com_gb3 = com_gb3.reset_index()
com_gb3 = com_gb3.rename(columns = {'P*logP':'I'})
com_gb3['日期'] = com_gb3['日期'].astype(int) #一定要赋值
com_gb3.head()
com_gb3.dtypes
证券代码 object
日期 int32
I float64
dtype: object
注意:此处我遇到了一个问题:两张表中的日期的类型不一致,所以需要将日期类型转换为一致才能使用merge
com_data = df_com.merge(com_gb3, on = ['证券代码','日期'], how='left')
com_data.head()
证券代码 | 日期 | I | |
---|---|---|---|
0 | #000007 | 2014 | -4.429740 |
1 | #000403 | 2015 | -4.025963 |
2 | #000408 | 2016 | -4.066295 |
3 | #000408 | 2017 | NaN |
4 | #000426 | 2015 | -4.449655 |
任务2:组队学习信息表的变换
请把组队学习的队伍信息表变换为如下形态,其中“是否队长”一列取1表示队长,否则为0
1.通过read_excel读取数据,并观察数据与目标数据的差异,发现我们源数据拥有多列的编号和名称,我们需要将编号和名称转换位列,保留队伍名称,对于没有存值的编号和昵称,我们需要将他们去除
df = pd.read_excel('练习题数据/组队信息汇总表(Pandas).xlsx')
df.head(2)
所在群 | 队伍名称 | 队长编号 | 队长_群昵称 | 队员1 编号 | 队员_群昵称 | 队员2 编号 | 队员_群昵称.1 | 队员3 编号 | 队员_群昵称.2 | ... | 队员6 编号 | 队员_群昵称.5 | 队员7 编号 | 队员_群昵称.6 | 队员8 编号 | 队员_群昵称.7 | 队员9 编号 | 队员_群昵称.8 | 队员10编号 | 队员_群昵称.9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Pandas数据分析 | 你说的都对队 | 5 | 山枫叶纷飞 | 6 | 蔡 | 7.0 | 安慕希 | 8.0 | 信仰 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Pandas数据分析 | 熊猫人 | 175 | 鱼呲呲 | 44 | Heaven | 37.0 | 吕青 | 50.0 | 余柳成荫 | ... | 25.0 | Never say never | 55.0 | K | 120.0 | Y. | 28.0 | X.Y.Q | 151.0 | swrong |
2 rows × 24 columns
colum = ['所在群','队伍名称','编号-0','昵称-0','编号-1','昵称-1','编号-2','昵称-2','编号-3','昵称-3','编号-4','昵称-4','编号-5','昵称-5','编号-6','昵称-6','编号-7','昵称-7','编号-8','昵称-8','编号-9','昵称-9','编号-10','昵称-10']
df.columns=colum
df.head(2)
所在群 | 队伍名称 | 编号-0 | 昵称-0 | 编号-1 | 昵称-1 | 编号-2 | 昵称-2 | 编号-3 | 昵称-3 | ... | 编号-6 | 昵称-6 | 编号-7 | 昵称-7 | 编号-8 | 昵称-8 | 编号-9 | 昵称-9 | 编号-10 | 昵称-10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Pandas数据分析 | 你说的都对队 | 5 | 山枫叶纷飞 | 6 | 蔡 | 7.0 | 安慕希 | 8.0 | 信仰 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Pandas数据分析 | 熊猫人 | 175 | 鱼呲呲 | 44 | Heaven | 37.0 | 吕青 | 50.0 | 余柳成荫 | ... | 25.0 | Never say never | 55.0 | K | 120.0 | Y. | 28.0 | X.Y.Q | 151.0 | swrong |
2 rows × 24 columns
2.多列转行,我们考虑使用wide_to_long,先对列名进行整理,最后再将多余数据进行删除处理。这个过程中编号第一次转换的是float型,需要将他转换为Int型,这里使用astype; 在进行是否为队长的匹配的时候,注意一定要将取出来的值转换为list
df1 = pd.wide_to_long(df,
stubnames=['编号','昵称'],
i = ['所在群','队伍名称'],
j = '编码',
sep='-',
suffix='.+')
df2 = df1[df1['编号'].notnull()]
df2 = df2.reset_index()
df2 = df2[['队伍名称','编号','昵称']]
df2['编号'] = df2['编号'].astype(int)
df2.head()
队伍名称 | 编号 | 昵称 | |
---|---|---|---|
0 | 你说的都对队 | 5 | 山枫叶纷飞 |
1 | 你说的都对队 | 6 | 蔡 |
2 | 你说的都对队 | 7 | 安慕希 |
3 | 你说的都对队 | 8 | 信仰 |
4 | 你说的都对队 | 20 | biubiu🙈🙈 |
list1 = list(df['编号-0'].astype(int)) ## 此处一定要转换为列表
df2['是否队长'] = df2.apply(lambda x: 1 if x['编号'] in list1 else 0 ,axis=1) #axis=1按行
df2.head()
队伍名称 | 编号 | 昵称 | 是否队长 | |
---|---|---|---|---|
0 | 你说的都对队 | 5 | 山枫叶纷飞 | 1 |
1 | 你说的都对队 | 6 | 蔡 | 0 |
2 | 你说的都对队 | 7 | 安慕希 | 0 |
3 | 你说的都对队 | 8 | 信仰 | 0 |
4 | 你说的都对队 | 20 | biubiu🙈🙈 | 0 |
3.将列名的顺序与原表保持一致
(此处call一下我们组的小可爱们~)
cols = ['是否队长','队伍名称','昵称','编号']
df_data = df2.loc[:,cols]
df_data[df_data['队伍名称'] == '不急不躁我最棒✌️']
是否队长 | 队伍名称 | 昵称 | 编号 | |
---|---|---|---|---|
82 | 1 | 不急不躁我最棒✌️ | Lyndsey | 2 |
83 | 0 | 不急不躁我最棒✌️ | Roman. | 90 |
84 | 0 | 不急不躁我最棒✌️ | 李松泽 Orwell | 91 |
85 | 0 | 不急不躁我最棒✌️ | L. | 115 |
86 | 0 | 不急不躁我最棒✌️ | 阿涂 | 163 |
87 | 0 | 不急不躁我最棒✌️ | YHY | 68 |
88 | 0 | 不急不躁我最棒✌️ | 佬仔 | 62 |
89 | 0 | 不急不躁我最棒✌️ | JWJ | 30 |
90 | 0 | 不急不躁我最棒✌️ | 👀 | 154 |
任务3:美国大选投票情况
两张数据表中分别给出了美国各县(county)的人口数以及大选的投票情况,请解决以下问题:
county = pd.read_csv('练习题数据/county_population.csv')
county.head(2)
US County | Population | |
---|---|---|
0 | .Autauga County, Alabama | 55869 |
1 | .Baldwin County, Alabama | 223234 |
vote = pd.read_csv('练习题数据/president_county_candidate.csv')
vote.head(2)
state | county | candidate | party | total_votes | won | |
---|---|---|---|---|---|---|
0 | Delaware | Kent County | Joe Biden | DEM | 44552 | True |
1 | Delaware | Kent County | Donald Trump | REP | 41009 | False |
问题1:有多少县满足总投票数超过县人口数的一半
1.观察数据,发现county的US County包含了vote中的county和state字段,因此要先对数据进行处理,得到一个主键,之后再将满足条件的值选择出来
vote['US County'] = vote.apply(lambda x :'.'+x['county']+', '+x['state'],axis=1)
vote.head(2)
state | county | candidate | party | total_votes | won | t | US County | |
---|---|---|---|---|---|---|---|---|
0 | Delaware | Kent County | Joe Biden | DEM | 44552 | True | NaN | .Kent County, Delaware |
1 | Delaware | Kent County | Donald Trump | REP | 41009 | False | NaN | .Kent County, Delaware |
vote_sum = vote.groupby(['US County'])['total_votes'].sum()
vote_sum = vote_sum.to_frame()
vote_sum.reset_index()
vote_sum.head()
total_votes | |
---|---|
US County | |
.Abbeville County, South Carolina | 12433 |
.Abbot, Maine | 417 |
.Abington, Massachusetts | 9660 |
.Acadia Parish, Louisiana | 28425 |
.Accomack County, Virginia | 16962 |
county_vote_sum = county.merge(vote_sum, on=['US County'], how='left')
county_1 = county_vote_sum[county_vote_sum['total_votes']/county_vote_sum['Population']>0.5]
county_1['US County'].head()
11 .Choctaw County, Alabama
12 .Clarke County, Alabama
13 .Clay County, Alabama
16 .Colbert County, Alabama
17 .Conecuh County, Alabama
Name: US County, dtype: object
问题2:把州(state)作为行索引,把投票候选人作为列名,列名的顺序按照候选人在全美的总票数由高到低排序,行列对应的元素为该候选人在该州获得的总票数
1.根据问题的理解,也就是要将候选人转换为列
vote2 = vote[['state','candidate','total_votes']]
vote2.head(2)
state | candidate | total_votes | |
---|---|---|---|
0 | Delaware | Joe Biden | 44552 |
1 | Delaware | Donald Trump | 41009 |
pivot_vote = vote2.pivot_table(index = 'state',
columns = 'candidate',
values = 'total_votes',
aggfunc='sum')
pivot_vote.head()
candidate | None of these candidates | Write-ins | Alyson Kennedy | Bill Hammons | Blake Huber | Brian Carroll | Brock Pierce | Brooke Paige | Christopher LaFontaine | Connie Gammon | ... | Mark Charles | Phil Collins | President Boddie | Princess Jacob-Fambro | Richard Duncan | Ricki Sue King | Rocky De La Fuente | Sheila Samm Tittle | Tom Hoefling | Zachary Scalf |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
state | |||||||||||||||||||||
Alabama | NaN | 7312.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Alaska | NaN | 34210.0 | NaN | NaN | NaN | NaN | 825.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 318.0 | NaN | NaN | NaN |
Arizona | NaN | 2032.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Arkansas | NaN | NaN | NaN | NaN | NaN | 1713.0 | 2141.0 | NaN | NaN | 1475.0 | ... | NaN | 2812.0 | NaN | NaN | NaN | NaN | 1321.0 | NaN | NaN | NaN |
California | NaN | 80.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 60155.0 | NaN | NaN | NaN |
5 rows × 38 columns
2.将列的顺序按照总票数进行调整,这里先找到候选人和总票数,之后再进行排序,最后再进行列调整。
注意:做了reset_index()之后一定要赋值到原来的表,否则原表再次使用的时候依然不会变
rank = vote2.groupby('candidate')['total_votes'].sum()
rank = rank.to_frame()
rank2 = rank.sort_values('total_votes', ascending = False)
rank2 = rank2.reset_index()
new_rank = list(rank2['candidate'])
state_vote = pivot_vote.loc[:,new_rank]
state_vote.head()
candidate | Joe Biden | Donald Trump | Jo Jorgensen | Howie Hawkins | Write-ins | Rocky De La Fuente | Gloria La Riva | Kanye West | Don Blankenship | Brock Pierce | ... | Tom Hoefling | Ricki Sue King | Princess Jacob-Fambro | Blake Huber | Richard Duncan | Joseph Kishore | Jordan Scott | Gary Swing | Keith McCormic | Zachary Scalf |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
state | |||||||||||||||||||||
Alabama | 849648.0 | 1441168.0 | 25176.0 | NaN | 7312.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Alaska | 153405.0 | 189892.0 | 8896.0 | NaN | 34210.0 | 318.0 | NaN | NaN | 1127.0 | 825.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Arizona | 1672143.0 | 1661686.0 | 51465.0 | NaN | 2032.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Arkansas | 423932.0 | 760647.0 | 13133.0 | 2980.0 | NaN | 1321.0 | 1336.0 | 4099.0 | 2108.0 | 2141.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
California | 11109764.0 | 6005961.0 | 187885.0 | 81025.0 | 80.0 | 60155.0 | 51036.0 | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 38 columns
问题3:每一个州下设若干县,定义拜登在该县的得票率减去川普在该县的得票率为该县的BT指标,若某个州所有县BT指标的中位数大于0,则称该州为Biden State,请找出所有的Biden State
1.首先我们将数据筛选出仅有拜登和特朗普的数据,此处使用isin,接着将拜登和特朗普放置到列,方便我们之后拼接总的的票数和BT指标的中位数
df_vote = vote[vote['candidate'].isin(['Joe Biden','Donald Trump'])]
df_vote1 = df_vote[['state','county','candidate','total_votes']]
df_vote2 = df_vote1.pivot_table(index = ['state','county'],
columns = 'candidate',
values = 'total_votes',
aggfunc='sum')
df_vote2.head()
candidate | Donald Trump | Joe Biden | |
---|---|---|---|
state | county | ||
Alabama | Autauga County | 19838 | 7503 |
Baldwin County | 83544 | 24578 | |
Barbour County | 5622 | 4816 | |
Bibb County | 7525 | 1986 | |
Blount County | 24711 | 2640 |
2.开始计算得票率,首先找到县的总投票数,然后将他拼入上面的表中,计算出得票率,再计算出该县的BT指标
county_vote = vote.groupby(['state','county'])['total_votes'].sum()
county_rate = df_vote2.merge(county_vote, on=['state','county'] ,how='left')
county_rate['Donald Trump'] =county_rate['Donald Trump']/county_rate['total_votes']
county_rate['Joe Biden'] =county_rate['Joe Biden']/county_rate['total_votes']
county_rate['BT'] = county_rate['Joe Biden'] - county_rate['Donald Trump']
county_rate.head()
Donald Trump | Joe Biden | total_votes | BT | ||
---|---|---|---|---|---|
state | county | ||||
Alabama | Autauga County | 0.714368 | 0.270184 | 27770 | -0.444184 |
Baldwin County | 0.761714 | 0.224090 | 109679 | -0.537623 | |
Barbour County | 0.534512 | 0.457882 | 10518 | -0.076631 | |
Bibb County | 0.784263 | 0.206983 | 9595 | -0.577280 | |
Blount County | 0.895716 | 0.095694 | 27588 | -0.800022 |
3.计算该县的BT指标的中位数
rate_date = county_rate.groupby('state')['BT'].median()
rate_date = rate_date.reset_index()
BidenState = rate_date[rate_date['BT']>0]
BidenState
state | BT | |
---|---|---|
4 | California | 0.084957 |
6 | Connecticut | 0.082626 |
7 | Delaware | 0.040712 |
8 | District of Columbia | 0.895536 |
11 | Hawaii | 0.321145 |
21 | Massachusetts | 0.251015 |
30 | New Jersey | 0.066718 |
39 | Rhode Island | 0.136207 |
45 | Vermont | 0.242156 |