综合练习

最新推荐文章于 2021-10-03 16:31:08 发布

biohsliu

最新推荐文章于 2021-10-03 16:31:08 发布

阅读量168

点赞数 1

分类专栏： datawhale 文章标签： python pandas

本文链接：https://blog.csdn.net/qq_33249277/article/details/112061225

版权

datawhale 专栏收录该内容

21 篇文章 1 订阅

订阅专栏

学习参考：http://datawhale.club/t/topic/579

文章目录

【任务一】企业收入的多样性

【题目描述】一个企业的产业收入多样性可以仿照信息熵的概念来定义收入熵指标：
在这里插入图片描述
其中 p(xi)是企业该年某产业收入额占该年所有产业总收入的比重。在company.csv中存有需要计算的企业和年份，在company_data.csv中存有企业、各类收入额和收入年份的信息。现请利用后一张表中的数据，在前一张表中增加一列表示该公司该年份的收入熵指标 I。

import numpy as np
import pandas as pd
df1 = pd.read_csv('../data/第一次综合练习-数据集/任务一/company.csv')
df2 = pd.read_csv('../data/第一次综合练习-数据集/任务一/company_data.csv')
df1.head()

在这里插入图片描述

df2.head()

在这里插入图片描述

df1['code'] = df1['证券代码'].str.replace('#[0]*','',regex=True).astype('int64')  # 新建code列
df1.head()

在这里插入图片描述

def income_entropy(x):
    income_sum = np.abs(x.sum())
    ratio = np.abs(x)/income_sum    
    entropy = -1*(np.sum(ratio*np.log2(ratio)))
    return entropy

df2_w = df2.pivot(index=['证券代码','日期'], columns='收入类型', values='收入额')
df2_w1 = df2_w.apply(income_entropy,axis=1).to_frame()
df2_w1.reset_index(inplace = True)
df2_w1.head()

在这里插入图片描述

df2_w1['日期'] = df2_w1['日期'].str.replace('/12/31','',regex=True).astype('int64')
df2_w1['证券代码'] = df2_w1['证券代码'].astype('int64')
df2_w1.columns = ['code','日期','收入熵']
df2_w1.head()

在这里插入图片描述

df3 = df1.merge(df2_w1, on=['code','日期'], how='left')
df3.shape  #(1048, 4)
df3.head()

在这里插入图片描述

【任务二】组队学习信息表的变换

【题目描述】请把组队学习的队伍信息表变换为如下形态，其中“是否队长”一列取1表示队长，否则为0
在这里插入图片描述

file2 = '../data/第一次综合练习-数据集/任务二/组队信息汇总表（Pandas）.xlsx'
df = pd.read_excel(file2)
df.head()

在这里插入图片描述

from openpyxl import load_workbook

name_dict = {} # {'昵称'：'队伍名称'}
id_dict = {}  # {'昵称'：'编号'}
leader_dict = {} # {'昵称'：id}  id=0/1

workbook = load_workbook(file2,data_only=True)
booksheet = workbook.active  # 只有一个工作表
flag = 0
for row in booksheet.rows:
    if flag == 0:  #跳过第一行
        flag += 1
        continue
    line_list = [col.value for col in row][1:]
    team_name = line_list[0]
    name_list = [i for i in line_list[2::2] if i != None]  # 需要去除空格
    id_list = [i for i in line_list[1::2] if i != None]
    for i in name_list:
        name_dict[i] = team_name
        index = name_list.index(i)
        id_dict[i] = id_list[index]
        if index == 0:
            leader_dict[i] = 1
        else:
            leader_dict[i] = 0

df = pd.DataFrame([leader_dict,name_dict,id_dict]).T.reset_index()
df.columns = ['昵称',"是否队长","队伍名称","编号"]
df

在这里插入图片描述

【任务三】美国大选投票情况

【题目描述】两张数据表中分别给出了美国各县（county）的人口数以及大选的投票情况，请解决以下问题：

有多少县满足总投票数超过县人口数的一半
把州（state）作为行索引，把投票候选人作为列名，列名的顺序按照候选人在全美的总票数由高到低排序，行列对应的元素为该候选人在该州获得的总票数
每一个州下设若干县，定义拜登在该县的得票率减去川普在该县的得票率为该县的BT指标，若某个州所有县BT指标的中位数大于0，则称该州为Biden
State，请找出所有的Biden State

file_pop = "../data/第一次综合练习-数据集/任务三/county_population.csv"
file_vote = "../data/第一次综合练习-数据集/任务三/president_county_candidate.csv"
df_pop = pd.read_csv(file_pop)
df_pop.head()

在这里插入图片描述

df_vote = pd.read_csv(file_vote)
df_vote.head()

在这里插入图片描述

#1.有多少县满足总投票数超过县人口数的一半
df_pop = pd.read_csv(file_pop)
print(df_pop.shape)
df_pop.head()

在这里插入图片描述

df_pop['US County'] = df_pop['US County'].str.strip('.')  # 去除名字前面的'.'符号
df_pop.head()

在这里插入图片描述

df_vote = pd.read_csv(file_vote)
df_vote.head()

在这里插入图片描述

df_pop_vote = df_vote.groupby(['state','county'])['total_votes'].sum().to_frame().reset_index()
df_pop_vote['US County'] = df_pop_vote['county']+', '+df_pop_vote['state']
print(df_pop_vote.shape)
df_pop_vote.head()

在这里插入图片描述

df_pop_vote = df_pop_vote.merge(df_pop,on = 'US County',how = 'inner')
df_pop_vote['vote_rate'] = df_pop_vote['total_votes']/df_pop_vote['Population']
df_pop_vote.head()

在这里插入图片描述

df_pop_vote.loc[df_pop_vote['vote_rate']> 0.5,'US County']  # 一共1419个县投票率超过0.5

在这里插入图片描述

#2.把州（state）作为行索引，把投票候选人作为列名，列名的顺序按照候选人在全美的总票数由高到低排序，行列对应的元素为该候选人在该州获得的总票数
df_candidate = df_vote.groupby(['candidate','state'])['total_votes'].sum().to_frame().reset_index()
df_candidate_towide = df_candidate.pivot(index = 'state', columns='candidate', values='total_votes')
df_candidate_towide.head()

在这里插入图片描述

candidate_vote_count = pd.Series(df_candidate_towide.sum(axis = 0),index = df_candidate_towide.columns,name='state vote count')
df_candidate_vote = df_candidate_towide.append(candidate_vote_count)
df_candidate_vote.tail()

在这里插入图片描述

df_candidate_vote.sort_values(by='state vote count',axis=1,ascending = False)  # 按指定索引行排序

在这里插入图片描述

#3.每一个州下设若干县，定义拜登在该县的得票率减去川普在该县的得票率为该县的BT指标，若某个州所有县BT指标的中位数大于0，则称该州为Biden State，请找出所有的Biden State
df_state_vote = df_vote.groupby(['state','county'])['total_votes'].sum().to_frame().reset_index()
df_BT = df_vote.groupby(['state','county','candidate'])['total_votes'].sum().to_frame().reset_index()
df_BT = df_BT.query('candidate == ["Donald Trump","Joe Biden"]').rename(columns={'total_votes':'votes'})
print(df_BT.shape)
df_BT.head()

在这里插入图片描述

df_BT = df_BT.merge(df_state_vote,on=['state','county'], how='left')
df_BT['vote ratio'] = df_BT['votes']/df_BT['total_votes']
df_BT.head()

在这里插入图片描述

df_BT = df_BT.pivot(index = ['state','county'],columns='candidate', values='vote ratio').reset_index()
df_BT['BT_index'] = df_BT['Joe Biden'] - df_BT['Donald Trump']
df_BT.head()

在这里插入图片描述

df_BT_state = df_BT.groupby(['state'])['county','BT_index'].median()
df_BT_state.reset_index().rename(columns = {'candidate':''})
df_BT_state.head()

在这里插入图片描述

df1 = df_BT_state.query('BT_index > 0')
df1.reset_index()['state']  # 一共9个Biden State

在这里插入图片描述

biohsliu

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录