Datawhale中期考核

最新推荐文章于 2022-07-01 21:00:16 发布

qq_42776962

最新推荐文章于 2022-07-01 21:00:16 发布

阅读量167

点赞数

本文链接：https://blog.csdn.net/qq_42776962/article/details/112069210

版权

该博客主要涵盖Datawhale的中期考核任务，包括Task1中对企业收入多样性的分析，Task2涉及组队学习信息表的变换处理，以及Task3对美国大选投票情况的深入探讨。通过代码展示和结果呈现，博主分享了各个任务的完成过程和发现。

摘要由CSDN通过智能技术生成

Pandas

Task1企业收入的多样性
Task2组队学习信息表的变换
Task3美国大选投票情况

题目内容http://datawhale.club/t/topic/579/4

Task1企业收入的多样性

在这里插入图片描述
代码

// An highlighted block
import pandas as pd
import numpy  as np
df1 = pd.read_csv('company.csv')
df2 = pd.read_csv('company_data.csv')
df1.head(5)
df2.head(5)
df11 = df1.copy()
df11['证券代码'] = df11['证券代码'].str[1:].astype('int64')#使两张表的证券代码格式一致
df2['日期'] = df2['日期'].str[:4].astype('int64')#使两张表的日期格式一致
def entropy(x):
    if x.any():
        p = x/x.sum()
        return -(p*np.log2(p)).sum()
    return np.nan
res = df11.merge(df2, on=['证券代码','日期'], how='left').groupby(['证券代码','日期'])['收入额'].apply(entropy).reset_index()
res.head(5)
df1['收入熵指标'] = res['收入额']
df1

结果展示
在这里插入图片描述

![

Task2组队学习信息表的变换

在这里插入图片描述

// An highlighted block
df = pd.read_excel('team_data.xlsx') #把表格名称改成了team_data,所以是读team_data
df.drop(columns='所在群', inplace=True) #所在群的信息没用到，删去
df.head(5)
col_1 = np.array(['队伍名称','编号_leader01','昵称_leader01'])
col_2 = np.array([[f'编号_member{i}0', f'昵称_member{i}0']for i in range(1,11)]).flatten()
df.columns = np.r_[col_1,col_2]
df.head(5)
res = pd.wide_to_long(  df.reset_index(),
                        stubnames = ['昵称','编号'],
                        i = ['index','队伍名称'],
                        j = '是否队长',
                        sep = '_',
                        suffix = '.+').dropna().reset_index().drop(columns='index')
res
res['是否队长'],res['编号'] = res['是否队长'].str[-1],res['编号'].astype('int64')
res.reindex(columns=['是否队长','队伍名称','昵称','编号']

结果展示
在这里插入图片描述

在这里插入图片描述

Task3美国大选投票情况

在这里插入图片描述
代码

// An highlighted block
df1=pd.read_csv('president_county_candidate.csv')
df2=pd.read_csv('county_population.csv')
df1.head(5)
df2.head(5)
sum_vote=df1.groupby(['county','state'])['total_votes'].sum()
sum_vote=sum_vote.to_frame().reset_index()
US_county='.'+sum_vote['county']+', '+sum_vote['state']
df3=sum_vote.copy()
df3.head(5)
df4=df3.drop(['county','state'],axis=1).copy()
df4['US County']=US_county
df_12=df2.merge(df4,on='US County',how='left')
df_12[df_12['total_votes']/df_12['Population']>0.5].count(0)
columns=df1.groupby('candidate')['total_votes'].sum().sort_values(ascending = False).index
result=df1.pivot_table(index='state',columns='candidate',values='total_votes')
result.reindex(columns=columns)
df1.groupby(['state','county'])['total_votes'].transform('sum')
df1['县总票数']=df1.groupby(['state','county'])['total_votes'].transform('sum')
df1['县得票率']=df1['total_votes']/df1['县总票数']
df_bt=df1.pivot(index=['state','county'],columns='candidate',values='县得票率')
s_bt=df_bt['Joe Biden']-df_bt['Donald Trump']
df3=s_bt.to_frame()
result3=df3.rename(columns={0:'BT指标'}).reset_index()
def function(x):
    if x.median()>0:
        return 'Biden State'
    else:
        return 'Not Biden State'
result=result3.groupby('state')['BT指标'].transform(function)
result3[result=='Biden State']['state'].drop_duplicates().reset_index(drop=True)