【DS实践 | Coursera】Assignment4 | Introduction to Data Science in Python

Mart_inn

已于 2022-01-20 13:37:40 修改

阅读量1.3k

点赞数 2

分类专栏： Coursera Data Science with Python 文章标签： python 机器学习数据分析数据挖掘

于 2022-01-04 22:33:42 首次发布

本文链接：https://blog.csdn.net/mart_inn/article/details/122312148

版权

Data Science with Python 同时被 2 个专栏收录

16 篇文章 8 订阅

订阅专栏

Coursera

6 篇文章 1 订阅

订阅专栏

一、问题分析

1.1 问题描述

Description
In this assignment you must read in a file of metropolitan regions and associated sports teams from assets/wikipedia_data.html and answer some questions about each metropolitan region. Each of these regions may have one or more teams from the “Big 4”: NFL (football, in assets/nfl.csv), MLB (baseball, in assets/mlb.csv), NBA (basketball, in assets/nba.csv or NHL (hockey, in assets/nhl.csv). Please keep in mind that all questions are from the perspective of the metropolitan region, and that this file is the “source of authority” for the location of a given sports team. Thus teams which are commonly known by a different area (e.g. “Oakland Raiders”) need to be mapped into the metropolitan region given (e.g. San Francisco Bay Area). This will require some human data understanding outside of the data you’ve been given (e.g. you will have to hand-code some names, and might need to google to find out where teams are)!

For each sport I would like you to answer the question: what is the win/loss ratio’s correlation with the population of the city it is in? Win/Loss ratio refers to the number of wins over the number of wins plus the number of losses. Remember that to calculate the correlation with pearsonr, so you are going to send in two ordered lists of values, the populations from the wikipedia_data.html file and the win/loss ratio for a given sport in the same order. Average the win/loss ratios for those cities which have multiple teams of a single sport. Each sport is worth an equal amount in this assignment (20%*4=80%) of the grade for this assignment. You should only use data from year 2018 for your analysis – this is important!

Notes

Do not include data about the MLS or CFL in any of the work you are doing, we’re only interested in the Big 4 in this assignment.
I highly suggest that you first tackle the four correlation questions in order, as they are all similar and worth the majority of grades for this assignment. This is by design!
It’s fair game to talk with peers about high level strategy as well as the relationship between metropolitan areas and sports teams. However, do not post code solving aspects of the assignment (including such as dictionaries mapping areas to teams, or regexes which will clean up names).
There may be more teams than the assert statements test, remember to collapse multiple teams in one city into a single value!

1.2 问题分析

处理html与csv数据，生成Dataframe格式

(对于csv文件的读取可见csv模块与python基础 )

(对于Pandas中DataFrame的产生与处理，可见Pandas中Series & DataFrame的结构、创建、查询、修改语法与实例 )

我们可以发现Q1-Q4都是要计算2018年中Big4中某个比赛的胜/负率与人口的相关系数，我们首先将assets中的csv文件数据导入，得到DataFrame格式的表，查看发现里面只有胜/负率相关的数据，没有球队所属城市和城市人口，好在前文给出了建议，从维基百科的网页上寻找城市人口数量相关的数据，并给出了html文件：wikipedia_data.html，那么我们要做的就是根据html上城市人口数量与球队与城市的关联制成另一张DataFrame格式的表。

用正则表达式规范队伍名

(对于正则表达式的用法，可见re模块与正则表达式 )

但是我们发现两个DataFrame中球队的格式规范在html文件和csv文件中又是不一样的，这时我们就需要用正则表达式regex来进行规范了，html中是一个城市对应多个球队名，在读取的时候把多个球队名连在了一起，需要分开，有的名字有脚注，需要去除，且只有名字(eg.‘Red Sox’)，而csv中则是全称(eg.‘Boston Red Sox’)但全称的前几个单词又不一定是所属城市名，见描述中的例子。本文用到的方法首先得到html生成的DataFrame中各个队伍名的长度，如果队伍名长度为1，则只匹配csv生成DataFrame的队伍名拆分后的最后一个单词，若队伍名为2，则匹配倒数两个，以此类推。

合并两张表并分组计算

(对于DataFrame的合并与分组计算，可见DataFrame的合并、分组聚合与数据透视表 )

将html得到的DataFrame和csv得到的DataFrame做内连接，得到Big4各联赛中各支球队所属的城市及城市的人口数量，关联键应为球队名。

合并后将数据转化成int型，再按城市分组，将各个城市的胜场与负场分别相加（如果先求平均值再聚合，得到的结果可能与我的结果不同，我看有人在论坛上问为什么数值很接近但没通过就是这个原因）再计算每个城市的胜/负率，再结合stats.pearsonr就可以计算相关的皮尔森系数了

计算T检验p-value矩阵

(对于T检验的数学理论与实现，可见假设检验、显著性水平、T检验的数学理论与scipy.stats模块实现)

为了得到T检验的p-value矩阵，我们只要将Big4各联赛的城市名与比赛胜/负率做T检验的配对变量，两辆配对检验一下就可以了。

二、具体代码与注释

2.1 Q1-Q4通用：得到合并后的城市与胜率

import pandas as pd
import numpy as np
import scipy.stats as stats
import re

def get_WLrate_Population(Big4):
    #传入参数为Big4名
    
    #读取html文件
    cities=pd.read_html("assets/wikipedia_data.html")[1]
    cities=cities.iloc[:-1,[0,3,5,6,7,8]]

    #1、处理《城市-人口-队伍表》
    #改名
    cities.rename(columns={'Population (2016 est.)[8]':'Population'},inplace=True)
    
    #去掉中括号及内容
    cities.replace(to_replace='\[[\w ]*\]$',value='',regex=True,inplace=True)
    
    #选择城市、人口、NHL队伍名
    cities=cities[['Metropolitan area','Population',Big4]]
    
    #删掉为空和'-'值的数据
    cities.replace(to_replace='—',value=np.nan,regex=True,inplace=True)
    cities.replace(to_replace='',value=np.nan,regex=True,inplace=True)
    cities.dropna(inplace=True)
    
    #重置index并删除原index
    cities.reset_index(inplace=True)
    cities.drop('index',axis=1,inplace=True)
    
    #针对一个城市产生多支队伍的情况开始匹配，得到“城市_队伍”的list，其元素为[城市名,队伍名]
    pattern='([A-Z]{0,2}[a-z0-9]+ [A-Z]{0,2}[a-z]+|[A-Z]{0,2}[a-z0-9]+)'
    cty_team=[]
    for i in range(len(cities)):
        a=re.findall(pattern,cities.iloc[i,2])
        #print(a,len(a))
        for j in a:
            cty_team.append([cities.iloc[i,0],j])
            
    #将队伍名按空格拆分，方便计算队伍名的长度        
    for i in cty_team:
        i[1]=re.split(' ',i[1])
    #print(len(cty_team))
    
    #2、处理《城市-队伍-战绩》 
    
    #取18年的战队数据并去掉分区信息  
    big4_df=pd.read_csv('assets/'+Big4.lower()+'.csv')
    big4_df2018=big4_df[big4_df['year']==2018]
    
    #提取'W'、'L'和'team'列
    big4_df2018=big4_df2018[['W','L','team']]
    
    #去掉*和?号，获得新的列'team_name'
    #注意：结尾有空值键，如果不去掉空值键无法匹配'*$'
    big4_df2018['team'].replace(to_replace='\([0-9]{0,2}\)$',value='',regex=True,inplace=True)
    big4_df2018['team'].replace(to_replace='[\*+]?[\s]?$',value='',regex=True,inplace=True)

    #将所有的team名按空格进行分割
    lists=[]
    for i in range(len(big4_df2018['team'])):
        lists.append(re.split(' ',big4_df2018['team'][i]))
    #print(lists)
    
    #添加分割后的列‘team_name’
    big4_df2018['team_name']=lists
    #big4_df2018['team_name']=big4_df2018['team'].str.extract('([\w]*$)')

    #增添城市信息，逻辑如下：
    #如果cty_team中队伍名字长度为1，匹配最后一个元素相同即可返回城市，若果队伍长度为2，匹配最后两个元素相同再返回城市
    def get_region(item):
        for i in cty_team:
            if len(i[-1])==1 and i[-1][-1]==item[-1] :
                return i[0]
            elif len(i[-1])==2 and i[-1][-1]==item[-1] and i[-1][-2]==item[-2]:
                return i[0]
    big4_df2018['Region']=big4_df2018['team_name'].apply(get_region)

    # #将两张表合并
    merge1=pd.merge(big4_df2018,cities,left_on='Region',right_on='Metropolitan area')
    #先将数据转化成int型，后按城市分类并求和
    merge1['W']=merge1['W'].astype('int')
    merge1['L']=merge1['L'].astype('int')
    
    #分组聚合求的胜场和、负场和与城市人口
    merge2=merge1.groupby('Region').agg({'W':np.sum,'L':np.sum,'Population':np.unique})
    merge2['W/L rate']=merge2['W']/(merge2['L']+merge2['W'])
    merge2['Population']=merge2['Population'].astype('int')
    
    #返回城市胜率与城市人口的DataFrame
    return merge2

文中提到的不去掉空值符，可能出现以下问题，这样就去不掉*号了：
在这里插入图片描述

2.2 Q1求解

取出对应的列计算皮尔森系数即可，后面一样：

def nhl_correlation(): 
    merge2=get_WLrate_Population('NHL')
    population_by_region = merge2['Population'] # pass in metropolitan area population from cities
    win_loss_by_region = merge2['W/L rate'] # pass in win/loss ratio from nhl_df in the same order as cities["Metropolitan area"]

    assert len(population_by_region) == len(win_loss_by_region), "Q1: Your lists must be the same length"
    assert len(population_by_region) == 28, "Q1: There should be 28 teams being analysed for NHL"
    
    return stats.pearsonr(population_by_region, win_loss_by_region)[0]
nhl_correlation()

解为：0.012308996455744264
所以NHL球队胜率与人口关系并不具有相关性。

2.3 Q2求解

def nba_correlation():
    merge2=get_WLrate_Population('NBA')
    population_by_region = merge2['Population'] # pass in metropolitan area population from cities
    win_loss_by_region = merge2['W/L rate'] # pass in win/loss ratio from nba_df in the same order as cities["Metropolitan area"]

    assert len(population_by_region) == len(win_loss_by_region), "Q2: Your lists must be the same length"
    assert len(population_by_region) == 28, "Q2: There should be 28 teams being analysed for NBA"

    return stats.pearsonr(population_by_region, win_loss_by_region)[0]
nba_correlation()

解为：-0.17657160252844614
所以NBA球队胜率与人口关系并不具有相关性。

2.4 Q3求解

def mlb_correlation(): 
    merge2=get_WLrate_Population('MLB')
    population_by_region = merge2['Population'] # pass in metropolitan area population from cities
    win_loss_by_region = merge2['W/L rate'] # pass in win/loss ratio from nfl_df in the same order as cities["Metropolitan area"]

    assert len(population_by_region) == len(win_loss_by_region), "Q3: Your lists must be the same length"
    assert len(population_by_region) == 26, "Q3: There should be 26 teams being analysed for MLB"

    return stats.pearsonr(population_by_region, win_loss_by_region)[0]
mlb_correlation()

解为：0.1505230448710485
所以MLB球队胜率与人口关系并不具有相关性。

2.5 Q4求解

def nfl_correlation(): 

    merge2=get_WLrate_Population('NFL')
    population_by_region = merge2['Population'] # pass in metropolitan area population from cities
    win_loss_by_region = merge2['W/L rate'] # pass in win/loss ratio from nfl_df in the same order as cities["Metropolitan area"]

    assert len(population_by_region) == len(win_loss_by_region), "Q4: Your lists must be the same length"
    assert len(population_by_region) == 29, "Q4: There should be 29 teams being analysed for NFL"

    return stats.pearsonr(population_by_region, win_loss_by_region)[0]
nfl_correlation()

解为：0.004922112149349409
所以NFL球队胜率与人口关系并不具有相关性。

2.6 Q5求解

用双重for循环得到p-value矩阵即可

import pandas as pd
import numpy as np
import scipy.stats as stats
#import rehttps://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html

mlb_df=pd.read_csv("assets/mlb.csv")
nhl_df=pd.read_csv("assets/nhl.csv")
nba_df=pd.read_csv("assets/nba.csv")
nfl_df=pd.read_csv("assets/nfl.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]

def sports_team_performance():
    # Note: p_values is a full dataframe, so df.loc["NFL","NBA"] should be the same as df.loc["NBA","NFL"] and
    # df.loc["NFL","NFL"] should return np.nan
    sports = ['NFL', 'NBA', 'NHL', 'MLB']
    p_values = pd.DataFrame({k:np.nan for k in sports}, index=sports)
    
    for i in sports:
        for j in sports:
            if i!=j:
                Mi=get_WLrate_Population(i)
                Mj=get_WLrate_Population(j)
                Mi=Mi['W/L rate']
                Mj=Mj['W/L rate']
                merge=pd.merge(Mi,Mj,how='inner',left_index=True,right_index=True)
                #print(merge)
                p_values.loc[i, j]=stats.ttest_rel(merge['W/L rate_x'],merge['W/L rate_y'])[1]
    assert abs(p_values.loc["NBA", "NHL"] - 0.02) <= 1e-2, "The NBA-NFL p-value should be around 0.02"
    assert abs(p_values.loc["MLB", "NFL"] - 0.80) <= 1e-2, "The MLB-NFL p-value should be around 0.80"
    return p_values
sports_team_performance()