优达棒球赛数据分析项目

棒球运动员的身高、体重的特点

作者获得了一份从1820到1995年出生的棒球运动员的身体数据。这里我对各地运动员的身高、体重情况以及他们随着时间的变化,以及它们和运动员寿命的关系情况感兴趣。接下来,我将对这些进行分析

提出问题:

1.运动员的出生区域分布
2.运动员的身高、体重随出生年份的变化
3.运动员的寿命与身高、体重的关系

这里,运动员的身高、体重是因变量,年份、城市是自变量
#导入数据库

# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from __future__ import division
%matplotlib inline

导入数据

def read_csv(filename):
    file=filename
    data=pd.read_csv(file)
    return(data)
player_df=read_csv('Master.csv')
#stars_df=read_csv('AllstarFull.csv')

让我们先来看一下导入的数据的结构

player_df.head()
playerIDbirthYearbirthMonthbirthDaybirthCountrybirthStatebirthCitydeathYeardeathMonthdeathDay...nameLastnameGivenweightheightbatsthrowsdebutfinalGameretroIDbbrefID
0aardsda011981.012.027.0USACODenverNaNNaNNaN...AardsmaDavid Allan220.075.0RR2004/4/62015/8/23aardd001aardsda01
1aaronha011934.02.05.0USAALMobileNaNNaNNaN...AaronHenry Louis180.072.0RR1954/4/131976/10/3aaroh101aaronha01
2aaronto011939.08.05.0USAALMobile1984.08.016.0...AaronTommie Lee190.075.0RR1962/4/101971/9/26aarot101aaronto01
3aasedo011954.09.08.0USACAOrangeNaNNaNNaN...AaseDonald William190.075.0RR1977/7/261990/10/3aased001aasedo01
4abadan011972.08.025.0USAFLPalm BeachNaNNaNNaN...AbadFausto Andres184.073.0LL2001/9/102006/4/13abada001abadan01

5 rows × 24 columns

下面是数据中表头的含义:

1.playerID       A unique code asssigned to each player.  The playerID links
             the data in this file with records in the other files.
2.birthYear      Year player was born
3.birthMonth     Month player was born
4.birthDay       Day player was born
5.birthCountry   Country where player was born
6.birthState     State where player was born
7.birthCity      City where player was born
8.deathYear      Year player died
9.deathMonth     Month player died
10.deathDay       Day player died
11.deathCountry   Country where player died
12.deathState     State where player died
13.deathCity      City where player died
14.nameFirst      Player's first name
15.nameLast       Player's last name
16.nameGiven      Player's given name (typically first and middle)
17.weight         Player's weight in pounds
18.height         Player's height in inches
19.bats           Player's batting hand (left, right, or both)        
20.throws         Player's throwing hand (left or right)
21.debut          Date that player made first major league appearance

数据项目有很多,但我们只需要选手ID,出生年份、出生国家、城市等数据,这里将提取这些数据

data1_df=player_df[['playerID','birthYear','deathYear','birthCountry','birthState','birthCity','weight','height']]

让我们看一下新数据的结构

data1_df.head()
playerIDbirthYeardeathYearbirthCountrybirthStatebirthCityweightheight
0aardsda011981.0NaNUSACODenver220.075.0
1aaronha011934.0NaNUSAALMobile180.072.0
2aaronto011939.01984.0USAALMobile190.075.0
3aasedo011954.0NaNUSACAOrange190.075.0
4abadan011972.0NaNUSAFLPalm Beach184.073.0
data1_df.head()
playerIDbirthYeardeathYearbirthCountrybirthStatebirthCityweightheight
0aardsda011981.0NaNUSACODenver220.075.0
1aaronha011934.0NaNUSAALMobile180.072.0
2aaronto011939.01984.0USAALMobile190.075.0
3aasedo011954.0NaNUSACAOrange190.075.0
4abadan011972.0NaNUSAFLPalm Beach184.073.0

接下来让我们查看一下数据的摘要信息

data1_df.describe()
birthYeardeathYearweightheight
count18703.0000009336.00000017975.00000018041.000000
mean1930.6641181963.850364185.98086272.255640
std41.22907931.50636921.2269882.598983
min1820.0000001872.00000065.00000043.000000
25%1894.0000001942.000000170.00000071.000000
50%1936.0000001966.000000185.00000072.000000
75%1968.0000001989.000000200.00000074.000000
max1995.0000002016.000000320.00000083.000000

从摘要信息中可以看到,棒球运动员的平均身高为72.255英寸,分布在43英寸到83英寸之间;体重的波动范围为65-320磅,平均体重为185.98磅

让我们看一下是否存在数据缺失情况

data1_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18846 entries, 0 to 18845
Data columns (total 8 columns):
playerID        18846 non-null object
birthYear       18703 non-null float64
deathYear       9336 non-null float64
birthCountry    18773 non-null object
birthState      18220 non-null object
birthCity       18647 non-null object
weight          17975 non-null float64
height          18041 non-null float64
dtypes: float64(4), object(4)
memory usage: 1.2+ MB


可以看到,数据中体重、身高、出生年份、死亡年份数据信息不全。
其中,身高、体重数据将用前值补全,出生年份缺失的则需要将其剔除
#定义补全函数
def enfull_ave(letter):
    
    data1_df[letter].fillna(method='ffill')
#补全体重
enfull_ave('weight')
#补全身高
enfull_ave('height')
#剔除缺失数据
data1_df=data1_df.dropna(how='all')

现在,让我们对棒球运动员的国家分布和城市分布进行分析

#下面定义几个常用函数
# 按照name对运动员进行分组后,计算每组的人数 
def player_count(data,name):
    return data.groupby(name)['playerID'].count()

def player_count_rate(data,name):
   
    b=player_count(data,name)
    
    a=data['playerID'].count()
 
    return b/a

# 输出饼图
def print_pie(group_data,title):
    group_data.plot.pie(title=title,figsize=(12, 12),autopct='%3.1f%%',startangle =90,legend=True)
# 输出柱状图
def print_bar(data,title):
    bar=data.plot.bar(title=title,width=10)
    for p in bar.patches:
        bar.annotate('%3.1f%%' % (p.get_height()*100), (p.get_x() * 1.005, p.get_height() * 1.005))
#输出折线图
def print_plot(data,name1,title):
    
    x=data.index
    y=data[name1]
    plt.figure(figsize=(12,6)) #创建绘图对象  
    plt.plot(x,y,'ro',color="red",linewidth=1)   #在当前绘图对象绘图(X轴,Y轴,蓝色虚线,线宽度)
    plt.xlabel("year")
    plt.ylabel(name1)
    plt.title(title) #图标题  
    plt.show()  #显示图  
    plt.savefig("line.jpg") #保存图  

接下来,让我们查看棒球运动员在各个国家的分布比例

player_count_rate(data1_df,'birthCountry').sort_values(ascending=False)
birthCountry
USA               0.875730
D.R.              0.034119
Venezuela         0.018094
P.R.              0.013425
CAN               0.012947
Cuba              0.010506
Mexico            0.006261
Japan             0.003290
Panama            0.002918
Ireland           0.002653
United Kingdom    0.002600
Germany           0.002441
Australia         0.001486
South Korea       0.000902
Colombia          0.000902
Nicaragua         0.000743
Curacao           0.000743
V.I.              0.000637
Netherlands       0.000637
Taiwan            0.000584
Russia            0.000424
France            0.000424
Italy             0.000371
Bahamas           0.000318
Aruba             0.000265
Poland            0.000265
Austria           0.000212
Sweden            0.000212
Spain             0.000212
Czech Republic    0.000212
Jamaica           0.000212
Brazil            0.000159
Norway            0.000159
Saudi Arabia      0.000106
At Sea            0.000053
American Samoa    0.000053
Belgium           0.000053
Belize            0.000053
China             0.000053
Viet Nam          0.000053
Denmark           0.000053
Finland           0.000053
Greece            0.000053
Guam              0.000053
Honduras          0.000053
Indonesia         0.000053
Lithuania         0.000053
Philippines       0.000053
Singapore         0.000053
Slovakia          0.000053
Switzerland       0.000053
Afghanistan       0.000053
Name: playerID, dtype: float64

可以看到,棒球运动员来自50多个国家和地区。绝大多数棒球运动员的出生国家在美国,占比87.6%;比较高的有D.R.、Venezuela、P.R.、CAN、Cuba ,都达到了1%以上。接下来,让我们看一下美国运动员的州分布

#提取美国运动员数据
data_usa=data1_df[data1_df['birthCountry']=='USA']
#画饼图
print_pie(player_count_rate(data_usa,'birthState'),'The player rate about States')

在这里插入图片描述

从这里可以看到,出生在CA的棒球运动员最多,占比为13%,其次为PA,为8.5%。排名前五的州为CA,PA,NY,IL,OH,有超过44%的美国棒球运动员在这些地方出生

让我们看一下各地棒球运动员的身高、体重情况吧

data2=data1_df[['birthCountry','birthState','height','weight']]
#按平均身高排序
data3=data2.groupby('birthCountry').mean().sort_values(by='height',ascending=False)
print '有%d个国家超过了平均水平'%(data3['height'][data3['height']>=data1_df['height'].mean()].count())
data3
有26个国家超过了平均水平
heightweight
birthCountry
Indonesia78.000000220.000000
Belgium77.000000205.000000
Jamaica75.250000201.250000
Afghanistan75.000000215.000000
Brazil74.333333205.000000
Singapore74.000000205.000000
Honduras74.000000185.000000
Guam74.000000210.000000
Australia73.500000200.500000
Netherlands73.454545183.333333
South Korea73.411765198.294118
Curacao73.357143207.857143
Spain73.250000189.666667
Switzerland73.000000170.000000
Lithuania73.000000185.000000
Norway73.000000180.000000
China73.000000165.000000
Philippines73.000000188.000000
Aruba73.000000200.000000
Panama72.890909186.018182
D.R.72.819596192.916019
Taiwan72.727273194.454545
Sweden72.666667185.000000
Nicaragua72.571429189.785714
Germany72.375000182.871795
USA72.257213185.427646
Venezuela72.225806197.222874
Japan72.209677192.354839
Mexico72.127119189.118644
Saudi Arabia72.000000200.000000
Greece72.000000185.000000
American Samoa72.000000210.000000
Bahamas72.000000180.833333
Slovakia72.000000196.000000
CAN71.979167185.212500
P.R.71.881423185.818182
France71.833333184.666667
Austria71.750000190.250000
Cuba71.682051185.451282
Colombia71.647059199.125000
Poland71.600000179.800000
V.I.71.333333186.250000
Italy71.142857180.428571
Czech Republic71.000000184.000000
At Sea71.000000170.000000
Viet Nam71.000000200.000000
United Kingdom70.377778174.500000
Belize70.000000180.000000
Russia69.857143167.428571
Ireland69.552632170.131579
Finland69.000000165.000000
Denmark67.000000158.000000

可以看到,平均身高最高的国家是印度尼西亚,为78英寸,接下来为比利时,为77英寸。各国的平均身高都不低于67英寸,超过平均水平的国家有26个。接下来,让我们看一下体重情况

c=data2.groupby('birthCountry').mean().sort_values(by='weight',ascending=False)
#对超过平均水平的国家计数
print '有%d个国家超过了平均水平'%(data3['weight'][data3['weight']>=data1_df['weight'].mean()].count())
c
有27个国家超过了平均水平
heightweight
birthCountry
Indonesia78.000000220.000000
Afghanistan75.000000215.000000
American Samoa72.000000210.000000
Guam74.000000210.000000
Curacao73.357143207.857143
Singapore74.000000205.000000
Belgium77.000000205.000000
Brazil74.333333205.000000
Jamaica75.250000201.250000
Australia73.500000200.500000
Saudi Arabia72.000000200.000000
Viet Nam71.000000200.000000
Aruba73.000000200.000000
Colombia71.647059199.125000
South Korea73.411765198.294118
Venezuela72.225806197.222874
Slovakia72.000000196.000000
Taiwan72.727273194.454545
D.R.72.819596192.916019
Japan72.209677192.354839
Austria71.750000190.250000
Nicaragua72.571429189.785714
Spain73.250000189.666667
Mexico72.127119189.118644
Philippines73.000000188.000000
V.I.71.333333186.250000
Panama72.890909186.018182
P.R.71.881423185.818182
Cuba71.682051185.451282
USA72.257213185.427646
CAN71.979167185.212500
Lithuania73.000000185.000000
Greece72.000000185.000000
Honduras74.000000185.000000
Sweden72.666667185.000000
France71.833333184.666667
Czech Republic71.000000184.000000
Netherlands73.454545183.333333
Germany72.375000182.871795
Bahamas72.000000180.833333
Italy71.142857180.428571
Norway73.000000180.000000
Belize70.000000180.000000
Poland71.600000179.800000
United Kingdom70.377778174.500000
Ireland69.552632170.131579
At Sea71.000000170.000000
Switzerland73.000000170.000000
Russia69.857143167.428571
Finland69.000000165.000000
China73.000000165.000000
Denmark67.000000158.000000

这里我们可以看到,运动员的平均体重最高的国家仍然是印度尼西亚,为220磅,接下来是阿富汗,为215磅,有27个国家的运动员超过了平均水平

接下来,让我们看一下全明星运动员的情况吧

接下来,让我们看一下平均身高、平均体重岁随年份的变化


#提取数据
b=data1_df.groupby('birthYear').mean()

d=b.dropna()
#打印体重-时间折线图
print_plot(d,'weight','The weight change about birthyears')

在这里插入图片描述

<matplotlib.figure.Figure at 0xe404400>
#打印身高-时间折线图
print_plot(d,'height','The height change about birthYear')

在这里插入图片描述

<matplotlib.figure.Figure at 0xe1509e8>

从这里可以看到,运动员的身高和体重随着出生年份呈现正相关关系。那么,他们之间有多大的相关性呢?接下来让我们查看一下

#提取数据
e=pd.DataFrame(d,columns=['birthyear','weight','height'])
e['birthyear']=e.index
#计算相关系数
e.corrwith(e['birthyear'])
birthyear    1.000000
weight       0.929546
height       0.947681
dtype: float64



从这里可以看到,运动员的出生年份与运动员的平均身高的的相关系数为0.947,与平均体重的相关系数为0.934。可以看到运动员的平均身高、体重与年份有很大的相关性。但是由于缺乏进一步数据,造成这种现象的原因不得而知

接下来,我们看一下运动员的寿命与身高、体重情况

#剔除在世运动员的数据,并提取数据
data_age=data1_df.dropna(how='all')
data_age=data_age[['playerID','birthYear','deathYear','weight','height']]
#计算运动员寿命
data_age=pd.DataFrame(data_age,columns=['playerID','birthYear','deathYear','Age','weight','height'])
data_age['Age']=data_age['deathYear']-data_age['birthYear']

去掉可能存在的缺失值

#剔除存在缺失的数据
data_age=data_age.dropna()
#计算平均值
f=data_age.groupby('Age').mean()
f
birthYeardeathYearweightheight
Age
20.01907.5000001927.500000176.50000070.500000
21.01867.0000001888.000000181.50000072.500000
22.01925.8000001947.800000179.00000071.400000
23.01915.0000001938.000000169.60000072.000000
24.01916.2000001940.200000177.40000071.300000
25.01898.3076921923.307692176.15384672.461538
26.01903.4000001929.400000177.53333371.733333
27.01887.7692311914.769231172.88461570.884615
28.01894.5000001922.500000178.50000071.500000
29.01907.4324321936.432432176.29729771.486486
30.01888.7096771918.709677172.77419471.064516
31.01881.6666671912.666667169.25925970.777778
32.01889.3939391921.393939173.33333370.727273
33.01894.2580651927.258065167.29032370.516129
34.01898.9000001932.900000177.04000071.820000
35.01899.1351351934.135135183.40540571.756757
36.01891.0512821927.051282176.71794970.128205
37.01886.5384621923.538462171.46153870.333333
38.01892.0833331930.083333178.25000071.354167
39.01897.5897441936.589744179.43589771.641026
40.01892.3111111932.311111178.55555671.133333
41.01893.5000001934.500000177.70454570.727273
42.01893.2250001935.225000179.22500071.275000
43.01891.2040821934.204082175.67346970.816327
44.01885.3442621929.344262173.01639370.377049
45.01898.1212121943.121212178.84848571.136364
46.01893.9387761939.938776179.04081671.061224
47.01893.4415581940.441558175.01298770.805195
48.01894.0000001942.000000174.16455770.949367
49.01894.2131151943.213115175.59016470.868852
...............
75.01900.2850241975.285024174.78260971.164251
76.01897.8949771973.894977175.80821971.118721
77.01897.6071431974.607143173.99107171.004464
78.01897.6066351975.606635176.32701471.033175
79.01898.9909911977.990991175.64414471.157658
80.01899.3512401979.351240177.00000071.190083
81.01899.8796301980.879630176.35185270.925926
82.01900.7544641982.754464176.07589371.281250
83.01901.4541281984.454128175.66513871.243119
84.01898.2578951982.257895175.41578970.915789
85.01900.0052631985.005263172.21578970.968421
86.01903.9139781989.913978175.81182871.209677
87.01897.7986111984.798611175.40277871.090278
88.01904.5405411992.540541177.42567671.533784
89.01900.2992131989.299213174.86614271.228346
90.01901.4867261991.486726173.49557570.858407
91.01899.0681821990.068182173.75000070.681818
92.01901.6736841993.673684175.83157971.157895
93.01901.5131581994.513158173.82894771.000000
94.01898.0888891992.088889173.53333371.311111
95.01899.4615381994.461538172.57692370.826923
96.01902.2222221998.222222176.50000071.111111
97.01893.6470591990.647059171.82352970.352941
98.01900.8823531998.882353174.70588270.705882
99.01897.2222221996.222222163.44444469.666667
100.01899.7000001999.700000168.60000070.100000
101.01900.4000002001.400000167.00000070.400000
102.01900.0000002002.000000165.00000071.000000
103.01911.0000002014.000000158.00000065.000000
107.01891.0000001998.000000162.00000069.000000

85 rows × 4 columns

#提取年龄
age_df=pd.DataFrame(f,columns=['age','weight','height'])
age_df['age']=f.index
#绘制折线图
print_plot(age_df,'weight','weight-age')
print_plot(age_df,'height','height-age')

在这里插入图片描述

<matplotlib.figure.Figure at 0xe81df98>

在这里插入图片描述

<matplotlib.figure.Figure at 0xdfd5c50>
#计算相关系数
age_df.corr()
ageweightheight
age1.000000-0.430298-0.371683
weight-0.4302981.0000000.724237
height-0.3716830.7242371.000000

可以看到,运动员寿命与身高、体重存在弱相关关系,且与运动员身高、体重呈负相关关系。其相关性远不如出生年份。但这里也说明运动员的身高、体重在某种程度上有可能影响运动员寿命

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值