找出诺贝尔得奖者最多的国家
1.1: 读取数据并进行观察
1.2: 统计得奖国家的个数
1.3: 探索得奖最多的国家崛起的时间
1.1: 读取数据并进行观察
import pandas as pd
import numpy as np
read_csv(路径名/文件名)
nobel = pd.read_csv(‘nobel.csv’)
观察头几行 数据集.head()
nobel.head() # 头五行
year category prize motivation prize_share laureate_id laureate_type full_name birth_date birth_city birth_country sex organization_name organization_city organization_country death_date death_city death_country
0 1901 Chemistry The Nobel Prize in Chemistry 1901 "in recognition of the extraordinary services … 1/1 160 Individual Jacobus Henricus van 't Hoff 1852-08-30 Rotterdam Netherlands Male Berlin University Berlin Germany 1911-03-01 Berlin Germany
1 1901 Literature The Nobel Prize in Literature 1901 "in special recognition of his poetic composit… 1/1 569 Individual Sully Prudhomme 1839-03-16 Paris France Male NaN NaN NaN 1907-09-07 Châtenay France
2 1901 Medicine The Nobel Prize in Physiology or Medicine 1901 "for his work on serum therapy, especially its… 1/1 293 Individual Emil Adolf von Behring 1854-03-15 Hansdorf (Lawice) Prussia (Poland) Male Marburg University Marburg Germany 1917-03-31 Marburg Germany
3 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 462 Individual Jean Henry Dunant 1828-05-08 Geneva Switzerland Male NaN NaN NaN 1910-10-30 Heiden Switzerland
4 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 463 Individual Frédéric Passy 1822-05-20 Paris France Male NaN NaN NaN 1912-06-12 Paris France
1.2: 统计得奖国家的个数
查询出生国家 『birth_country』再看看哪个国家的得奖者多
统计国家的个数 数据集.value_counts()
head(你想看几行)
nobel[‘birth_country’].value_counts().head(10)
United States of America 259
United Kingdom 85
Germany 61
France 51
Sweden 29
Japan 24
Netherlands 18
Canada 18
Russia 17
Italy 17
Name: birth_country, dtype: int64
1.3: 探索得奖最多的国家崛起的时间 (美国)
提取所有来自美国的得奖者
nobel[“usa_winner”] = nobel[‘birth_country’] == “United States of America”
nobel[“usa_winner”]
提取时间 已每十年为一个单位 decade 世代
(np.floor(年份 /10)) * 10
nobel[“decade”] = np.floor(nobel[‘year’] /10) * 10
nobel[“decade”]
计算占有多少比例 数据集.groupby(“建立群组”)[“要查询的对象”].要做什么
prop_usa_winner = nobel.groupby(“decade”, as_index= False)[“usa_winner”].mean()
prop_usa_winner
可视化
import matplotlib.pyplot as plt
import seaborn as sb
plt.plot(nobel[“decade”], nobel[“usa_winner”] )
plt.xlabel(“AAAA”)
plt.ylabel(“BBB”)
plt.show()
plt.rcParams[‘figure.figsize’] = [11, 7]
#折线图 linplot( x = X轴的数据, y = Y轴的数据)
sb.lineplot(x= nobel[“decade”] , y= nobel[“usa_winner”])
plt.show()
/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)]
instead of arr[seq]
. In the future this will be interpreted as an array index, arr[np.array(seq)]
, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
找出诺贝尔奖中女性的比例
2.1: 取出所有女性诺贝尔奖得主
2.2: 计算所占的比例
2.3: 第一位女性诺贝尔奖得主在哪一年得奖?
2.1: 取出所有女性诺贝尔奖得主
nobel[“sex”].value_counts()
Male 836
Female 49
Name: sex, dtype: int64
2.2: 计算所占的比例
提取所有的女性得主
nobel[“female_winner”] = nobel[“sex”] == “Female”
以十年为一个单位
nobel[“decade”]
进行每个十年之间有多少女性得主 groupby(“要建立的群组”)[“要查询的对象”].操作
prop_female_winner = nobel.groupby(“decade”, as_index= False)[“female_winner”].mean()
可视化 折线图
sb.lineplot(x = “decade”, y= “female_winner”, data= prop_female_winner)
<matplotlib.axes._subplots.AxesSubplot at 0x1228fde48>
2.3: 第一位女性诺贝尔奖得主在哪一年得奖?
female_winner = nobel[ nobel[“sex”] == “Female”]
female_winner.nsmallest(1,‘year’)
female_winner.min()
year category prize motivation prize_share laureate_id laureate_type full_name birth_date birth_city ... sex organization_name organization_city organization_country death_date death_city death_country usa_winner decade female_winner
19 1903 Physics The Nobel Prize in Physics 1903 "in recognition of the extraordinary services … 1/4 6 Individual Marie Curie, née Sklodowska 1867-11-07 Warsaw … Female NaN NaN NaN 1934-07-04 Sallanches France False 1900.0 True
1 rows × 21 columns
找出诺贝尔奖得奖者的平均年龄
3.1: 计算所有得奖者得奖的年纪
3.2: 可视化结果
3.3: 所有的奖项得奖者的年纪
3.1: 计算所有得奖者得奖的年纪 哪一年得奖 - 出生时间 = 获奖年纪
出生日期的格式 需要转换
to_datetime(要转换的数据放进来)
.dt.year 取出年 的部份
nobel[‘birth_date’] = pd.to_datetime(nobel[‘birth_date’])
nobel[“age”] = nobel[“year”] - nobel[‘birth_date’].dt.year
nobel[“age”]
year category prize motivation prize_share laureate_id laureate_type full_name birth_date birth_city ... organization_name organization_city organization_country death_date death_city death_country usa_winner decade female_winner age
0 1901 Chemistry The Nobel Prize in Chemistry 1901 "in recognition of the extraordinary services … 1/1 160 Individual Jacobus Henricus van 't Hoff 1852-08-30 Rotterdam … Berlin University Berlin Germany 1911-03-01 Berlin Germany False 1900.0 False 49.0
1 1901 Literature The Nobel Prize in Literature 1901 "in special recognition of his poetic composit… 1/1 569 Individual Sully Prudhomme 1839-03-16 Paris … NaN NaN NaN 1907-09-07 Châtenay France False 1900.0 False 62.0
2 1901 Medicine The Nobel Prize in Physiology or Medicine 1901 "for his work on serum therapy, especially its… 1/1 293 Individual Emil Adolf von Behring 1854-03-15 Hansdorf (Lawice) … Marburg University Marburg Germany 1917-03-31 Marburg Germany False 1900.0 False 47.0
3 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 462 Individual Jean Henry Dunant 1828-05-08 Geneva … NaN NaN NaN 1910-10-30 Heiden Switzerland False 1900.0 False 73.0
4 1901 Peace The Nobel Peace Prize 1901 NaN 1/2 463 Individual Frédéric Passy 1822-05-20 Paris … NaN NaN NaN 1912-06-12 Paris France False 1900.0 False 79.0
5 rows × 22 columns
3.2: 可视化结果
sb.lmplot(x = “year”, y=“age”, data = nobel, aspect = 宽度, line_kws = {color : 颜色 }, lowess = True)
sb.lmplot(x = “year”, y=“age”, data = nobel, aspect= 2, line_kws= {“color” : “black”}, lowess=True)
<seaborn.axisgrid.FacetGrid at 0x1a27f51f60>
3.3: 所有的奖项得奖者的年纪
sb.lmplot(x = “year”,
y="age",
data = nobel,
aspect= 2,
line_kws= {"color" : "black"},
lowess=True,
row = "category"
)