写在前面
关于数据科学环境的建立,可以参考我的博客:
【深耕 Python】Data Science with Python 数据科学(1)环境搭建
往期数据科学博文:
【深耕 Python】Data Science with Python 数据科学(2)jupyter-lab和numpy数组
【深耕 Python】Data Science with Python 数据科学(3)Numpy 常量、函数和线性空间
【深耕 Python】Data Science with Python 数据科学(4)(书337页)练习题及解答
【深耕 Python】Data Science with Python 数据科学(5)Matplotlib可视化(1)
【深耕 Python】Data Science with Python 数据科学(6)Matplotlib可视化(2)
【深耕 Python】Data Science with Python 数据科学(7)书352页练习题
【深耕 Python】Data Science with Python 数据科学(8)pandas数据结构:Series和DataFrame
【深耕 Python】Data Science with Python 数据科学(9)书361页练习题
【深耕 Python】Data Science with Python 数据科学(10)pandas 数据处理(一)
【深耕 Python】Data Science with Python 数据科学(11)pandas 数据处理(二)
代码说明: 由于实机运行的原因,可能省略了某些导入(import)语句。
本期,继续对诺奖获得者(laureates.csv)进行分析。
Python Code Snippet 1
通过出生日期(born)字段查找数据项。
print(nobel.loc[nobel["born"] == "1879-03-14"])
print(nobel.loc[nobel["born"] == "1879-03-14"]["surname"])
print(nobel.loc[nobel["born"].str.contains("06-28", na=False)])
print(nobel.loc[(nobel["born"].astype('string').str.contains("06-28")) & (nobel["category"] == "physics")])
print(nobel.iloc[79])
# 爱因斯坦的获奖信息
id firstname surname born died bornCountry \
25 26 Albert Einstein 1879-03-14 1955-04-18 Germany
bornCountryCode bornCity diedCountry diedCountryCode diedCity gender \
25 DE Ulm USA US Princeton NJ male
year category overallMotivation share \
25 1921 physics NaN 1
motivation \
25 "for his services to Theoretical Physics and e...
name city country
25 Kaiser-Wilhelm-Institut (now Max-Planck-Instit... Berlin Germany
# 仅输出姓氏字段的值
25 Einstein
Name: surname, dtype: object
# 输出出生日期为6月28日的获奖者信息
id firstname surname born died \
79 79 Maria Goeppert Mayer 1906-06-28 1972-02-20
125 126 Klaus von Klitzing 1943-06-28 0000-00-00
281 283 F. Sherwood Rowland 1927-06-28 2012-03-10
304 306 Alexis Carrel 1873-06-28 1944-11-05
598 607 Luigi Pirandello 1867-06-28 1936-12-10
790 809 Muhammad Yunus 1940-06-28 0000-00-00
889 916 William C. Campbell 1930-06-28 0000-00-00
bornCountry bornCountryCode \
79 Germany (now Poland) PL
125 German-occupied Poland (now Poland) PL
281 USA US
304 France FR
598 Italy IT
790 British India (now Bangladesh) BD
889 Ireland IE
bornCity diedCountry diedCountryCode diedCity \
79 Kattowitz (now Katowice) USA US San Diego CA
125 Schroda NaN NaN NaN
281 Delaware OH USA US Corona del Mar CA
304 Sainte-Foy-lès-Lyon France FR Paris
598 Agrigento Sicily Italy IT Rome
790 Chittagong NaN NaN NaN
889 Ramelton NaN NaN NaN
gender year category overallMotivation share \
79 female 1963 physics NaN 4
125 male 1985 physics NaN 1
281 male 1995 chemistry NaN 3
304 male 1912 medicine NaN 1
598 male 1934 literature NaN 1
790 male 2006 peace NaN 2
889 male 2015 medicine NaN 4
motivation \
79 "for their discoveries concerning nuclear shel...
125 "for the discovery of the quantized Hall effect"
281 "for their work in atmospheric chemistry parti...
304 "in recognition of his work on vascular suture...
598 "for his bold and ingenious revival of dramati...
790 "for their efforts to create economic and soci...
889 "for their discoveries concerning a novel ther...
name city country
79 University of California San Diego CA USA
125 Max-Planck-Institut für Festkörperforschung Stuttgart Germany
281 University of California Irvine CA USA
304 Rockefeller Institute for Medical Research New York NY USA
598 NaN NaN NaN
790 NaN NaN NaN
889 Drew University Madison NJ USA
# 输出出生日期为6月28日,且获得物理学奖的获奖者信息
id firstname surname born died \
79 79 Maria Goeppert Mayer 1906-06-28 1972-02-20
125 126 Klaus von Klitzing 1943-06-28 0000-00-00
bornCountry bornCountryCode \
79 Germany (now Poland) PL
125 German-occupied Poland (now Poland) PL
bornCity diedCountry diedCountryCode diedCity \
79 Kattowitz (now Katowice) USA US San Diego CA
125 Schroda NaN NaN NaN
gender year category overallMotivation share \
79 female 1963 physics NaN 4
125 male 1985 physics NaN 1
motivation \
79 "for their discoveries concerning nuclear shel...
125 "for the discovery of the quantized Hall effect"
name city country
79 University of California San Diego CA USA
125 Max-Planck-Institut für Festkörperforschung Stuttgart Germany
# 通过iloc (index location) 方法输出条目79的获奖者信息
id 79
firstname Maria
surname Goeppert Mayer
born 1906-06-28
died 1972-02-20
bornCountry Germany (now Poland)
bornCountryCode PL
bornCity Kattowitz (now Katowice)
diedCountry USA
diedCountryCode US
diedCity San Diego CA
gender female
year 1963
category physics
overallMotivation NaN
share 4
motivation "for their discoveries concerning nuclear shel...
name University of California
city San Diego CA
country USA
Name: 79, dtype: object
Python Code Snippet 2
获得诺奖得主的出生、逝世日期,计算诺奖得主的寿命(以天计),并换算为年Y
import numpy as np
bethe = nobel.loc[nobel["surname"] == "Bethe"]
print(bethe["born"])
print(bethe["died"])
diff = pd.to_datetime(bethe["died"]) - pd.to_datetime(bethe["born"])
print(diff)
print(diff.dt.days)
print(diff/np.timedelta64(1, "Y"))
# 汉斯·贝特 Hans Bethe 的出生日期
88 1906-07-02
Name: born, dtype: object
# 汉斯·贝特 Hans Bethe 的逝世日期
88 2005-03-06
Name: died, dtype: object
# 汉斯·贝特 Hans Bethe 的寿命(以天计算)
88 36042 days
dtype: timedelta64[ns]
# 另一种方法访问上述值
88 36042
dtype: int64
# 使用timedelta64()方法计算年数
88 98.679644
dtype: float64
Python Code Snippet 3
组织(非个人)的获奖信息
print(nobel.loc[nobel["born"] == "1873-00-00"])
print(nobel.iloc[465].born)
print(nobel.iloc[465].category)
print(nobel.iloc[465].year)
# 国际法组织,成立于1873年,由于不是个人,月份和日期用00-00表示
id firstname surname born died \
465 467 Institute of International Law NaN 1873-00-00 0000-00-00
bornCountry bornCountryCode bornCity diedCountry diedCountryCode diedCity \
465 NaN NaN NaN NaN NaN NaN
gender year category overallMotivation share \
465 org 1904 peace NaN 1
motivation name city country
465 "for its striving in public law to develop pea... NaN NaN NaN
# 该组织成立于1873年
1873-00-00
# 获得和平奖
peace
# 获奖年份
1904
Python Code Snippet 4
通过出生(born)和逝世(died)字段计算诺奖得主的寿命(在原文件中新建一个lifespan字段)。
nobel["born"] = pd.to_datetime(nobel["born"], errors="coerce")
nobel["died"] = pd.to_datetime(nobel["died"], errors="coerce")
print(nobel.iloc[465].born)
nobel["lifespan"] = (nobel["died"] - nobel["born"]) / np.timedelta64(1, "Y")
bethe = nobel.loc[nobel["surname"] == "Bethe"]
print(bethe["lifespan"])
# 将不合规的“出生”日期转化为NaT (Not a Time) 值。
NaT
# 输出Hans Bethe的寿命值(以年Y计)
88 98.679644
Name: lifespan, dtype: float64
Python Code Snippet 5
通过上一步得到的lifespan字段绘制诺奖得主的寿命直方图。因此,想要长寿,最好的方法是得诺奖。
import matplotlib.pyplot as plt
nobel.hist(column = "lifespan")
plt.show()
参考文献 Reference
《Learn Enough Python to be Dangerous——Software Development, Flask Web Apps, and Beginning Data Science with Python》, Michael Hartl, Boston, Pearson, 2023.