写在前面
关于数据科学环境的建立,可以参考我的博客:
【深耕 Python】Data Science with Python 数据科学(1)环境搭建
往期数据科学博文:
【深耕 Python】Data Science with Python 数据科学(2)jupyter-lab和numpy数组
【深耕 Python】Data Science with Python 数据科学(3)Numpy 常量、函数和线性空间
【深耕 Python】Data Science with Python 数据科学(4)(书337页)练习题及解答
【深耕 Python】Data Science with Python 数据科学(5)Matplotlib可视化(1)
【深耕 Python】Data Science with Python 数据科学(6)Matplotlib可视化(2)
【深耕 Python】Data Science with Python 数据科学(7)书352页练习题
【深耕 Python】Data Science with Python 数据科学(8)pandas数据结构:Series和DataFrame
【深耕 Python】Data Science with Python 数据科学(9)书361页练习题
代码说明: 由于实机运行的原因,可能省略了某些导入(import)语句。
本期使用Pandas库进行初步的数据处理分析,所用的csv为历年诺贝尔奖获得者信息(位于laureates.csv中,需事先下载)。使用pandas中的简单命令对庞大的文件进行匹配和搜索,得到有关理查德费曼(Richard Feynman)和基普索恩(Kip Thorne)的获奖信息,直到遇到第一个报错(ValueError)。csv文件的下载命令如下:
curl -OL https://cdn.learnenough.com/laureates.csv
Python Code Snippet 1
import pandas as pd
nobel = pd.read_csv("laureates.csv")
print("Output for describe() method:")
print(nobel.describe())
print()
print("Output for head() method:")
print(nobel.head())
print()
print("Output for info() method:")
print(nobel.info())
print()
Output for describe() method:
id year share
count 975.000000 975.000000 975.000000
mean 496.221538 1972.471795 2.014359
std 290.594353 34.058064 0.943909
min 1.000000 1901.000000 1.000000
25% 244.500000 1948.500000 1.000000
50% 488.000000 1978.000000 2.000000
75% 746.500000 2001.000000 3.000000
max 1009.000000 2021.000000 4.000000
Output for head() method:
id firstname surname born died \
0 1 Wilhelm Conrad Röntgen 1845-03-27 1923-02-10
1 2 Hendrik A. Lorentz 1853-07-18 1928-02-04
2 3 Pieter Zeeman 1865-05-25 1943-10-09
3 4 Henri Becquerel 1852-12-15 1908-08-25
4 5 Pierre Curie 1859-05-15 1906-04-19
bornCountry bornCountryCode bornCity \
0 Prussia (now Germany) DE Lennep (now Remscheid)
1 the Netherlands NL Arnhem
2 the Netherlands NL Zonnemaire
3 France FR Paris
4 France FR Paris
diedCountry diedCountryCode diedCity gender year category \
0 Germany DE Munich male 1901 physics
1 the Netherlands NL NaN male 1902 physics
2 the Netherlands NL Amsterdam male 1902 physics
3 France FR NaN male 1903 physics
4 France FR Paris male 1903 physics
overallMotivation share motivation \
0 NaN 1 "in recognition of the extraordinary services ...
1 NaN 2 "in recognition of the extraordinary service t...
2 NaN 2 "in recognition of the extraordinary service t...
3 NaN 2 "in recognition of the extraordinary services ...
4 NaN 4 "in recognition of the extraordinary services ...
name city \
0 Munich University Munich
1 Leiden University Leiden
2 Amsterdam University Amsterdam
3 École Polytechnique Paris
4 École municipale de physique et de chimie indu... Paris
country
0 Germany
1 the Netherlands
2 the Netherlands
3 France
4 France
Output for info() method:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 975 entries, 0 to 974
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 975 non-null int64
1 firstname 975 non-null object
2 surname 945 non-null object
3 born 974 non-null object
4 died 975 non-null object
5 bornCountry 946 non-null object
6 bornCountryCode 946 non-null object
7 bornCity 943 non-null object
8 diedCountry 640 non-null object
9 diedCountryCode 640 non-null object
10 diedCity 634 non-null object
11 gender 975 non-null object
12 year 975 non-null int64
13 category 975 non-null object
14 overallMotivation 23 non-null object
15 share 975 non-null int64
16 motivation 975 non-null object
17 name 717 non-null object
18 city 712 non-null object
19 country 713 non-null object
dtypes: int64(3), object(17)
memory usage: 152.5+ KB
None
Python Code Snippet 2
print(nobel[nobel["surname"] == "Feynman"])
print(nobel[nobel["surname"] == "Feynman"].year)
print((nobel["surname"] == "Feynman")[86])
print(nobel.loc[nobel["surname"] == "Feynman", "year"])
print(nobel.loc[nobel["firstname"] == "Kip"])
print(nobel.loc[nobel["firstname"] == "Kip S."])
print(nobel.loc[nobel["firstname"] == "Kip S."].year)
print(nobel.loc[nobel["firstname"].str.contains("Kip")])
print(nobel.loc[nobel["surname"].str.contains("Feynman")])
Output (with error at the end):
# 【美】理查德费曼(1918-1988)的获奖信息,1965年,加州理工学院,位于条目86中:
id firstname surname born died bornCountry \
86 86 Richard P. Feynman 1918-05-11 1988-02-15 USA
bornCountryCode bornCity diedCountry diedCountryCode diedCity \
86 US New York NY USA US Los Angeles CA
gender year category overallMotivation share \
86 male 1965 physics NaN 3
motivation \
86 "for their fundamental work in quantum electro...
name city country
86 California Institute of Technology (Caltech) Pasadena CA USA
# 获奖年份:1965
86 1965
Name: year, dtype: int64
#
True
# 仅返回年份信息
86 1965
Name: year, dtype: int64
# “Kip”的搜索结果为空
Empty DataFrame
Columns: [id, firstname, surname, born, died, bornCountry, bornCountryCode, bornCity, diedCountry, diedCountryCode, diedCity, gender, year, category, overallMotivation, share, motivation, name, city, country]
Index: []
# 改为Kip S.,得到基普索恩((1940-),引力波天文台)的获奖信息:
id firstname surname born died bornCountry \
916 943 Kip S. Thorne 1940-06-01 0000-00-00 USA
bornCountryCode bornCity diedCountry diedCountryCode diedCity gender \
916 US Logan UT NaN NaN NaN male
year category overallMotivation share \
916 2017 physics NaN 4
motivation \
916 "for decisive contributions to the LIGO detect...
name city country
916 LIGO/VIRGO Collaboration NaN NaN
# 基普索恩Kip S. Thorne获奖年份:
916 2017
Name: year, dtype: int64
# 字符串搜寻,寻找姓氏中带有Kip的条目,返回916条:
id firstname surname born died bornCountry \
916 943 Kip S. Thorne 1940-06-01 0000-00-00 USA
bornCountryCode bornCity diedCountry diedCountryCode diedCity gender \
916 US Logan UT NaN NaN NaN male
year category overallMotivation share \
916 2017 physics NaN 4
motivation \
916 "for decisive contributions to the LIGO detect...
name city country
916 LIGO/VIRGO Collaboration NaN NaN
# 报错:由于存在NaN(Not a Number)条目,返回ValueError:
[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m Traceback (most recent call last)
[0;32m<ipython-input-7-a3d47195d041>[0m in [0;36m<module>[0;34m[0m
[1;32m 7[0m [0mprint[0m[0;34m([0m[0mnobel[0m[0;34m.[0m[0mloc[0m[0;34m[[0m[0mnobel[0m[0;34m[[0m[0;34m"firstname"[0m[0;34m][0m [0;34m==[0m [0;34m"Kip S."[0m[0;34m][0m[0;34m.[0m[0myear[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m 8[0m [0mprint[0m[0;34m([0m[0mnobel[0m[0;34m.[0m[0mloc[0m[0;34m[[0m[0mnobel[0m[0;34m[[0m[0;34m"firstname"[0m[0;34m][0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mcontains[0m[0;34m([0m[0;34m"Kip"[0m[0;34m)[0m[0;34m][0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m----> 9[0;31m [0mprint[0m[0;34m([0m[0mnobel[0m[0;34m.[0m[0mloc[0m[0;34m[[0m[0mnobel[0m[0;34m[[0m[0;34m"surname"[0m[0;34m][0m[0;34m.[0m[0mstr[0m[0;34m.[0m[0mcontains[0m[0;34m([0m[0;34m"Feynman"[0m[0;34m)[0m[0;34m][0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m
[0;32m~/PycharmProjects/data_science/.venv/lib/python3.6/site-packages/pandas/core/indexing.py[0m in [0;36m__getitem__[0;34m(self, key)[0m
[1;32m 877[0m [0;34m[0m[0m
[1;32m 878[0m [0mmaybe_callable[0m [0;34m=[0m [0mcom[0m[0;34m.[0m[0mapply_if_callable[0m[0;34m([0m[0mkey[0m[0;34m,[0m [0mself[0m[0;34m.[0m[0mobj[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 879[0;31m [0;32mreturn[0m [0mself[0m[0;34m.[0m[0m_getitem_axis[0m[0;34m([0m[0mmaybe_callable[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0maxis[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m 880[0m [0;34m[0m[0m
[1;32m 881[0m [0;32mdef[0m [0m_is_scalar_access[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mkey[0m[0;34m:[0m [0mTuple[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m~/PycharmProjects/data_science/.venv/lib/python3.6/site-packages/pandas/core/indexing.py[0m in [0;36m_getitem_axis[0;34m(self, key, axis)[0m
[1;32m 1087[0m [0mself[0m[0;34m.[0m[0m_validate_key[0m[0;34m([0m[0mkey[0m[0;34m,[0m [0maxis[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m 1088[0m [0;32mreturn[0m [0mself[0m[0;34m.[0m[0m_get_slice_axis[0m[0;34m([0m[0mkey[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0maxis[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;32m-> 1089[0;31m [0;32melif[0m [0mcom[0m[0;34m.[0m[0mis_bool_indexer[0m[0;34m([0m[0mkey[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m 1090[0m [0;32mreturn[0m [0mself[0m[0;34m.[0m[0m_getbool_axis[0m[0;34m([0m[0mkey[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0maxis[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[1;32m 1091[0m [0;32melif[0m [0mis_list_like_indexer[0m[0;34m([0m[0mkey[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m~/PycharmProjects/data_science/.venv/lib/python3.6/site-packages/pandas/core/common.py[0m in [0;36mis_bool_indexer[0;34m(key)[0m
[1;32m 132[0m [0mna_msg[0m [0;34m=[0m [0;34m"Cannot mask with non-boolean array containing NA / NaN values"[0m[0;34m[0m[0;34m[0m[0m
[1;32m 133[0m [0;32mif[0m [0misna[0m[0;34m([0m[0mkey[0m[0;34m)[0m[0;34m.[0m[0many[0m[0;34m([0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0;32m--> 134[0;31m [0;32mraise[0m [0mValueError[0m[0;34m([0m[0mna_msg[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m 135[0m [0;32mreturn[0m [0;32mFalse[0m[0;34m[0m[0;34m[0m[0m
[1;32m 136[0m [0;32mreturn[0m [0;32mTrue[0m[0;34m[0m[0;34m[0m[0m
[0;31mValueError[0m: Cannot mask with non-boolean array containing NA / NaN values
参考文献 Reference
《Learn Enough Python to be Dangerous——Software Development, Flask Web Apps, and Beginning Data Science with Python》, Michael Hartl, Boston, Pearson, 2023.