4小时学完Python数据分析入门笔记(三)
接上文:4小时学完Python数据分析入门笔记(二)
写在前面
freecodecamp.org + RMOTR的Python数据分析课的笔记,全视频4小时22分钟。以下内容斜体和(括号)部分仅为我个人的想法或补充,其他文字为中文翻译。分几次发完,转载标明出处。
原视频链接:https://www.youtube.com/watch?v=r-uOLxNrNk8
Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)
最近做数据清洗深刻又一次感受到pandas的重要性,接下来介绍pandas series。
Pandas Series
Series是Pandas两个重要的数据类型之一(我一般接触dataframe,也没怎么用过series)。
series基础
创建一个series, 不难发现series长得很像python list, 但是dtype: float64意味着他储存为一个numpy数据类型。
# in millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])
g7_pop
0 35.467
1 63.951
2 80.940
3 60.665
4 127.061
5 64.511
6 318.523
dtype: float64
我们再给series一个名字。
g7_pop.name = 'G7 Population in millions'
g7_pop
0 35.467
1 63.951
2 80.940
3 60.665
4 127.061
5 64.511
6 318.523
Name: G7 Population in millions, dtype: float64
查看series的类型、值、index。
g7_pop.dtype
dtype(‘float64’)
g7_pop.values
array([ 35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523])
g7_pop.index
RangeIndex(start=0, stop=7, step=1)
选择特定index的值。
g7_pop[2]
80.94
到这里我们会感觉,pandas series和python list好像没有什么很大的不同。但这是因为我们一开始创建series的时候没有指明index所以默认从0开始和python list一样。index其实可以自己指定。
g7_pop.index = [
'Canada',
'France',
'Germany',
'Italy',
'Japan',
'United Kingdom',
'United States'
]
g7_pop
Canada 35.467
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 318.523
Name: G7 Population in millions, dtype: float64
那我们接下来就不需要记住某个数字了。
g7_pop['Canada']
35.467
g7_pop[['Canada', 'Japan']]
Canada 35.467
Japan 127.061
Name: G7 Population in millions, dtype: float64
g7_pop['Canada': 'Japan'] # Japan也会被选到
Canada 35.467
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
Name: G7 Population in millions, dtype: float64
如何从一开始就指定series的index呢?这里有两种方法。
pd.Series({
'Canada': 35.467,
'France': 63.951,
'Germany': 80.940,
'Italy': 60.665,
'Japan': 127.061,
'United Kingdom': 64.511,
'United States': 318.523
}, name = 'G7 Population in millions')
pd.Series(
[35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523],
index = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States'],
name = 'G7 Population in millions'
)
你可以通过index从已有series里截取新的series。
pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])
France 63.951
Germany 80.940
Italy 60.665
Spain NaN
Name: G7 Population in millions, dtype: float64
虽然现在的index不是数字了,但我们还是可以输入位置来找特定信息的。
g7_pop.iloc[0]
35.467
g7_pop.iloc[[0, 2]]
Canada 35.467
Germany 80.940
Name: G7 Population in millions, dtype: float64
g7_pop.iloc[0: 3]
Canada 35.467
France 63.951
Germany 80.940
Name: G7 Population in millions, dtype: float64
series运算/变换
基本运算:
g7_pop * 1_000_000
Canada 35467000.0
France 63951000.0
Germany 80940000.0
Italy 60665000.0
Japan 127061000.0
United Kingdom 64511000.0
United States 318523000.0
Name: G7 Population in millions, dtype: float64
np.log(g7_pop)
Canada 3.568603
France 4.158117
Germany 4.393708
Italy 4.105367
Japan 4.844667
United Kingdom 4.166836
United States 5.763695
Name: G7 Population in millions, dtype: float64
布尔运算:
g7_pop > 70
Canada False
France False
Germany True
Italy False
Japan True
United Kingdom False
United States True
Name: G7 Population in millions, dtype: bool
g7_pop[g7_pop > 70]
Germany 80.940
Japan 127.061
United States 318.523
Name: G7 Population in millions, dtype: float64
g7_pop[g7_pop > g7_pop.mean()]
Japan 127.061
United States 318.523
Name: G7 Population in millions, dtype: float64
g7_pop[(g7_pop < g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)]
# 故意来个复杂一点的例子 演示一下各种运算符的运用
Canada 35.467
United States 318.523
Name: G7 Population in millions, dtype: float64
修改series的值:
g7_pop['Canada'] = 40.5
g7_pop
Canada 40.500
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 318.523
Name: G7 Population in millions, dtype: float64
g7_pop.iloc[-1] = 500
g7_pop
Canada 40.500
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 500.000
Name: G7 Population in millions, dtype: float64
g7_pop[g7_pop < 70] = 99.9
g7_pop
Canada 99.900
France 99.900
Germany 80.940
Italy 99.900
Japan 127.061
United Kingdom 99.900
United States 500.000
Name: G7 Population in millions, dtype: float64
小结
介绍了Pandas series的基本运用。
原视频链接:https://www.youtube.com/watch?v=r-uOLxNrNk8
Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)