4小时学完Python数据分析入门笔记（三） Data Analysis with Python(Numpy, Pandas, Matplotlib, Seaborn)

最新推荐文章于 2023-10-28 13:10:16 发布

Alisonyao

最新推荐文章于 2023-10-28 13:10:16 发布

阅读量606

点赞数

文章标签：数据分析人工智能 python

4小时学完Python数据分析入门笔记（三）

接上文：4小时学完Python数据分析入门笔记（二）
写在前面
Pandas Series
- series基础
- series运算/变换
小结

接上文：4小时学完Python数据分析入门笔记（二）

写在前面

freecodecamp.org + RMOTR的Python数据分析课的笔记，全视频4小时22分钟。以下内容斜体和（括号）部分仅为我个人的想法或补充，其他文字为中文翻译。分几次发完，转载标明出处。
原视频链接：https://www.youtube.com/watch?v=r-uOLxNrNk8
Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)

最近做数据清洗深刻又一次感受到pandas的重要性，接下来介绍pandas series。

Pandas Series

Series是Pandas两个重要的数据类型之一（我一般接触dataframe，也没怎么用过series）。

series基础

创建一个series，不难发现series长得很像python list，但是dtype: float64意味着他储存为一个numpy数据类型。

# in millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])
g7_pop

0 35.467
1 63.951
2 80.940
3 60.665
4 127.061
5 64.511
6 318.523
dtype: float64

我们再给series一个名字。

g7_pop.name = 'G7 Population in millions'
g7_pop

0 35.467
1 63.951
2 80.940
3 60.665
4 127.061
5 64.511
6 318.523
Name: G7 Population in millions, dtype: float64

查看series的类型、值、index。

g7_pop.dtype

dtype(‘float64’)

g7_pop.values

array([ 35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523])

g7_pop.index

RangeIndex(start=0, stop=7, step=1)

选择特定index的值。

g7_pop[2]

80.94

到这里我们会感觉，pandas series和python list好像没有什么很大的不同。但这是因为我们一开始创建series的时候没有指明index所以默认从0开始和python list一样。index其实可以自己指定。

g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States'
]
g7_pop

Canada 35.467
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 318.523
Name: G7 Population in millions, dtype: float64

那我们接下来就不需要记住某个数字了。

g7_pop['Canada']

35.467

g7_pop[['Canada', 'Japan']]

Canada 35.467
Japan 127.061
Name: G7 Population in millions, dtype: float64

g7_pop['Canada': 'Japan'] # Japan也会被选到

Canada 35.467
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
Name: G7 Population in millions, dtype: float64

如何从一开始就指定series的index呢？这里有两种方法。

pd.Series({
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.940,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name = 'G7 Population in millions')

pd.Series(
    [35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523],
    index = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States'],
    name = 'G7 Population in millions'
)

你可以通过index从已有series里截取新的series。

pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])

France 63.951
Germany 80.940
Italy 60.665
Spain NaN
Name: G7 Population in millions, dtype: float64

虽然现在的index不是数字了，但我们还是可以输入位置来找特定信息的。

g7_pop.iloc[0]

35.467

g7_pop.iloc[[0, 2]]

Canada 35.467
Germany 80.940
Name: G7 Population in millions, dtype: float64

g7_pop.iloc[0: 3]

Canada 35.467
France 63.951
Germany 80.940
Name: G7 Population in millions, dtype: float64

series运算/变换

基本运算：

g7_pop * 1_000_000

Canada 35467000.0
France 63951000.0
Germany 80940000.0
Italy 60665000.0
Japan 127061000.0
United Kingdom 64511000.0
United States 318523000.0
Name: G7 Population in millions, dtype: float64

np.log(g7_pop)

Canada 3.568603
France 4.158117
Germany 4.393708
Italy 4.105367
Japan 4.844667
United Kingdom 4.166836
United States 5.763695
Name: G7 Population in millions, dtype: float64

布尔运算：

g7_pop > 70

Canada False
France False
Germany True
Italy False
Japan True
United Kingdom False
United States True
Name: G7 Population in millions, dtype: bool

g7_pop[g7_pop > 70]

Germany 80.940
Japan 127.061
United States 318.523
Name: G7 Population in millions, dtype: float64

g7_pop[g7_pop > g7_pop.mean()]

Japan 127.061
United States 318.523
Name: G7 Population in millions, dtype: float64

g7_pop[(g7_pop < g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)]
# 故意来个复杂一点的例子 演示一下各种运算符的运用

Canada 35.467
United States 318.523
Name: G7 Population in millions, dtype: float64

修改series的值：

g7_pop['Canada'] = 40.5
g7_pop

Canada 40.500
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 318.523
Name: G7 Population in millions, dtype: float64

g7_pop.iloc[-1] = 500
g7_pop

Canada 40.500
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 500.000
Name: G7 Population in millions, dtype: float64

g7_pop[g7_pop < 70] = 99.9
g7_pop

Canada 99.900
France 99.900
Germany 80.940
Italy 99.900
Japan 127.061
United Kingdom 99.900
United States 500.000
Name: G7 Population in millions, dtype: float64

小结

介绍了Pandas series的基本运用。

原视频链接：https://www.youtube.com/watch?v=r-uOLxNrNk8
Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)

上一篇：4小时学完Python数据分析入门笔记（二）
下一篇：4小时学完Python数据分析入门笔记（四）

Alisonyao

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
4小时学完Python数据分析入门笔记（三） Data Analysis with Python(Numpy, Pandas, Matplotlib, Seaborn)

4小时学完Python数据分析入门笔记（三）接上文：4小时学完Python数据分析入门笔记（二）写在前面Pandas Seriesseries基础series运算/变换小结接上文：4小时学完Python数据分析入门笔记（二）写在前面freecodecamp.org + RMOTR的Python数据分析课的笔记，全视频4小时22分钟。以下内容斜体和（括号）部分仅为我个人的想法或补充，其他文字为中文翻译。分几次发完，转载标明出处。原视频链接：https://www.youtube.com/watch?v
复制链接

扫一扫