4小时学完Python数据分析入门笔记(三) Data Analysis with Python(Numpy, Pandas, Matplotlib, Seaborn)

接上文:4小时学完Python数据分析入门笔记(二)

写在前面

freecodecamp.org + RMOTR的Python数据分析课的笔记,全视频4小时22分钟。以下内容斜体和(括号)部分仅为我个人的想法或补充,其他文字为中文翻译。分几次发完,转载标明出处。
原视频链接:https://www.youtube.com/watch?v=r-uOLxNrNk8
Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)

最近做数据清洗深刻又一次感受到pandas的重要性,接下来介绍pandas series。

Pandas Series

Series是Pandas两个重要的数据类型之一(我一般接触dataframe,也没怎么用过series)。

series基础

创建一个series, 不难发现series长得很像python list, 但是dtype: float64意味着他储存为一个numpy数据类型。

# in millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])
g7_pop

0 35.467
1 63.951
2 80.940
3 60.665
4 127.061
5 64.511
6 318.523
dtype: float64

我们再给series一个名字。

g7_pop.name = 'G7 Population in millions'
g7_pop

0 35.467
1 63.951
2 80.940
3 60.665
4 127.061
5 64.511
6 318.523
Name: G7 Population in millions, dtype: float64

查看series的类型、值、index。

g7_pop.dtype

dtype(‘float64’)

g7_pop.values

array([ 35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523])

g7_pop.index

RangeIndex(start=0, stop=7, step=1)

选择特定index的值。

g7_pop[2]

80.94

到这里我们会感觉,pandas series和python list好像没有什么很大的不同。但这是因为我们一开始创建series的时候没有指明index所以默认从0开始和python list一样。index其实可以自己指定。

g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States'
]
g7_pop

Canada 35.467
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 318.523
Name: G7 Population in millions, dtype: float64

那我们接下来就不需要记住某个数字了。

g7_pop['Canada']

35.467

g7_pop[['Canada', 'Japan']]

Canada 35.467
Japan 127.061
Name: G7 Population in millions, dtype: float64

g7_pop['Canada': 'Japan'] # Japan也会被选到

Canada 35.467
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
Name: G7 Population in millions, dtype: float64

如何从一开始就指定series的index呢?这里有两种方法。

pd.Series({
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.940,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name = 'G7 Population in millions')
pd.Series(
    [35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523],
    index = ['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom', 'United States'],
    name = 'G7 Population in millions'
)

你可以通过index从已有series里截取新的series。

pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])

France 63.951
Germany 80.940
Italy 60.665
Spain NaN
Name: G7 Population in millions, dtype: float64

虽然现在的index不是数字了,但我们还是可以输入位置来找特定信息的。

g7_pop.iloc[0]

35.467

g7_pop.iloc[[0, 2]]

Canada 35.467
Germany 80.940
Name: G7 Population in millions, dtype: float64

g7_pop.iloc[0: 3]

Canada 35.467
France 63.951
Germany 80.940
Name: G7 Population in millions, dtype: float64

series运算/变换

基本运算:

g7_pop * 1_000_000

Canada 35467000.0
France 63951000.0
Germany 80940000.0
Italy 60665000.0
Japan 127061000.0
United Kingdom 64511000.0
United States 318523000.0
Name: G7 Population in millions, dtype: float64

np.log(g7_pop)

Canada 3.568603
France 4.158117
Germany 4.393708
Italy 4.105367
Japan 4.844667
United Kingdom 4.166836
United States 5.763695
Name: G7 Population in millions, dtype: float64

布尔运算:

g7_pop > 70

Canada False
France False
Germany True
Italy False
Japan True
United Kingdom False
United States True
Name: G7 Population in millions, dtype: bool

g7_pop[g7_pop > 70]

Germany 80.940
Japan 127.061
United States 318.523
Name: G7 Population in millions, dtype: float64

g7_pop[g7_pop > g7_pop.mean()]

Japan 127.061
United States 318.523
Name: G7 Population in millions, dtype: float64

g7_pop[(g7_pop < g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)]
# 故意来个复杂一点的例子 演示一下各种运算符的运用

Canada 35.467
United States 318.523
Name: G7 Population in millions, dtype: float64

修改series的值:

g7_pop['Canada'] = 40.5
g7_pop

Canada 40.500
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 318.523
Name: G7 Population in millions, dtype: float64

g7_pop.iloc[-1] = 500
g7_pop

Canada 40.500
France 63.951
Germany 80.940
Italy 60.665
Japan 127.061
United Kingdom 64.511
United States 500.000
Name: G7 Population in millions, dtype: float64

g7_pop[g7_pop < 70] = 99.9
g7_pop

Canada 99.900
France 99.900
Germany 80.940
Italy 99.900
Japan 127.061
United Kingdom 99.900
United States 500.000
Name: G7 Population in millions, dtype: float64

小结

介绍了Pandas series的基本运用。

原视频链接:https://www.youtube.com/watch?v=r-uOLxNrNk8
Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)

上一篇:4小时学完Python数据分析入门笔记(二)
下一篇:4小时学完Python数据分析入门笔记(四)

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值