pandas学习1--pandas数据结构之Series简介

最新推荐文章于 2022-11-24 16:14:51 发布

魏晋小子

最新推荐文章于 2022-11-24 16:14:51 发布

阅读量255

点赞数

分类专栏： pandas 文章标签： pandas Series 数据处理学习笔记

本文链接：https://blog.csdn.net/weixin_43684951/article/details/88690364

版权

pandas 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

本文是阅读《利用Python进行数据分析(第2版)》一书的学习笔记

import pandas as pd
import numpy as np

1. 创建 Series 对象

# Series 是一种类似一维数组的对象
# 数据标签(索引) + 数据  
# 个人感觉有点类似与 python 中的 dict
# 如果不指定索引值，默认为 0 - length-1
obj = pd.Series([4, 7, -5, -3])
obj

0    4
1    7
2   -5
3   -3
dtype: int64

2. 使用 Series 对象的 values属性和 index属性，分别得到 Series对象的值和对应的索引

# values 属性得到 Series对象 的值
obj.values

array([ 4,  7, -5, -3], dtype=int64)

# index 属性得到 Series对象 的索引 
obj.index

RangeIndex(start=0, stop=4, step=1)

3. 创建具有自定义索引的 Series对象

# 使用 index变量来定义索引值
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
print(obj2)
print(obj2.index)

d    4
b    7
a   -5
c    3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')

4. 通过索引的方式得到或修改 Series对象的值

obj2["a"]

-5

obj2["d"] = 6

obj2

d    6
b    7
a   -5
c    3
dtype: int64

obj2[["a", "c", "d"]]

a   -5
c    3
d    6
dtype: int64

5. 布尔过滤、标量乘法及数学函数

上述操作都会保留索引与值之间的链接（即便是值已经发生改变）

obj2[obj2 > 0] 
# 去掉了 值为负数的元素

d    6
b    7
c    3
dtype: int64

obj2 * 2
# 对 Series对象中的每一个值 进行*2的操作，索引保留

d    12
b    14
a   -10
c     6
dtype: int64

np.exp(obj2)
# 对 Series对象 中的每一个值 执行自然指数函数操作，索引保留

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

"b" in obj2
# 判断 "b" 是否是 obj2 的一个索引

True

obj2

d    6
b    7
a   -5
c    3
dtype: int64

6 in obj2.values
# 判断 6 是否是 obj2 的一个值
# 同理，"b" in obj2  <<==>> "b" in obj2.index 
# 从这一点来看 Series对象 更加类似与 字典对象

True

6. 字典创建Series

sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

# 注意到 obj3 索引的顺序就是 字典对象sdata 原有的顺序(即便 dict本身并没有顺序，但是其在创建 Series对象时显示的索引顺序，就是 Series对象的索引顺序)
# 由此 可以在创建 Series对象时传入有序的 索引值就能实现 Series对象的有序排列， 当然可以传入所需要的任意顺序
key_list = list(sdata.keys())
key_list_sorted = sorted(key_list)  # 已经将 键 排序
obj4 = pd.Series(sdata, index=key_list_sorted)  # obj4 的索引就是有序的
obj4

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

7. 数据缺失

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj5 = pd.Series(sdata, index=states)
obj5

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

# 数据缺失的地方会默认为 NaN (not a number) 
# NaN 表示 数据缺失   或者 NA值(比如 1/0)
# 使用 isnull() 和 notnull() 方法来检测 缺失数据
pd.isnull(obj5)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

pd.notnull(obj5)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

obj5.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

obj5.notnull()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

8. 两个 Series 对象的简单操作

# 更加有用的多个 Series 对象间的操作会陆续总结，在此仅仅是为了体会
# Series 对象的最重要一个功能就是  根据运算的索引标签自动对齐数据
obj4

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

obj5

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

obj6 = obj4 + obj5
obj6
# 可以看到，obj6 含有 obj4 和 obj5 两个对象的所有索引，只存在于其中一个对象中的索引在运算后对应的值为 NaN

9. name 属性

obj5.name = "population"
obj5.index.name = "state"
obj5

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

# 此处需要注意, 不能使用 obj5.values.name = "million"  去定义 values 的 name
# 个人感觉原因是  
# index 属性得到的类型是 Index 类型, 该类型应该是 pandas 中定义的一种类似于 Series 类型的类型, 故存在属性 name
# 而 values 属性得到的类型是 numpy 类型, 并没有 name 属性