Pandas - Series、DataFrame、Index

最新推荐文章于 2022-05-09 16:57:15 发布

夕麻

最新推荐文章于 2022-05-09 16:57:15 发布

阅读量295

点赞数 1

分类专栏： Pandas 简单入门文章标签： python

本文链接：https://blog.csdn.net/weixin_45735361/article/details/104240993

版权

Pandas 简单入门专栏收录该内容

6 篇文章 0 订阅

订阅专栏

pandas概述

Numpy主要处理结构化数据，数据量比较小，规则
对于⼤量数据，需要清理的数据，则需要pandas
⼀般使用方法是 import pandas as pd

Series对象

Series对象是⼀个带索引数据构成的⼀维数组。
Series把数组和⼀组索引绑定在⼀起。
如果想获取绑定的内容，分别可以使用values属性和index属性。

Series的创建

Series对象的创建主要分为：

使用数组直接创建,此时索引为默认索引
使用数组创建，显式指定索引
使用字典创建

# 直接使⽤数组进⾏创建
import pandas as pd
data = pd.Series([10, 11, 12, 13,14])
print("完整Series内容是：\n", data)
print("\nSeries的值： \n\n", data.values)
print("\nSeries的索引是： \n", data.index)

完整Series内容是
 0 10
1 11
2 12
3 13
4 14
dtype: int64
Series的值：
 [10 11 12 13 14]
Series的索引是：
 RangeIndex(start=0, stop=5, step=1)

显式制定索引

data_index = pd.Series([10,11,12,13,14], index=[1,2,3,4,5])
print("data_index: \n", data_index)

data_index:
 1 10
2 11
3 12
4 13
5 14
dtype: int64

使用字典进⾏创建

d = {"one":1, "two":2, "three":3, "four":4, "five":5}
d_s = pd.Series(d)
print(d_s)

one 1
two 2
three 3
four 4
five 5
dtype: int64

对于字典，可以通过index来显式筛选内容

d = {4:"four", 5:"five", 1:"one", 3:"trhee", 2:"two"}
d_s = pd.Series(d, index=[2,1,3])
print(d_s)

2 two
1 one
3 trhee
dtype: object

Series对象的简单使用

Series和Numpy的数组很像，他们的区别主要体现在索引上：
Numpy通过隐式索引来对数据进⾏操作
Series通过显式索引来将索引和数值关联
显式索引具有更强大的功能，你可以对其进行修改操作：
Series的values和indexs可以直接使⽤，查看。
请注意他们的类型。

import pandas as pd
data = pd.Series([10,11,12,13,14], index=[1,2,3,9,5])
print("data.values=", data.values)
print(type(data.values))
print()
print("data.indexs=", data.index)
print(type(data.index))

data.values= [10 11 12 13 14]
<class 'numpy.ndarray'>
data.indexs= Int64Index([1, 2, 3, 9, 5], dtype='int64')
<class 'pandas.core.indexes.numeric.Int64Index'>

对于series的对象的访问，完全可以像数组那样使用

# 数据选择使⽤显式索引
print("第⼀个值是：", data[1])
# 隐式索引
print("\n 前三个值是：\n ", data[:9])
print("\n 前三个值是：\n ", data[:4])

第⼀个值是： 10
前三个值是：
 1 10
2 11
3 12
9 13
5 14
dtype: int64
前三个值是：
 1 10
2 11
3 12
9 13
dtype: int64

对于不按顺序排列的index同样可以

# 请注意下例data[:2]是到index=2的那个位置结束，并不是从0开始的下标直到2结束
data = pd.Series([10, 11, 12, 13, 14], index=[54, 32, 2, 1, 9])
print("不连续数字作为索引也是可以的：\n", data)
print("\n data3 =", data[32])
print("\n data[:1] = \n", data[:1])

不连续数字作为索引也是可以的：
 54 10
32 11
2 12
1 13
9 14
dtype: int64
 data3 = 11
 data[:1] =
 54 10
dtype: int64

对于自定义index同样可以使用切片

# 注意此时切片的结束位置，使用显示索引包含结束位置！
d = {"one":1, "two":2, "three":3, "four":4, "five":5}
d_s = pd.Series(d)
print('d_s["four"] = ', d_s["four"])
print('d_s[":four"] = \n', d_s[:"four"])

d_s["four"] = 4
d_s[":four"] =
 one 1
two 2
three 3
four 4
dtype: int64

对数据的选取，Series处理的并不是特别好，在对数据访问的时候，显式索引和隐式索引容易造成混
淆。
为了应对这种混乱，Python为我们提供了三个索引器，用以清晰访问数据：

loc: 只使用显式索引, label based indexing
iloc：只使用隐式索引, positional indexing
ix：前两种索引的混合模式，主要用在DataFrame中, 不推荐
首先我们观察⼀下对Series的访问的显式索引和隐式索引的使用。

# Series对显式索引和隐式索引的使⽤
s = pd.Series([11,12,13,14,15], index=[3,5,7,9,1])
print("s = \n", s)
# Series对于单个内容的访问，采⽤的是显式索引
# s[1]表示的是显式索引 ”1“ 的内容
print("\n s[1]=", s[1])
# Series切⽚采⽤的是隐式所含，即默认是从下标0开始的升序索引
# s[1:3] 选中的内容是s[1] 和 s[2]
print("\n s[1:3] = \n", s[1:3])
s = pd.Series([1,2,3,4,5], index=["one", "two", "three", "four", "five"])
print("\n s=\n", s)
# 对于显式索引的切⽚， 是包含最后⼀位内容的
# 这⼀点跟隐式索引有很⼤区别
print('\n s["two": "four"]=', s["two":"four"])

s =
 3 11
5 12
7 13
9 14
1 15
dtype: int64
 s[1]= 15
 s[1:3] =
 5 12
7 13
dtype: int64
 s=
 one 1
two 2
three 3
four 4
five 5
dtype: int64
 s["two": "four"]= two 2
three 3
four 4
dtype: int64

# loc索引器表示切⽚和取值都是显式的，不使⽤隐式索引。
s = pd.Series([11,12,13,14,15], index=[3,5,7,9,1])
print("s = \n", s)
print("\n s.loc[3] = ", s.loc[3])
print("\n s.loc[3:9] = \n", s.loc[3:9])

s =
 3 11
5 12
7 13
9 14
1 15
dtype: int64
 s.loc[3] = 11
 s.loc[3:9] =
 3 11
5 12
7 13
9 14
dtype: int64

# iloc索引器表示切⽚和取值都是隐式的，不使⽤显式索引
s = pd.Series([11,12,13,14,15], index=[3,5,7,9,1])
print("s = \n", s)
print("\n s.iloc[3] = ", s.iloc[3])
print("\n s.iloc[3:9] = \n", s.iloc[3:9])

s =
 3 11
5 12
7 13
9 14
1 15
dtype: int64
 s.iloc[3] = 14
 s.iloc[3:9] =
  9 14
1 15
dtype: int64

DataFrame对象

DataFrame可以看做是通用的NumPy数组，也可以看做特殊的字典
DataFrame最常见的结构可以想象成⼀个Excel内容，每⼀行都有行号，每⼀列都有列名的二维结构
创建DataFrame的⽅式比较多，常见的有：
通过单个Series创建
通过字典列表创建
通过Series对象字典创建
通过NumPy⼆维数组创建
通过Numpy结构化数组创建

通过单个Series对象创建

import pandas as pd
s = pd.Series([1,2,3,4,5])
print("S=\n", s)
df = pd.DataFrame(s, columns=['digits'])
print("df=\n", df)

S=
 0 1
1 2
2 3
3 4
4 5
dtype: int64
df=
 digits
0 1
1 2
2 3
3 4
4 5

通过字典列表创建

dl = [{"个": i, "⼗":i*10, "百":i*100} for i in range(1,5)]
print("dl = ", dl)
df = pd.DataFrame(dl)
print("df = \n", df)
# 在通过字典创建的时候，如果有的值并不存在，则⾃动⽤NaN(not a number)填充
# 参看下⾯例⼦
dl = [{"a":1, "b":1}, {"b":2, "c":2}, {"c":3, "d":3}]
df = pd.DataFrame(dl)
print("df = \n", df)

dl = [{'个': 1, '⼗': 10, '百': 100}, {'个': 2, '⼗': 20, '百': 200}, {'个': 3,
'⼗': 30, '百': 300}, {'个': 4, '⼗': 40, '百': 400}]
df =
 个 ⼗ 百
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
df =
 a b c d
0 1.0 1.0 NaN NaN
1 NaN 2.0 2.0 NaN
2 NaN NaN 3.0 3.0

通过Series对象字典创建

s1 = pd.Series([i for i in range(1,6)], index=[1,2,3,4,5])
s2 = pd.Series([i*10 for i in range(1,6)], index=[3,4,5,6,7])
df = pd.DataFrame({"个": s1, "⼗":s2})
print("df = \n", df)

df =
 个 ⼗
1 1.0 NaN
2 2.0 NaN
3 3.0 10.0
4 4.0 20.0
5 5.0 30.0
6 NaN 40.0
7 NaN 50.0

通过Numpy⼆维数组创建

import numpy as np
df = pd.DataFrame(np.zeros([5,3]),
 columns=["A", "B", "C"],
 index=["one", "two", "three", "four", "five"])
print("df=\n",df)

df=
 A B C
one 0.0 0.0 0.0
two 0.0 0.0 0.0
three 0.0 0.0 0.0
four 0.0 0.0 0.0
five 0.0 0.0 0.0

通过Numpy结构化数组

d = np.zeros(3, dtype=[("A", "i8"), ("B", "f8")])
print("d = \n", d)
df = pd.DataFrame(d)
print("df=\n", df)

d =
 [(0, 0.) (0, 0.) (0, 0.)]
df=
 A B
0 0 0.0
1 0 0.0
2 0 0.0

对于DataFrame数据的使用方式，我们可以把DataFrame看做是具有行列号和⾸部标题行的Excel表格，而去除掉列号和⾸部标题行
DataFrame就可以看做是⼀个⼆维数组。
对于DataFrame的数据选择，可以采用字典形式的访问，此时访问的是一列值，也可以采用切片等，下面详细进行介绍：

dl = [{"个": i, "⼗":i*10, "百":i*100} for i in range(1,5)]
df = pd.DataFrame(dl)
print("df = \n", df)

使用字典的方式访问

print("df['百'] =\n", df['百'] )

df['百'] =
 0 100
1 200
2 300
3 400
Name: 百, dtype: int64

如果选取的键的名字跟上例中的df的属性或者函数不冲突，可以直接采用.进行访问：

# 上⾯访问⽅式等价于下⾯访问⽅式
# 但下⾯访问⽅式⾮通⽤和⽅法，可能会引起冲突
print("df.百=\n", df.百)

df.百=
 0 100
1 200
2 300
3 400
Name: 百, dtype: int64

DataFrame可以看做是⼀个增强版的⼆维数组，此时他的全部值可以⽤DataFrame.values来表示：

# values属性的使⽤
print("df.values = \n", df.values)

df.values =
 [[ 1 10 100]
 [ 2 20 200]
 [ 3 30 300]
 [ 4 40 400]]

对于DataFrame的访问，推荐时候用loc，iloc索引器进行访问，避免引起混淆：

使用loc进行显示访问

print('df.loc[:2, :"⼗"]=\n', df.loc[:2, :"⼗"])

df.loc[:2, :"⼗"]=
 个 ⼗
0 1 10
1 2 20
2 3 30

使用iloc进⾏隐式访问

print('df.iloc[:2, :2] = \n', df.iloc[:2, :2])

df.iloc[:2, :2] =
 个 ⼗
0 1 10
1 2 20

或者使用ix进行混合访问,并且使用时会出现警告

# indexing的初衷是避免访问混淆，但ix显然并没有达到这⼀点，所以，ix索引器不推荐使⽤
print(df.ix[:2, :"⼗"])

  个 ⼗
0 1 10
1 2 20
2 3 30


DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-isdeprecated

Index对象

Pandas的Index对象是⼀个独立的对象，用来表示数据的索引，可以把它看做不可变的数组(tuple)或者有序的集合。
当作为不可变数组的时候，除了数组的⼀些读操作外，还具有⼀些NumPy数组的属性。
Index作为有序集合操作主要是为了进行一些基于集合的操作，比如集合的交差并补操作。

Idnex作为不可变数组

idx = pd.Index([2,4,6,8,10])
print("idx = ", idx)
print("idx[1:4] = ", idx[1:4])
print("idx.size=", idx.size)
print("idx.shape=", idx.shape)
print("idx.ndim", idx.ndim)
print("idx.dtype=", idx.dtype)

idx = Int64Index([2, 4, 6, 8, 10], dtype='int64')
idx[1:4] = Int64Index([4, 6, 8], dtype='int64')
idx.size= 5
idx.shape= (5,)
idx.ndim 1
idx.dtype= int64

Index作为有序集合

idx_1 = pd.Index([1,3,5,6,7,])
idx_2 = pd.Index([2,4,6,7,8,9])
print("交集： idx_1 & idx_2 = ", idx_1 & idx_2)
print("并集： idx_1 | idx_2 = ", idx_1 | idx_2)
# print("差集： idx_1 - idx_2 = ", idx_1.intersection( idx_2))
print("异或： idx_1 ^ idx_2 = ", idx_1 ^ idx_2)

交集： idx_1 & idx_2 = Int64Index([6, 7], dtype='int64')
并集： idx_1 | idx_2 = Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
异或： idx_1 ^ idx_2 = Int64Index([1, 2, 3, 4, 5, 8, 9], dtype='int64')

夕麻

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pandas - Series、DataFrame、Index

pandas概述Numpy主要处理结构化数据，数据量比较小，规则对于⼤量数据，需要清理的数据，则需要pandas⼀般使用方法是 import pandas as pdSeries对象Series对象是⼀个带索引数据构成的⼀维数组。Series把数组和⼀组索引绑定在⼀起。如果想获取绑定的内容，分别可以使用values属性和index属性。Series的创建Series对象的创...
复制链接

扫一扫

专栏目录