Pandas入门1：Pandas提供的基本数据结构

最新推荐文章于 2022-09-22 11:01:57 发布

__XX__

最新推荐文章于 2022-09-22 11:01:57 发布

阅读量570

点赞数

分类专栏：数据科学文章标签： python 机器学习 pandas 数据分析

本文链接：https://blog.csdn.net/qq_29575471/article/details/108045642

版权

16 篇文章 0 订阅

订阅专栏

import pandas as pd
import numpy as np

前言

普通的列表或numpy ndarray的索引是隐式定义的，即对于列表a = [10, 20, 30, 40]，显然，隐式定义的索引1, 2, 3, 4分别与[10, 20, 30, 40]的元素相对应
但Series的索引可以显式定义
从下面的例子可以发现，Series像一个NumPy数组，又像一个字典

x = pd.Series([10, 20, 30, 40])
x # 索引与对象相互绑定

0    10
1    20
2    30
3    40
dtype: int64

print(
    x.values,
    x.index,
    sep = '\n'
)

[10 20 30 40]
RangeIndex(start=0, stop=4, step=1)

y = pd.Series(
    [1, 2, 3, 4],
    index=['a', 'b', 'c', 'd']
)

y

a    1
b    2
c    3
d    4
dtype: int64

y['c']

my_dict = {
    'a': 0,
    'b': 1
}
y = pd.Series(my_dict)
y

a    0
b    1
dtype: int64

在这里插入图片描述

a    0
b    1
dtype: int64

x = {'a' : 10, 'b' : 20}
x = pd.Series(x)
x

a    10
b    20
dtype: int64

z1 = pd.DataFrame(
    data = [x, y],
    index = ['x', 'y']
)

z1

	a	b
x	10	20
y	0	1

z = pd.DataFrame({
    'x': x,
    'y': y
})

z

# 注意区分上面两种构造DataFrame的方式

	x	y
a	10	0
b	20	1

print(
    z1.index,
    z.index,
    sep = '\n'
)

Index(['x', 'y'], dtype='object')
Index(['a', 'b'], dtype='object')

print(
    z1.columns,
    z.columns,
    sep = '\n'
)

Index(['a', 'b'], dtype='object')
Index(['x', 'y'], dtype='object')

如何理解上面两种构造方式的差异？
pd.DataFrame的函数原型：pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
根据上面的测试可知，DataFrame的标签是行标签。
首先可以这么理解：Series是带标签的一维数组，即每个标签只对应于一个数据块；而DataFrame是带标签的二维数组，每个标签都可以对应于多个数据块，且每个数据块都可以有自带标签。
其次，考察上面的案例。
z1的构造本质上是指明了data和index参数，而data指示每一个标签（行标签）对应的数据块（一行），采用列表的方式将指定列表元素为行。
z的构造其实是z = pd.DataFrame(data = {'x': x, 'y': y})，没有指示index参数。这里是指定每一个标签（行标签）对应的数据块（一行）为一个字典。采用字典不会指定字典元素为行，而是将字典整体（整个数据块）作为一行

注意，这里x y本质上是series

	x	y
a	10	0
b	20	1

z['x']['a']
# 列索引优先

列优先是很少见的，因此一般不要把DataFrame看作一种二维数组，而应当看作一种多维字典
这种字典是每个key映射一个Series，Series作为一个value。
这很有趣，在前面讨论Series时，核心概念在index上，是index和value绑定，因此一般把Series看作一个动态指定index的数组。而DataFrame不是，它的核心概念在于字典的key上，具体而言就是其列索引
因此，上面展示的DataFrame的index本质上不是它的index，而是它内部存储的Series的index。它的列索引（key）才是核心概念
这样，对于上面构造z和z1两种不同的结果又有了新的理解
我们再看下构造过程：

z = pd.DataFrame({
    'x': x,
    'y': y
})

z1 = pd.DataFrame(
    data = [x, y],
    index = ['x', 'y']
)

	x	y
a	10	0
b	20	1

z1

	a	b
x	10	20
y	0	1

z将DataFrame的data域用字典指定，根据上述的理解，自然列标签就应当是字典中的两个key，分别对应于两个Series，行标签本质上是各个Series的index
z1将data域用列表指定，可以理解为每个data都是这样一个列表[x, y]（这样一个列表被视为一个Series）
更推荐将DataFrame视为字典而不是二维列表，故这里更推荐z的初始化方法，更直观
综上，我们将Series视为动态指定标签的列表，DataFrame视为Series的集合
创建DataFrame的其它方法：

在这里插入图片描述

ind = pd.Index([i for i in range(5, 1, -1)])
ind

Int64Index([5, 4, 3, 2], dtype='int64')

ind[1] = 2
# 报错：Index does not support mutable operations

indA = pd.Index([i for i in range(1, 10, 1)])
indB = pd.Index([i for i in range(6, 15, 1)])
print(
    indA,
    indB,
    sep = '\n'
)

Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
Int64Index([6, 7, 8, 9, 10, 11, 12, 13, 14], dtype='int64')

indA & indB

Int64Index([6, 7, 8, 9], dtype='int64')

indA | indB

Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], dtype='int64')

indA ^ indB

Int64Index([1, 2, 3, 4, 5, 10, 11, 12, 13, 14], dtype='int64')

关注

专栏目录