python pandas 之 Dataframe 数据结构

最新推荐文章于 2024-06-11 19:49:24 发布

在到处之间找我

最新推荐文章于 2024-06-11 19:49:24 发布

阅读量2.3k

点赞数 1

分类专栏： # Python学习笔记文章标签： pandas Dataframe python

本文链接：https://blog.csdn.net/sinat_41104353/article/details/85037221

版权

DataFrame 是 pandas 中两个主要数据结构之一，另一个是 Series。DataFrame 的文档在这里：传送门。

因为这几天需要使用这个数据结构来完成一个小作业，在这里总结一下 Dataframe 的一些基本用法。

创建

首先我们来看一看 Dataframe 的创建，Dataframe 文档里给出的构造函数是：

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

其中 data 的类型可以是 numpy ndarray、字典或者 Dataframe。相当于是你构造 DataFrame 的初始值。
index 是行索引，columns 是列索引，我看的这两个一般都给的是列表。
dtype 是指定 data 里元素的数据类型。上面那个链接给的例子里有涉及到这一项，可以去看看。

我们来看一个实例

>>> import pandas as pd
>>> from pandas import DataFrame
>>> data = {
   'state':['ok', 'normal', 'good', 'bad'],
            'year':[2000, 2001, 2002, 2003],
            'pop':[3.7, 3.6, 2.4, 0.9]}
>>> print(DataFrame(data))  # 行索引 index 默认为0，1，2，3
    state  year  pop
0      ok  2000  3.7
1  normal  2001  3.6
2    good  2002  2.4
3     bad  2003  0.9

>>> print(DataFrame(data, index = ['one', 'two', 'three', 'four']))  # 指定行索引
        state  year  pop
one        ok  2000  3.7
two    normal  2001  3.6
three    good  2002  2.4
four      bad  2003  0.9

>>> print(DataFrame(None, index=range(3), columns=range(4)))  # data 默认是 None，第一个我们不写也会得到下面结果
     0    1    2    3
0  NaN  NaN  NaN  NaN
1  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN

>>> print(DataFrame(2, index=range(3), columns=range(4)))
   0  1  2  3
0  2  2  2  2
1  2  2  2  2
2  2  2  2  2

有一些函数返回的数据类型也是 DataFrame，比如 DataFrame.from_dict, pandas.read_csv, pandas.read_table, pandas.read_clipboard 等（这个可以看文档）

更详细的可以参考 Python Pandas - DataFrame，对应的中文版是 Pandas数据帧（DataFrame）

元素访问

参考：https://pandas.pydata.org/pandas-docs/stable/indexing.html

Python 和 NumPy 的索引操作符 [] 和属性操作符 . 为访问 pandas 数据结构提供了方便快捷的方式。

索引的不同方式（Different Choices for Indexing）

Pandas 现在支持三种类型的多轴索引

.loc 主要是基于标签的，但也可以与布尔数组一起使用。当找不到项时 .loc 会抛出 KeyError。允许的输入是：
- 单个标签，例如 5 或 ‘a’（注意，5 被解释为索引的标签。这里用法不是索引的整数位置。）
- 列表或标签数组。[‘a’, ‘b’, ‘c’]
- 带有标签的切片对象 ‘a’:‘f’（注意，与通常的 python 切片相反，切片时包括起始和终止标签）。参考使用标签切片
- 布尔数组
- 有一个参数的可调用函数，并且该函数返回 Series, DataFrame 或者 Panel 的索引¹
  参考标签选择

.iloc 主要基于整数位置（从轴的 0 到 length-1 ），但也可以使用一个布尔数组。如果请求的索引器超出范围，.iloc 则会抛出 IndexError，但允许越界索引的切片索引器除外（这符合 Python / NumPy 切片语义）。允许的输入是：
- 一个整数，例如 5
- 整数列表或数组，如 [4, 3, 0]
- 带有整数的切片对象，如 1:7
- 布尔数组
- 有一个参数的可调用函数，并且该函数返回 Series, DataFrame 或者 Panel 的索引

参考按位置选择，高级索引和高级层次结构

.loc，.iloc 以及 [] 索引也可以接受一个 callable 类型的索引器。在 Select By Callable 中查看更多信息。

从一个多轴对象取值使用下述语法（以 .loc 为例，.iloc 是一样的道理）。任何轴访问器可以是空切片 :。在规范范围之外的轴被默认看作 :，如 p.loc[‘a’] 等价于 p.loc[‘a’, :, :]。

Object Type	Indexers
Series	s.loc[indexer]
DataFrame	df.loc[row_indexer,column_indexer]
Panel	p.loc[item_indexer,major_indexer,minor_indexer]

基础知识

使用 [] 进行索引的主要功能是选出更低维度的切片。下表显示了使用 [] 对 pandas 对象进行索引时的返回类型值：

Object Type	Selection	ReturnValue Type
Series	series[label]	scalar value
DataFrame	frame[colname]	Series corresponding to colname
Panel	panel[itemname]	DataFrame corresponding to the itemname

下面，我们构建一个简单的时间序列数据集，用于说明索引功能：

>>> import numpy as np
>>> import pandas as pd
>>> from pandas import DataFrame

>>> dates = pd.date_range('1/1/2000', periods=8)
>>> df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
>>> df
                   A         B         C         D
2000-01-01 -1.770120  1.139613  0.298093  2.510808
2000-01-02  0.014236  0.822682  0.064727 -1.788583
2000-01-03  0.763586 -0.131537 -0.672899 -0.189418
2000-01-04 -0.472942  0.387618 -0.284481  0.000814
2000-01-05  0.212607  0.280663  0.573487 -0.174647
2000-01-06  0.154372  1.836240 -0.683947  1.273916
2000-01-07 -0.195618  0.529984 -0.771837 -0.009498
2000-01-08  1.054207 -0.130465 -0.785729 -0.657609

我们可以使用 [] 进行最基本的索引

>>> s = df['A']
>>> s
2000-01-01   -1.770120
2000-01-02    0.014236
2000-01-03    0.763586
2000-01-04   -0.472942
2000-01-05    0.212607
2000-01-06    0.154372
2000-01-07   -0.195618
2000-01-08    1.054207
Freq: D, Name: A, dtype: float64
>>> s[dates[5]]
0.1543721114379397

我们可以传递列的列表给 []，以按该顺序选择列。如果 DataFrame 不包含某一列，则会抛出一个异常。也可以以这种方式设置多列。

>>> df
                   A         B         C         D
2000-01-01 -1.770120  1.139613  0.298093  2.510808
2000-01-02  0.014236  0.822682  0.064727 -1.788583
2000-01-03  0.763586 -0.131537 -0.672899 -0.189418
2000-01-04 -0.472942  0.387618 -0.284481  0.000814
2000-01-05  0.212607  0.280663  0.573487 -0.174647
2000-01-06  0.154372  1.836240 -0.683947  1.273916

最低0.47元/天解锁文章

在到处之间找我

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
python pandas 之 Dataframe 数据结构

DataFrame 是 pandas 中两个主要数据结构之一，另一个是 Series。DataFrame 的文档在这里：传送门。因为这几天需要使用这个数据结构来完成一个小作业，在这里总结一下 Dataframe 的一些基本用法。文章目录创建元素访问获取行列常用属性矩阵转置元素个数行元素个数列元素个数操作遍历行遍历列遍历求和行求和列求和参考资料创建首先我们来看一看 Dataframe 的创建...
复制链接

扫一扫