Pandas 的学习笔记【数据结构简介】

最新推荐文章于 2023-02-22 16:38:42 发布

Flamsky

最新推荐文章于 2023-02-22 16:38:42 发布

阅读量279

点赞数

分类专栏： Python 学习

本文链接：https://blog.csdn.net/Flamsky/article/details/100811776

版权

Python 学习专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Pandas 的学习笔记【数据结构简介】

几点约定

全文采用1 ~ 3个井号，对应全书的一级、二级和三级标题。
来表示不同层级的笔记，而采用四个井号来记录正式的笔记内容

文章目录

- Pandas 的学习笔记【数据结构简介】
数据结构简介（Intro to data structure）

数据结构简介（Intro to data structure）

基本信条：数据是【本质】对齐的

除非你手动这么做，否则【标签】与【数据】之间的【连接】是不会断开的。

序列（Series）

建立Series的方法

s = pd.Series(data, index = index)

这里的data的可以是，dict、ndarray、标量

import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5), index = list('abcde'))

d = dict(b = 1, a = 0, c = 2)
c  = pd.Series(d)

e = pd.Series(5. , index = list('abcde'))

Series 的用法与ndarray的用法差不多

注意这个用法

# 经过尝试，这个方法仅仅在pandas的Series和numpy的array中可以使用，list不行
s[[4, 3, 1]]

pandas 向 numpy narray的转变

>>> e.to_numpy()      # 函数to_numpy
array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])
>>> e.values              # 属性values
array([0. , 0.2, 0.4, 0.6, 0.8, 1. , 1.2, 1.4, 1.6, 1.8, 2. ])

判断label是否存在

>>> 'e' in e
True
>>> 'l' in e

调用Series的数值

s.get('f')    #如果没有'f'的话，则返回None
s.get('f', np.nan)    # 也可以返回np.nan，需要手动设置
s['f']    # 这种调用方法，如果‘f’不存在的话，会跳出KeyError

向量化操作与标签对齐

s + s
s ** 2
np.exp(s)
s * 2

与ndarray不同，Series在进行计算的时候会自动进行标签对齐，而把相互缺少的元素直接给出NaN的结果。

Series的name属性

c.name = 'somename'
s = pd.Series(np.random.randn(5), name = 'another_name')

DataFrame （数据框）

可以接受的输入包括：dict形式的ndaray，list，dicts 或者 Series；二维的ndarray；结构或记录的ndarray；另外一个DataFrame

DataFrame的两类标签: index (row labels); column (column labels)。

从Series的dict来建立DataFrame

# 创建Series 的dict，即一个dict，其中的元素都是Series
# 注意这里面的index，实质上是给出最初的index定义
>>> d = dict(one = pd.Series([1., 2., 3.], index = list('abc')),
...     two = pd.Series([1., 2., 3., 4.], index = list('abcd')))
>>> d
{'one': a    1.0
b    2.0
c    3.0
dtype: float64, 'two': a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64}
>>> df = pd.DataFrame(d)    # 创建DataFrame
>>> df
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
>>> pd.DataFrame(d, index = list('bda'))   # 注意这里的index实质上是对d数据的检索。与
   one  two
b  2.0  2.0
d  NaN  4.0
a  1.0  1.0
>>> pd.DataFrame(d, index = list('dba'), columns = ['two', 'three']) # 注意这里的columns，也是一种检索。而不是定义。
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

从上面的样例可以看出，DataFrame的index和columns一旦定义好了以后，除非手动修改，否则每次使用的时候都是对数据的检索。

>>> df.index   # 行号
Index(['a', 'b', 'c', 'd'], dtype='object')
>>> df.columns  # 列号
Index(['one', 'two'], dtype='object')

从ndarray/list来创建DataFrame

# 使用的ndarray必需有相同的长度
>>> d = dict(one = [1., 2., 3., 4.], two = [4, 3, 2, 1])
>>> d
{'one': [1.0, 2.0, 3.0, 4.0], 'two': [4, 3, 2, 1]}
>>> pd.DataFrame(d)
   one  two
0  1.0    4
1  2.0    3
2  3.0    2
3  4.0    1
# 如果给出index，那么它的长度也需要与list/ndarray的长度相同
>>> pd.DataFrame(d, index = list('abcd'))
   one  two
a  1.0    4
b  2.0    3
c  3.0    2
d  4.0    1

从结构化或者记录array来创建

>>> data = np.zeros((2, ), dtype = [('A', 'i4'), ('B', 'f4'), ('C', 'a10')])
>>> data[:] = [(1, 2., 'Hello'), (2, 3., 'World')]
>>> pd.DataFrame(data)
   A    B         C
0  1  2.0  b'Hello'
1  2  3.0  b'World'
>>> pd.DataFrame(data, index = ['first', 'second'])
        A    B         C
first   1  2.0  b'Hello'
second  2  3.0  b'World'
>>> pd.DataFrame(data, columns = list('CAB'))
          C  A    B
0  b'Hello'  1  2.0
1  b'World'  2  3.0

从dicts的list来创建

>>> data2 = [dict(a = 1, b =2), dict(a = 5, b = 10, c = 20)]
>>> data2
[{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
>>> pd.DataFrame(data2)
   a   b     c
0  1   2   NaN
1  5  10  20.0
# 注意这里并没有创建新的columns，而是从data2中检索其中的a、b两列
>>> pd.DataFrame(data2, columns = list('ab'))
   a   b
0  1   2
1  5  10

用tuples的dict来创建

>>> pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
...               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
...               ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
...               ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
...               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

## 创建了一个多层级的DataFrame
       a              b
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

其它创建器（constructors）

DataFrame.from_dict：从list的dict来创建。每个list的key作为columns。

>>> pd.DataFrame(dict(a = [1, 2, 3], b = [4, 5, 6]))
   a  b
0  1  4
1  2  5
2  3  6
>>> pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))
   A  B
0  1  4
1  2  5
2  3  6
# 从上面两个来看，结果是一样的，但是from_dict方法可以将默认orient从'columns'改成’index‘
# 如此一来，就能实现将每个list做成一行，或者是做成一列。都可以切换。
>>> pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]), orient = 'index')
# 注意下面行列进行了转置
   0  1  2
A  1  2  3
B  4  5  6
# 改变orient为'index'之后，还可以传递'columns'给DataFrame
>>> pd.DataFrame.from_dict(dict( A = [1, 2, 3], B = [4, 5, 6]), orient = 'index', 
	        columns = ['one', 'two', 'three'])
   one  two  three
A    1    2      3
B    4    5      6
# 直接使用DataFrame方法，不能修改orient，默认就是’columns'
>>> pd.DataFrame(dict(a = [1, 2, 3], b = [4, 5, 6]), orient = 'index')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() got an unexpected keyword argument 'orient'

DataFrame.from_records 与DataFrame不同之处在于可以利用records里面某一类型作为index

>>> data
array([(1, 2., b'Hello'), (2, 3., b'World')],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
>>> pd.DataFrame.from_records(data, index = 'C')   # 这里的index并不是C，而是'C'那一列column
          A    B
C
b'Hello'  1  2.0
b'World'  2  3.0

从Series，创建一个与Series一样的DataFrame，column的名字与Series的名字一样。

DataFrame中column（列）的选择、增加和删除

处理DataFrame的方法，基本与dict相同

增加一列

>>> df
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
>>> df['one']
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64
# 列的增加，直接给出新的column（列名)，然后赋值
>>> df['three'] = df['one'] * df['two']
>>> df['flag'] = df['one'] > 2
>>> df
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False
# 可以给新的列直接赋一样的值
>>> df['four'] = 'bar'
>>> df
   one  two  three   flag four
a  1.0  1.0    1.0  False  bar
b  2.0  2.0    4.0  False  bar
c  3.0  3.0    9.0   True  bar
d  NaN  4.0    NaN  False  bar
# 可以给新的列直接用range方法赋值，像处理list一样
>>> df['four'] = range(4)
>>> df
   one  two  three   flag  four
a  1.0  1.0    1.0  False     0
b  2.0  2.0    4.0  False     1
c  3.0  3.0    9.0   True     2
d  NaN  4.0    NaN  False     3

>>>df['one_trunc'] = df['one'][ :2]   # 新的一列可以用其他列的切片来初始化
>>> df
   one  two  three   flag  one_trunc
a  1.0  1.0    1.0  False        1.0
b  2.0  2.0    4.0  False        2.0
c  3.0  3.0    9.0   True        NaN
d  NaN  4.0    NaN  False        NaN

删除1列

>>> df.pop('four')
a    0
b    1
c    2
d    3
Name: four, dtype: int32
>>> df
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False
>>> df.pop()   # pop必需指明列的名称
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: pop() missing 1 required positional argument: 'item'

插入一列

>>> df.insert(2, 'bar', df['one'])   # 在index = 2 的位置，插入’bar’, 且等于df['one']
>>> df
   one  two  bar  three   flag
a  1.0  1.0  1.0    1.0  False
b  2.0  2.0  2.0    4.0  False
c  3.0  3.0  3.0    9.0   True
d  NaN  4.0  NaN    NaN  False

用一套新方法来插入新的一列

>>> df.assign(next = df['one'] + df['two'])  # 这样的额代码是比较清楚
   one  two  bar  three   flag  next
a  1.0  1.0  1.0    1.0  False   2.0
b  2.0  2.0  2.0    4.0  False   4.0
c  3.0  3.0  3.0    9.0   True   6.0
d  NaN  4.0  NaN    NaN  False   NaN
# 感觉下面的代码，x的归属没有讲清楚
>>> df.assign(next2 = lambda x: (x['one'] / x['two']))
   one  two  bar  three   flag  next2
a  1.0  1.0  1.0    1.0  False    1.0
b  2.0  2.0  2.0    4.0  False    1.0
c  3.0  3.0  3.0    9.0   True    1.0
d  NaN  4.0  NaN    NaN  False    NaN

上面的代码可以看出，assign方法，并没有把原来的df进行改变，而是返回了一个全新的DataFrame

检索和选择 indexing / selection

操作	语法	结果
选择列	df[col]	Series
通过标签选择行	df.loc[label]	Series
通过行号选择行	df.iloc[loc]	Series
行切片	df[5: 10]	DataFrame
用bool向量选择行	df[bool_vec]	DataFrame

创建时间序列

>>> index = pd.date_range('1/1/2000', periods = 8)
>>> df = pd.DataFrame(np.random.randn(8, 3), index = index, columns = list('ABC'))
>>> df
                   A         B         C
2000-01-01 -1.055287 -0.718843  2.571885
2000-01-02 -0.185018  0.134261  0.961604
2000-01-03  0.473588 -0.272325 -0.497209
2000-01-04  1.154382 -1.370340  0.256893
2000-01-05 -0.967360 -0.118422 -0.792276
2000-01-06  1.625338  0.411694  0.908086
2000-01-07  0.414562  1.927479  0.813965
2000-01-08 -0.684130  0.477201  1.068507

DataFrame的对齐和算数操作

>>> df1 = pd.DataFrame(np.random.randn(10, 4), 
...	            columns = list('ABCD'))

>>> df2 = pd.DataFrame(np.random.randn(7, 3), 
...              columns = list('ABC'))
>>> df1 + df2    # 运算将自动对齐index和columns
          A         B         C   D
0 -4.223208 -2.010021  1.892859 NaN
1  2.270219 -2.490922  0.987272 NaN
2 -2.382549  2.489289 -1.313801 NaN
3 -0.223298 -0.415960  1.608284 NaN
4 -1.697305 -1.668344  0.341503 NaN
5  0.778567  0.378612 -0.533668 NaN
6 -0.643223 -0.257056  2.251711 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN

DataFrame的每一列（columns）数据类型 dtypes

>>> df.dtypes
A    float64
B    float64
C    float64
D    float64
dtype: object

DataFrame的布尔操作


>>> c = dict(a = [1, 0, 1], b = [0, 1 ,1 ])
>>> df1 = pd.DataFrame(c, dtype = bool)
>>> c = dict(a = [0, 1, 1], b = [1, 1, 0])
>>> df2 = pd.DataFrame(c, dtype = bool)
>>> df1 | df2
      a     b
0  True  True
1  True  True
2  True  True
>>> df1 ^ df2
       a      b
0   True   True
1   True  False
2  False   True
>>> - df1
       a      b
0  False   True
1   True  False
2  False  False

利用Numpy的ufuncs操作DataFrame

列举的可以使用的ufuncs
目前numpy的ufuncs有60多个可用
详细见[ufunc available][]

[ufunc available]:(https://numpy.org/devdocs/reference/ufuncs.html#available-ufuncs “Available ufuncs”)

>>> ser1 = pd.Series([1, 2, 3], index = list('abc'))
>>> ser2 = pd.Series([1, 3, 5], index = list('abc'))
>>> ser3 = pd.Series([2, 4, 5], index = list('bcd'))    # 注意index不对齐
>>> np.remainder(ser1, ser2)
a    0
b    2
c    3
dtype: int64
>>> np.remainder(ser1, ser3)  # 不对齐的部分，会自动生成NaN
a    NaN
b    0.0
c    3.0
d    NaN
dtype: float64
>>> type(np.remainder(ser1, ser3)[0])   # 查看其中的NaN的数据类型
<class 'numpy.float64'>
>>> np.dtype(np.remainder(ser1, ser3)[0]) # 可以看到这的NaN的数据类型是numpy.float64
dtype('float64')
>>> np.isnan(np.remainder(ser1, ser3))   # 可以看到np的float类型中包含了NaN这个概念。
a     True
b    False
c    False
d     True
dtype: bool

Index类型和Series类型交叉计算

额，到这里的时候pd居然又有了一个Index类型，而且可以与Series类型交叉计算

>>> ser = pd.Series([1, 2, 3], index = list('abc'))
>>> idx = pd.Index([4, 5, 6])   # 另外试验过了，Index 是没有index的
>>> np.maximum(ser,idx)   
# 注意这里采用np的运算，没管Series和Index类型的index问题。
# 上面一句话里面的Index（Index类型）和index（行标签）不一样，
# 混合操作返回的Series采用了原来Series的index
a    4
b    5
c    6
dtype: int64

官方文档中这样描述的：

当ufunc应用于Series和Index的时候，Series操作优先，并且返回Series

When a binary ufunc is applied to a Series and Index, the Series implementation takes precedence and a Series is returned.

终端展示

>>> iris_data = pd.read_csv('iris.csv')
>>> iris_data
     SepalLength  SepalWidth  PetalLength  PetalWidth            Name
0            5.1         3.5          1.4         0.2     Iris-setosa
1            4.9         3.0          1.4         0.2     Iris-setosa
2            4.7         3.2          1.3         0.2     Iris-setosa
3            4.6         3.1          1.5         0.2     Iris-setosa
4            5.0         3.6          1.4         0.2     Iris-setosa
..           ...         ...          ...         ...             ...
145          6.7         3.0          5.2         2.3  Iris-virginica
146          6.3         2.5          5.0         1.9  Iris-virginica
147          6.5         3.0          5.2         2.0  Iris-virginica
148          6.2         3.4          5.4         2.3  Iris-virginica
149          5.9         3.0          5.1         1.8  Iris-virginica

[150 rows x 5 columns]
>>> iris_data.info()    # DataFrame 自带了info()方法，下面拆解一下
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
SepalLength    150 non-null float64
SepalWidth     150 non-null float64
PetalLength    150 non-null float64
PetalWidth     150 non-null float64
Name           150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

行	意义
<class ‘pandas.core.frame.DataFrame’>	类的名字
RangeIndex: 150 entries, 0 to 149	index的范围
Data columns (total 5 columns):	列的数量
SepalLength 150 non-null float64	第一列名称、数字个数和非空数据类型
SepalWidth 150 non-null float64	…
PetalLength 150 non-null float64	…
PetalWidth 150 non-null float64	…
Name 150 non-null object	…
dtypes: float64(4), object(1)	数据类型，包含了4列float，共同组成一个df对象
memory usage: 6.0+ KB	内存消耗

>>> print(iris_data)
     SepalLength  SepalWidth  PetalLength  PetalWidth            Name
0            5.1         3.5          1.4         0.2     Iris-setosa
1            4.9         3.0          1.4         0.2     Iris-setosa
2            4.7         3.2          1.3         0.2     Iris-setosa
3            4.6         3.1          1.5         0.2     Iris-setosa
4            5.0         3.6          1.4         0.2     Iris-setosa
..           ...         ...          ...         ...             ...
145          6.7         3.0          5.2         2.3  Iris-virginica
146          6.3         2.5          5.0         1.9  Iris-virginica
147          6.5         3.0          5.2         2.0  Iris-virginica
148          6.2         3.4          5.4         2.3  Iris-virginica
149          5.9         3.0          5.1         1.8  Iris-virginica

[150 rows x 5 columns]
>>> print(iris_data.to_string())
     SepalLength  SepalWidth  PetalLength  PetalWidth             Name
0            5.1         3.5          1.4         0.2      Iris-setosa
1            4.9         3.0          1.4         0.2      Iris-setosa
2            4.7         3.2          1.3         0.2      Iris-setosa
。。。这里将全部打印，我手动删除了
146          6.3         2.5          5.0         1.9   Iris-virginica
147          6.5         3.0          5.2         2.0   Iris-virginica
148          6.2         3.4          5.4         2.3   Iris-virginica
149          5.9         3.0          5.1         1.8   Iris-virginica
>>> iris_data['PetalWidth']
0      0.2
1      0.2
2      0.2
3      0.2
4      0.2
      ...
145    2.3
146    1.9
147    2.0
148    2.3
149    1.8
Name: PetalWidth, Length: 150, dtype: float64
>>> iris_data['PetalWidth'].iloc[:20]
0     0.2
1     0.2
2     0.2
3     0.2
4     0.2
5     0.4
6     0.3
7     0.2
8     0.2
9     0.1
10    0.2
11    0.2
12    0.1
13    0.1
14    0.2
15    0.4
16    0.4
17    0.3
18    0.3
19    0.3
Name: PetalWidth, dtype: float64
>>> print(iris_data['PetalWidth'][:20])   # 正确的检索方法
0     0.2
1     0.2
2     0.2
3     0.2
4     0.2
5     0.4
6     0.3
7     0.2
8     0.2
9     0.1
10    0.2
11    0.2
12    0.1
13    0.1
14    0.2
15    0.4
16    0.4
17    0.3
18    0.3
19    0.3
Name: PetalWidth, dtype: float64
>>> print(iris_data['PetalWidth'].iloc[:20].to_string())    # 正确的数据检索方法
0     0.2
1     0.2
2     0.2
3     0.2
4     0.2
5     0.4
6     0.3
7     0.2
8     0.2
9     0.1
10    0.2
11    0.2
12    0.1
13    0.1
14    0.2
15    0.4
16    0.4
17    0.3
18    0.3
19    0.3

DataFrame的正确的检索/切片方法

>>> iris_data['PetalWidth'][15]
0.4
>>> iris_data['PetalWidth'].iloc[15]   # 两种方法都是正确的，默认df['column']是从df中找出某一列数据
0.4
>>> iris_data.loc[20,'PetalWidth']  # 注意这种直接用list检索的方法，先说行(index)，再说列(columns)。【前提是先加入loc】
0.2

错误的检索/切片方法

iris_data['PetalWidth', :20]  # 错误的，不能用

形成对比的numpy的切片和检索法

>>> c = np.arange(1, 10, 1).reshape([3, 3])
>>> c
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
>>> c[2, 1]    # 这种方法在DataFrame中必需使用loc来调用。同样，先说第几行，再说第几列
8
>>> c[2][1]
8

console的显示设置

>>> pd.set_option('display.width', 40)   # 没看出来干啥用
>>> pd.DataFrame(np.random.randn(3, 50))
         0         1         2         3   ...        46        47        48        49
0 -0.201146 -1.992127  0.031741 -0.761252  ...  1.990652 -0.464801  2.085081 -0.790846
1 -0.440614  0.254293  1.178797 -0.952115  ... -0.454850 -0.027599 -0.299474 -0.051353
2  0.015812 -1.434413 -1.103818 -0.622329  ...  0.080404  1.734101 -1.673379 -0.830576

[3 rows x 50 columns]
>>> pd.set_option('display.width', 40)
>>> pd.DataFrame(np.random.randn(3, 50))
         0         1         2         3         4         5         6         7   ...        42        43        44        45        46        47        48        49
0 -0.959138  1.140804 -0.937232  0.381355  0.450238 -1.815852 -1.846794 -0.130763  ...  0.049053 -0.310651  0.473138 -0.848743  0.764086 -0.046611 -0.525201 -1.898720
1  0.582194  0.558483 -1.024662  1.009687 -0.552811 -0.228287  2.491511 -1.229770  ...  0.232407 -0.896743 -2.470100 -1.746119  0.012921 -0.135434 -0.114438 -0.814146
2  0.399524 -1.143016  1.150325 -0.659419 -0.192943  0.084315  0.498713 -0.103001  ...  1.186128  1.945769  0.946035  0.469909 -0.326366  0.585538  0.473147  0.170713

[3 rows x 50 columns]
>>> pd.set_option('display.max_colwidth', 4)
>>> pd.DataFrame(np.random.randn(3,50))
    0    1    2    3    4    5    6    7    8    9    10  ...   39   40   41   42   43   44   45   46   47   48   49
0  ...  ... -...  ... -... -...  ...  ...  ...  ... -...  ...  ... -...  ... -... -...  ... -...  ...  ...  ... -...
1  ...  ...  ...  ... -... -...  ...  ...  ... -...  ...  ... -... -...  ...  ...  ... -...  ...  ...  ...  ...  ...
2  ... -...  ...  ... -... -...  ...  ... -...  ...  ...  ... -... -...  ... -...  ...  ... -... -...  ...  ... -...

[3 rows x 50 columns]

Flamsky

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Pandas 的学习笔记【数据结构简介】

Pandas 的学习笔记几点约定全文采用1 ~ 3个井号，对应全书的一级、二级和三级标题。来表示不同层级的笔记，而采用四个井号来记录正式的笔记内容文章目录Pandas 的学习笔记Getting Started数据结构(Intro to data structure)基本信条：数据是【本质】对齐的序列（Series）pandas 向 numpy narray的转变判断label是否存在调...
复制链接

扫一扫