pandas Series DataFrame 综合学习

最新推荐文章于 2022-11-25 16:52:03 发布

yangxiaodong88

最新推荐文章于 2022-11-25 16:52:03 发布

阅读量251

点赞数

分类专栏：数据分析文章标签： pandas

本文链接：https://blog.csdn.net/yangxiaodong88/article/details/80662572

版权

数据分析专栏收录该内容

8 篇文章 0 订阅

订阅专栏

综合学习分析

索引对象

pandas 中的索引对象负责管理轴标签和其他元数据（比如轴名称）

from pandas import Series

obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index) # Index(['a', 'b', 'c'], dtype='object')
print(index[1:]) # Index(['b', 'c'], dtype='object')

Index 是不能被修改的用户不能对其修改

index[1] = 'd'
# Traceback (most recent call last):
#   File "E:/pandas_study/comone/a.py", line 8, in <module>
#     index[1] = 'd'
#   File "C:\Python36\lib\site-packages\pandas\core\indexes\base.py", line 1724, in __setitem__
#     raise TypeError("Index does not support mutable operations")
# TypeError: Index does not support mutable operations

不可修改行很重要，这样才能是Index对象在多个数据结构中安全共享数据

from pandas import Series
import pandas as pd
import numpy as np

index = pd.Index(np.arange(3))
obj = Series([1.5, -2.5, 0], index=index)

print(index is obj.index)
print(obj.index is index)

基本功能

现在我们要操作Series和DataFrame 中的基础数据的基本手段

1 重新索引
reindex 作用：创建一个适应新索引的新对象。
下面来比较这几种没有index 指定index 重新指定排序

from pandas import Series
import pandas as pd
import numpy as np

data = {"a": -5.3, "c": 3.6, "b": 7.2, 'd': 4.5}
# obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj = Series(data)
print(obj)
# a   -5.3
# b    7.2
# c    3.6
# d    4.5
# dtype: float64
print("=================")
obj2 = Series(data, index=['d', 'b', 'a', 'c'])
print(obj2)
# d    4.5
# b    7.2
# a   -5.3
# c    3.6
# dtype: float64
print("=================")
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj2)
# a   -5.3
# b    7.2
# c    3.6
# d    4.5
# e    NaN
# dtype: float64

如果某个索引值当前不存在，就引入缺失值

空的时候缺失值使用fill_value 填充

obj3 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=10)
print(obj3)
# a    -5.3
# b     7.2
# c     3.6
# d     4.5
# e    10.0
# dtype: float64

重新索引有时候需要插值处理。method选项可以达到。 ffill可以实现向前值传值

obj = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj)
# 0      blue
# 2    purple
# 4    yellow
# dtype: object
print("====")
obj2 = obj.reindex(range(6), method='ffill')
print(obj2)
# 0      blue
# 1      blue
# 2    purple
# 3    purple
# 4    yellow
# 5    yellow
# dtype: object

这里写图片描述

ffill 向前填充
bfill 向后填充

修改index 索引

对于DataFrame， reindex可以修改索引，或者连个都修改。如果只传入一个序列，则会重新索引行

from pandas import Series, DataFrame
import numpy as np

frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                  columns=['yang', 'xiao', 'dong']
                  )
print(frame)
#    yang  xiao  dong
# a     0     1     2
# c     3     4     5
# d     6     7     8
print("=========")
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
print(frame2)
#    yang  xiao  dong
# a   0.0   1.0   2.0
# b   NaN   NaN   NaN
# c   3.0   4.0   5.0
# d   6.0   7.0   8.0
print("==============")
state = ['yang', 'yan', 'dong']
frame3 = frame.reindex(columns=state)
print(frame3)
#    yang  yan  dong
# a     0  NaN     2
# c     3  NaN     5
# d     6  NaN     8
print("=============")

可以对行和列进行重新索引，而插值只能按行应用（轴为0）

# 对行和列同时进行索引
frame.reindex(index=['a','b','c','d'], method='ffill,
        columns=state
# 比较简洁的一种方式, 下面这种方式是上面方式的简写
frame.ix(['a','b','c','d'], state)

利用ix的标签索引功能，重新索引任务可以变得更加简洁

reindex 函数中的参数
这里写图片描述

丢弃指定轴上的项

由于需要执行一些数据整理和集合逻辑，所以drop方法返回的是一个再指定轴上删除了指定值的新对象
注意返回的是新的对象。

Series 上面的丢弃

from pandas import Series
import pandas as pd
import numpy as np

obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print(obj)
# a    0.0
# b    1.0
# c    2.0
# d    3.0
# e    4.0
# dtype: float64
print("=============")
new_obj = obj.drop("c")
print(new_obj)
# a    0.0
# b    1.0
# d    3.0
# e    4.0
# dtype: float64
print("===============")
new_obj2 = obj.drop(['a', 'b'])
print(new_obj2)
# c    2.0
# d    3.0
# e    4.0
# dtype: float64

DataFrame 上面的丢弃

axis =0 =1 的理解
这里写图片描述

0 跨行沿着行垂直往下
1 跨列沿着列方向水平延伸

操作列就是 axis 为1 操作行就是axis =0

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

frame = DataFrame(np.arange(16).reshape((4,4)),
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two', 'three', 'four']
                  )

print(frame)
#   one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
# d   12   13     14    15
print("==============")
frame2 = frame.drop(['a', 'b'])
print(frame2)
#    one  two  three  four
# c    8    9     10    11
# d   12   13     14    15
print("======")
frame3 = frame.drop('two', axis=1)
print(frame3)
#    one  three  four
# a    0      2     3
# b    4      6     7
# c    8     10    11
# d   12     14    15
print("============")
frame4 = frame.drop(['two', 'four'], axis=1)
print(frame4)
#    one  three
# a    0      2
# b    4      6
# c    8     10
# d   12     14

默认的是axis = 0

索引选取过滤

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj)
# a    0.0
# b    1.0
# c    2.0
# d    3.0
# dtype: float64
print("==")
print(obj['b'])
print(obj.b)
print(obj[1])
print(obj[3])
# 1.0
# 1.0
# 1.0
# 3.0
print("============")
print(obj[2:4])
print(obj[['b', 'c', 'd']])
print(obj[[1, 3]])
print(obj[obj < 2])

# c    2.0
# d    3.0
# dtype: float64
# b    1.0
# c    2.0
# d    3.0
# dtype: float64
# b    1.0
# d    3.0
# dtype: float64
# a    0.0
# b    1.0
# dtype: float64

切片利用标签的切片运算和普通的不一样，其末端是包含的。

print(obj['b':'c'])
#b    1.0
#c    2.0
#dtype: float64

给切片的位置设置值

obj['b':'c'] = 5
print(obj)
# a    0.0
# b    5.0
# c    5.0
# d    3.0
# dtype: float64

对DataFrame 进行索引就是获取一个或者多个列

索引中的特殊情况

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

frame = DataFrame(np.arange(16).reshape((4, 4)),
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two', 'three', 'four']
                  )

print(frame[:2])
#   one  two  three  four
#a    0    1      2     3
#b    4    5      6     7
print("========")
print(frame[frame['three'] > 5])
#   one  two  three  four
#b    4    5      6     7
#c    8    9     10    11
#d   12   13     14    15

索引字段ix

为了DataFrame 在行上进行标签索引。她是你可以通过Numpy 式的标记法以及轴标签从DataFrame中选取行和列的子集

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

frame = DataFrame(np.arange(16).reshape((4, 4)),
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two', 'three', 'four']
                  )
print(frame)
#    one  two  three  four
# a    0    1      2     3
# b    4    5      6     7
# c    8    9     10    11
# d   12   13     14    15
print(frame.ix['a', ['two', 'three']])
# two      1
# three    2
# Name: a, dtype: int32
print("=======")
print(frame.ix[['b', 'c'], [3, 0, 1]])
#    four  one  two
# b     7    4    5
# c    11    8    9
print(frame.ix[['b', 'c'], ["four", "one", "two"]])
#    four  one  two
# b     7    4    5
# c    11    8    9
print("=======")
print(frame.ix[2])
# one       8
# two       9
# three    10
# four     11
# Name: c, dtype: int32
print(frame.ix[:'c', 'two'])
# a    1
# b    5
# c    9
# Name: two, dtype: int32
print("=========")
print(frame.ix[frame.three > 5, :3])
#    one  two  three
# b    4    5      6
# c    8    9     10
# d   12   13     14

pandas 对象中的数据的选取和重排的方式很多
下面是一些总结
这里写图片描述

这里写图片描述

算术运算和数据对其

pandas 的一个重要功能是对不同索引的对象进行算术运算。在将对象相加的时候，如果存在不同的索引对，则结果的索引就是索引对的并集。

s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s3 = s1 + s2
print(s1)
#a    7.3
#c   -2.5
#d    3.4
#e    1.5
#dtype: float64
print(s2)
#a   -2.1
#c    3.6
#e   -1.5
#f    4.0
#g    3.1
#dtype: float64
print(s3)
#a    5.2
#c    1.1
#d    NaN
#e    0.0
#f    NaN
#g    NaN
#dtype: float64

自动的数据对齐操作在不重叠的索引处引入了NA 值。缺失值会在算术运算过程中传播。

对于DataFrame, 对其操作会同时发生在行和列上面

df = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
               index=['one', 'two', 'three']
               )

df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['five', 'one', 'two', 'six']
                )

print(df)
#          b    c    d
# one    0.0  1.0  2.0
# two    3.0  4.0  5.0
# three  6.0  7.0  8.0
print(df2)
#         b     d     e
# five  0.0   1.0   2.0
# one   3.0   4.0   5.0
# two   6.0   7.0   8.0
# six   9.0  10.0  11.0
print(df + df2)
#          b   c     d   e
# five   NaN NaN   NaN NaN
# one    3.0 NaN   6.0 NaN
# six    NaN NaN   NaN NaN
# three  NaN NaN   NaN NaN
# two    9.0 NaN  12.0 NaN

上面可以看到有很多的NaN的值，现在需要填充起来
使用add fill_value 来进行填充。规则是两者中有一个没有的就填写没有的那一方指的是行列。如果两则都没有有一个行列在另外一个对象中没有的还是NAN

from pandas import Series, DataFrame
import pandas as pd
import numpy as np

df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['one', 'two', 'three']
                )

df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['five', 'one', 'two', 'six']
                )
print(df1 + df2)
#         b   c     d   e
# five   NaN NaN   NaN NaN
# one    3.0 NaN   6.0 NaN
# six    NaN NaN   NaN NaN
# three  NaN NaN   NaN NaN
# two    9.0 NaN  12.0 NaN
print(df1)
#          b    c    d
# one    0.0  1.0  2.0
# two    3.0  4.0  5.0
# three  6.0  7.0  8.0
print(df2)
#         b     d     e
# five  0.0   1.0   2.0
# one   3.0   4.0   5.0
# two   6.0   7.0   8.0
# six   9.0  10.0  11.0
df3 = df1.add(df2, fill_value=0)
print(df3)
#          b    c     d     e
# five   0.0  NaN   1.0   2.0
# one    3.0  1.0   6.0   5.0
# six    9.0  NaN  10.0  11.0
# three  6.0  7.0   8.0   NaN
# two    9.0  4.0  12.0   8.0

这里写图片描述

DataFrame 和Series之间的运算

他们之间的运算都是广播。首先来看个numpy 之间的运算然后再切换到DataFrame 和Series 之间的运算

import numpy as np

arr = np.arange(12.).reshape((3, 4))
print(arr)
#[[ 0.  1.  2.  3.]
# [ 4.  5.  6.  7.]
# [ 8.  9. 10. 11.]]
print(arr[0]) # [0. 1. 2. 3.]
print("=====")
arr2 = arr - arr[0]
print(arr2)
#[[0. 0. 0. 0.]
# [4. 4. 4. 4.]
# [8. 8. 8. 8.]]

现在看看DataFrame和Series 之间的运算

frame = DataFrame(np.arange(12.).reshape((4,3)), columns=list('bde'),
                  index=['one', 'two', 'three', 'four']
                  )
series = frame.ix[0]
print(series)
series2 = frame.ix["one"]
print(series2)

aa = frame - series
print(aa)

 #        b    d    e
#one    0.0  0.0  0.0
#two    3.0  3.0  3.0
#three  6.0  6.0  6.0
#four   9.0  9.0  9.0

默认情况下 DataFrame 和Series的算术运算会将 Series的索引匹配到DataFrame的列，然后沿着行一直向下广播。

如果，某个索引值在DataFrame的列或者Series的索引中找不到，则参与运算的两个对象就会被重新索引以形成并集

series = Series(range(3), index=['b', 'e', 'f'])
print(frame - series)
#          b   d     e   f
# one    0.0 NaN   1.0 NaN
# two    3.0 NaN   4.0 NaN
# three  6.0 NaN   7.0 NaN
# four   9.0 NaN  10.0 NaN

注意上面是在行上面广播，在列上面广播要注意呀，，敲黑板啦。要使用算术方法

series = frame['d']
print(series)
# one       1.0
# two       4.0
# three     7.0
# four     10.0
# Name: d, dtype: float64
print(frame.sub(series, axis=0))
#          b    d    e
# one   -1.0  0.0  1.0
# two   -1.0  0.0  1.0
# three -1.0  0.0  1.0
# four  -1.0  0.0  1.0

传入的轴号就是希望匹配的轴。在本例中我们得目的是匹配DataFrame的行索引并进行广播

函数应用和映射

yangxiaodong88

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas Series DataFrame 综合学习

综合学习分析索引对象pandas 中的索引对象负责管理轴标签和其他元数据（比如轴名称）from pandas import Seriesobj = Series(range(3), index=['a', 'b', 'c'])index = obj.indexprint(index) # Index(['a', 'b', 'c'], dtype='object')prin...
复制链接

扫一扫