[三] 2 数据分析工具：Pandas -- 数据结构

最新推荐文章于 2022-03-06 19:00:46 发布

数佳

最新推荐文章于 2022-03-06 19:00:46 发布

阅读量402

点赞数

分类专栏：数据分析文章标签： pandas python Series DataFrame 数据结构

本文链接：https://blog.csdn.net/yaocong1993/article/details/111052133

版权

数据分析专栏收录该内容

15 篇文章 1 订阅

订阅专栏

一、Series数据结构

Series是带标签的一维数组（有顺序），可存储整数、浮点数、字符串、Python对象等类型的数据。轴标签统称为索引。

1、创建 pd.Series(data, index=[])

1) 多维数组

data是多维数组时，index长度必须与data长度一致。没有指定index参数时，创建数值型索引，即[0, …, len(data) - 1]。

import pandas as pd

s = pd.Series(np.random.randn(5))
s # <class 'pandas.core.series.Series'>
'''
0   -1.335396
1   -1.302328
2    0.048324
3   -0.145933
4   -0.821122
dtype: float64
'''
# .index查看series索引，类型是<class 'pandas.indexes.range.RangeIndex'>
s.index
# .values查看series值，类型是<class 'numpy.ndarray'>
s.values

pd.Series(np.random.randn(5), index = list('abcde'))
'''
a   -0.290481
b    1.893950
c    0.927622
d    0.233786
e    1.502956
dtype: float64
'''

2) 字典

data是字典时，且未设置index参数时，Series按字典的插入顺序排序索引（Python >=3.6 && Pandas>= 0.23），否则Series按字母顺序排序。
如果设置了index参数，则按索引标签提取data里对应的值。

d = {'b': 1, 'd': 4, 'a': 0, 'c': 2}
pd.Series(d)
'''
这是按字母顺序排序的
a    0
b    1
c    2
d    4
dtype: int64
'''

pd.Series(d, index = ['a', 'c', 'b', 'b'])
'''
a    0
c    2
b    1
b    1
dtype: int64
'''

3) 标量值

data是标量值时，必须提供索引。Series按索引长度重复该标量值。

pd.Series('cong', index = range(5))
'''
0    cong
1    cong
2    cong
3    cong
4    cong
dtype: object
'''

2、索引及切片

s = pd.Series(np.random.randn(6), index = ['a', 'b', 'c', 'd', 'e', 'f'])
'''
a    1.367450
b    0.435309
c    1.596586
d   -0.083855
e   -1.464273
f   -2.672955
dtype: float64
'''

1) 下标索引及切片

下标切片是前闭后开[)

s[3]
# -0.0838549595595
s[-1]
# -2.67295467388
s[1:-2]
'''
b    0.435309
c    1.596586
d   -0.083855
dtype: float64
'''
s[::2]
'''
a    1.367450
c    1.596586
e   -1.464273
dtype: float64
'''

2) 标签索引及切片

标签切片是前闭后闭，末端包含[]

s['b']
# 0.435309196563
s[['b','a','d']]
'''
b    0.435309
a    1.367450
d   -0.083855
dtype: float64
'''
s['b':'e']
'''
b    0.435309
c    1.596586
d   -0.083855
e   -1.464273
dtype: float64
'''
s['b'::2]
'''
b    0.435309
d   -0.083855
f   -2.672955
dtype: float64
'''

3) 布尔型索引

数组作判断后，返回一个由布尔值组成的新数组。
.isnull() / .notnull()判断是否为空值。None空值，NaN缺失值，两个均识别为空值。

s = pd.Series(np.random.rand(3)*100)
s[2] = None
s[5] = None
s
'''
0    3.35074
1    76.1909
2        NaN
5       None
dtype: object
'''
s > 50
'''
0    False
1     True
2    False
5    False
dtype: bool
'''
s.isnull()
'''
0    False
1    False
2     True
5     True
dtype: bool
'''
s.notnull()
'''
0     True
1     True
2    False
5    False
dtype: bool
'''

s[s > 50]
'''
1    76.1909
dtype: object
'''
s[s.notnull()]
'''
0    3.35074
1    76.1909
dtype: object
'''

3、基础操作

1) 查看头部/尾部数据.head()/.tail()

.head()查看头部数据
.tail()查看尾部数据
默认查看5条

s = pd.Series(np.random.rand(50))
s.head()
'''
0    0.459775
1    0.630393
2    0.883201
3    0.976862
4    0.157557
dtype: float64
'''
s.tail(2)
'''
48    0.818718
49    0.776736
dtype: float64
'''

2) .reindex()

.reindex()根据索引提取数据，如果当前索引不存在，则为NaN或者fill_value值。

s = pd.Series(np.random.rand(3), index = ['a', 'b', 'c'])
s1 = s.reindex(['c', 'a', 'd'])
'''
c    0.779870
a    0.128787
d         NaN
dtype: float64
'''
s2 = s.reindex(['c', 'a', 'd'], fill_value = 0)
'''
c    0.779870
a    0.128787
d    0.000000
dtype: float64
'''

3) 对齐

Series之间的操作会基于标签自动对齐

chn = pd.Series([88, 72, 95], index = ['Jack', 'Mary', 'Tom'])
math = pd.Series([86, 90, 95], index = ['Lily', 'Mary', 'Jack'])
chn + math
'''
Jack    183.0
Lily      NaN
Mary    162.0
Tom       NaN
dtype: float64
'''

4) 删除.drop()

inplace属性默认为False，.drop()不修改Series本身，返回新Series。
inplace为True时，修改Series本身，.drop()返回值变为None。

s = pd.Series(np.random.rand(3))
s.drop(1)
'''
0    0.286693
2    0.542345
dtype: float64
'''
s.drop([0,1])
'''
2    0.542345
dtype: float64
'''

s.drop(1, inplace = True) # None
s
'''
0    0.286693
2    0.542345
dtype: float64
'''

5) 添加

直接通过下标/标签添加值；
通过.append()方法，添加一个Series，生成一个新Series，不改变之前的Series。

s = pd.Series(np.random.rand(2))
s[5] = 100
s['a'] = 100
s
'''
0      0.864331
1      0.662573
5    100.000000
a    100.000000
dtype: float64
'''

s1 = pd.Series(np.random.rand(2), index = ['a', 'c'])
s2 = pd.Series(np.random.rand(2), index = ['b', 'd'])
s1.append(s2)
'''
a    0.443560
c    0.406555
b    0.899235
d    0.656586
dtype: float64
'''

6).name属性和.rename()

Series支持name属性。一般情况下，Series自动分配name，特别是提取一维 DataFrame切片时。
.rename()方法用于重命名Series，生成新Series，和原Series指向不同的对象。

s = pd.Series(np.random.rand(2), name = "test")
s
'''
0    0.311986
1    0.313821
Name: test, dtype: float64
'''
s.name # test

s2 = s.rename("test2")
s2[1] = 100
s2
'''
0      0.22976
1    100.00000
Name: test2, dtype: float64
'''
s
'''
0    0.229760
1    0.483885
Name: test, dtype: float64
'''

二、DataFrame数据结构

DataFrame是由多种类型的列构成的二维标签数据结构。

1、创建 pd.DataFrame(data, index=[], columns=[])

总结：

对于list、ndarray，重新指定索引必须与list或ndarray长度相同；
（因为index是用来重写行索引，而不是提取其中的几行）
对于字典、Series，指定索引提取其中对应的key值。

1) list字典 / ndarray字典

由list字典 / ndarray字典创建DataFrame，

columns为字典key值，可以重新指定列，如果字典中没有该key，则产生NaN值。
index默认为数字标签。如果传递了index，index的长度必须与list / ndarray长度一致。
list / ndarray的长度必须保持一致。

# list字典创建
dic = {"name": ["Mary", "Tom", "Jack"],
         "age": [18, 25, 17]}
df = pd.DataFrame(dic) # <class 'pandas.core.frame.DataFrame'> 
'''
   age  name
0   18  Mary
1   25   Tom
2   17  Jack
'''

# .index查看行标签，值为RangeIndex(start=0, stop=3, step=1)，类型为<class 'pandas.indexes.range.RangeIndex'>
df.index
# .columns查看列标签，值为Index(['age', 'name'], dtype='object')，类型为<class 'pandas.indexes.base.Index'>
df.columns
# .values查看值，类型为<class 'numpy.ndarray'>
df.values
'''
[[18 'Mary']
 [25 'Tom']
 [17 'Jack']]
'''

# columns为字典key值，可以重新指定列，如果字典中没有该key，则产生NaN值。
dic = {"name": ["Mary", "Tom", "Jack"],
         "age": [18, 25, 17]}
df = pd.DataFrame(dic, columns = ["name", "gender"])
'''
   name gender
0  Mary    NaN
1   Tom    NaN
2  Jack    NaN
'''

# ndarray字典创建
dic = {"one":np.random.rand(3),
       "two":np.random.rand(3)}
df = pd.DataFrame(dic)
'''
        one       two
0  0.281193  0.777831
1  0.528591  0.618943
2  0.511128  0.215831
'''
df2 = pd.DataFrame(dic, index = list("bac"))
'''
        one       two
b  0.586689  0.654119
a  0.247894  0.696588
c  0.797849  0.459429
'''

2) Series字典

由Series字典创建DataFrame，

columns为字典key值，可以重新指定列，如果字典中没有该key，则产生NaN值。
index为Series索引，不指定index时生成的索引是每个Series索引的并集，不存在的值为NaN；设置index后，只提取index指定的Series索引值。
Series长度可以不一致，

# Series字典创建
dic = {"one":pd.Series(np.random.rand(3), index = ["a", "b", "c"]),
       "two":pd.Series(np.random.rand(2), index = ["b", "c"])}
pd.DataFrame(dic)
'''
        one       two
a  0.690421       NaN
b  0.067540  0.876073
c  0.213951  0.149273
'''
pd.DataFrame(dic, index = ["b", "a"])
'''
        one       two
b  0.067540  0.876073
a  0.690421       NaN
'''
pd.DataFrame(dic, columns = ["one", "three"])
'''
        one three
a  0.690421   NaN
b  0.067540   NaN
c  0.213951   NaN
'''

3) 二维数组

由二维数组创建DataFrame，

index和columns长度分别与二维数组的长度和二维数组的元素长度一致。

# 二维数组创建
arr = np.random.rand(8).reshape(2, 4)
pd.DataFrame(arr)
'''
          0         1         2         3
0  0.821861  0.491297  0.306242  0.416237
1  0.743759  0.409836  0.833850  0.557605
'''
pd.DataFrame(arr, index = list("ab"), columns = ["one", "two", "three", "four"])
'''
        one       two     three      four
a  0.821861  0.491297  0.306242  0.416237
b  0.743759  0.409836  0.833850  0.557605
'''

4) 字典列表

由字典列表创建DataFrame，

columns为字典key值，可以重新指定列，如果字典中没有该key，则产生NaN值。
index默认为数字标签。如果传递了index，index的长度必须与列表长度一致。

# 字典列表创建
lst = [{"a": 1, "c": 5, "b": 6},
      {"b": 3, "c": 3}]
pd.DataFrame(lst)
'''
     a  b  c
0  1.0  6  5
1  NaN  3  3
'''
pd.DataFrame(lst, index = ["one", "two"], columns = ["b", "c"])
'''
     b  c
one  6  5
two  3  3
'''

5) 二维字典

由二维字典创建DataFrame，

columns为一维字典key值，可以重新指定列，如果一维字典中没有该key，则产生NaN值。
index默认为为子字典key值，可以重新指定index，如果子字典中没有该key，则产生NaN值。

# 二维字典创建
dic = {'Jack': {'math': 99, 'chinese': 83, 'art': 87},
       'Mary': {'math': 90, 'chinese': 90, 'art': 91},
        'Bob': {'math': 86, 'chinese': 79}}
pd.DataFrame(dic)
'''
          Bob  Jack  Mary
art       NaN    87    91
chinese  79.0    83    90
math     86.0    99    90
'''
pd.DataFrame(dic, index = ["math", "chinese", "english"], columns = ["Mary", "Jack", "Tom"])
'''
         Mary  Jack  Tom
math     90.0  99.0  NaN
chinese  90.0  83.0  NaN
english   NaN   NaN  NaN
'''

2、索引及切片

1) 选择列 df[col]

选择不存在的col会报错。

# 选择列 df[col]
df = pd.DataFrame(np.random.rand(12).reshape(3, 4),
                 index = ["one", "two", "three"],
                 columns = ["a", "b", "c", "d"])
'''
              a         b         c         d
one    0.601347  0.617820  0.105009  0.128088
two    0.152353  0.242718  0.325983  0.163464
three  0.063760  0.216591  0.028874  0.729303
'''
df["c"] # <class 'pandas.core.series.Series'>
'''
one      0.105009
two      0.325983
three    0.028874
Name: c, dtype: float64
'''
df[["d", "b"]] # <class 'pandas.core.frame.DataFrame'>
'''
              d         b
one    0.128088  0.617820
two    0.163464  0.242718
three  0.729303  0.216591
'''

2) 用标签选择行 df.loc[label]

如果label不存在，则返回NaN。

# 用标签选择行 df.loc[label]
df = pd.DataFrame(np.random.rand(12).reshape(3,4),
                 columns = ["a", "b", "c", "d"])
'''
          a         b         c         d
0  0.515504  0.615885  0.185691  0.716931
1  0.039046  0.875238  0.145669  0.310193
2  0.243892  0.817281  0.701406  0.830673
'''
df.loc[1] # <class 'pandas.core.series.Series'>
'''
a    0.039046
b    0.875238
c    0.145669
d    0.310193
Name: 1, dtype: float64
'''
df.loc[[3, 0]] # <class 'pandas.core.frame.DataFrame'>
'''
          a         b         c         d
3       NaN       NaN       NaN       NaN
0  0.515504  0.615885  0.185691  0.716931
'''
# 切片索引，末端包含
df.loc[0:1] # <class 'pandas.core.frame.DataFrame'>
'''
          a         b         c         d
0  0.515504  0.615885  0.185691  0.716931
1  0.039046  0.875238  0.145669  0.310193
'''

3) 用整数位置选择行 df.iloc[index]

如果index越界，会报错。

# 用整数位置选择行 df.iloc[index]
df = pd.DataFrame(np.random.rand(12).reshape(4, 3),
                  index = ['one', 'two', 'three', 'four'],
                  columns = ['a', 'b', 'c'])
'''
              a         b         c
one    0.925215  0.523920  0.825322
two    0.177753  0.069859  0.113256
three  0.490654  0.690347  0.910460
four   0.880070  0.192099  0.161468
'''
df.iloc[1] # <class 'pandas.core.series.Series'>
'''
a    0.177753
b    0.069859
c    0.113256
Name: two, dtype: float64
'''
df.iloc[[3,-1]] # <class 'pandas.core.frame.DataFrame'>
'''
            a         b         c
four  0.88007  0.192099  0.161468
four  0.88007  0.192099  0.161468
'''
df.iloc[::2] # <class 'pandas.core.frame.DataFrame'>
'''
              a         b         c
one    0.925215  0.523920  0.825322
three  0.490654  0.690347  0.910460
'''
# 切片索引，末端不包含
df.iloc[1:3] # <class 'pandas.core.frame.DataFrame'>
'''
              a         b         c
two    0.177753  0.069859  0.113256
three  0.490654  0.690347  0.910460
'''

4) 用布尔向量索引 df[bool_vec]

# 用布尔向量索引 df[bool_vec]
df = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
                  index = ['one', 'two', 'three', 'four'],
                  columns = ['a', 'b', 'c', 'd'])
'''
               a          b          c          d
one    88.144522  68.527028  47.129355  32.633366
two    78.557709  81.340032  26.315262  13.479440
three  25.819493  63.532810  99.830179  22.410082
four   91.287652  88.188402   9.637995  81.761480
'''

对每个值做判断：返回所有数据，True返回原数据，False返回NaN。

# 对每个值做判断
df > 50
'''
           a     b      c      d
one     True  True  False  False
two     True  True  False  False
three  False  True   True  False
four    True  True  False   True
'''
df[df > 50] # <class 'pandas.core.frame.DataFrame'>
'''
               a          b          c         d
one    88.144522  68.527028        NaN       NaN
two    78.557709  81.340032        NaN       NaN
three        NaN  63.532810  99.830179       NaN
four   91.287652  88.188402        NaN  81.76148
'''

对单列做判断，选择行：返回为True的值所在行。

# 对单列做判断，选择行
df["a"] > 50
'''
one       True
two       True
three    False
four      True
Name: a, dtype: bool
'''
df[df["a"] > 50] # <class 'pandas.core.frame.DataFrame'>
'''
              a          b          c          d
one   88.144522  68.527028  47.129355  32.633366
two   78.557709  81.340032  26.315262  13.479440
four  91.287652  88.188402   9.637995  81.761480
'''

对多列做判断：返回所有数据，True返回原数据，False返回NaN，未选择的列均返回NaN。

# 对多列做判断
df[["a", "b"]] > 50
'''
           a     b
one     True  True
two     True  True
three  False  True
four    True  True
'''
df[df[["a", "b"]] > 50] # <class 'pandas.core.frame.DataFrame'>
'''
               a          b   c   d
one    88.144522  68.527028 NaN NaN
two    78.557709  81.340032 NaN NaN
three        NaN  63.532810 NaN NaN
four   91.287652  88.188402 NaN NaN
'''

对多行做判断：返回所有数据，True返回原数据，False返回NaN，未选择的行均返回NaN。

# 对多行做判断
df.loc[["two", "three"]] > 50
'''
           a     b      c      d
two     True  True  False  False
three  False  True   True  False
'''
df[df.loc[["two", "three"]] > 50] # <class 'pandas.core.frame.DataFrame'>
'''
               a          b          c   d
one          NaN        NaN        NaN NaN
two    78.557709  81.340032        NaN NaN
three        NaN  63.532810  99.830179 NaN
four         NaN        NaN        NaN NaN
'''

5) 索引的组合使用

df[["a", "c"]].loc[["one", "four"]]
'''
              a          c
one   86.963936  18.116991
four   9.805014  62.993647
'''
df[df["a"] < 50].iloc[0]
'''
a    13.059468
b    85.634085
c    83.900548
d    48.594762
Name: two, dtype: float64
'''

3、基础操作

1) 转置T

# 转置
df = pd.DataFrame(np.random.rand(6).reshape(2, 3),
                 index = ["one", "two"],
                 columns = ["a", "b", "c"])
'''
            a         b         c
one  0.583898  0.506374  0.651442
two  0.193543  0.416532  0.912228
'''
df.T
'''
        one       two
a  0.583898  0.193543
b  0.506374  0.416532
c  0.651442  0.912228
'''

2) 查看头部/尾部数据.head()/.tail()

.head()查看头部数据
.tail()查看尾部数据
默认查看5条

3) 添加与修改

通过索引添加与修改。

# 添加
df = pd.DataFrame(np.random.rand(9).reshape(3, 3) * 100,
                 index = ["one", "two", "three"],
                 columns = list("abc"))


df["d"] = 10 # 添加列，以标量值填充
df.loc["four"] = df.loc["one"] + df.loc["two"] # 添加行

df
'''
               a           b           c     d
one    42.366201   81.319170   65.621465  10.0
two    23.171599   35.551451   55.851890  10.0
three  64.258966   25.451767   37.622228  10.0
four   65.537800  116.870621  121.473355  20.0
'''

# 修改
df = pd.DataFrame(np.random.rand(9).reshape(3, 3) * 100,
                 index = ["one", "two", "three"],
                 columns = list("abc"))

df["a"] = df["b"] + df["c"]
df["b"].loc[["two", "three"]] = 20

df
'''
                a          b          c
one     84.388654  59.621808  24.766845
two    129.506297  20.000000  91.568479
three  116.821680  20.000000  89.145148
'''

4) 删除 del/.drop()

# 删除
df = pd.DataFrame(np.random.rand(25).reshape(5, 5) * 100,
                 index = ["one", "two", "three", "four", "five"],
                 columns = list("abcde"))

# 删除列
del df["a"] # 改变原数据
c = df.pop("c") # 改变原数据
df.drop(["d"], axis = 1, inplace = True) # 改变原数据

newDf = df.drop(["b"], axis = 1) # inplace = False，删除后生成新的数据，不改变原数据

df
'''
               b          e
one    29.632793  94.193498
two    84.801648   5.697404
three  79.095469  57.280865
four   45.895612  78.464924
five   20.933363  84.786636
'''
newDf
'''
               e
one    94.193498
two     5.697404
three  57.280865
four   78.464924
five   84.786636
'''

# 删除行
df.drop("one", inplace = True)
df.drop(["three", "four"], inplace = True)

df
'''
              b          e
two   88.776930  58.910481
five  19.339399   6.110319
'''

5) 对齐

DataFrame之间的操作会基于index和columns自动对齐。

# 对齐
df1 = pd.DataFrame(np.floor(np.random.rand(9).reshape(3, 3) * 10),
                   index = list("abc"))
df2 = pd.DataFrame(np.floor(np.random.rand(25).reshape(5, 5) * 10),
                   index = ["c", "d", "e", "b", "a"])
df1
'''
     0    1    2
a  4.0  6.0  5.0
b  1.0  5.0  6.0
c  4.0  8.0  9.0
'''
df2
'''
     0    1    2    3    4
c  9.0  1.0  8.0  7.0  8.0
d  1.0  5.0  2.0  7.0  9.0
e  6.0  6.0  1.0  7.0  7.0
b  9.0  5.0  8.0  5.0  1.0
a  0.0  1.0  3.0  9.0  1.0
'''
df1 + df2
'''
      0     1     2   3   4
a   4.0   7.0   8.0 NaN NaN
b  10.0  10.0  14.0 NaN NaN
c  13.0   9.0  17.0 NaN NaN
d   NaN   NaN   NaN NaN NaN
e   NaN   NaN   NaN NaN NaN
'''

6) 排序 .sort_values()/.sort_index()

同样适用于Series。

# 排序
df = pd.DataFrame(np.floor(np.random.rand(15).reshape(5, 3) * 10),
                  index = ["c", "d", "e", "b", "a"],
                  columns = ["one", "two", "three"])
'''
   one  two  three
c  0.0  3.0    4.0
d  0.0  4.0    0.0
e  5.0  6.0    9.0
b  5.0  1.0    6.0
a  7.0  8.0    0.0
'''

# 按值排序 .sort_values()
df.sort_values("one", ascending = True)
'''
   one  two  three
c  0.0  3.0    4.0
d  0.0  4.0    0.0
e  5.0  6.0    9.0
b  5.0  1.0    6.0
a  7.0  8.0    0.0
'''
df.sort_values("one", ascending = False)
'''
   one  two  three
a  7.0  8.0    0.0
e  5.0  6.0    9.0
b  5.0  1.0    6.0
c  0.0  3.0    4.0
d  0.0  4.0    0.0
'''
df.sort_values(["one", "three"]) # 多列排序，第一列升序的基础上第二列升序
'''
   one  two  three
d  0.0  4.0    0.0
c  0.0  3.0    4.0
b  5.0  1.0    6.0
e  5.0  6.0    9.0
a  7.0  8.0    0.0
'''

df.sort_values("two", inplace = True) # None
df
'''
   one  two  three
b  5.0  1.0    6.0
c  0.0  3.0    4.0
d  0.0  4.0    0.0
e  5.0  6.0    9.0
a  7.0  8.0    0.0
'''

# 按索引排序 .sort_index
# 默认ascending = True, inplace = False
df.sort_index()
'''
   one  two  three
a  7.0  8.0    0.0
b  5.0  1.0    6.0
c  0.0  3.0    4.0
d  0.0  4.0    0.0
e  5.0  6.0    9.0
'''