Pandas入门

最新推荐文章于 2022-07-30 17:54:12 发布

人工智障1025

最新推荐文章于 2022-07-30 17:54:12 发布

阅读量214

点赞数

分类专栏： python数据分析文章标签： python 数据分析

本文链接：https://blog.csdn.net/bulo1025/article/details/85717060

版权

python数据分析专栏收录该内容

0 篇文章 0 订阅

订阅专栏

Pandas入门

Pandas数据结构介绍
- Series
- DataFrame
基本功能

Pandas数据结构介绍

两个主要的数据结构：Series和DataFrame。

Series

类似一维数组，由一组数据以及与之相关的标签组成。

In [1]: from pandas import Series,DataFrame

In [2]: import pandas as pd

In [3]: obj = Series([4,7,5,-3])

In [4]: obj
Out[4]:
0    4
1    7
2    5
3   -3
dtype: int64

没有指定索引，所以自动创建0到N-1（N为数据长度）的整数型索引。可以通过Series的values和index属性获取数组表示形式和索引对象。

In [7]: obj.values
Out[7]: array([ 4,  7,  5, -3], dtype=int64)


In [8]: obj.index
Out[8]: RangeIndex(start=0, stop=4, step=1)

那么如何创建带有索引的Series呢？

In [9]: obj2 = Series([4,7,-5,3],index=['d','b','a','c'])

In [10]: obj2
Out[10]:
d    4
b    7
a   -5
c    3
dtype: int64

In [11]: obj2.index
Out[11]: Index(['d', 'b', 'a', 'c'], dtype='object')

索引的建立方便我们快速的从Series中选取单个或多个的值

# 单个值
In [18]: obj2['a']
Out[18]: -5

In [19]: obj2['d'] = 6

# 多个值
In [20]: obj2[['c', 'a', 'd']]
Out[20]: 
c    3
a   -5
d    6
dtype: int64

需要注意的是，上面多个索引选取时，传入的参数是一个list，如果传入‘a’,‘c’,'b’的话会报错：TypeError: ‘tuple’ object cannot be interpreted as an integer

还有一种可以直接创建Series的方法，有些数据直接存放在一个字典中，那么直接传入，字典的key就是索引。

In [26]: sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [27]: obj3 = pd.Series(sdata)

In [28]: obj3
Out[28]: 
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

DataFrame

表格型的数据结构，存在行索引和列索引，可以看成是由Series组成的

那么如何构建DataFrame呢？

# 传入一个字典
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
In [46]: frame.head()
Out[46]: 
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

# 用columns规定列索引，index规定行索引，如果传入的列在数据中找不到，那么就会用NaN值填充

In [48]: frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
   ....:                       index=['one', 'two', 'three', 'four',
   ....:                              'five', 'six'])

In [49]: frame2
Out[49]: 
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN

In [50]: frame2.columns
Out[50]: Index(['year', 'state', 'pop', 'debt'], dtype='object')

# 嵌套字典，外层字典的键作为列，内层键作为行索引
In [65]: pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....:        'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [66]: frame3 = pd.DataFrame(pop)

In [67]: frame3
Out[67]: 
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

从DataFrame中，我们可以从列获得Series，有两种方式：

# 索引的方式获得
In [51]: frame2['state']
Out[51]: 
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

# 方法和属性的方式获得
In [52]: frame2.year
Out[52]: 
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

我们也可以从行获得索引，利用loc方法：

In [53]: frame2.loc['three']
Out[53]: 
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

列可以通过赋值的方式修改，将列表或者数组赋给某个列是，长度要和DataFrame相匹配，如果给不存在的列赋值，则会创建一个新列。

警告：通过索引方式返回的是对应数据的视图，所以修改全是在原数据上修改，但可以用copy方法复制列。

基本功能

本节介绍一些pandas的基本手段

重新索引

reindex创建一个新的对象

In [91]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [92]: obj
Out[92]: 
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
# 当前引入的索引不存在时
In [93]: obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [94]: obj2
Out[94]: 
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

有时候需要连续一点的时间序列数据，我们可以进行插值处理

In [95]: obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [96]: obj3
Out[96]: 
0      blue
2    purple
4    yellow
dtype: object

In [97]: obj3.reindex(range(6), method='ffill')
Out[97]: 
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

利用reindex还可以修改行或者列的索引

In [98]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
   ....:                      index=['a', 'c', 'd'],
   ....:                      columns=['Ohio', 'Texas', 'California'])

In [99]: frame
Out[99]: 
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

In [100]: frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [101]: frame2
Out[101]: 
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0

# 用columns关键字重新索引
In [102]: states = ['Texas', 'Utah', 'California']

In [103]: frame.reindex(columns=states)
Out[103]: 
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

丢弃指定轴上的项

顾名思义，指定轴肯定指的是columns和index分别的两个轴

# 丢弃Series中的项，就不存在行还是列了
In [105]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [106]: obj
Out[106]: 
a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [107]: new_obj = obj.drop('c')

In [108]: new_obj
Out[108]: 
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [109]: obj.drop(['d', 'c'])
Out[109]: 
a    0.0
b    1.0
e    4.0
dtype: float64

# 丢弃DataFrame行的项
In [110]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
   .....:                     index=['Ohio', 'Colorado', 'Utah', 'New York'],
   .....:                     columns=['one', 'two', 'three', 'four'])

In [111]: data
Out[111]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [112]: data.drop(['Colorado', 'Ohio'])
Out[112]: 
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

# 通过传递 axis = 1或者 axis = 'columns' 删除列的值
In [113]: data.drop('two', axis=1)
Out[113]: 
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15

In [114]: data.drop(['two', 'four'], axis='columns')
Out[114]: 
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

# drop中传入一个参数inplace可以就地修改，不会生成一个新的对象
In [115]: obj.drop('c', inplace=True)

In [116]: obj
Out[116]: 
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

索引、选取和过滤

对Series的操作

In [117]: obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [118]: obj
Out[118]: 
a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [119]: obj['b']
Out[119]: 1.0

In [120]: obj[1]
Out[120]: 1.0

In [121]: obj[2:4]  # 注意这里右边是不包含的，下面利用标签进行索引右边是包含的
Out[121]: 
c    2.0
d    3.0
dtype: float64

In [122]: obj[['b', 'a', 'd']]
Out[122]:
b    1.0
a    0.0
d    3.0
dtype: float64

In [123]: obj[[1, 3]]
Out[123]: 
b    1.0
d    3.0
dtype: float64

In [124]: obj[obj < 2]
Out[124]: 
a    0.0
b    1.0
dtype: float64

In [126]: obj['b':'c'] = 5     # 这里是标签的切片，就包括了右边，这是不同点，得记住，而且还可以直接修改

In [127]: obj
Out[127]: 
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

对DataFrame的操作

In [128]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
   .....:                     index=['Ohio', 'Colorado', 'Utah', 'New York'],
   .....:                     columns=['one', 'two', 'three', 'four'])

In [129]: data
Out[129]: 
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

In [130]: data['two']
Out[130]: 
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [131]: data[['three', 'one']]
Out[131]: 
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12

用loc和iloc进行选取

轴标签索引loc；整数索引iloc，从DataFrame选取行和列的子集

In [137]: data.loc['Colorado', ['two', 'three']]   # 第一个为行；第二个为列，最后变成了Series
Out[137]: 
two      5
three    6
Name: Colorado, dtype: int64

In [138]: data.iloc[2, [3, 0, 1]]
Out[138]: 
four    11
one      8
two      9
Name: Utah, dtype: int64

In [139]: data.iloc[2]
Out[139]: 
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [140]: data.iloc[[1, 2], [3, 0, 1]]
Out[140]: 
          four  one  two
Colorado     7    4    5
Utah        11    8    9

DataFrame和Series的算术运算

pandas一个神奇的功能是，可以对不同索引的对象进行算术运算。在将对象相加时，如果存在不同的索引对，则结果的索引就是该索引对的并集。自动的数据对齐操作在不重叠的索引处引入了NA值。缺失值会在算术运算过程中传播。下面我们来看看在Series和DataFrame中会发生什么。

# Series中
In [150]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [151]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
   .....:                index=['a', 'c', 'e', 'f', 'g'])

In [152]: s1
Out[152]: 
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [153]: s2
Out[153]: 
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [154]: s1 + s2
Out[154]: 
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

# DataFrame中
In [155]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
   .....:                    index=['Ohio', 'Texas', 'Colorado'])

In [156]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
   .....:                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [157]: df1
Out[157]: 
            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0

In [158]: df2
Out[158]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [159]: df1 + df2
Out[159]: 
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

如果想让加法没有缺失值，可以这么做：

In [171]: df1.add(df2, fill_value=0)
Out[171]: 
In [171]: df1.add(df2, fill_value=0)
Out[171]: 
      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0   5.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0

DataFrame和Series之间的算术运算

广播（broadcasting）的机制，可以通过下列的例子进行理解：

In [179]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
   .....:                      columns=list('bde'),
   .....:                      index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [180]: series = frame.iloc[0]

In [181]: frame
Out[181]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [182]: series
Out[182]: 
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [183]: frame - series
Out[183]: 
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

简而言之，每一行都会执行这个操作。那么如果想在列上进行操作怎么办呢？，使用算术运算方法可以解决这个问题。

In [186]: series3 = frame['d']

In [187]: frame
Out[187]: 
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0

In [188]: series3
Out[188]: 
Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [189]: frame.sub(series3, axis='index')    # 利用参数axis='index' or axis='0'进行广播
Out[189]: 
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0

函数的应用和映射

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [192]: np.abs(frame)    #所有元素求绝对值 
Out[192]: 
               b         d         e
Utah    0.204708  0.478943  0.519439
Ohio    0.555730  1.965781  1.393406
Texas   0.092908  0.281746  0.769023
Oregon  1.246435  1.007189  1.296221

In [193]: f = lambda x: x.max() - x.min()    
In [194]: frame.apply(f)
Out[194]: 
b    1.802165
d    1.684034
e    2.689627
dtype: float64

In [195]: frame.apply(f, axis='columns')
Out[195]:
Utah      0.998382
Ohio      2.521511
Texas     0.676115
Oregon    2.542656
dtype: float64

排序

In [203]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
				  index=['three', 'one'],columns=['d', 'a', 'b', 'c']
In [206]: frame.sort_index(axis=1, ascending=False)   # 降序排序
Out[206]: 
       d  c  b  a
three  0  3  2  1
one    4  7  6  5

描述性统计

In [230]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
   .....:                    [np.nan, np.nan], [0.75, -1.3]],
   .....:                   index=['a', 'b', 'c', 'd'],
   .....:                   columns=['one', 'two'])

In [231]: df
Out[231]: 
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3

In [232]: df.sum()
Out[232]: 
one    9.25
two   -5.80
dtype: float64

In [234]: df.mean(axis='columns', skipna=False)   #skipna排除缺失值，默认值为True
Out[234]: 
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64