pandas模块

Numpy 和 Pandas 有什么不同?

如果用 python 的列表和字典来作比较, 那么可以说 Numpy 是列表形式的,没有数值标签,而 Pandas 就是字典形式。Pandas是基于Numpy构建的,让Numpy为中心的应用变得更加简单。

要使用pandas,首先需要了解他主要两个数据结构:SeriesDataFrame

Series的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引。于是会自动创建一个0到N-1(N为长度)的整数型索引。

DataFrame是一个表格型的数据结构,它包含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。DataFrame既有行索引也有列索引, 它可以被看做由Series组成的大字典。

官方建议导入方法:

from pandas import Series,DataFrame
import pandas as pd

创建对象

>>> from pandas import Series,DataFrame
>>> import pandas as pd 
>>> import numpy as np 
>>> s = Series([1,2,3,'a',np.nan,[1,2]])
>>> s
0         1
1         2
2         3
3         a
4       NaN  #not a number的意思
5    [1, 2]
dtype: object
>>> dates = pd.date_range('2017', periods=6)
>>> dates
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04','2017-01-05', '2017-01-06'],dtype='datetime64[ns]', freq='D')
>>> df = DataFrame(np.random.randn(6,4), index=dates)#不指定index和columns时默认从0开始索引。
>>> df
                   0         1         2         3
2017-01-01 -0.923905  0.305506  0.676255 -1.428198
2017-01-02  0.234690  1.756183 -0.226916  0.516676
2017-01-03 -0.180496 -0.410745  0.145798 -1.189019
2017-01-04 -0.676189  0.602093 -0.151042 -0.915054
2017-01-05 -1.000729  0.784595  0.623079 -0.551410
2017-01-06  1.024644 -0.305822 -0.867859  0.867652
>>> df = DataFrame(np.random.randn(6,4), columns=('a','b','c','d'))
>>> df
          a         b         c         d
0  0.000196 -1.342386  0.189864 -0.874669
1 -0.638368 -1.403264  0.121946  0.720223
2 -0.504676  0.328643  0.478719 -1.165611
3 -0.011445 -0.775834  0.809029  2.148832
4 -1.012311  1.345237  0.725192 -1.658297
5 -1.580452 -0.664339 -0.370294 -1.370419

查看和选择数据

>>> df2
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
>>> df2.head(2) #头两行
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
>>> df2.tail(2)
     A          B    C  D      E    F
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
>>> df2[0:2]  #但是df2[0]就会报错
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
>>> df2.A
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64
>>> df2['B']
0   2013-01-02
1   2013-01-02
2   2013-01-02
3   2013-01-02
Name: B, dtype: datetime64[ns]
>>> df2.values
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')
>>> df2.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
>>> df2.describe() #只对数字有统计
         A    C    D
count  4.0  4.0  4.0
mean   1.0  1.0  3.0
std    0.0  0.0  0.0
min    1.0  1.0  3.0
25%    1.0  1.0  3.0
50%    1.0  1.0  3.0
75%    1.0  1.0  3.0
max    1.0  1.0  3.0

loc

我们可以使用标签来选择数据, 本例子主要通过标签名字选择某一行数据, 或者通过选择某行或者所有行(:代表所有行)然后选其中某一列或几列数据。:

>>> dates = pd.date_range('20130101', periods=6)
>>> df = DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
>>> df
             A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23
>>> df.loc['20130102']
A    4
B    5
C    6
D    7
Name: 2013-01-02 00:00:00, dtype: int32
>>> df.loc['2013-01-01':'2013-01-04','A':'C']#当':'两边是str的时候包含两边,如果是[0:3],则包括左边不包括右边
             A   B   C
2013-01-01   0   1   2
2013-01-02   4   5   6
2013-01-03   8   9  10
2013-01-04  12  13  14

iloc

另外我们可以采用位置进行选择 iloc, 在这里我们可以通过位置选择在不同情况下所需要的数据例如选某一个,连续选或者跨行选等操作。

>>> df.iloc[1:4,0:3]   #包括1不包括4
             A   B   C
2013-01-02   4   5   6
2013-01-03   8   9  10
2013-01-04  12  13  14
>>> df.iloc[[1,3,4],:]
             A   B   C   D
2013-01-02   4   5   6   7
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19

ix

我们还可以采用混合选择。

>>> df.ix[0:2,['A','D']]
            A  D
2013-01-01  0  3
2013-01-02  4  7

bool筛选

>>> df[df>5] 
               A     B     C     D
2013-01-01   NaN   NaN   NaN   NaN
2013-01-02   NaN   NaN   6.0   7.0
2013-01-03   8.0   9.0  10.0  11.0
2013-01-04  12.0  13.0  14.0  15.0
2013-01-05  16.0  17.0  18.0  19.0
2013-01-06  20.0  21.0  22.0  23.0
>>> df[df.A>8] #df.A那一列中大于8的列
             A   B   C   D
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23

Pandas 处理NaN

有时候我们导入或处理数据, 会产生一些空的或者是 NaN 数据,如何删除或者是填补这些 NaN 数据呢?

dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
这里写图片描述

>>> dates = pd.date_range('20130101', periods=6)
>>> df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
>>> df.iloc[0,1] = np.nan
>>> df.iloc[1,2] = np.nan
"""
             A     B     C   D
2013-01-01   0   NaN   2.0   3
2013-01-02   4   5.0   NaN   7
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
"""
>>> df.dropna()
             A     B     C   D
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
>>> df
             A     B     C   D
2013-01-01   0   NaN   2.0   3
2013-01-02   4   5.0   NaN   7
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
>>> df.dropna(axis='columns',how='any')
             A   D
2013-01-01   0   3
2013-01-02   4   7
2013-01-03   8  11
2013-01-04  12  15
2013-01-05  16  19
2013-01-06  20  23
>>> df.fillna(value=-1)
             A     B     C   D
2013-01-01   0  -1.0   2.0   3
2013-01-02   4   5.0  -1.0   7
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
>>> df.isnull() 
                A      B      C      D
2013-01-01  False   True  False  False
2013-01-02  False  False   True  False
2013-01-03  False  False  False  False
2013-01-04  False  False  False  False
2013-01-05  False  False  False  False
2013-01-06  False  False  False  False
>>> np.any(df.isnull()) == True #用以检查是否存在NaN,存在返回True
True

pandas数据存储和读取

可以存取的格式:
这里写图片描述

>>> path = r'C:\Users\zhifei\Desktop\student.csv'
>>> data = pd.read_csv(path)
>>> data
    Student ID  name   age  gender
0         1100  Kelly   22  Female
1         1101    Clo   21  Female
2         1102  Tilly   22  Female
3         1103   Tony   24    Male
4         1104  David   20    Male
5         1105  Catty   22  Female
6         1106      M    3  Female
7         1107      N   43    Male
8         1108      A   13    Male
9         1109      S   12    Male
10        1110  David   33    Male
11        1111     Dw    3  Female
12        1112      Q   23    Male
13        1113      W   21  Female
>>> type(data)
<class 'pandas.core.frame.DataFrame'>
>>> path2 = r'C:\Users\zhifei\Desktop\json.txt'
>>> data.to_json(path2)
>>> data_2 = pd.read_json(path2)
>>> data_2
    Student ID  age  gender  name 
0         1100   22  Female  Kelly
1         1101   21  Female    Clo
10        1110   33    Male  David
11        1111    3  Female     Dw
12        1112   23    Male      Q
13        1113   21  Female      W
2         1102   22  Female  Tilly
3         1103   24    Male   Tony
4         1104   20    Male  David
5         1105   22  Female  Catty
6         1106    3  Female      M
7         1107   43    Male      N
8         1108   13    Male      A
9         1109   12    Male      S

pandas数据合并

函数原型:
这里写图片描述

import pandas as pd
import numpy as np

#定义资料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*2, columns=['a','b','c','d'])

#concat纵向合并
res = pd.concat([df1, df2, df3], axis=0)

#打印结果
print(res)
#     a    b    c    d
# 0  0.0  0.0  0.0  0.0
# 1  0.0  0.0  0.0  0.0
# 2  0.0  0.0  0.0  0.0
# 0  1.0  1.0  1.0  1.0
# 1  1.0  1.0  1.0  1.0
# 2  1.0  1.0  1.0  1.0
# 0  2.0  2.0  2.0  2.0
# 1  2.0  2.0  2.0  2.0
# 2  2.0  2.0  2.0  2.0

res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)

#打印结果
print(res)
#     a    b    c    d
# 0  0.0  0.0  0.0  0.0
# 1  0.0  0.0  0.0  0.0
# 2  0.0  0.0  0.0  0.0
# 3  1.0  1.0  1.0  1.0
# 4  1.0  1.0  1.0  1.0
# 5  1.0  1.0  1.0  1.0
# 6  2.0  2.0  2.0  2.0
# 7  2.0  2.0  2.0  2.0
# 8  2.0  2.0  2.0  2.0

join=’outer’为预设值,因此未设定任何参数时,函数默认join=’outer’。此方式是依照column来做纵向合并,有相同的column上下合并在一起,其他独自的column个自成列,原本没有值的位置皆以NaN填充。

import pandas as pd
import numpy as np

#定义资料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])

#纵向"外"合并df1与df2
res = pd.concat([df1, df2], axis=0, join='outer')

print(res)
#     a    b    c    d    e
# 1  0.0  0.0  0.0  0.0  NaN
# 2  0.0  0.0  0.0  0.0  NaN
# 3  0.0  0.0  0.0  0.0  NaN
# 2  NaN  1.0  1.0  1.0  1.0
# 3  NaN  1.0  1.0  1.0  1.0
# 4  NaN  1.0  1.0  1.0  1.0

#纵向"内"合并df1与df2
res = pd.concat([df1, df2], axis=0, join='inner')

#打印结果
print(res)
#     b    c    d
# 1  0.0  0.0  0.0
# 2  0.0  0.0  0.0
# 3  0.0  0.0  0.0
# 2  1.0  1.0  1.0
# 3  1.0  1.0  1.0
# 4  1.0  1.0  1.0

#重置index并打印结果
res = pd.concat([df1, df2], axis=0, join='inner', ignore_index=True)
print(res)
#     b    c    d
# 0  0.0  0.0  0.0
# 1  0.0  0.0  0.0
# 2  0.0  0.0  0.0
# 3  1.0  1.0  1.0
# 4  1.0  1.0  1.0
# 5  1.0  1.0  1.0

join_axes (依照 axes 合并)

import pandas as pd
import numpy as np

#定义资料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])

#依照`df1.index`进行横向合并
res = pd.concat([df1, df2], axis=1, join_axes=[df1.index])

#打印结果
print(res)
#     a    b    c    d    b    c    d    e
# 1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
# 2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
# 3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0

#移除join_axes,并打印结果
res = pd.concat([df1, df2], axis=1)
print(res)
#     a    b    c    d    b    c    d    e
# 1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
# 2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
# 3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
# 4  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0

append函数原型:
这里写图片描述


>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('CD'))
>>> df
   A  B
0  1  2
1  3  4
>>> df2
   C  D
0  5  6
1  7  8
>>> df.append(df2)#只能在下面加
     A    B    C    D
0  1.0  2.0  NaN  NaN
1  3.0  4.0  NaN  NaN
0  NaN  NaN  5.0  6.0
1  NaN  NaN  7.0  8.0
>>> df3 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
>>> df.append(df3,ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8
>>> s = pd.Series(['a','b'],index=['A','B'])
>>> df.append(s,ignore_index=True)
   A  B
0  1  2
1  3  4
2  a  b

merge合并

函数原型
这里写图片描述
这里写图片描述

更多详情参见help(pd.merge)

pandas画图

import pandas as pd
import numpy as np
import matplotlib.pyplot as pltl# 随机生成1000个数据
data = pd.Series(np.random.randn(1000),index=np.arange(1000))

# 为了方便观看效果, 我们累加这个数据
data.cumsum()

# pandas 数据可以直接观看其可视化形式
data.plot()

plt.show()

更多画图有关操作详情请见matplotlib模块。

参考链接:

  1. http://pandas.pydata.org/pandas-docs/stable/10min.html
  2. https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/3-1-pd-intro/
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值