Numpy 和 Pandas 有什么不同?
如果用 python 的列表和字典来作比较, 那么可以说 Numpy 是列表形式的,没有数值标签,而 Pandas 就是字典形式。Pandas是基于Numpy构建的,让Numpy为中心的应用变得更加简单。
要使用pandas,首先需要了解他主要两个数据结构:Series和DataFrame。
Series的字符串表现形式为:索引在左边,值在右边。由于我们没有为数据指定索引。于是会自动创建一个0到N-1(N为长度)的整数型索引。
DataFrame是一个表格型的数据结构,它包含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。DataFrame既有行索引也有列索引, 它可以被看做由Series组成的大字典。
官方建议导入方法:
from pandas import Series,DataFrame
import pandas as pd
创建对象
>>> from pandas import Series,DataFrame
>>> import pandas as pd
>>> import numpy as np
>>> s = Series([1,2,3,'a',np.nan,[1,2]])
>>> s
0 1
1 2
2 3
3 a
4 NaN #not a number的意思
5 [1, 2]
dtype: object
>>> dates = pd.date_range('2017', periods=6)
>>> dates
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04','2017-01-05', '2017-01-06'],dtype='datetime64[ns]', freq='D')
>>> df = DataFrame(np.random.randn(6,4), index=dates)#不指定index和columns时默认从0开始索引。
>>> df
0 1 2 3
2017-01-01 -0.923905 0.305506 0.676255 -1.428198
2017-01-02 0.234690 1.756183 -0.226916 0.516676
2017-01-03 -0.180496 -0.410745 0.145798 -1.189019
2017-01-04 -0.676189 0.602093 -0.151042 -0.915054
2017-01-05 -1.000729 0.784595 0.623079 -0.551410
2017-01-06 1.024644 -0.305822 -0.867859 0.867652
>>> df = DataFrame(np.random.randn(6,4), columns=('a','b','c','d'))
>>> df
a b c d
0 0.000196 -1.342386 0.189864 -0.874669
1 -0.638368 -1.403264 0.121946 0.720223
2 -0.504676 0.328643 0.478719 -1.165611
3 -0.011445 -0.775834 0.809029 2.148832
4 -1.012311 1.345237 0.725192 -1.658297
5 -1.580452 -0.664339 -0.370294 -1.370419
查看和选择数据
>>> df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
>>> df2.head(2) #头两行
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
>>> df2.tail(2)
A B C D E F
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
>>> df2[0:2] #但是df2[0]就会报错
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
>>> df2.A
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
>>> df2['B']
0 2013-01-02
1 2013-01-02
2 2013-01-02
3 2013-01-02
Name: B, dtype: datetime64[ns]
>>> df2.values
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')
>>> df2.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
>>> df2.describe() #只对数字有统计
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
loc
我们可以使用标签来选择数据, 本例子主要通过标签名字选择某一行数据, 或者通过选择某行或者所有行(:代表所有行)然后选其中某一列或几列数据。:
>>> dates = pd.date_range('20130101', periods=6)
>>> df = DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
>>> df
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
>>> df.loc['20130102']
A 4
B 5
C 6
D 7
Name: 2013-01-02 00:00:00, dtype: int32
>>> df.loc['2013-01-01':'2013-01-04','A':'C']#当':'两边是str的时候包含两边,如果是[0:3],则包括左边不包括右边
A B C
2013-01-01 0 1 2
2013-01-02 4 5 6
2013-01-03 8 9 10
2013-01-04 12 13 14
iloc
另外我们可以采用位置进行选择 iloc, 在这里我们可以通过位置选择在不同情况下所需要的数据例如选某一个,连续选或者跨行选等操作。
>>> df.iloc[1:4,0:3] #包括1不包括4
A B C
2013-01-02 4 5 6
2013-01-03 8 9 10
2013-01-04 12 13 14
>>> df.iloc[[1,3,4],:]
A B C D
2013-01-02 4 5 6 7
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
ix
我们还可以采用混合选择。
>>> df.ix[0:2,['A','D']]
A D
2013-01-01 0 3
2013-01-02 4 7
bool筛选
>>> df[df>5]
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN 6.0 7.0
2013-01-03 8.0 9.0 10.0 11.0
2013-01-04 12.0 13.0 14.0 15.0
2013-01-05 16.0 17.0 18.0 19.0
2013-01-06 20.0 21.0 22.0 23.0
>>> df[df.A>8] #df.A那一列中大于8的列
A B C D
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
Pandas 处理NaN
有时候我们导入或处理数据, 会产生一些空的或者是 NaN 数据,如何删除或者是填补这些 NaN 数据呢?
dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
>>> dates = pd.date_range('20130101', periods=6)
>>> df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
>>> df.iloc[0,1] = np.nan
>>> df.iloc[1,2] = np.nan
"""
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
"""
>>> df.dropna()
A B C D
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
>>> df
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
>>> df.dropna(axis='columns',how='any')
A D
2013-01-01 0 3
2013-01-02 4 7
2013-01-03 8 11
2013-01-04 12 15
2013-01-05 16 19
2013-01-06 20 23
>>> df.fillna(value=-1)
A B C D
2013-01-01 0 -1.0 2.0 3
2013-01-02 4 5.0 -1.0 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
>>> df.isnull()
A B C D
2013-01-01 False True False False
2013-01-02 False False True False
2013-01-03 False False False False
2013-01-04 False False False False
2013-01-05 False False False False
2013-01-06 False False False False
>>> np.any(df.isnull()) == True #用以检查是否存在NaN,存在返回True
True
pandas数据存储和读取
可以存取的格式:
>>> path = r'C:\Users\zhifei\Desktop\student.csv'
>>> data = pd.read_csv(path)
>>> data
Student ID name age gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
3 1103 Tony 24 Male
4 1104 David 20 Male
5 1105 Catty 22 Female
6 1106 M 3 Female
7 1107 N 43 Male
8 1108 A 13 Male
9 1109 S 12 Male
10 1110 David 33 Male
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
>>> type(data)
<class 'pandas.core.frame.DataFrame'>
>>> path2 = r'C:\Users\zhifei\Desktop\json.txt'
>>> data.to_json(path2)
>>> data_2 = pd.read_json(path2)
>>> data_2
Student ID age gender name
0 1100 22 Female Kelly
1 1101 21 Female Clo
10 1110 33 Male David
11 1111 3 Female Dw
12 1112 23 Male Q
13 1113 21 Female W
2 1102 22 Female Tilly
3 1103 24 Male Tony
4 1104 20 Male David
5 1105 22 Female Catty
6 1106 3 Female M
7 1107 43 Male N
8 1108 13 Male A
9 1109 12 Male S
pandas数据合并
函数原型:
import pandas as pd
import numpy as np
#定义资料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*2, columns=['a','b','c','d'])
#concat纵向合并
res = pd.concat([df1, df2, df3], axis=0)
#打印结果
print(res)
# a b c d
# 0 0.0 0.0 0.0 0.0
# 1 0.0 0.0 0.0 0.0
# 2 0.0 0.0 0.0 0.0
# 0 1.0 1.0 1.0 1.0
# 1 1.0 1.0 1.0 1.0
# 2 1.0 1.0 1.0 1.0
# 0 2.0 2.0 2.0 2.0
# 1 2.0 2.0 2.0 2.0
# 2 2.0 2.0 2.0 2.0
res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
#打印结果
print(res)
# a b c d
# 0 0.0 0.0 0.0 0.0
# 1 0.0 0.0 0.0 0.0
# 2 0.0 0.0 0.0 0.0
# 3 1.0 1.0 1.0 1.0
# 4 1.0 1.0 1.0 1.0
# 5 1.0 1.0 1.0 1.0
# 6 2.0 2.0 2.0 2.0
# 7 2.0 2.0 2.0 2.0
# 8 2.0 2.0 2.0 2.0
join=’outer’为预设值,因此未设定任何参数时,函数默认join=’outer’。此方式是依照column来做纵向合并,有相同的column上下合并在一起,其他独自的column个自成列,原本没有值的位置皆以NaN填充。
import pandas as pd
import numpy as np
#定义资料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])
#纵向"外"合并df1与df2
res = pd.concat([df1, df2], axis=0, join='outer')
print(res)
# a b c d e
# 1 0.0 0.0 0.0 0.0 NaN
# 2 0.0 0.0 0.0 0.0 NaN
# 3 0.0 0.0 0.0 0.0 NaN
# 2 NaN 1.0 1.0 1.0 1.0
# 3 NaN 1.0 1.0 1.0 1.0
# 4 NaN 1.0 1.0 1.0 1.0
#纵向"内"合并df1与df2
res = pd.concat([df1, df2], axis=0, join='inner')
#打印结果
print(res)
# b c d
# 1 0.0 0.0 0.0
# 2 0.0 0.0 0.0
# 3 0.0 0.0 0.0
# 2 1.0 1.0 1.0
# 3 1.0 1.0 1.0
# 4 1.0 1.0 1.0
#重置index并打印结果
res = pd.concat([df1, df2], axis=0, join='inner', ignore_index=True)
print(res)
# b c d
# 0 0.0 0.0 0.0
# 1 0.0 0.0 0.0
# 2 0.0 0.0 0.0
# 3 1.0 1.0 1.0
# 4 1.0 1.0 1.0
# 5 1.0 1.0 1.0
join_axes (依照 axes 合并)
import pandas as pd
import numpy as np
#定义资料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])
#依照`df1.index`进行横向合并
res = pd.concat([df1, df2], axis=1, join_axes=[df1.index])
#打印结果
print(res)
# a b c d b c d e
# 1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
# 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
# 3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
#移除join_axes,并打印结果
res = pd.concat([df1, df2], axis=1)
print(res)
# a b c d b c d e
# 1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
# 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
# 3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
# 4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
append函数原型:
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('CD'))
>>> df
A B
0 1 2
1 3 4
>>> df2
C D
0 5 6
1 7 8
>>> df.append(df2)#只能在下面加
A B C D
0 1.0 2.0 NaN NaN
1 3.0 4.0 NaN NaN
0 NaN NaN 5.0 6.0
1 NaN NaN 7.0 8.0
>>> df3 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
>>> df.append(df3,ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
>>> s = pd.Series(['a','b'],index=['A','B'])
>>> df.append(s,ignore_index=True)
A B
0 1 2
1 3 4
2 a b
merge合并
函数原型
更多详情参见help(pd.merge)
pandas画图
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltl# 随机生成1000个数据
data = pd.Series(np.random.randn(1000),index=np.arange(1000))
# 为了方便观看效果, 我们累加这个数据
data.cumsum()
# pandas 数据可以直接观看其可视化形式
data.plot()
plt.show()
更多画图有关操作详情请见matplotlib模块。
参考链接: