Pandas 是一个很强大的数据科学分析工具,你可以把它当做是Excel。它是NumPy的延伸库。如果结合NumPy使用,基本上可以解决大多数据问题。 我将从下面几个方面来介绍Pandas的一些常用功能和函数:
- Series 级数
- DataFrames 数据框
- Missing Data 空值(无效值)
- GroupBy 分组
- Merging, Joining,and Concatenating 归并
- Operations 运算操作0
- Data Input and Output 数据输入输出
Series
Series 很像 NumPy中的Array。两者的区别是: Series会有一个一个axis labels(维度标签),这个axis labels不仅仅是数字,Series可以用这个作为索引。而array的只能用实数表示位置。另外Series中的数据可以是任何类型的object,但是array中只能是数字(NumPy1.1版本数据类型)。从这个功能上看,Series有点像是Hash table。
让我们来看几个例子先:
import numpy as np
import pandas as pd
创建一个 Series
你可以将 list,numpy array, 或者 dictionary 转换成 Series:
labels = ['a','b','c'] #list
my_list = [10,20,30] #list
arr = np.array([10,20,30]) # numpy array
d = {'a':10,'b':20,'c':30} # dictionary
用 Lists
pd.Series(data=my_list)
0 10
1 20
2 30
dtype: int64
pd.Series(data=my_list,index=labels) # Series主要的两个参数data在前,index在后
a 10
b 20
c 30
dtype: int64
pd.Series(my_list,labels) # 在正式项目中最好指明哪个变量指向哪个参数
以后我还要介绍更多数据科学工具,有些功能与NumPy或者pandas相似,但是更专业,出来的效果对business更有帮助。所以在对代码格式和报告要求不严格的项目中可以随意一些,但是严谨的business项目中最好要对格式有所要求。
a 10
b 20
c 30
dtype: int64
NumPy Arrays
pd.Series(arr) # axis labels 默认为有理实数
0 10
1 20
2 30
dtype: int64
pd.Series(arr,labels)
a 10
b 20
c 30
dtype: int64
Dictionary
pd.Series(d)
a 10
b 20
c 30
dtype: int64
Series里的数据类型
pandas Series 可以存放很多种数据类型:
pd.Series(data=labels)
0 a
1 b
2 c
dtype: object
# 甚至函数
pd.Series([sum,print,len])
0 <built-in function sum>
1 <built-in function print> # 必须加 from __future__ import print_function 在代码前,原因参考下面列表
2 <built-in function len>
dtype: object
所有的build-in函数列表
使用Index
使用Series的关键在于使用index,Series包含两个关键参数:data和index,可见index的地位与data同等重要。学会使用index可以更快查找数据。 下面几个例子将练习如何在Series中使用index,首先我们创建两个Series: ser1 & ser2
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
ser1
USA 1
Germany 2
USSR 3
Japan 4
dtype: int64
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
ser2
USA 1
Germany 2
Italy 5
Japan 4
dtype: int64
ser1['USA']
1
算数运算也会基于index进行
ser1 + ser2
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64 # 一旦进行算数运算结果自动转换为float.并且结果根据index将序排列
简单介绍一下,如果想向Series中增加数据,一般用set_value.后面的参数需要跟index和data。append和add方法只能加入另外一个一个Series。
ser2.set_value('Canada',7)
USA 1
Germany 2
Italy 5
Japan 4
china 6
Canada 7
dtype: int64
ser1 + ser2
Canada NaN
China NaN
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64
# 只有两个Series中共有的index相加才有有结果,否则得到NaN值。
DataFrames
DataFrames 的重要性不言而喻,受R语言的启发而开发出来的。 DataFrame就是许多object分享同一个index。在我学习R的时候很难理解DataFrame的意义。简单点来说你可以把它当做Excel里的一个sheet,或者数据库里的一个表。 让我们来看几个例子
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101) # seed 相当于reset。 因为随机数每次会取一个数作为seed,然后对这个数进行运算,通过算法得到一系列随机数,如果seed定了,那么得到的随机数是可以预测的。所以我们每次可以重置seed。
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Selection and Indexing
从DataFrame中选取数据
df['W']
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
# Pass a list of column names 注意是双括号
df[['W','Z']]
| W | Z |
---|
A | 2.706850 | 0.503826 |
B | 0.651118 | 0.605965 |
C | -2.018168 | -0.589001 |
D | 0.188695 | 0.955057 |
E | 0.190794 | 0.683509 |
用SQL语法(不推荐)
df.W
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
DataFrame 的 Columns 就是 Series (DataFrame Columns are just Series)
type(df['W'])
pandas.core.series.Series
增加列
df['new'] = df['W'] + df['Y']
df
| W | X | Y | Z | new |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
移除列
df.drop('new',axis=1) #axis=1指的是列,axis=0指行,axis=2指二维表,以此类推
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
注意:inplace参数默认为false,即是否替换原数据。上一个drop没有设置inplace参数,所以df的数据,没有改变
df
| W | X | Y | Z | new |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
现在我们设置inplace参数
df.drop('new',axis=1,inplace=True)
df
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
同样可以删除行
df.drop('E',axis=0)
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
选择行
df.loc['A'] # 选择行必须要用loc或者iloc函数
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
或者用位置代替index
df.iloc[2]
W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64
选择子集
df.loc['B','Y']
-0.84807698340363147
注意啊,如果直接用 df[‘B’,’Y’] 会报错,用 df[‘Y’,’B’] 也会报错。所以必须用df.loc[‘B’,’Y’]
df.loc[['A','B'],['W','Y']]
| W | Y |
---|
A | 2.706850 | 0.907969 |
B | 0.651118 | -0.848077 |
df.loc[['B','X','A','Y']]
B 0.190915
X NaN
A -0.747158
Y NaN
Name: Y, dtype: float64
情况比较选择(Conditional Selection)
一个很重要的功能,很像numpy的一个功能:
df
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df>0
| W | X | Y | Z |
---|
A | True | True | True | True |
B | True | False | False | True |
C | False | True | True | False |
D | True | False | False | True |
E | True | True | True | True |
df[df>0]
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | NaN | NaN | 0.605965 |
C | NaN | 0.740122 | 0.528813 | NaN |
D | 0.188695 | NaN | NaN | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
# 注意,在上一个情况中为false的值会用NaN占位(当然以后我们在机器学习中可以用其他数值替换NaN),但是在这个情况中,为false的值直接不显示
df[df['W']>0]
| W | X | Y | Z |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df[df['W']>0]['Y']
A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
# 注意啦!注意啦!此处是两个双中括号的并列关系,前面一个设定条件,后面一个设置需要选择的范围
| Y | X |
---|
A | 0.907969 | 0.628133 |
B | -0.848077 | -0.319318 |
D | -0.933237 | -0.758872 |
E | 2.605967 | 1.978757 |
df[(df['W']>0) & (df['Y'] > 1)]
| W | X | Y | Z |
---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
更多index细节 让我们来看看更多index的使用情况
df
| W | X | Y | Z | new |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
# 将index重置为 0,1...n
df.reset_index()
# 会有inplace参数 和fillin参数可以重置index
| index | W | X | Y | Z |
---|
0 | A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
1 | B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
2 | C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
3 | D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
4 | E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df # 增加新的一列
| W | X | Y | Z | States |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | CA |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | NY |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | WY |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | OR |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | CO |
df.set_index('States')
| W | X | Y | Z |
---|
States | | | | |
CA | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
NY | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
WY | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
OR | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
CO | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df
| W | X | Y | Z | States |
---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | CA |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | NY |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | WY |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | OR |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | CO |
df.set_index('States',inplace=True)
# 由此可见inplace参数将影响到原数据
df
| W | X | Y | Z |
---|
States | | | | |
CA | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
NY | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
WY | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
OR | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
CO | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
复合index和index阶层
复合index 需要将index做成一个tuple
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
| | A | B |
---|
Group | Num | | |
| 1 | 0.153661 | 0.167638 |
G1 | 2 | -0.765930 | 0.962299 |
| 3 | 0.902826 | -0.537909 |
| 1 | -1.549671 | 0.435253 |
G2 | 2 | 1.259904 | -0.447898 |
| 3 | 0.266207 | 0.412580 |
利用xs函数返回dataframe的 cross-section子集 python df.xs('G1')
| A | B |
---|
Num | | |
1 | 0.153661 | 0.167638 |
2 | -0.765930 | 0.962299 |
3 | 0.902826 | -0.537909 |
df.xs(['G1',1])
A 0.153661
B 0.167638
Name: (G1, 1), dtype: float64
df.xs(1,level='Num')
| A | B |
---|
Group | | |
G1 | 0.153661 | 0.167638 |
G2 | -1.549671 | 0.435253 |
Missing Data(无效数据)
Pandas有几种方法来处理无效的数据
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
df
| A | B | C |
---|
0 | 1.0 | 5.0 | 1 |
1 | 2.0 | NaN | 2 |
2 | NaN | NaN | 3 |
df.dropna() # 将含有nan值的rows一同删除
| | A | B | C | | —- | —- | —- | —- | | 0 | 1.0 | 5.0 | 1 | python df.dropna(axis=1) # 将含有nan值的columns一同删除
df.dropna(thresh=2) # 这个很有用, drop掉所有含有有效数据(除NaN外)小于thresh的Series
# 还有个how参数,需要指定是drop掉全为NaN的行还是drop掉出现NaN的行,这我就不给出例子了
df.fillna(value='FILL VALUE') # 这个不用介绍了吧
| A | B | C |
---|
0 | 1 | 5 | 1 |
1 | 2 | FILL VALUE | 2 |
2 | FILL VALUE | FILL VALUE | 3 |
df['A'].fillna(value=df['A'].mean())
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
df.fillna(value=df['B'].mean())
| A | B | C |
---|
0 | 1.0 | 5.0 | 1 |
1 | 2.0 | 5.0 | 2 |
2 | 5.0 | 5.0 | 3 |
所以记住处理NaN数据用dropna() 和 fillna() 两个方法就OK啦~~
Groupby
Groupby可以让你的数据分组并调用聚合函数。具体可以参考数据库中的groupby
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
| Company | Person | Sales |
---|
0 | GOOG | Sam | 200 |
1 | GOOG | Charlie | 120 |
2 | MSFT | Amy | 340 |
3 | MSFT | Vanessa | 124 |
4 | FB | Carl | 243 |
5 | FB | Sarah | 350 |
现在我们用groupBy()方法来进行分组,以下将增加一个 DataFrameGroupBy 的 object: python df.groupby('Company')
<pandas.core.groupby.DataFrameGroupBy object at 0x113014128>
我们将它赋给一个新的变量 python by_comp = df.groupby("Company")
来call聚合方法 python by_comp.mean()
| Sales |
---|
Company | |
FB | 296.5 |
GOOG | 160.0 |
MSFT | 232.0 |
等同于: python df.groupby('Company').mean()
| Sales |
---|
Company | |
FB | 296.5 |
GOOG | 160.0 |
MSFT | 232.0 |
标准差: python by_comp.std()
| Sales |
---|
Company | |
FB | 75.660426 |
GOOG | 56.568542 |
MSFT | 152.735065 |
最小值: python by_comp.min()
| | Person | Sales | | ——- | ——- | —– | | Company | | | | FB | Carl | 243 | | GOOG | Charlie | 120 | | MSFT | Amy | 124 | 最大值:python by_comp.max()
| | Person | Sales | | ——- | ——- | —– | | Company | | | | FB | Sarah | 350 | | GOOG | Sam | 200 | | MSFT | Vanessa | 340 | 计数:python by_comp.count()
| Person | Sales |
---|
Company | | |
FB | 2 | 2 |
GOOG | 2 | 2 |
MSFT | 2 | 2 |
描述分析(等同于数据库中的功能): python by_comp.describe()
| | Sales |
---|
Company | | |
FB | count | 2.000000 |
FB | mean | 296.500000 |
FB | std | 75.660426 |
FB | min | 243.000000 |
FB | 25% | 269.750000 |
FB | 50% | 296.500000 |
FB | 75% | 323.250000 |
FB | max | 350.000000 |
GOOG | count | 2.000000 |
GOOG | mean | 160.000000 |
GOOG | std | 56.568542 |
GOOG | min | 120.000000 |
GOOG | 25% | 140.000000 |
GOOG | 50% | 160.000000 |
GOOG | 75% | 180.000000 |
GOOG | max | 200.000000 |
MSFT | count | 2.000000 |
MSFT | mean | 232.000000 |
MSFT | std | 152.735065 |
MSFT | min | 124.000000 |
MSFT | 25% | 178.000000 |
MSFT | 50% | 232.000000 |
MSFT | 75% | 286.000000 |
MSFT | max | 340.000000 |
注:由于markdown格式问题,我将公司名称全部写了出来,在编译器中公司名称其实只显示一个。
by_comp.describe().transpose() # 翻转
Company | FB | | | | | | | | | GOOG | | | | | | | MSFT | | | | |
---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | … | 75% | max | count | mean | std | min | 25% | 50% | 75% | max |
Sales | 2.0 | 296.5 | 75.660426 | 243.0 | 269.75 | 296.5 | 323.25 | 350.0 | 2.0 | 160.0 | … | 180.0 | 200.0 | 2.0 | 232.0 | 152.735065 | 124.0 | 178.0 | 232.0 | 286.0 | 340.0 |
by_comp.describe().transpose()['GOOG']
也可以翻转一个公司的数据,我就不打印出来了
Merging, Joining, and Concatenating(归并,连接,级联)
这是三个主要的进行数据合并的方法,其中几种join的区别我就不详细介绍了,主要简单介绍一下这三种方法的差别
Concatenation(级联)
最基本的连接DF的方法,注意连接的DF维度一定要相同,可以通过axis参数来设置级联的位置,默认为0(在row的后面连接)
首先我们先建立三个DF
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
df1
| A | B | C | D |
---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
df2
| A | B | C | D |
---|
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
df3
| A | B | C | D |
---|
8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 |
10 | A10 | B10 | C10 | D10 |
11 | A11 | B11 | C11 | D11 |
pd.concat([df1,df2,df3])
| A | B | C | D |
---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 |
10 | A10 | B10 | C10 | D10 |
11 | A11 | B11 | C11 | D11 |
pd.concat([df1,df2,df3],axis=1) # 通过column相连
| A | B | C | D | A | B | C | D | A | B | C | D |
---|
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | A3 | B3 | C3 | D3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | A4 | B4 | C4 | D4 | NaN | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | A5 | B5 | C5 | D5 | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN | A6 | B6 | C6 | D6 | NaN | NaN | NaN | NaN |
7 | NaN | NaN | NaN | NaN | A7 | B7 | C7 | D7 | NaN | NaN | NaN | NaN |
8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A8 | B8 | C8 | D8 |
9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A9 | B9 | C9 | D9 |
10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A10 | B10 | C10 | D10 |
11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A11 | B11 | C11 | D11 |
Merging
合并有点像SQL语句里的join,需要keys(在两个DF中key可以是相同的Series也可以是不同的)进行合并 与join的区别是merge的key是通过相似的column(在实际问题中),并且两个合并的表没有主从之分,而join的index是指row的index,有一个为主一个为副。
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
left
| A | B | key |
---|
0 | A0 | B0 | K0 |
1 | A1 | B1 | K1 |
2 | A2 | B2 | K2 |
3 | A3 | B3 | K3 |
right
| C | D | key |
---|
0 | C0 | D0 | K0 |
1 | C1 | D1 | K1 |
2 | C2 | D2 | K2 |
3 | C3 | D3 | K3 |
pd.merge(left,right,how='inner',on='key')
| A | B | key | C | D |
---|
0 | A0 | B0 | K0 | C0 | D0 |
1 | A1 | B1 | K1 | C1 | D1 |
2 | A2 | B2 | K2 | C2 | D2 |
3 | A3 | B3 | K3 | C3 | D3 |
下面我们使用key不同的两个表
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on=['key1', 'key2'])
| A | B | key1 | key2 | C | D |
---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A2 | B2 | K1 | K0 | C1 | D1 |
2 | A2 | B2 | K1 | K0 | C2 | D2 |
pd.merge(right,left, on=['key1', 'key2']) # 注意左右交换的结果
| C | D | key1 | key2 | A | B |
---|
0 | C0 | D0 | K0 | K0 | A0 | B0 |
1 | C1 | D1 | K1 | K0 | A2 | B2 |
2 | C2 | D2 | K1 | K0 | A2 | B2 |
pd.merge(left, right, how='outer', on=['key1', 'key2'])
| A | B | key1 | key2 | C | D |
---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A1 | B1 | K0 | K1 | NaN | NaN |
2 | A2 | B2 | K1 | K0 | C1 | D1 |
3 | A2 | B2 | K1 | K0 | C2 | D2 |
4 | A3 | B3 | K2 | K1 | NaN | NaN |
5 | NaN | NaN | K2 | K0 | C3 | D3 |
pd.merge(left, right, how='right', on=['key1', 'key2'])
| A | B | key1 | key2 | C | D |
---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A2 | B2 | K1 | K0 | C1 | D1 |
2 | A2 | B2 | K1 | K0 | C2 | D2 |
3 | NaN | NaN | K2 | K0 | C3 | D3 |
pd.merge(left, right, how='left', on=['key1', 'key2'])
| A | B | key1 | key2 | C | D |
---|
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A1 | B1 | K0 | K1 | NaN | NaN |
2 | A2 | B2 | K1 | K0 | C1 | D1 |
3 | A2 | B2 | K1 | K0 | C2 | D2 |
4 | A3 | B3 | K2 | K1 | NaN | NaN |
Joining
Joining是将两个不同index的DF合并成一个单一的DF,所以要求大部分column相同
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
left.join(right)
| A | B | C | D |
---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | NaN | NaN |
K2 | A2 | B2 | C2 | D2 |
left.join(right, how='outer')
| A | B | C | D |
---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | NaN | NaN |
K2 | A2 | B2 | C2 | D2 |
K3 | NaN | NaN | C3 | D3 |
Operations(运算)
哈,终于说到运算操作啦。其中很多方法与R语言的很像。
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
| col1 | col2 | col3 |
---|
0 | 1 | 444 | abc |
1 | 2 | 555 | def |
2 | 3 | 666 | ghi |
3 | 4 | 444 | xyz |
各种Unique值
df['col2'].unique() # 找出唯一的值并返回array
array([444, 555, 666])
df['col2'].nunique() # 返回唯一值的个数
3
df['col2'].value_counts() # 计数
444 2
555 1
666 1
Name: col2, dtype: int64
选择数据
# 从复合条件中选择数据
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
应用方程或函数
def times2(x):
return x*2
df['col1'].apply(times2) # 通过apply对DF使用自定义函数
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
df['col3'].apply(len)
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
df['col1'].sum()
10 ### 永久性删除一个column
del df['col1']
df
| col2 | col3 |
---|
0 | 444 | abc |
1 | 555 | def |
2 | 666 | ghi |
3 | 444 | xyz |
得到column或者index名称
df.columns
Index(['col2', 'col3'], dtype='object')
df.index
RangeIndex(start=0, stop=4, step=1)
排序
df
| col2 | col3 |
---|
0 | 444 | abc |
1 | 555 | def |
2 | 666 | ghi |
3 | 444 | xyz |
df.sort_values(by='col2') #inplace=False 默认
| col2 | col3 |
---|
0 | 444 | abc |
3 | 444 | xyz |
1 | 555 | def |
2 | 666 | ghi |
判空
df.isnull()
| col2 | col3 |
---|
0 | False | False |
1 | False | False |
2 | False | False |
3 | False | False |
# Drop rows with NaN Values
df.dropna()
| col2 | col3 |
---|
0 | 444 | abc |
1 | 555 | def |
2 | 666 | ghi |
3 | 444 | xyz |
替换NaN
import numpy as np
df = pd.DataFrame({'col1':[1,2,3,np.nan],
'col2':[np.nan,555,666,444],
'col3':['abc','def','ghi','xyz']})
df.head()
| col1 | col2 | col3 |
---|
0 | 1.0 | NaN | abc |
1 | 2.0 | 555.0 | def |
2 | 3.0 | 666.0 | ghi |
3 | NaN | 444.0 | xyz |
df.fillna('FILL')
| col1 | col2 | col3 |
---|
0 | 1 | FILL | abc |
1 | 2 | 555 | def |
2 | 3 | 666 | ghi |
3 | FILL | 444 | xyz |
建立一个基准表,挺有意思的
data = {'A':['foo','foo','foo','bar','bar','bar'],
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
df
| A | B | C | D |
---|
0 | foo | one | x | 1 |
1 | foo | one | y | 3 |
2 | foo | two | x | 2 |
3 | bar | two | y | 5 |
4 | bar | one | x | 4 |
5 | bar | one | y | 1 |
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
| C | x | y |
---|
A | B | | |
bar | one | 4.0 | 1.0 |
bar | two | NaN | 5.0 |
foo | one | 1.0 | 3.0 |
foo | two | 2.0 | NaN |
数据输入与输出
pandas需要用 pd.read_ methods可以读取各种类型的数据哦!!
import numpy as np
import pandas as pd
## CSV
CSV 输入
df = pd.read_csv('example') # 文件最好在相同目录下,不在的话需要指定文件路径
df
| a | b | c | d |
---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
CSV 输出
df.to_csv('example',index=False)
## Excel Pandas 仅仅针对Data进行输入,不能输入图像或者公式,否则会报错
Excel 输入
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1') # 注意需要指明哪个sheet
| a | b | c | d |
---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
Excel 输出
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
## HTML
需要先安装Python的html5库支持
pip install lxml
pip install html5lib
pip install BeautifulSoup4 # 这个我很喜欢,写一下简单的爬虫时候比较好用
HTML 输入
Pandas read_html 将会自动读取网页中的表格并且返回一个包含DataFrame的List对象
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
df[0]
数据太多了我就不列出来了,大家可以自己试试
HTML 输出
df.to_html(‘banklist.html’) # 注意一定要扩展名
SQL语句
一般不需要用pandas直接对数据库的数据进行读取。由于安全考虑和独立性的原因。 但是pandas有 pandas.io.sql 模块可以对数据库进行数据读取,但是要有合适的API
以下是一些常用的功能:
- read_sql_table(table_name, con[, schema, …]) 读取数据库中的表到DF中
- read_sql_query(sql, con[, index_col, …]) 读取Query进DF
- read_sql(sql, con[, index_col, …]) 读取表或者query进DF
- DataFrame.to_sql(name, con[, flavor, …]) 将DF中的记录导入数据库
以下给出一个简单的例子:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:') # Python默认使用sqlite数据库
df.to_sql('data', engine)
sql_df = pd.read_sql('data',con=engine)
sql_df
| index | a | b | c | d |
---|
0 | 0 | 0 | 1 | 2 | 3 |
1 | 1 | 4 | 5 | 6 | 7 |
2 | 2 | 8 | 9 | 10 | 11 |
3 | 3 | 12 | 13 | 14 | 15 |