目录
DataFrame创建
使用pd.DataFrame(),通过指定参数data(数据,可以为二维数组或列表,也可以为Series对象或Series列表,也可以为字典),参数index为行索引列表,参数columns为列索引列表
使用二维数组创建
import numpy as np
import pandas as pd
np.random.seed(1)
data1 = pd.DataFrame(data=np.random.randint(0,10,(2,3)),index = list('AB'),columns=list('ABC'))
print(data1)
# A B C
# A 5 8 9
# B 5 0 0
使用Series对象和Series列表创建
需要注意Series对象在转为DataFrame后会作为其的一个列,name属性会默认成为该列的列名,索引会成为行索引,若使用Series列表,则每个Series会成为DataFrame的一行(这里的name属性就会变成行索引,而index会变成列索引),如果多个Series的索引不统一,则会进行索引对齐,然后在未赋值的地方进行空值填充.
索引对齐的情况
import numpy as np
import pandas as pd
np.random.seed(1)
s1 = pd.Series(['jack','mike'],name = 'Name')
s2 = pd.Series([22,33],name = 'Age')
s3 = pd.Series(['clerk','policeman'],name = 'Position')
data1 = pd.DataFrame(data=s1)
data2 = pd.DataFrame(data=[s1,s2,s3])
print(data1)
print(data2)
# Name
# 0 jack
# 1 mike
# 0 1
# Name jack mike
# Age 22 33
# Position clerk policeman
索引不对齐
import numpy as np
import pandas as pd
np.random.seed(1)
s1 = pd.Series(['jack','mike'],name = 'Name',index=[1,2])
s2 = pd.Series([22,33],name = 'Age',index=[2,3])
s3 = pd.Series(['clerk','policeman'],name = 'Position',index=[0,1])
data2 = pd.DataFrame(data=[s1,s2,s3])
print(data2)
# 1 2 3 0
# Name jack mike NaN NaN
# Age NaN 22.0 33.0 NaN
# Position policeman NaN NaN clerk
这里为了进行索引对齐,比如若原来的索引只有1,2,而其他的Series对象会有出现索引0和3,则会将没有赋值的单元进行空值赋值
使用字典或字典列表进行赋值
使用字典进行赋值的话字典的键就会成为DataFrame的列名,方法除了可以使用DataFrame实例化,还可以使用DataFrame对象.from_dict()方法进行加载数据
使用单一字典进行创建
使用单一字典进行创建的时候,键会成为列名,值要求为容器对象(注意,值一般要求长度相等,只有在两者都为Series对象并且具有不同索引的时候才会使用索引对齐和空值补全),并且如值为Series对象,则DataFrame中的行索引就会默认为Series对象的行索引
import numpy as np
import pandas as pd
data1 = pd.DataFrame(data={
'name' : ['Jack','Mike'],
'Age' : [18,20]
},index=np.arange(2))
print(data1)
# name Age
# 0 Jack 18
# 1 Mike 20
import numpy as np
import pandas as pd
dict1 = {
'name' : pd.Series(['aaa','bbb'],index=list('ab')),
'Age' : pd.Series([18,20]),
}
data2 = pd.DataFrame(data=dict1)
print(data2)
# name Age
# a aaa NaN
# b bbb NaN
# 0 NaN 18.0
# 1 NaN 20.0
上面例子中由于两个Series对象的索引不同,所以出现了四行数据,正常为两行,这里变为四行的原因为索引不对齐导致的,并且不对齐的位置的单元被赋值为了空值
这里还可以使用from_dict进行加载
import numpy as np
import pandas as pd
dict1 = {
'name' : 'jack',
'Age' : 20
}
dict2 = {
'name' : 'mike',
'Age' : 30
}
dict3 = {
'position' : 'police'
}
data2 = pd.DataFrame.from_dict(data=[dict1,dict2,dict3])
print(data2)
# name Age position
# 0 jack 20.0 NaN
# 1 mike 30.0 NaN
# 2 NaN NaN police
使用字典列表进行创建
若使用字典列表进行创建,键依旧为列名,但是值一般为单个元素,每个字典会成为DataFrame的一行元素
import numpy as np
import pandas as pd
dict1 = {
'name' : 'jack',
'Age' : 20
}
dict2 = {
'name' : 'mike',
'Age' : 30
}
dict3 = {
'position' : 'police'
}
data2 = pd.DataFrame(data=[dict1,dict2,dict3])
print(data2)
# name Age position
# 0 jack 20.0 NaN
# 1 mike 30.0 NaN
# 2 NaN NaN police
这里同样可以使用from_dict()进行加载
import numpy as np
import pandas as pd
dict1 = {
'name' : 'jack',
'Age' : 20
}
dict2 = {
'name' : 'mike',
'Age' : 30
}
dict3 = {
'position' : 'police'
}
data2 = pd.DataFrame.from_dict(data=[dict1,dict2,dict3])
print(data2)
# name Age position
# 0 jack 20.0 NaN
# 1 mike 30.0 NaN
# 2 NaN NaN police
DataFrame属性
DataFrame属性一般包括dtypes(每一列的数据类型),values(DataFrame中的值,会以二维ndarray数组的形式显示),index(行索引列表),columns(列索引列表)
dtypes一般分为int,float和object
由于DataFrame中每一列都是一个Series,由于Series中要有统一化描述元素数据类型,在输出Series对象的时候可以看到底部会写上dtype:
其中object可以表示字符串,也可以表示包括字符串数值在内,存在多种数据类型,统称为object
import numpy as np
import pandas as pd
dict1 = {
'name' : 'jack',
'Age' : 20
}
dict2 = {
'name' : 'mike',
'Age' : 30
}
data2 = pd.DataFrame.from_dict(data=[dict1,dict2])
print(data2)
print(data2.dtypes)
print(data2.values)
print(data2.index,list(data2.index))
print(data2.columns,list(data2.columns))
# name Age
# 0 jack 20
# 1 mike 30
# name object
# Age int64
# dtype: object
# [['jack' 20]
# ['mike' 30]]
# RangeIndex(start=0, stop=2, step=1) [0, 1]
# Index(['name', 'Age'], dtype='object') ['name', 'Age']
DataFrame运算机制
DataFrame同样支持广播机制,以下为其常用运算符和运算函数
Python Operator | Pandas Method(s) |
---|---|
+ | add() |
- | sub() , subtract() |
* | mul() , multiply() |
/ | truediv() , div() , divide() |
// | floordiv() |
% | mod() |
** | pow() |
与标量进行运算的广播
import numpy as np
import pandas as pd
np.random.seed(1)
data1 = pd.DataFrame(data=np.random.randint(0,10,(3,4)))
print(data1)
print(data1+100)
# 0 1 2 3
# 0 5 8 9 5
# 1 0 0 1 7
# 2 6 9 2 4
# 0 1 2 3
# 0 105 108 109 105
# 1 100 100 101 107
# 2 106 109 102 104
使用一维列表进行广播(按行广播)
使用一维列表进行广播即要求一维列表的长度等于DataFrame列数,然后广播机制会将一维列表复制为多维列表直到与DataFrame维度相同然后进行计算
import numpy as np
import pandas as pd
np.random.seed(1)
data1 = pd.DataFrame(data=np.random.randint(0,10,(3,4)))
print(data1)
print(data1+[100,200,300,400])
# 0 1 2 3
# 0 5 8 9 5
# 1 0 0 1 7
# 2 6 9 2 4
# 0 1 2 3
# 0 105 208 309 405
# 1 100 200 301 407
# 2 106 209 302 404
使用二维列表进行广播(按列广播)
按列广播要求被广播对象为二维ndarray,并且形状为(n,1),n为DataFrame的行数,广播机制会将其复制为多列的二维ndarray,知道与DataFrame维度相同然后进行计算
注意这里只能使用二维的ndarray进行广播,否DataFrame会默认为是按行广播
import numpy as np
import pandas as pd
np.random.seed(1)
data1 = pd.DataFrame(data=np.random.randint(0,10,(3,4)))
print(data1)
print(data1+np.array([[100],[200],[300]]))
# 0 1 2 3
# 0 5 8 9 5
# 1 0 0 1 7
# 2 6 9 2 4
# 0 1 2 3
# 0 105 108 109 105
# 1 200 200 201 207
# 2 306 309 302 304
使用运算函数进行广播
这里以add函数为例,可以使用axis函数进行广播维度的选择,为0为按列方向广播,为1为按行方向广播,默认为1
注意这里不能像之前使用二维数组,需使用一维数组或列表
import numpy as np
import pandas as pd
np.random.seed(1)
data1 = pd.DataFrame(data=np.random.randint(0,10,(3,4)))
print(data1)
print(data1.add([100,200,300],axis=0))
# 0 1 2 3
# 0 5 8 9 5
# 1 0 0 1 7
# 2 6 9 2 4
# 0 1 2 3
# 0 105 108 109 105
# 1 200 200 201 207
# 2 306 309 302 304
DataFrame与Series进行运算
DataFrame与Series进行运算会采用索引对齐原则
索引完全匹配时
import numpy as np
import pandas as pd
np.random.seed(1)
data1 = pd.DataFrame(data=np.random.randint(0,10,(3,4)),index = list('ABC'),columns=list('defg'))
data2 = pd.Series(data=np.random.randint(0,10,4),index=list('defg'))
data3 = pd.Series(data=np.random.randint(0,10,3),index=list('ABC'))
print(data1,end='\n\n')
print(data2,end='\n\n')
print(data3,end='\n\n')
print(data1.add(data2),end='\n\n')
print(data1.add(data3,axis=0))
# d e f g
# A 5 8 9 5
# B 0 0 1 7
# C 6 9 2 4
#
# d 5
# e 2
# f 4
# g 2
# dtype: int64
#
# A 4
# B 7
# C 7
# dtype: int64
#
# d e f g
# A 10 10 13 7
# B 5 2 5 9
# C 11 11 6 6
#
# d e f g
# A 9 12 13 9
# B 7 7 8 14
# C 13 16 9 11
索引不完全匹配时
索引若不完全匹配,根据索引对齐机制,会进行空值填充
import numpy as np
import pandas as pd
np.random.seed(1)
data1 = pd.DataFrame(data=np.random.randint(0,10,(3,4)),index = list('ABC'),columns=list('defg'))
data2 = pd.Series(data=np.random.randint(0,10,4),index=list('dety'))
data3 = pd.Series(data=np.random.randint(0,10,3),index=list('BCD'))
print(data1,end='\n\n')
print(data2,end='\n\n')
print(data3,end='\n\n')
print(data1.add(data2),end='\n\n')
print(data1.add(data3,axis=0))
# d e f g
# A 5 8 9 5
# B 0 0 1 7
# C 6 9 2 4
#
# d 5
# e 2
# t 4
# y 2
# dtype: int64
#
# B 4
# C 7
# D 7
# dtype: int64
#
# d e f g t y
# A 10.0 10.0 NaN NaN NaN NaN
# B 5.0 2.0 NaN NaN NaN NaN
# C 11.0 11.0 NaN NaN NaN NaN
#
# d e f g
# A NaN NaN NaN NaN
# B 4.0 4.0 5.0 11.0
# C 13.0 16.0 9.0 11.0
# D NaN NaN NaN NaN
使用指定值进行运算时的空值填充
在使用运算函数如add的时候,可以指定餐食fill_value为空值填充的值,默认为pd.NA,需要注意fill_value的填充只有在进行索引匹配时,进行运算的两个DataFrame其中一个缺失数据,而另一个DataFrame在该位置具有数据才可以使用fill_value替换pd.NA参与运算
并且fill_value的使用限制于DataFrame和DataFrame的运算或Series与Series的运算,不可用于DataFrame与Series的运算中
from operator import index
import numpy as np
import pandas as pd
np.random.seed(1)
data1 = pd.DataFrame(data=np.random.randint(0,10,(3,4)),index = list('ABC'),columns=list('defg'))
data2 = pd.DataFrame(pd.Series(data=np.random.randint(0,10,2),index=list('de'))).transpose()
data2.index=['A']
data3 = pd.DataFrame(pd.Series(data=np.random.randint(0,10,2),index=list('BC')),columns=['d'])
print(data1,end='\n\n')
print(data2,end='\n\n')
print(data3,end='\n\n')
print(data1.add(data2,fill_value=0),end='\n\n')
print(data1.add(data3,fill_value=0))
# d e f g
# A 5 8 9 5
# B 0 0 1 7
# C 6 9 2 4
#
# d e
# A 5 2
#
# d
# B 4
# C 2
#
# d e f g
# A 10.0 10.0 9.0 5.0
# B 0.0 0.0 1.0 7.0
# C 6.0 9.0 2.0 4.0
#
# d e f g
# A 5.0 8.0 9.0 5.0
# B 4.0 0.0 1.0 7.0
# C 8.0 9.0 2.0 4.0
from operator import index
import numpy as np
import pandas as pd
np.random.seed(1)
data1 = pd.DataFrame(data=np.random.randint(0,10,(3,4)),index = list('ABC'),columns=list('defg'))
data2 = pd.DataFrame(pd.Series(data=np.random.randint(0,10,4),index=list('dety'))).transpose()
data3 = pd.DataFrame(pd.Series(data=np.random.randint(0,10,3),index=list('BCD')))
print(data1,end='\n\n')
print(data2,end='\n\n')
print(data3,end='\n\n')
print(data1.add(data2,fill_value=0),end='\n\n')
print(data1.add(data3,fill_value=0))
# d e f g
# A 5 8 9 5
# B 0 0 1 7
# C 6 9 2 4
#
# d e t y
# 0 5 2 4 2
#
# 0
# B 4
# C 7
# D 7
#
# d e f g t y
# A 5.0 8.0 9.0 5.0 NaN NaN
# B 0.0 0.0 1.0 7.0 NaN NaN
# C 6.0 9.0 2.0 4.0 NaN NaN
# 0 5.0 2.0 NaN NaN 4.0 2.0
#
# d e f g 0
# A 5.0 8.0 9.0 5.0 NaN
# B 0.0 0.0 1.0 7.0 4.0
# C 6.0 9.0 2.0 4.0 7.0
# D NaN NaN NaN NaN 7.0
DataFrame及Series元素读写机制
显式索引访问
使用显示索引访问及根据指定的行索引和列索引进行元素访问
显示索引使用Series对象或DataFrame对象的loc方法访问
Series读写
Series在指定显示索引后进行读写
from operator import index
import numpy as np
import pandas as pd
data1 = pd.Series(data = [1,2,3],index=list('ABC'))
print(data1)
print(data1['A'])
print(data1.A)
data1.A = 100
print(data1)
data1['A'] = 200
print(data1)
# A 1
# B 2
# C 3
# dtype: int64
# 1
# 1
# A 100
# B 2
# C 3
# dtype: int64
# A 200
# B 2
# C 3
# dtype: int64
切片读写
使用显示索引的切片是是左闭右闭的区间
import numpy as np
import pandas as pd
data1 = pd.Series(data = [1,2,3],index=list('ABC'))
print(data1)
print(data1.loc['A':'B'])
data1.loc['A':'B'] = [100,200]
print(data1)
# A 1
# B 2
# C 3
# dtype: int64
# A 1
# B 2
# dtype: int64
# A 100
# B 200
# C 3
# dtype: int64
当然也可以通过Series对象.索引名的形式取出数据
from operator import index
import numpy as np
import pandas as pd
data1 = pd.Series(data = [1,2,3],index=list('ABC'))
print(data1)
print(data1.A)
# A 1
# B 2
# C 3
# dtype: int64
# 1
DataFrame读写
DataFrame由于是二维表,所以指定索引的时候要先指定行索引,然后指定列索引
这里使用显示访问需要DataFrame对象的实例方法loc,全称为location
当然在DataFrame中也可以通过.列名的形式取出一列数据
from operator import index
import numpy as np
import pandas as pd
np.random.seed(1)
data1 = pd.DataFrame(data=np.random.randint(1,10,(3,4)),index = list('abc'),columns=list('ABCD'))
print(data1)
print(data1.A)
print(data1.loc['a']['A'])
print(data1.loc['a','A'])
print(data1.loc[['a','c'],['A','C']])
print(data1.loc[['a','b','c'],['A','B']])
data1.loc['a','A']=100
print(data1)
data1.loc[['a','b','c'],['A','B']]=np.array([[100,200],[300,400],[500,600]])
print(data1)
# A B C D
# a 6 9 6 1
# b 1 2 8 7
# c 3 5 6 3
# a 6
# b 1
# c 3
# Name: A, dtype: int64
# 6
# 6
# A C
# a 6 6
# c 3 6
# A B
# a 6 9
# b 1 2
# c 3 5
# A B C D
# a 100 9 6 1
# b 1 2 8 7
# c 3 5 6 3
# A B C D
# a 100 200 6 1
# b 300 400 8 7
# c 500 600 6 3
隐式索引访问
隐式索引使用Series对象或DataFrame对象的iloc方法访问,全称为integer location
隐式索引在Series对象中为0-len-1
在DataFrame对象中行索引为0-len(行)-1,列索引为0-len(列)-1
Series读写
Series对象的隐式索引切片为左闭右开
from operator import index
import numpy as np
import pandas as pd
data1 = pd.Series(data = [1,2,3],index=list('ABC'))
print(data1)
print(data1.iloc[0])
print(data1.iloc[0:2])
# A 1
# B 2
# C 3
# dtype: int64
# 1
# A 1
# B 2
# dtype: int64
DataFrame对象读写
import numpy as np
import pandas as pd
np.random.seed(0)
data1 = pd.DataFrame(data=np.random.randint(0,10,(3,4)),index = list('abc'),columns=list('ABCD'))
print(data1)
print(data1.iloc[0, 0])
print(data1.iloc[:,0:2])
data1.iloc[:,0:2]=np.random.randint(100,200,(3,2))
print(data1)
# A B C D
# a 5 0 3 3
# b 7 9 3 5
# c 2 4 7 6
# 5
# A B
# a 5 0
# b 7 9
# c 2 4
# A B C D
# a 188 188 3 3
# b 112 158 3 5
# c 165 139 7 6