《数据采集与分析》课程期中测试重要知识点复习
① Numpy部分应掌握的重要知识点
一、Numpy数组的创建
1、基于一维/二维列表创建
import numpy as np
a1_1= np. array( [ 1 , 2 , 3 , 4 ] )
a1_1
array( [ 1 , 2 , 3 , 4 ] )
a1_2= np. array( [ [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] , [ 7 , 8 , 9 ] ] )
a1_2
array( [ [ 1 , 2 , 3 ] ,
[ 4 , 5 , 6 ] ,
[ 7 , 8 , 9 ] ] )
2、基于可以产生有规律数据的函数创建一维数组
a2= np. arange( 1 , 10 , 2 )
print ( "数组a2的元素:" , a2)
a3= np. linspace( 0 , 100 , 6 )
print ( "数组a3的元素:" , a3)
数组a2的元素: [ 1 3 5 7 9 ]
数组a3的元素: [ 0. 20. 40. 60. 80. 100. ]
3、基于特殊函数创建二维数组
b2= np. zeros( ( 3 , 4 ) )
print ( "全0数组b2的元素:\n" , b2)
b3= np. ones( ( 3 , 4 ) )
print ( "全1数组b3的元素:\n" , b3)
b4= np. full( ( 3 , 4 ) , 5 )
print ( "全为5的数组b4的元素:\n" , b4)
b5= np. identity( 3 )
print ( "对角线都为1的数组b5的元素:\n" , b5)
b6= np. diag( [ 5 , 6 , 7 ] , k= 0 )
print ( "对角形数组b6的元素:\n" , b6)
全0 数组b2的元素:
[ [ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ] ]
全1 数组b3的元素:
[ [ 1. 1. 1. 1. ]
[ 1. 1. 1. 1. ]
[ 1. 1. 1. 1. ] ]
全为5 的数组b4的元素:
[ [ 5 5 5 5 ]
[ 5 5 5 5 ]
[ 5 5 5 5 ] ]
对角线都为1 的数组b5的元素:
[ [ 1. 0. 0. ]
[ 0. 1. 0. ]
[ 0. 0. 1. ] ]
对角形数组b6的元素:
[ [ 5 0 0 ]
[ 0 6 0 ]
[ 0 0 7 ] ]
4、基于随机数创建数组
np. random. seed( 666 )
c1= np. random. random( 3 )
print ( "[0,1)范围内的随机一维小数数组:" , c1)
c2= np. random. random( ( 2 , 3 ) )
print ( "[0,1)范围内的随机二维小数数组:\n" , c2)
c3= np. random. randint( 1 , 100 , 6 )
print ( "[1,100)范围内的随机一维整数数组:" , c3)
c4= np. random. randint( 1 , 100 , ( 2 , 3 ) )
print ( "[1,100)范围内的随机二维整数数组:\n" , c4)
[ 0 , 1 )范围内的随机一维小数数组: [ 0.70043712 0.84418664 0.67651434 ]
[ 0 , 1 )范围内的随机二维小数数组:
[ [ 0.72785806 0.95145796 0.0127032 ]
[ 0.4135877 0.04881279 0.09992856 ] ]
[ 1 , 100 )范围内的随机一维整数数组: [ 64 17 47 40 70 83 ]
[ 1 , 100 )范围内的随机二维整数数组:
[ [ 77 80 14 ]
[ 70 21 12 ] ]
c7= np. random. uniform( 1 , 11 , ( 2 , 3 ) )
print ( "服从[1,11)区间均匀分布的二维随机小数数组:\n" , c7)
c8= np. random. normal( 5 , 2 , ( 2 , 3 ) )
print ( "服从均值为5、标准差为2的正态分布的二维随机小数数组:\n" , c8)
服从[ 1 , 11 ) 区间均匀分布的二维随机小数数组:
[ [ 1.05108839 2.12857654 2.10953672 ]
[ 3.47668229 1.23236299 8.27321154 ] ]
服从均值为5 、标准差为2 的正态分布的二维随机小数数组:
[ [ 2.82241403 3.84845851 1.63419846 ]
[ 5.4583705 1.48674955 6.68926524 ] ]
二、查看数组的属性和相关帮助信息
1、查看数组的维数、形状等属性
lst= [ [ 1 , 2 , 3 ] , [ 4 , 5 , 6 ] , [ 7 , 8 , 9 ] , [ 10 , 11 , 12 ] ]
b1= np. array( lst)
print ( "数组b1的元素:\n" , b1)
print ( "数组b1的维数:" , b1. ndim)
print ( "数组b1的形状:" , b1. shape)
print ( "数组b1的元素个数:" , b1. size)
数组b1的元素:
[ [ 1 2 3 ]
[ 4 5 6 ]
[ 7 8 9 ]
[ 10 11 12 ] ]
数组b1的维数: 2
数组b1的形状: ( 4 , 3 )
数组b1的元素个数: 12
2、查看联机帮助的两种常见方法(help和?)
np. diag?
Help on built- in function randint:
randint( . . . ) method of numpy. random. mtrand. RandomState instance
randint( low, high= None , size= None , dtype= int )
Return random integers from `low` ( inclusive) to `high` ( exclusive) .
Return random integers from the "discrete uniform" distribution of
the specified dtype in the "half-open" interval [ `low`, `high`) . If
`high` is None ( the default) , then results are from [ 0 , `low`) .
. . note: :
New code should use the ``integers`` method of a ``default_rng( ) ``
instance instead; please see the : ref: `random- quick- start`.
Parameters
- - - - - - - - - -
low : int or array- like of ints
Lowest ( signed) integers to be drawn from the distribution ( unless
``high= None ``, in which case this parameter is one above the
* highest* such integer) .
high : int or array- like of ints, optional
If provided, one above the largest ( signed) integer to be drawn
from the distribution ( see above for behavior if ``high= None ``) .
If array- like, must contain integer values
size : int or tuple of ints, optional
Output shape. If the given shape is , e. g. , ``( m, n, k) ``, then
``m * n * k`` samples are drawn. Default is None , in which case a
single value is returned.
dtype : dtype, optional
Desired dtype of the result. Byteorder must be native.
The default value is int .
. . versionadded: : 1.11 .0
Returns
- - - - - - -
out : int or ndarray of ints
`size`- shaped array of random integers from the appropriate
distribution, or a single such random int if `size` not provided.
See Also
- - - - - - - -
random_integers : similar to `randint`, only for the closed
interval [ `low`, `high`] , and 1 is the lowest value if `high` is
omitted.
random. Generator. integers: which should be used for new code.
Examples
- - - - - - - -
>> > np. random. randint( 2 , size= 10 )
array( [ 1 , 0 , 0 , 0 , 1 , 1 , 0 , 0 , 1 , 0 ] )
>> > np. random. randint( 1 , size= 10 )
array( [ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ] )
Generate a 2 x 4 array of ints between 0 and 4 , inclusive:
>> > np. random. randint( 5 , size= ( 2 , 4 ) )
array( [ [ 4 , 0 , 2 , 1 ] ,
[ 3 , 2 , 2 , 0 ] ] )
Generate a 1 x 3 array with 3 different upper bounds
>> > np. random. randint( 1 , [ 3 , 5 , 10 ] )
array( [ 2 , 2 , 9 ] )
Generate a 1 by 3 array with 3 different lower bounds
>> > np. random. randint( [ 1 , 5 , 7 ] , 10 )
array( [ 9 , 8 , 7 ] )
Generate a 2 by 4 array using broadcasting with dtype of uint8
>> > np. random. randint( [ 1 , 3 , 5 , 7 ] , [ [ 10 ] , [ 20 ] ] , dtype= np. uint8)
array( [ [ 8 , 6 , 9 , 7 ] ,
[ 1 , 16 , 9 , 12 ] ] , dtype= uint8)
三、数组转置、改变形状和升维
1、数组转置需要使用.T属性
print ( "转置前b1数组:" , b1)
b1= b1. T
print ( "转置后b1数组:" , b1)
转置前b1数组: [ [ 1 2 3 ]
[ 4 5 6 ]
[ 7 8 9 ]
[ 10 11 12 ] ]
转置后b1数组: [ [ 1 4 7 10 ]
[ 2 5 8 11 ]
[ 3 6 9 12 ] ]
2、修改数组的形状
b1_2= b1. reshape( 2 , 6 )
b1_2
array( [ [ 1 , 4 , 7 , 10 , 2 , 5 ] ,
[ 8 , 11 , 3 , 6 , 9 , 12 ] ] )
b1_3= b1_2. ravel( )
print ( "ravel拉伸后数组b1_2没有变化:" , b1_2)
print ( "拉伸后的结果数组b1_3:" , b1_3)
拉伸后数组b1_2没有变化: [ [ 1 4 7 10 2 5 ]
[ 8 11 3 6 9 12 ] ]
拉伸后的结果数组b1_3: [ 1 4 7 10 2 5 8 11 3 6 9 12 ]
print ( "flatten拉伸后数组b1_2发生变化:" , b1_2. flatten( ) )
flatten拉伸后数组b1_2发生变化: [ 1 4 7 10 2 5 8 11 3 6 9 12 ]
3、提升数组的维度
a = np. array( [ 1 , 2 , 3 , 4 , 5 , 6 ] )
print ( "一维数组a的形状:" , a. shape)
b= a[ np. newaxis, : ]
print ( "二维数组b的形状:" , b. shape)
c= a[ : , np. newaxis]
print ( "二维数组c的形状:" , c. shape)
一维数组a的形状: ( 6 , )
二维数组b的形状: ( 1 , 6 )
二维数组c的形状: ( 6 , 1 )
四、数组索引和切片
n= np. array( [ [ 1 , 2 , 3 , ] , [ 11 , 22 , 33 ] , [ 111 , 222 , 333 ] , [ 1111 , 2222 , 3333 ] ] )
print ( "n =" , n)
print ( "n[1:,2:4] =" , n[ 1 : , 2 : 4 ] )
print ( "n[::,-3::2] =" , n[ : : , - 3 : : 2 ] )
print ( "n[1,:]用于取第2行:" , n[ 1 , : ] )
print ( "n[:,-1]用于取最后一列:" , n[ : , - 1 ] )
n = [ [ 1 2 3 ]
[ 11 22 33 ]
[ 111 222 333 ]
[ 1111 2222 3333 ] ]
n[ 1 : , 2 : 4 ] = [ [ 33 ]
[ 333 ]
[ 3333 ] ]
n[ : : , - 3 : : 2 ] = [ [ 1 3 ]
[ 11 33 ]
[ 111 333 ]
[ 1111 3333 ] ]
n[ 1 , : ] 用于取第2 行: [ 11 22 33 ]
n[ : , - 1 ] 用于取最后一列: [ 3 33 333 3333 ]
p= n[ : , : ]
print ( "使用切片操作,数组p和数组n指向相同的数组对象:" , p)
p[ 1 : 2 , : ] = 555
print ( "切片赋值后数组p=" , p)
print ( "切片赋值后数组n=" , n)
使用切片操作,数组p和数组n指向相同的数组对象: [ [ 1 2 3 ]
[ 11 22 33 ]
[ 111 222 333 ]
[ 1111 2222 3333 ] ]
切片赋值后数组p= [ [ 1 2 3 ]
[ 555 555 555 ]
[ 111 222 333 ]
[ 1111 2222 3333 ] ]
切片赋值后数组n= [ [ 1 2 3 ]
[ 555 555 555 ]
[ 111 222 333 ]
[ 1111 2222 3333 ] ]
x= np. arange( 15 ) . reshape( ( 3 , 5 ) )
print ( x)
print ( x% 3 == 0 )
y= x[ x% 3 == 0 ]
print ( "数组x中3的倍数构成的新数组y=" , y)
[ [ 0 1 2 3 4 ]
[ 5 6 7 8 9 ]
[ 10 11 12 13 14 ] ]
[ [ True False False True False ]
[ False True False False True ]
[ False False True False False ] ]
数组x中3 的倍数构成的新数组y= [ 0 3 6 9 12 ]
np. random. seed( 666 )
z= np. random. randint( 1 , 100 , 12 ) . reshape( ( 3 , 4 ) )
print ( "二维随机整数数组z=:" , z)
idx= [ 2 , [ 1 , 3 ] ]
print ( "索引数组idx=" , idx)
print ( "用idx做索引数组,检索数组z得到的子集z[idx]=" , z[ idx] )
二维随机整数数组z= : [ [ 3 46 31 63 ]
[ 71 74 31 37 ]
[ 62 92 95 52 ] ]
索引数组idx= [ 2 , [ 1 , 3 ] ]
用idx做索引数组,检索数组z得到的子集z[ idx] = [ 92 52 ]
五、应用统计与排序函数
np. random. seed( 666 )
z= np. random. randint( 1 , 100 , 12 ) . reshape( ( 3 , 4 ) )
print ( "二维随机整数数组z =" , z)
print ( "z的全部元素之和:" , z. sum ( ) )
print ( "z的列元素之和:" , z. sum ( axis= 0 ) )
print ( "z的行元素之和:" , z. sum ( axis= 1 ) )
print ( "z的全部元素均值:" , np. mean( z) )
print ( "z的列元素均值:" , z. mean( axis= 0 ) )
print ( "z的行元素均值:" , z. mean( axis= 1 ) )
print ( "z的最大值:" , z. max ( ) )
print ( "z的最大值所在的索引:" , z. argmax( ) )
print ( "z的每行最大值:" , z. max ( axis= 1 ) )
print ( "z的每行最大值所在的索引:" , z. argmax( axis= 1 ) )
print ( "z大于90的元素个数:" , np. sum ( ( z> 90 ) ) )
print ( "z介于60到80之间的元素个数:" , np. sum ( ( z>= 60 ) & ( z<= 80 ) ) )
二维随机整数数组z = [ [ 3 46 31 63 ]
[ 71 74 31 37 ]
[ 62 92 95 52 ] ]
z的全部元素之和: 657
z的列元素之和: [ 136 212 157 152 ]
z的行元素之和: [ 143 213 301 ]
z的全部元素均值: 54.75
z的列元素均值: [ 45.33333333 70.66666667 52.33333333 50.66666667 ]
z的行元素均值: [ 35.75 53.25 75.25 ]
z的最大值: 95
z的最大值所在的索引: 10
z的每行最大值: [ 63 74 95 ]
z的每行最大值所在的索引: [ 3 1 2 ]
z大于90 的元素个数: 2
z介于60 到80 之间的元素个数: 4
print ( "排序前数组z =" , z)
print ( "按行排序的结果:" , np. sort( z) )
print ( "按行排序结果的原索引:" , np. argsort( z) )
print ( "按列排序的结果:" , np. sort( z, axis= 0 ) )
print ( "按列排序结果的原索引:" , np. argsort( z, axis= 0 ) )
r= z. flatten( )
print ( "z按行拉成的一维数组r =" , r)
print ( "拉伸后的数组r的排序结果:" , np. sort( r) )
print ( "通过切片实现降序排列:" , np. sort( r) [ : : - 1 ] )
print ( "通过argsort函数实现降序排列:" , r[ np. argsort( - r) ] )
print ( "*************************" )
print ( "排序前的数组r=" , r)
print ( "用np.sort(r)排序后的结果:" , np. sort( r) )
print ( "排序后的数组r=" , r)
print ( "用r.sort()排序后的结果:" , r. sort( ) )
print ( "排序后的数组r=" , r)
排序前数组z = [ [ 3 46 31 63 ]
[ 71 74 31 37 ]
[ 62 92 95 52 ] ]
按行排序的结果: [ [ 3 31 46 63 ]
[ 31 37 71 74 ]
[ 52 62 92 95 ] ]
按行排序结果的原索引: [ [ 0 2 1 3 ]
[ 2 3 0 1 ]
[ 3 0 1 2 ] ]
按列排序的结果: [ [ 3 46 31 37 ]
[ 62 74 31 52 ]
[ 71 92 95 63 ] ]
按列排序结果的原索引: [ [ 0 0 0 1 ]
[ 2 1 1 2 ]
[ 1 2 2 0 ] ]
z按行拉成的一维数组r = [ 3 46 31 63 71 74 31 37 62 92 95 52 ]
拉伸后的数组r的排序结果: [ 3 31 31 37 46 52 62 63 71 74 92 95 ]
通过切片实现降序排列: [ 95 92 74 71 63 62 52 46 37 31 31 3 ]
通过argsort函数实现降序排列: [ 95 92 74 71 63 62 52 46 37 31 31 3 ]
** ** ** ** ** ** ** ** ** ** ** ** *
排序前的数组r= [ 3 46 31 63 71 74 31 37 62 92 95 52 ]
用np. sort( r) 排序后的结果: [ 3 31 31 37 46 52 62 63 71 74 92 95 ]
排序后的数组r= [ 3 46 31 63 71 74 31 37 62 92 95 52 ]
用r. sort( ) 排序后的结果: None
排序后的数组r= [ 3 31 31 37 46 52 62 63 71 74 92 95 ]
② Pandas部分应掌握的重要知识点
六、DataFrame数据框的创建
1、直接基于二维数据创建(同时使用index和columns参数)
import numpy as np
import pandas as pd
scores= np. array( [ [ 97 , 93 , 86 ] ,
[ 95 , 97 , 88 ] ] )
pd. DataFrame( scores, index= [ 's01' , 's02' ] , columns= [ '数学' , '英语' , '语文' ] )
2、基于excel文件中的数据来创建
team= pd. read_excel( 'team.xlsx' )
team. head( )
name team Q1 Q2 Q3 Q4 0 Liver E 89 21 24 64 1 Arry C 36 37 37 57 2 Ack A 57 60 18 84 3 Eorge C 93 96 71 78 4 Oah D 65 49 61 86
七、查看数据框中的数据和联机帮助信息
1、查看特殊行的数据
team. head( 3 )
name team Q1 Q2 Q3 Q4 0 Liver E 89 21 24 64 1 Arry C 36 37 37 57 2 Ack A 57 60 18 84
team. tail( )
name team Q1 Q2 Q3 Q4 95 Gabriel C 48 59 87 74 96 Austin7 C 21 31 30 43 97 Lincoln4 C 98 93 1 20 98 Eli E 11 74 58 91 99 Ben E 21 43 41 74
team. sample( 2 )
name team Q1 Q2 Q3 Q4 73 Elliot C 15 17 76 22 26 Teddy E 71 91 21 48
2、查看联机帮助的两种常见方法(help和?)
help ( team. sample)
3、查看总体统计情况
team. describe( )
Q1 Q2 Q3 Q4 count 100.000000 100.000000 100.000000 100.000000 mean 49.200000 52.550000 52.670000 52.780000 std 29.962603 29.845181 26.543677 27.818524 min 1.000000 1.000000 1.000000 2.000000 25% 19.500000 26.750000 29.500000 29.500000 50% 51.500000 49.500000 55.000000 53.000000 75% 74.250000 77.750000 76.250000 75.250000 max 98.000000 99.000000 99.000000 99.000000
4、根据指定行号或列号查看数据
team. iloc[ 3 : 5 , [ 0 , 2 ] ]
team[ 10 : 13 ]
name team Q1 Q2 Q3 Q4 10 Leo B 17 4 33 79 11 Logan B 9 89 35 65 12 Archie C 83 89 59 68
5、根据行标签或列标签查看数据
team. loc[ 3 : 4 , [ "name" , "Q1" ] ]
特别提醒,虽然上述两种通用写法的输出相同,但原理不同:
(1)iloc索引器的切片不包含终值,所以team.iloc[3:5,[0,2]]中不包含下标为5的行
(2) loc索引器的切片却包含终值,所以team.loc[3:4,[0,2]]中却包含行标签为4的行
(3)同样是整数,在iloc索引器中将被解读为行/列下标,而在loc索引器中将被解读为行/列标签
print ( team[ 'team' ] . unique( ) )
team[ [ 'name' , 'Q1' ] ] . head( 3 )
[ 'E' 'C' 'A' 'D' 'B' ]
name Q1 0 Liver 89 1 Arry 36 2 Ack 57
补充说明:使用.iloc或loc索引器的通用写法适用性更广泛,因此掌握通用写法是基本要求,在此基础上最好能掌握基于列标签的简化写法,因为这种写法也比较常见
6、根据给定条件查询数据
team. loc[ team[ 'Q1' ] > 90 , 'name' ]
3 Eorge
17 Henry
19 Max
32 Alexander
38 Elijah
80 Ryan
88 Aaron
97 Lincoln4
Name: name, dtype: object
team. loc[ team[ 'name' ] . str . startswith( 'M' ) , [ 'name' , 'Q1' , 'Q2' ] ]
name Q1 Q2 19 Max 97 75 23 Mason 80 96 62 Matthew 44 33 77 Michael 89 21
if team. index. name!= 'name' :
team. set_index( "name" , inplace= True )
team. loc[ team. index. str . startswith( 'M' ) , 'Q1' : 'Q4' ]
Q1 Q2 Q3 Q4 name Max 97 75 41 3 Mason 80 96 26 49 Matthew 44 33 41 98 Michael 89 21 59 92
八、对数据框进行增删改操作
1、在数据框的尾部增加一列
df = pd. DataFrame( { 'employee' : [ 'Bob' , 'Jake' , 'Lisa' , 'Sue' ] ,
'group' : [ 'Accounting' , 'Engineering' , 'Engineering' , 'HR' ] } )
print ( "增加性别列之前:\n" , df)
sex_value= pd. Series( [ 'M' , 'M' , 'F' , 'F' ] )
salary_value= [ 6000 , 5000 , 4000 , 3000 ]
df[ 'sex' ] = sex_value
df[ 'salary' ] = salary_value
print ( "增加性别和工资列之后:" )
df
增加性别列之前:
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
增加性别和工资列之后:
employee group sex salary 0 Bob Accounting M 6000 1 Jake Engineering M 5000 2 Lisa Engineering F 4000 3 Sue HR F 3000
2、在尾部增加一行
df. loc[ len ( df) , : ] = [ 'Mike' , 'Guarding' , 'M' , 2000 ]
print ( "在尾部增加一行之后:" )
df
在尾部增加一行之后:
employee group sex salary 0 Bob Accounting M 6000.0 1 Jake Engineering M 5000.0 2 Lisa Engineering F 4000.0 3 Sue HR F 3000.0 4 Mike Guarding M 2000.0
3、修改一列数据
new_sex= len ( df) * [ "Unknown" ]
print ( new_sex)
df[ 'sex' ] = new_sex
print ( "修改性别列之后:" )
df
[ 'Unknown' , 'Unknown' , 'Unknown' , 'Unknown' , 'Unknown' ]
修改性别列之后:
employee group sex salary 0 Bob Accounting Unknown 6000.0 1 Jake Engineering Unknown 5000.0 2 Lisa Engineering Unknown 4000.0 3 Sue HR Unknown 3000.0 4 Mike Guarding Unknown 2000.0
4、修改一行数据
df. loc[ 2 , : ] = [ "Rose" , "Sales" , "Female" , 3500 ]
print ( "修改标签为2的行之后:" )
df
修改标签为2 的行之后:
employee group sex salary 0 Bob Accounting Unknown 6000.0 1 Jake Engineering Unknown 5000.0 2 Rose Sales Female 3500.0 3 Sue HR Unknown 3000.0 4 Mike Guarding Unknown 2000.0
5、删除一列或多列数据
df. drop( [ 'sex' , 'salary' ] , axis= 1 , inplace= True )
print ( "删除性别和工资列之后:" )
df
删除性别和工资列之后:
employee group 0 Bob Accounting 1 Jake Engineering 2 Rose Sales 3 Sue HR 4 Mike Guarding
6、删除一行数据
df. drop( 4 , inplace= True )
print ( "删除标签为4的行之后:" )
df
删除标签为4 的行之后:
employee group 0 Bob Accounting 1 Jake Engineering 2 Rose Sales 3 Sue HR
说明:可以通过?或help来查看以上操作函数的参数,例如df.drop?可以查看drop函数的相关帮助信息
九、数据框的合并
问题:有两个数据框,如下图所示,现在期望将它们合并成如下图所示的效果,该如何做?
数据框df2: 数据框df3:
df2 = pd. DataFrame( { 'employee' : [ 'Lisa' , 'Bob' , 'Jake' , 'Sue' ] ,
'hire_date' : [ 2004 , 2008 , 2012 , 2014 ] } )
df2
employee hire_date 0 Lisa 2004 1 Bob 2008 2 Jake 2012 3 Sue 2014
df3 = pd. DataFrame( { 'employee' : [ 'Bob' , 'Jake' , 'Lisa' , 'Sue' , 'Tom' ] ,
'group' : [ 'Accounting' , 'Engineering' , 'Engineering' , 'HR' , np. NaN] } )
df3
employee group 0 Bob Accounting 1 Jake Engineering 2 Lisa Engineering 3 Sue HR 4 Tom NaN
1、merge合并
merge主要基于列值匹配而进行列合并,类似于SQL中的连接操作。合并时四种不同的连接规则:
df4_1= pd. merge( df3, df2)
df4_1
employee group hire_date 0 Bob Accounting 2008 1 Jake Engineering 2012 2 Lisa Engineering 2004 3 Sue HR 2014
df4_2= pd. merge( df3, df2, how= 'outer' )
df4_2
employee group hire_date 0 Bob Accounting 2008.0 1 Jake Engineering 2012.0 2 Lisa Engineering 2004.0 3 Sue HR 2014.0 4 Tom NaN NaN
2、concat合并
df5= pd. concat( [ df3, df2] )
df5
employee group hire_date 0 Bob Accounting NaN 1 Jake Engineering NaN 2 Lisa Engineering NaN 3 Sue HR NaN 4 Tom NaN NaN 0 Lisa NaN 2004.0 1 Bob NaN 2008.0 2 Jake NaN 2012.0 3 Sue NaN 2014.0
3、join合并
df6= df3. join( df2, lsuffix= '_l' , rsuffix= '_r' )
df6
employee_l group employee_r hire_date 0 Bob Accounting Lisa 2004.0 1 Jake Engineering Bob 2008.0 2 Lisa Engineering Jake 2012.0 3 Sue HR Sue 2014.0 4 Tom NaN NaN NaN
小结: concat默认的合并方式是行拼接,取并集(axis=0,join=‘outer’)
merge默认的合并方式是基于列值进行列拼接,取交集(how=‘inner’)
join默认的合并方式是基于行索引进行列合并,并且默认为左连接
十、分组及相关计算
1、分组及统计
team. groupby( 'team' ) [ [ 'Q1' , 'Q2' ] ] . mean( )
Q1 Q2 team A 62.705882 37.588235 B 44.318182 55.363636 C 48.000000 54.272727 D 45.263158 62.684211 E 48.150000 50.650000
team. groupby( 'team' ) . mean( ) [ [ 'Q1' , 'Q2' ] ]
Q1 Q2 team A 62.705882 37.588235 B 44.318182 55.363636 C 48.000000 54.272727 D 45.263158 62.684211 E 48.150000 50.650000
2、找到满足条件的分组(过滤掉不满足条件的分组)
flt_df= team. groupby( 'team' ) . filter ( lambda x: ( x[ 'Q1' ] . mean( ) > 45 ) & ( x[ 'Q2' ] . mean( ) > 45 ) )
flt_df. groupby( 'team' ) [ [ 'Q1' , 'Q2' ] ] . mean( )
Q1 Q2 team C 48.000000 54.272727 D 45.263158 62.684211 E 48.150000 50.650000
十一、处理缺失值
1、Pandas中缺失值的表示
data= pd. Series( [ 1 , np. nan, 'hello' , None ] )
data
0 1
1 NaN
2 hello
3 None
dtype: object
2、 与缺失值判断和处理相关的方法
isnull()
: 判断每个元素是否是缺失值,会返回一个与原对象尺寸相同的布尔性Pandas对象notnull()
: 与isnull()相反dropna()
: 返回一个删除缺失值后的数据对象fillna()
: 返回一个填充了缺失值之后的数据对象
data. isnull( )
0 False
1 True
2 False
3 True
dtype: bool
data. isnull( ) . sum ( )
2
df = pd. DataFrame( [ [ 1 , np. nan, 2 ] ,
[ 2 , 3 , 5 ] ,
[ np. nan, 4 , 6 ] ] )
df. isnull( ) . sum ( ) . sum ( )
2
df. dropna( )
df. dropna( axis= 1 )
df. dropna( axis= 'columns' , how= 'all' )
0 1 2 0 1.0 NaN 2 1 2.0 3.0 5 2 NaN 4.0 6
3、 填充缺失值
df. fillna( 0 )
0 1 2 0 1.0 0.0 2 1 2.0 3.0 5 2 0.0 4.0 6
df. fillna( method= 'ffill' )
0 1 2 0 1.0 NaN 2 1 2.0 3.0 5 2 2.0 4.0 6
df. fillna( method= 'bfill' )
0 1 2 0 1.0 3.0 2 1 2.0 3.0 5 2 NaN 4.0 6
df. interpolate( method= 'linear' , limit_direction= 'forward' , axis= 1 )
0 1 2 0 1.0 1.5 2.0 1 2.0 3.0 5.0 2 NaN 4.0 6.0
print ( df)
df. interpolate( method= 'linear' , limit_direction= 'forward' , axis= 1 , inplace= True )
df
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
0 1 2 0 1.0 1.5 2.0 1 2.0 3.0 5.0 2 0.0 4.0 6
df. fillna( method= 'ffill' )
0 1 2 0 1.0 NaN 2 1 2.0 3.0 5 2 2.0 4.0 6
df. fillna( method= 'bfill' )
0 1 2 0 1.0 3.0 2 1 2.0 3.0 5 2 NaN 4.0 6
df. interpolate( method= 'linear' , limit_direction= 'forward' , axis= 1 )
0 1 2 0 1.0 1.5 2.0 1 2.0 3.0 5.0 2 NaN 4.0 6.0
print ( df)
df. interpolate( method= 'linear' , limit_direction= 'forward' , axis= 1 , inplace= True )
df
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
0 1 2 0 1.0 1.5 2.0 1 2.0 3.0 5.0 2 NaN 4.0 6.0