Python之DataFrame数据处理

1. 说明

 DataFrame是Pandas库中处理表的数据结构,可看作是python中的类似数据库的操作,是Python数据挖掘中最常用的工具。下面介绍DataFrame的一些常用方法。

2. 遍历

1) 代码

import pandas as pd
import math

df=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3],'data2':[4,5,6]})  
print(df)
for idx,item in df.iterrows():
    print(idx)
    print(item)

2) 结果

   data1  data2 key
0      1      4   a
1      2      5   b
2      3      6   c
0
data1    1
data2    4
key      a
Name: 0, dtype: object
… 略

3. 同时遍历两个数据表

1) 代码

import pandas as pd
import math

df1=pd.DataFrame({'key':['a','b'],'data1':[1,2]})  
df2=pd.DataFrame({'key':['c','d'],'data2':[4,5]})  
for (idx1,item1),(idx2,item2) in zip(df1.iterrows(),df2.iterrows()):
    print("idx1",idx1)
    print(item1)
    print("idx2",idx2)
    print(item2)

2) 结果

('idx1', 0)
data1    1
key      a
Name: 0, dtype: object
('idx2', 0)
data2    4
key      c
Name: 0, dtype: object
('idx1', 1)
data1    2
key      b
Name: 1, dtype: object
('idx2', 1)
data2    5
key      d
Name: 1, dtype: object

4. 取一行或多行

1) 代码

import pandas as pd
import math

df1=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3]})  
df2=df1[:1]
print(df2)

2) 结果

   data1 key
0      1   a

5. 取一列或多列

1) 代码

import pandas as pd
import math

df1=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3]})  
df2=pd.DataFrame()
df2['key2']=df1['key']
print(df2)

2) 结果

  key2
0    a
1    b
2    c

6. 列连接(横向:变宽):merge

1) 代码

import pandas as pd

df1=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3]})  
df2=pd.DataFrame({'key':['a','b','c'],'data2':[4,5,6]}) 
df3=pd.merge(df1,df2)

2) 结果

   data1 key
0      1   a
1      2   b
2      3   c
   data2 key
0      4   a
1      5   b
2      6   c
   data1 key  data2
0      1   a      4
1      2   b      5
2      3   c      6

7. 行连接(纵向:变长):concat

1) 代码

import pandas as pd

df1=pd.DataFrame({'key':['a','b','c'],'data':[1,2,3]})  
df2=pd.DataFrame({'key':['d','e','f'],'data':[4,5,6]}) 
df3=pd.concat([df1,df2])

2) 结果

   data key
0     1   a
1     2   b
2     3   c
   data key
0     4   d
1     5   e
2     6   f
   data key
0     1   a
1     2   b
2     3   c
0     4   d
1     5   e
2     6   f

8. 对某列做简单变换

1) 代码

import pandas as pd

df=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3]})  
print(df)
df['data1']=df['data1']+1
print(df)

2) 结果

   data1 key
0      1   a
1      2   b
2      3   c
   data1 key
0      2   a
1      3   b
2      4   c

9. 对某列做复杂变换

1) 代码

import pandas as pd
import math

df=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3]})  
print(df)
df['data1']=df['data1'].apply(lambda x: math.sin(x))
print(df)

2) 结果

   data1 key
0      1   a
1      2   b
2      3   c
      data1 key
0  0.841471   a
1  0.909297   b
2  0.141120   c

10. 对某列做函数处理

1) 代码

import pandas as pd

def testme(x):
    print("???",x)
    y = x + 3000
    return y

df=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3]})  
print(df)
df['data1']=df['data1'].apply(testme)
print(df)

2) 结果

   data1 key
0      1   a
1      2   b
2      3   c
('???', 1)
('???', 2)
('???', 3)
   data1 key
0   3001   a
1   3002   b
2   3003   c

11. 用某几列计算生成新列

1) 代码

import pandas as pd

df=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3],'data2':[4,5,6]})  
print(df)
df['data3']=df['data1']+df['data2']
print(df)

2) 结果

   data1  data2 key
0      1      4   a
1      2      5   b
2      3      6   c
   data1  data2 key  data3
0      1      4   a      5
1      2      5   b      7
2      3      6   c      9

12. 用某几列用函数生成新列

1) 代码

import pandas as pd
import math

def testme(x):
    print(x['data1'],x['data2'])
    return x['data1'] + x['data2']

df=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3],'data2':[4,5,6]})  
print(df)
df['data3']=df.apply(testme, axis=1)
print(df)

2) 结果

   data1  data2 key
0      1      4   a
1      2      5   b
2      3      6   c
(1, 4)
(2, 5)
(3, 6)
   data1  data2 key  data3
0      1      4   a      5
1      2      5   b      7
2      3      6   c      9

13. 删除列

1) 代码

import pandas as pd
import math

df=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3],'data2':[4,5,6]})  
print(df)
df=df.drop(['data2'],axis=1)
print(df)

2) 结果

   data1  data2 key
0      1      4   a
1      2      5   b
2      3      6   c
   data1 key
0      1   a
1      2   b
2      3   c

14. One-Hot变换

(把一列枚举型变为多列数值型)

1) 代码

import pandas as pd
import math

df1=pd.DataFrame({'key':['a','b','c'],'data1':[1,2,3]})  
print(df1)
df2=pd.get_dummies(df1['key'])
print(df2)
df3=pd.get_dummies(df1)
print(df3)

2) 结果

   data1 key
0      1   a
1      2   b
2      3   c
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
   data1  key_a  key_b  key_c
0      1      1      0      0
1      2      0      1      0
2      3      0      0      1

15. 其它常用方法

1) 求均值方差,中位数等

df[f].describe()

2) 求均值

df[f].mean()

3) 求方差

df[f].std()

4) 清除空值

df.dropna()

5) 填充空值

df.fillna()


技术文章定时推送
请关注公众号:算法学习分享

©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页