文章目录
【1】处理丢失数据
1.人为制造丢失数据NaN
import numpy as np
import pandas as pd
row = pd.date_range('20200726',periods=6)
dates = np.arange(24).reshape(6,4)
column = ['A','B','C','D']
df = pd.DataFrame(dates,index=row,columns=column)
df.iloc[1,3] = np.nan
df.iloc[0,2] = np.nan
print(df)
'''
A B C D
2020-07-26 0 1 NaN 3.0
2020-07-27 4 5 6.0 NaN
2020-07-28 8 9 10.0 11.0
2020-07-29 12 13 14.0 15.0
2020-07-30 16 17 18.0 19.0
2020-07-31 20 21 22.0 23.0
'''
2.检查是否含有nan
print(df.isnull())
将所有是nan的位置返回True,否则返回False
print(np.any(df.isnull()) == True)
检查所有的isnull,如果存在一个True则np.any会返回True
print(df.isnull())
'''
A B C D
2020-07-26 False False True False
2020-07-27 False False False True
2020-07-28 False False False False
2020-07-29 False False False False
2020-07-30 False False False False
2020-07-31 False False False False
'''
print(np.any(df.isnull()) == True)
# True
3.统一将nan赋值
print(df.fillna(value=0))
'''
A B C D
2020-07-26 0 1 0.0 3.0
2020-07-27 4 5 6.0 0.0
2020-07-28 8 9 10.0 11.0
2020-07-29 12 13 14.0 15.0
2020-07-30 16 17 18.0 19.0
2020-07-31 20 21 22.0 23.0
'''
4.删除含有nan的行、列
axis = 0,表示行操作,axis = 1表示列操作。how = all/any,默认为any。
当how=all表示:该行、列全为nan则删除该行、列;当how=any表示:该行、列有一个为nan则删除该行、列
print(df.dropna(axis=1,how='all')) #与元数据相同
'''
A B C D
2020-07-26 0 1 NaN 3.0
2020-07-27 4 5 6.0 NaN
2020-07-28 8 9 10.0 11.0
2020-07-29 12 13 14.0 15.0
2020-07-30 16 17 18.0 19.0
2020-07-31 20 21 22.0 23.0
'''
df = df.dropna(axis=0,how='any') #去掉了两行
print(df)
'''
A B C D
2020-07-28 8 9 10.0 11.0
2020-07-29 12 13 14.0 15.0
2020-07-30 16 17 18.0 19.0
2020-07-31 20 21 22.0 23.0
'''
【2】导入导出数据
1.导入
pd.read_csv(filename) | 从CSV文件导入数据 |
---|---|
pd.read_table(filename) | 从限定分隔符的文本文件导入数据 |
pd.read_excel(filename) | 从Excel文件导入数据 |
pd.read_sql(query, connection_object) | 从SQL表/库导入数据 |
pd.read_json(json_string) | 从JSON格式的字符串导入数据 |
pd.read_html(url) | 解析URL、字符串或者HTML文件,抽取其中的tables表格 |
pd.read_clipboard() | 从你的粘贴板获取内容,并传给read_table() |
pd.DataFrame(dict) | 从字典对象导入数据,Key是列名,Value是数据 |
import numpy as np
import pandas as pd
filenames = 'student.csv'
dates = pd.read_csv(filenames)
print(dates)
2.导出
df.to_csv(filename) | 导出数据到CSV文件 |
---|---|
df.to_excel(filename) | 导出数据到Excel文件 |
df.to_sql(table_name, connection_object) | 导出数据到SQL表 |
df.to_json(filename) | 以Json格式导出数据到文本文件 |
【3】合并数据
1.将多表合并concatenate:concat
(1)原始数据
import numpy as np
import pandas as pd
column_name = ['A','B','C','D']
df1 = pd.DataFrame(np.ones((3,4))*0,columns=column_name)
df2 = pd.DataFrame(np.ones((3,4))*1,columns=column_name)
df3 = pd.DataFrame(np.ones((3,4))*2,columns=column_name)
print(df1)
print(df2)
print(df3)
'''
A B C D
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
A B C D
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
A B C D
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
'''
(2)连接相同的列标签
axis = 0表示行操作,增加的列表会在原表的下方合成。默认axis = 0。如果不设置ignore_index=True,默认为False即纵轴产生0 1 2 0 1 2 0 1 2的效果,忽略原坐标的标签会重新进行排序
ret = pd.concat([df1,df2,df3],axis=0,ignore_index=True)
print(ret)
'''
A B C D
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
'''
ret = pd.concat([df1,df2,df3],axis=1,ignore_index=True)
print(ret)
'''
0 1 2 3 4 5 6 7 8 9 10 11
0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
1 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
'''
由于进行列操作后,column的ABCD不够8个,所以默认为0-7的数字
(3)标注分类
ret = pd.concat([df1,df2,df3],keys=['df1','df2','df3'],names=['Series name', 'Row num'])
print(ret)
'''
A B C D
Series name Row num
df1 0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
df2 0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
df3 0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
'''
(4)合并列标签不同的表
join = outer 会把df1在df2中没有的列用nan代替,把df2在df1中没有的列用nan代替,默认join = outer
join = inner 会把df1在df2中没有的列去掉,把df2在df1中没有的列去掉,最后只有两者共有的部分
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.ones((3,3))*0,index=[1,2,3],columns=['a','b','c'])
df2 = pd.DataFrame(np.ones((3,3))*1,index=[2,3,4],columns=['b','c','d'])
ret = pd.concat([df1,df2],join='outer')
print(ret)
'''
a b c d
1 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 NaN
2 NaN 1.0 1.0 1.0
3 NaN 1.0 1.0 1.0
4 NaN 1.0 1.0 1.0
'''
ret = pd.concat([df1,df2],join='inner')
print(ret)
'''
b c
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
'''
同样可以用ignore_index将行的标签排序
(4)水平合并行列标签都不同的表
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.ones((3,3))*0,index=[1,2,3],columns=['a','b','c'])
df2 = pd.DataFrame(np.ones((3,3))*1,index=[2,3,4],columns=['b','c','d'])
ret = pd.concat([df1,df2],axis=1)
print(ret)
'''
a b c b c d
1 0.0 0.0 0.0 NaN NaN NaN
2 0.0 0.0 0.0 1.0 1.0 1.0
3 0.0 0.0 0.0 1.0 1.0 1.0
4 NaN NaN NaN 1.0 1.0 1.0
'''
2.在表后append添加
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.ones((3,3))*0,columns=['a','b','c'])
df2 = pd.DataFrame(np.ones((3,3))*1,columns=['a','b','c'])
df3 = pd.DataFrame(np.ones((3,3))*2,columns=['a','b','c'])
ret = df1.append(df2,ignore_index=True) #添加一行
print(ret)
'''
a b c
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
'''
s1 = pd.Series([1,2,3],index=['a','b','c'])
ret = df1.append(s1,ignore_index=True)
print(ret)
'''
a b c
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 1.0 2.0 3.0
'''
ret = df1.append([df2,df3],ignore_index=True) #添加多行
print(ret)
'''
a b c
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 1.0 1.0 1.0
4 1.0 1.0 1.0
5 1.0 1.0 1.0
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
'''
3.merge合并带key的两表
一个key有inner、outer两种合并方式
按照key相同的部分合并,因为默认how = inner
import numpy as np
import pandas as pd
left = pd.DataFrame({
'A':['A0','A1','A2','A3']
,'B':['B0','B1','B2','B3']
,'key':['K0','K1','K2','K3']
})
right = pd.DataFrame({
'C':['C0','C1','C2','C3']
,'D':['D0','D1','D2','D3']
,'key':['K0','K1','K2','K4']
})
print(left)
print(right)
'''
A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K2
3 A3 B3 K3
C D key
0 C0 D0 K0
1 C1 D1 K1
2 C2 D2 K2
3 C3 D3 K4
'''
ret = pd.merge(left,right,on='key')
print(ret)
'''
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
'''
两个key有left、right、inner、outer四种合并方式
原数据
import numpy as np
import pandas as pd
left = pd.DataFrame({
'A':['A0','A1','A2','A3']
,'B':['B0','B1','B2','B3']
,'key1':['K0','K1','K2','K3']
,'key2':['K0','K1','K0','K1']
})
right = pd.DataFrame({
'C':['C0','C1','C2','C3']
,'D':['D0','D1','D2','D3']
,'key1':['K0','K1','K1','K2']
,'key2':['K0','K1','K1','K0']
})
print(left)
print(right)
'''
A B key1 key2
0 A0 B0 K0 K0
1 A1 B1 K1 K1
2 A2 B2 K2 K0
3 A3 B3 K3 K1
C D key1 key2
0 C0 D0 K0 K0
1 C1 D1 K1 K1
2 C2 D2 K1 K1
3 C3 D3 K2 K0
'''
outer inner
outer为按照两表的key1、key2的并集合并,没有的按照nan补充;inner为按照两表的key1、key2的交集合并,没有的按照nan补充。how默认为inner
在使用outer合并时:right的key中有两个K1,而left中只有一个K1,合并时会把K1对应的A1 B1重复来进行填充。
在使用inner合并时:left的K0 K0与right的K0 K0可以对应上,所以合并后有K0 K0;left有一个K1 K1,right有两个K1 K1,交集会有两个所以合并后会有两个K1 K1,left不够-重复
A1 B1进行填充
ret = pd.merge(left,right,on=['key1','key2'],how='outer')
print(ret)
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K1 K1 C1 D1
2 A1 B1 K1 K1 C2 D2
3 A2 B2 K2 K0 C3 D3
4 A3 B3 K3 K1 NaN NaN
'''
ret = pd.merge(left,right,on=['key1','key2'])
print(ret)
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K1 K1 C1 D1
2 A1 B1 K1 K1 C2 D2
3 A2 B2 K2 K0 C3 D3
'''
left right
left、right为按照left、right的key1,key2合并两表(将left/right的key1、key2完全照抄,放入另一表进行填充)
ret = pd.merge(left,right,on=['key1','key2'],how='left')
print(ret)
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K1 K1 C1 D1
2 A1 B1 K1 K1 C2 D2
3 A2 B2 K2 K0 C3 D3
4 A3 B3 K3 K1 NaN NaN
'''
ret = pd.merge(left,right,on=['key1','key2'],how='right')
print(ret)
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K1 K1 C1 D1
2 A1 B1 K1 K1 C2 D2
3 A2 B2 K2 K0 C3 D3
'''
不过测试的时候how = left不是按照 left的key1 key2 合并right,还不知道原因。
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
'col1':[0,1]
,'col_left':['a','b']
})
df2 = pd.DataFrame({
'col1':[1,2,2]
,'col_right':[1,2,2]
})
print(df1)
print(df2)
'''
col1 col_left
0 0 a
1 1 b
col1 col_right
0 1 1
1 2 2
2 2 2
'''