python可视化基础知识
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
1.表格的变换
- 使用plotine包绘图或者分组groupby()计算处理时,通常是使用一维数据列表的DataFrame.若导入的数据表格为二维数据列表则将其转为一维。
df = pd.DataFrame({'X':['A','B','C'],'2019':[1,4,6],'2020':[4,1,7]})
df
df_melt = pd.melt(df,id_vars = 'X',var_name = 'year',value_name = 'value')
df_melt
| X | year | value |
---|
0 | A | 2019 | 1 |
1 | B | 2019 | 4 |
2 | C | 2019 | 6 |
3 | A | 2020 | 4 |
4 | B | 2020 | 1 |
5 | C | 2020 | 7 |
- 也可以用pivot_table()函数将一列根据变量展开为多行,从而一维表变成二维表
df_pivot = df_melt.pivot_table(index = 'X',columns = 'year',values = 'value')
df_pivot = df_pivot.reset_index()
df_pivot
year | X | 2019 | 2020 |
---|
0 | A | 1 | 4 |
1 | B | 4 | 1 |
2 | C | 6 | 7 |
2.变量的变换
- 有时候需要我们对DataFrame中的每个元素进行运算处理,从而产生新的行列并添加
df_melt['new_value'] = df_melt['value']*2
df_melt
| X | year | value | new_value |
---|
0 | A | 2019 | 1 | 2 |
1 | B | 2019 | 4 | 8 |
2 | C | 2019 | 6 | 12 |
3 | A | 2020 | 4 | 8 |
4 | B | 2020 | 1 | 2 |
5 | C | 2020 | 7 | 14 |
- 还可以使用多种方法,在这里大家可以按照需求随意发挥
df_melt['new_value2'] = df_melt.apply(lambda x:x['value']*2 if x['year']== '2019' else x['value'],axis = 1)
df_melt
| X | year | value | new_value | new_value2 |
---|
0 | A | 2019 | 1 | 2 | 2 |
1 | B | 2019 | 4 | 8 | 8 |
2 | C | 2019 | 6 | 12 | 12 |
3 | A | 2020 | 4 | 8 | 4 |
4 | B | 2020 | 1 | 2 | 1 |
5 | C | 2020 | 7 | 14 | 7 |
3.表格的排序
- 对于向量,可以通过np.sort()进行排序,而对于DataFrame,可以使用sort_values()函数,根据DataFrame中的某一列数值对整个表排序
df_melt = pd.melt(df,id_vars = 'X',var_name = 'year',value_name = 'value')
dat_sort1 = df_melt.sort_values(by = 'value',ascending = True)
dat_sort2 = df_melt.sort_values(by = ['year','value'],ascending = False)
dat_sort1
dat_sort2
| X | year | value |
---|
0 | A | 2019 | 1 |
4 | B | 2020 | 1 |
1 | B | 2019 | 4 |
3 | A | 2020 | 4 |
2 | C | 2019 | 6 |
5 | C | 2020 | 7 |
| X | year | value |
---|
5 | C | 2020 | 7 |
3 | A | 2020 | 4 |
4 | B | 2020 | 1 |
2 | C | 2019 | 6 |
1 | B | 2019 | 4 |
0 | A | 2019 | 1 |
4.表格的拼接
df1 = pd.DataFrame(dict(x = ['a','b','c'],y = range(1,4)))
df2 = pd.DataFrame(dict(z = ['B','D','H'],g = range(3,6)))
df3 = pd.DataFrame(dict(x = ['g','d'],y = [2,5]))
df1
df2
df3
- 有时候需要在已有数据的基础上,添加新的行/列,或者添加新的表格,可以通过pd.concat()或者append()函数实现
- 数据框添加列或者横向添加表格
- 数据框添加行或者纵向添加表格
dat_cbind = pd.concat([df1,df2],axis = 1)
dat_rbind = pd.concat([df1,df3],axis = 0)
dat_cbind
dat_rbind
df1.drop(labels = 'y',axis = 1,inplace = False)
5.表格融合
- 对于有的数据不能很好的保持一致性,不一致不能简单拼接,需要找到一个common key作为融合的依据,在表格的融合中,最常用的函数是pd.merge()
df1 = pd.DataFrame(dict(x = ['a','b','c'],y = range(1,4)))
df2 = pd.DataFrame(dict(x = ['a','b','d'],z = range(3,6)))
df3 = pd.DataFrame(dict(g = ['a','b','d'],y = [2,5,3]))
df4 = pd.DataFrame(dict(x = ['a','b','d'],y = [1,4,2],z = [2,5,3]))
df1
df2
df3
df4
dat_merge1 = pd.merge(left = df1,right = df2,how = 'left',on = 'x')
dat_merge1
dat_merge2 = pd.merge(left = df1,right = df2,how = 'right',on = 'x')
dat_merge2
dat_merge3 = pd.merge(left = df1,right = df2,how = 'inner',on = 'x')
dat_merge3
dat_merge4 = pd.merge(left = df1,right = df2,how = 'outer',on = 'x')
dat_merge4
| x | y | z |
---|
0 | a | 1.0 | 3.0 |
1 | b | 2.0 | 4.0 |
2 | c | 3.0 | NaN |
3 | d | NaN | 5.0 |
dat_merge5 = pd.merge(left = df1,right = df4,how = 'right',on =['x','y'])
dat_merge5
dat_merge6 = pd.merge(left = df2, right = df3, how = 'left',left_on = 'x',right_on = 'g')
dat_merge6
dat_merge7 = pd.merge(left = df1,right = df4,how = 'left',on = 'x',suffixes = ['.1','.2'])
dat_merge7
| x | y.1 | y.2 | z |
---|
0 | a | 1 | 1.0 | 2.0 |
1 | b | 2 | 4.0 | 5.0 |
2 | c | 3 | NaN | NaN |
6.表格的分组操作
df = pd.DataFrame({'x':['A','B','C','B','C'],'2017':[1,4,5,7,9],'2018':[2,3,4,8,9],'2019':[3,4,1,2,7]})
df_melt = pd.melt(df,id_vars = ['x'],var_name = 'year',value_name = 'value')
df
df_melt
| x | 2017 | 2018 | 2019 |
---|
0 | A | 1 | 2 | 3 |
1 | B | 4 | 3 | 4 |
2 | C | 5 | 4 | 1 |
3 | B | 7 | 8 | 2 |
4 | C | 9 | 9 | 7 |
| x | year | value |
---|
0 | A | 2017 | 1 |
1 | B | 2017 | 4 |
2 | C | 2017 | 5 |
3 | B | 2017 | 7 |
4 | C | 2017 | 9 |
5 | A | 2018 | 2 |
6 | B | 2018 | 3 |
7 | C | 2018 | 4 |
8 | B | 2018 | 8 |
9 | C | 2018 | 9 |
10 | A | 2019 | 3 |
11 | B | 2019 | 4 |
12 | C | 2019 | 1 |
13 | B | 2019 | 2 |
14 | C | 2019 | 7 |
df_rowsum = df[['2017','2018']].apply(lambda x:x.sum(),axis = 1)
df_colsum = df[['2017','2018']].apply(lambda x:x.sum(),axis = 0)
df_rowsum
df_colsum
0 3
1 7
2 9
3 15
4 18
dtype: int64
2017 26
2018 26
dtype: int64
df['2017_2'] = df['2017'].apply(lambda x:x+2)
df['2018_2019'] = df.apply(lambda x:x['2018']+3*x['2019'],axis = 1)
df
| x | 2017 | 2018 | 2019 | 2017_2 | 2018_2019 |
---|
0 | A | 1 | 2 | 3 | 3 | 11 |
1 | B | 4 | 3 | 4 | 6 | 15 |
2 | C | 5 | 4 | 1 | 7 | 7 |
3 | B | 7 | 8 | 2 | 9 | 14 |
4 | C | 9 | 9 | 7 | 11 | 30 |
6.1 分组操作
df_group_mean1 = df_melt.groupby('year',as_index = False).mean()
df_group_mean1
| year | value |
---|
0 | 2017 | 5.2 |
1 | 2018 | 5.2 |
2 | 2019 | 3.4 |
df_group_mean2 = df_melt.groupby(['x','year'],as_index = False).mean()
df_group_mean2
| x | year | value |
---|
0 | A | 2017 | 1.0 |
1 | A | 2018 | 2.0 |
2 | A | 2019 | 3.0 |
3 | B | 2017 | 5.5 |
4 | B | 2018 | 5.5 |
5 | B | 2019 | 3.0 |
6 | C | 2017 | 7.0 |
7 | C | 2018 | 6.5 |
8 | C | 2019 | 4.0 |
6.2分组聚合
- aggregate()与groupby()可以实现SQL中的分组聚合运算(aggregate()函数也可以简写为agg())
df_group = df_melt.groupby(['x','year'],as_index = False).agg({'value':{np.mean,np.median}})
df_group
| x | year | value |
---|
| | | mean | median |
---|
0 | A | 2017 | 1.0 | 1.0 |
1 | A | 2018 | 2.0 | 2.0 |
2 | A | 2019 | 3.0 | 3.0 |
3 | B | 2017 | 5.5 | 5.5 |
4 | B | 2018 | 5.5 | 5.5 |
5 | B | 2019 | 3.0 | 3.0 |
6 | C | 2017 | 7.0 | 7.0 |
7 | C | 2018 | 6.5 | 6.5 |
8 | C | 2019 | 4.0 | 4.0 |
6.3分组运算与筛选
- transform()函数可以结合groupby实现类似SQL的分组运算操作
- filter()函数可以结合groupby实现SQL中分组筛选运算
df_melt['precentage'] = df_melt.groupby('x')['value'].transform(lambda x:x/x.sum())
df_melt
df_filter = df_melt.groupby('x').filter(lambda x:x['value'].mean()>4)
df_filter
| x | year | value | precentage |
---|
0 | A | 2017 | 1 | 0.166667 |
1 | B | 2017 | 4 | 0.142857 |
2 | C | 2017 | 5 | 0.142857 |
3 | B | 2017 | 7 | 0.250000 |
4 | C | 2017 | 9 | 0.257143 |
5 | A | 2018 | 2 | 0.333333 |
6 | B | 2018 | 3 | 0.107143 |
7 | C | 2018 | 4 | 0.114286 |
8 | B | 2018 | 8 | 0.285714 |
9 | C | 2018 | 9 | 0.257143 |
10 | A | 2019 | 3 | 0.500000 |
11 | B | 2019 | 4 | 0.142857 |
12 | C | 2019 | 1 | 0.028571 |
13 | B | 2019 | 2 | 0.071429 |
14 | C | 2019 | 7 | 0.200000 |
| x | year | value | precentage |
---|
1 | B | 2017 | 4 | 0.142857 |
2 | C | 2017 | 5 | 0.142857 |
3 | B | 2017 | 7 | 0.250000 |
4 | C | 2017 | 9 | 0.257143 |
6 | B | 2018 | 3 | 0.107143 |
7 | C | 2018 | 4 | 0.114286 |
8 | B | 2018 | 8 | 0.285714 |
9 | C | 2018 | 9 | 0.257143 |
11 | B | 2019 | 4 | 0.142857 |
12 | C | 2019 | 1 | 0.028571 |
13 | B | 2019 | 2 | 0.071429 |
14 | C | 2019 | 7 | 0.200000 |
7.数据导入与导出
- 大部分时候导入外部保存的数据文件,再使用它绘制图表,下面介绍常见的三个格式:CSV\TXT\EXCEL文件
- CSV格式数据。通过df = pd.read_csv(‘Data.csv’,sep = ‘,’,header = 0,index_col = None,encoding = ‘utf8’)读取数据,通过df.to_csv(‘Data.csv’,index = False,header = True)将数据存储为CSV格式,index = False忽略索引信息,若为 True 则输出文件第一列保留索引值
- TXT格式数据。使用np.loadtxt()函数加载数据,需要注意TXT文本文件每一行必须含有相同数量的数据,delimiter表示分隔符,读取文件:df = pd.DataFrame(np.loadtxt(‘Data.txt’,delimiter = ‘,’)).使用np.savetxt(fname,X)将ndarry数据保存为TXT格式的文件
- Excel格式数据。使用pd.read_excel(‘data.xlsx’,sheetname = ‘sheetname’,header = 0)读取,若数据不含列名则header = None.使用df.to_excel(excel_writer,sheet_name = ‘sheetname’,index = False)保存数据
- 导入数据有时候存在缺失值,可以使用下列方法处理
- 直接删除带NaN的行:df_NA1 = df.dropna(axis =0)
- 使用最邻近元素填充NaN:df_NA2 = df.fillna(method = ‘ffill’)
- 使用指定的数值替代NaN:df_NA3 = df.fillna(13)