继续Python处理Excel的常用操作(一)接着分享:
8、多表联合
在excel中,有两个sheet,我们需要将这两个表进行合并处理,简单来说是直接将其中一个复制粘贴到另一个表中,但一般会用到一些函数语句,才能达到最终目的,在Python中,可以这样处理。
首先,先看一下需求,现在有两个数据集,如下图所示,需要将Score表单添加到student表单中,同时需要按照ID列一一对应。
![02cd0b7418d69fd64683d900965b0874.png](https://i-blog.csdnimg.cn/blog_migrate/cbf3ecaf41e9265336cd0c77c633a760.jpeg)
![3dc5a84d5ba29fc291888b81475d81c4.png](https://i-blog.csdnimg.cn/blog_migrate/c8ef8f334e9e55482da0eefd564b763d.png)
因此,需要先读取两个数据集,利用merge()函数进行相关操作,具体可见pandas文档,本次案例简单来说,便是按照student表的ID列左边连接score表,连接后发现有空值,以0填充,最后将文本类型改为整形。
students = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel16Student_Score.xlsx', sheet_name='Students')
scores = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel16Student_Score.xlsx', sheet_name='Scores')
table = students.merge(scores, how='left', on='ID').fillna(0)
table.Score = table.Score.astype(int)
print(students,scores,table)
# 最终输出结果如下:
ID Name Score
0 1 Student_001 81
1 3 Student_003 83
2 5 Student_005 85
3 7 Student_007 87
4 9 Student_009 89
5 11 Student_011 91
6 13 Student_013 93
7 15 Student_015 95
8 17 Student_017 97
9 19 Student_019 99
10 21 Student_021 0
11 23 Student_023 0
12 25 Student_025 0
13 27 Student_027 0
14 29 Student_029 0
15 31 Student_031 0
16 33 Student_033 0
17 35 Student_035 0
18 37 Student_037 0
19 39 Student_039 0
第二种方法:本文上一种是按照系统自带的index进行操作,现在我将ID列作为DateFrame的index,因此需要变化一下merge()函数的参数。
students = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel16Student_Score.xlsx', sheet_name='Students', index_col='ID')
scores = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel16Student_Score.xlsx', sheet_name='Scores', index_col='ID')
table = students.merge(scores, how='left', left_on=students.index, right_on=scores.index).fillna(0)
table.Score = table.Score.astype(int)
print(table)
key_0 Name Score
0 1 Student_001 81
1 3 Student_003 83
2 5 Student_005 85
3 7 Student_007 87
4 9 Student_009 89
5 11 Student_011 91
6 13 Student_013 93
7 15 Student_015 95
8 17 Student_017 97
9 19 Student_019 99
10 21 Student_021 0
11 23 Student_023 0
12 25 Student_025 0
13 27 Student_027 0
14 29 Student_029 0
15 31 Student_031 0
16 33 Student_033 0
17 35 Student_035 0
18 37 Student_037 0
19 39 Student_039 0
第三种方法:利用join()函数,具体可见文档,至于一些表连接的函数对比分析可见下面这篇文章。
yeayee:Pandas Dataframe操作concat、join、mergezhuanlan.zhihu.com![5af86ff9bf4075a7bb4bc60cb7713af8.png](https://i-blog.csdnimg.cn/blog_migrate/d9922f0cb168003a9fbec7259c77dee1.jpeg)
students = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel16Student_Score.xlsx', sheet_name='Students', index_col='ID')
scores = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel16Student_Score.xlsx', sheet_name='Scores', index_col='ID')
table = students.join(scores, how='left').fillna(0)
table.Score = table.Score.astype(int)
print(table)
9、数据校验
在excel中,有时会设置单元格数值的范围,比如说设定分数列只能填0-100,填其他数字便报错
![55a6be1544dde05547604fdf77e24e87.png](https://i-blog.csdnimg.cn/blog_migrate/be7ef8ec48ee6d0600cf92835790bbfa.jpeg)
在python中,可以先定义一个函数,然后将此函数通过apply()函数进行数据验证
def score_valication(row):
if not 0 <= row.Score <= 100:
print(f'#{row.ID}tstudent {row.Name} has an invalid score {row.Score}')
students = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel17Students.xlsx')
# print(students)
students.apply(score_valication, axis=1)
#1 student Student_001 has an invalid score -40
#2 student Student_002 has an invalid score -30
#3 student Student_003 has an invalid score -20
#4 student Student_004 has an invalid score -10
#16 student Student_016 has an invalid score 110
#17 student Student_017 has an invalid score 120
#18 student Student_018 has an invalid score 130
#19 student Student_019 has an invalid score 140
#20 student Student_020 has an invalid score 150
10、数据分列
在excel中有很多数据分列的方法,今天介绍一个简单方法:ctrl+E
![66946004d43b9603899bbc47c3a56f27.png](https://i-blog.csdnimg.cn/blog_migrate/e1ebd026708b3d2a9a748542a453d002.jpeg)
![eb7a1e910854fcbab41941c1326626c9.png](https://i-blog.csdnimg.cn/blog_migrate/e58d4ece3cec9a613141d2e284ed0816.jpeg)
在python中也可以进行相关操作,利用split()函数。
employees = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel18Employees.xlsx', index_col='ID')
df = employees['Full Name'].str.split(expand=True)
employees['First Name'] = df[0]
employees['Last Name'] = df[1]
print(employees)
Full Name First Name Last Name
ID
1 Syed Abbas Syed Abbas
2 Catherine Abel Catherine Abel
3 Kim Abercrombie Kim Abercrombie
4 Kim Abercrombie Kim Abercrombie
5 Kim Abercrombie Kim Abercrombie
6 Hazem Abolrous Hazem Abolrous
7 Sam Abolrous Sam Abolrous
8 Humberto Acevedo Humberto Acevedo
9 Gustavo Achong Gustavo Achong
10 Pilar Ackerman Pilar Ackerman
11 Pilar Ackerman Pilar Ackerman
12 Aaron Adams Aaron Adams
13 Adam Adams Adam Adams
14 Alex Adams Alex Adams
15 Alexandra Adams Alexandra Adams
16 Allison Adams Allison Adams
17 Amanda Adams Amanda Adams
18 Amber Adams Amber Adams
19 Andrea Adams Andrea Adams
20 Angel Adams Angel Adams
11、统计求和求平均
在excel中,计算平均数和总和是很常见的,大都是利用average()和sum()函数
![0f076415e4e0670ed195bc660b3447ce.png](https://i-blog.csdnimg.cn/blog_migrate/c39128459f5fd96e08891a3496b9900b.jpeg)
在python中,sum()代表求和,mean()代表平均数
students = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel19Students.xlsx', index_col='ID')
students['Total'] = students[['Test_1', 'Test_2', 'Test_3']].sum(axis=1)
students['Average'] = students[['Test_1', 'Test_2', 'Test_3']].mean(axis=1)
col_mean = students[['Test_1', 'Test_2', 'Test_3', 'Total', 'Average']].mean()
col_mean['Name'] = 'Summary'
students = students.append(col_mean, ignore_index=True)
print(students)
Name Test_1 Test_2 Test_3 Total Average
0 Student_001 62.00 86.00 83.0 231.0 77.000000
1 Student_002 77.00 97.00 78.0 252.0 84.000000
2 Student_003 57.00 96.00 46.0 199.0 66.333333
3 Student_004 57.00 87.00 80.0 224.0 74.666667
4 Student_005 95.00 59.00 87.0 241.0 80.333333
5 Student_006 56.00 97.00 61.0 214.0 71.333333
6 Student_007 64.00 91.00 67.0 222.0 74.000000
7 Student_008 96.00 70.00 48.0 214.0 71.333333
8 Student_009 77.00 73.00 48.0 198.0 66.000000
9 Student_010 90.00 94.00 67.0 251.0 83.666667
10 Student_011 62.00 55.00 63.0 180.0 60.000000
11 Student_012 83.00 76.00 81.0 240.0 80.000000
12 Student_013 68.00 60.00 90.0 218.0 72.666667
13 Student_014 82.00 68.00 98.0 248.0 82.666667
14 Student_015 61.00 67.00 91.0 219.0 73.000000
15 Student_016 59.00 63.00 46.0 168.0 56.000000
16 Student_017 62.00 83.00 93.0 238.0 79.333333
17 Student_018 90.00 75.00 80.0 245.0 81.666667
18 Student_019 100.00 95.00 55.0 250.0 83.333333
19 Student_020 61.00 87.00 100.0 248.0 82.666667
20 Summary 72.95 78.95 73.1 225.0 75.000000
12、消除重复值
在excel中,消除重复值可以利用条件格式
![b3442a232f929523634a1de925f462bf.png](https://i-blog.csdnimg.cn/blog_migrate/1cae311294aa0aac46d550d57cb91b3a.jpeg)
在python中,可以利用drop_duplicates()函数
students = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel020Students_Duplicates.xlsx')
students.drop_duplicates(subset='Name', inplace=True, keep='last')
print(students)
ID Name Test_1 Test_2 Test_3
5 6 Student_006 56 97 61
6 7 Student_007 64 91 67
7 8 Student_008 96 70 48
8 9 Student_009 77 73 48
9 10 Student_010 90 94 67
10 11 Student_011 62 55 63
11 12 Student_012 83 76 81
12 13 Student_013 68 60 90
13 14 Student_014 82 68 98
14 15 Student_015 61 67 91
15 16 Student_016 59 63 46
16 17 Student_017 62 83 93
17 18 Student_018 90 75 80
18 19 Student_019 100 95 55
19 20 Student_020 61 87 100
20 21 Student_001 62 86 83
21 22 Student_002 77 97 78
22 23 Student_003 57 96 46
23 24 Student_004 57 87 80
24 25 Student_005 95 59 87
13、行列转换
在excel中,进行行列转换,可以有一些操作,但是我还是不太会,就直接上python代码:
![984f0e230695399c013ad4707bc16f7d.png](https://i-blog.csdnimg.cn/blog_migrate/52f2974227ba62564d54e8ceb40539ae.png)
pd.options.display.max_columns = 999
videos = pd.read_excel(r'C:UsersAdministratorDesktoppythonprojectexcel021Videos.xlsx', index_col='Month')
# table = videos.transpose()
table = videos.T
print(table)
Month Jan Feb Mar Apr May Jun Jul Aug Sep
Active User 1000 1500 2250 3375 5063 7595 11393 17090 25635
Page View 3500 5250 7875 11813 17721 26583 39876 59815 89723
Month Oct Nov Dec
Active User 38453 57680 86520
Page View 134586 201880 302820