【Python办公自动化】快速将excel按照某列不同的值批量拆分_python 按某一列拆分数据保留颜色-CSDN博客

本文链接：https://blog.csdn.net/weixin_44216391/article/details/104750623

工作中可能经常遇到这种情况：出于数据安全考虑，不方便把全量数据发给我们的全部对象，需要把一份全量数据按照不同的渠道（有字段标识）分拆，然后分别发送给对应的对象。

如果这类型工作比较频繁且每次处理的量比较大，那就需要搞个自动化小程序，一键分拆，故有此帖。当然分拆之后也会需要自动发送邮件给不同对象，这个不在此帖描述。
—— —— —— —— —— —— ——
这里使用从贝壳找房爬取下来的租房数据，假设需要按照不同的公寓品牌来拆分excel。

主要代码是这个：

import pandas as pd

data = pd.read_excel(
    "D:\\Python_Efficiency\\sources\\20200305广州租房_贝壳找房_后羿采集器.xlsx")
data.head(3)

test_df = pd.DataFrame()
test_df

department_list = []
 
for i in range(rows):
    temp = data["brand"][i]
    if temp not in department_list:
        department_list.append(temp)   #将公寓品牌brand的分类存在一个列表中
 
for department in department_list:
    new_df = pd.DataFrame()
 
    for i in range (0, rows):
        if data["brand"][i] == department:
            new_df = pd.concat([new_df, data.iloc[[i],:]], axis = 0, ignore_index = True)
    
#     new_df.to_excel(str(department)+".xls", sheet_name=department, index = False)   #将每个品牌公寓存成一个新excel
    new_df.to_excel("D:\\Python_Efficiency\\sources\\20200305广州租房_贝壳找房_后羿采集器_拆分结果\\"+ str(department)+ "-20200305广州租房_贝壳找房.xlsx", sheet_name=department, index = False)   #将每个品牌公寓存成一个新excel

但是由于我看到一些缺失值，就把它们清洗一波小折腾一下再分拆。
已测试分拆成功，结果如下图：
在这里插入图片描述

—— —— —— —— —— —— —— ——

附上包含数据清洗过程的全部代码如下，供参考。

# 2020-03-06 sufm#

import pandas as pd

data = pd.read_excel(
    "D:\\Python_Efficiency\\sources\\20200305广州租房_贝壳找房_后羿采集器.xlsx")
data.head(3)

	标题	content__list--item--des	content__list--item-price	brand
0	整租·万科山景城 2室2厅北	黄埔-科学城-万科山景城\n /\n 65㎡\n ...	2200 元/月	NaN
1	整租·鑫润花园 1室1厅复式东	番禺-市桥北-鑫润花园\n /\n 54㎡\n /...	2300 元/月	链家
2	整租·金道花园 1室1厅东	荔湾-鹤洞-金道花园\n /\n 43㎡\n /东...	1500 元/月	NaN

rows = data.shape[0]  #获取行数 shape[1]获取列数
rows

a=data["brand"].unique() #查看brand列的唯一值
b=len(a)   #查看brand列有多少个不同的唯一值
c=data["brand"].value_counts() #查看brand列的各个唯一值的个数
print("【查看brand列的唯一值】",a,"\n","—— —— ——")
print("【查看brand列有多少个不同的唯一值】",b,"\n","—— —— ——")
print(c)

# 发现brand列有NAN值。为了便于理解，下面我们替换成中文“A非品牌中介租房”

【查看brand列的唯一值】 [nan '链家' '自如' '广州悠家公寓' '邻佑' '渥公寓' '一屋inroom公寓' '建方长租' '龙泉公寓' '团创物业'
 '惠庭公寓' '晟家公寓' '迹寓' '烽寓' '墨菲公寓' '迈尚公寓' 'Junyu君寓' '逗号公寓' '汉仕公寓' '中升长租'
 '爱守候公寓' '星河盟客公寓' '译家公寓' '安屋' '52团租' '广州领寓' '悠家青年公寓' '她寓' '常盛公寓' '柠檬小居'
 '君立国际公寓' '森悦创享'] 
 —— —— ——
【查看brand列有多少个不同的唯一值】 32 
 —— —— ——
自如            740
链家            734
广州悠家公寓         63
Junyu君寓        39
52团租           22
建方长租           22
一屋inroom公寓     22
龙泉公寓           19
邻佑             18
晟家公寓           17
中升长租           17
烽寓             16
迈尚公寓           13
墨菲公寓           13
渥公寓             9
她寓              7
迹寓              6
悠家青年公寓          5
常盛公寓            5
广州领寓            4
安屋              4
汉仕公寓            3
团创物业            2
译家公寓            2
君立国际公寓          2
逗号公寓            2
爱守候公寓           1
惠庭公寓            1
星河盟客公寓          1
柠檬小居            1
森悦创享            1
Name: brand, dtype: int64

data.isnull().sum(axis = 0)    #查看数据各列缺失值
# 发现brand列有1156个缺失值。

标题                              0
content__list--item--des        0
content__list--item-price       0
brand                        1156
dtype: int64

data.fillna(value={"brand":"A非品牌中介租房"},inplace=True)    
# 用fillna函数，将brand列缺失值填充替换成“A非品牌中介租房”，需要inplace参数才能对原表做变更。

# 参考来源：
# data3.fillna(value = {'gender': data3['gender'].mode()[0], # 使用性别的众数替换缺失性别
# 'age':data3['age'].mean() # 使用年龄的平均值替换缺失年龄
# },
# inplace = True# 原地修改数据
# )

data.head(3)

	标题	content__list--item--des	content__list--item-price	brand
0	整租·万科山景城 2室2厅北	黄埔-科学城-万科山景城\n /\n 65㎡\n ...	2200 元/月	A非品牌中介租房
1	整租·鑫润花园 1室1厅复式东	番禺-市桥北-鑫润花园\n /\n 54㎡\n /...	2300 元/月	链家
2	整租·金道花园 1室1厅东	荔湾-鹤洞-金道花园\n /\n 43㎡\n /东...	1500 元/月	A非品牌中介租房

# data.brand = data.brand.map({"A非品牌中介租房":"B非品牌中介租房"}) # 尝试一下替换值
# # 用map函数，将brand列“A非品牌中介租房”替换成“B非品牌中介租房”，不需要inplace参数或其他类似参数即可对原表做变更。
# # 但是，对于不是“A非品牌中介租房”的值，例如“链家”等品牌公寓名字，则变成了 “NaN”空值。得不偿失了。应该加多一个条件，说明当brand=“A”时，才执行替换brand=“B”，这里就不深入了。
# data.head(10)

test_df = pd.DataFrame()
test_df


department_list = []
 
for i in range(rows):
    temp = data["brand"][i]
    if temp not in department_list:
        department_list.append(temp)   #将公寓品牌brand的分类存在一个列表中
 
for department in department_list:
    new_df = pd.DataFrame()
 
    for i in range (0, rows):
        if data["brand"][i] == department:
            new_df = pd.concat([new_df, data.iloc[[i],:]], axis = 0, ignore_index = True)
    
#     new_df.to_excel(str(department)+".xls", sheet_name=department, index = False)   #将每个品牌公寓存成一个新excel
    new_df.to_excel("D:\\Python_Efficiency\\sources\\20200305广州租房_贝壳找房_后羿采集器_拆分结果\\"+ str(department)+ "-20200305广州租房_贝壳找房.xlsx", sheet_name=department, index = False)   #将每个品牌公寓存成一个新excel