业务需求:有两份数据,需要进行清洗和预处理。单纯通过正则很难处理,所以可以将两份数据模糊匹配确定同一的值。
-
导包
- fuzzywuzzy是一个用于模糊字符串匹配的库。
- tqdm是一个用于显示进度条的库。
- re是Python的正则表达式库,用于字符串的匹配和替换。
- multiprocess是Python的多进程库,用于提高数据处理的并行效率。
import pandas as pd from fuzzywuzzy import fuzz from tqdm import tqdm import re import multiprocess as mp
-
定义函数
a. 从original_column列中提取数据,并进行某些预处理操作,如删除括号等。
b. 按顺序使用精确匹配和模糊匹配,将df_a中的数据与df_b中的某些列进行匹配,并将匹配的结果存储在new_data列表中。
c. 将未匹配的数据存储在remaining_data列表中。
d. 更新df的new_column和original_column列。def process_dataframe(df): df['new_column'] = '' # 初始化new_column列 for index, row in tqdm(df.iterrows(), total=len(df), desc=f'Processing {mp.current_process().name}'): if pd.notnull(row['original_column']): # 检查original_column是否为空 data = str(row['original_column']) # 1. 先进行某些预处理操作,如删除括号等 processed_data = re.sub(r'\([^)]*\)', '', data) matched_indices = [] new_data = [] remaining_data = [] for i, item in enumerate(processed_data.split(', ')): # 以单词为最小单位进行模糊匹配 item_words = item.lower().split() match_found = False for b_item in df_b['target_column']: b_item_words = str(b_item).lower().split() if all(word in b_item_words for word in item_words): new_data.append(b_item) matched_indices.append(i) match_found = True break if not match_found: # 使用 fuzzywuzzy 进行模糊匹配 highest_match_score = 0 highest_match_item = None for b_item in df_b['target_column']: match_score = fuzz.token_set_ratio(' '.join(item_words), str(b_item).lower()) if match_score > highest_match_score: highest_match_score = match_score highest_match_item = b_item if highest_match_score >= 80: new_data.append(highest_match_item) matched_indices.append(i) else: remaining_data.append(data.split(', ')[i]) # 更新相关列 df.at[index, 'new_column'] = ', '.join(new_data) df.at[index, 'original_column'] = ', '.join(remaining_data) return df
-
进程池加速处理
num_processes = mp.cpu_count() * 2 pool = mp.Pool(processes=num_processes) df_a_all = pool.apply_async(process_dataframe, args=(df_a.copy(),)).get() df_a_link = pool.apply_async(process_dataframe, args=(df_a.copy(),)).get() pool.close() pool.join()
完整代码
import pandas as pd
from fuzzywuzzy import fuzz
from tqdm import tqdm
import re
import multiprocess as mp
# 读取数据
df_a = pd.read_csv('../Desktop/DataSource_A/Data_A_updated.csv')
df_b = pd.read_csv('../Desktop/DataSource_B/Data_B.csv')
# 定义一个函数来处理单个dataframe
def process_dataframe(df):
df['new_column'] = '' # 初始化new_column列
for index, row in tqdm(df.iterrows(), total=len(df), desc=f'Processing {mp.current_process().name}'):
if pd.notnull(row['original_column']): # 检查original_column是否为空
data = str(row['original_column'])
# 1. 先进行某些预处理操作,如删除括号等
processed_data = re.sub(r'\([^)]*\)', '', data)
matched_indices = []
new_data = []
remaining_data = []
for i, item in enumerate(processed_data.split(', ')):
# 以单词为最小单位进行模糊匹配
item_words = item.lower().split()
match_found = False
for b_item in df_b['target_column']:
b_item_words = str(b_item).lower().split()
if all(word in b_item_words for word in item_words):
new_data.append(b_item)
matched_indices.append(i)
match_found = True
break
if not match_found:
# 使用 fuzzywuzzy 进行模糊匹配
highest_match_score = 0
highest_match_item = None
for b_item in df_b['target_column']:
match_score = fuzz.token_set_ratio(' '.join(item_words), str(b_item).lower())
if match_score > highest_match_score:
highest_match_score = match_score
highest_match_item = b_item
if highest_match_score >= 80:
new_data.append(highest_match_item)
matched_indices.append(i)
else:
remaining_data.append(data.split(', ')[i])
# 更新相关列
df.at[index, 'new_column'] = ', '.join(new_data)
df.at[index, 'original_column'] = ', '.join(remaining_data)
return df
# 创建进程池并处理数据
num_processes = mp.cpu_count() * 2
pool = mp.Pool(processes=num_processes)
df_a_all = pool.apply_async(process_dataframe, args=(df_a.copy(),)).get()
df_a_link = pool.apply_async(process_dataframe, args=(df_a.copy(),)).get()
pool.close()
pool.join()