使用fuzzywuzzy和multiprocess进行相似度计算/模糊匹配

最新推荐文章于 2024-07-15 18:38:33 发布

nanason

最新推荐文章于 2024-07-15 18:38:33 发布

阅读量341

点赞数 6

分类专栏：机器学习文章标签：数据挖掘数据分析 python

本文链接：https://blog.csdn.net/weixin_51146503/article/details/137275201

版权

机器学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

业务需求：有两份数据，需要进行清洗和预处理。单纯通过正则很难处理，所以可以将两份数据模糊匹配确定同一的值。

导包
- fuzzywuzzy是一个用于模糊字符串匹配的库。
- tqdm是一个用于显示进度条的库。
- re是Python的正则表达式库,用于字符串的匹配和替换。
- multiprocess是Python的多进程库,用于提高数据处理的并行效率。
```
import pandas as pd
from fuzzywuzzy import fuzz
from tqdm import tqdm
import re
import multiprocess as mp
```

定义函数
a. 从original_column列中提取数据,并进行某些预处理操作,如删除括号等。
b. 按顺序使用精确匹配和模糊匹配,将df_a中的数据与df_b中的某些列进行匹配,并将匹配的结果存储在new_data列表中。
c. 将未匹配的数据存储在remaining_data列表中。
d. 更新df的new_column和original_column列。

def process_dataframe(df):
    df['new_column'] = ''  # 初始化new_column列
    
    for index, row in tqdm(df.iterrows(), total=len(df), desc=f'Processing {mp.current_process().name}'):
        if pd.notnull(row['original_column']):  # 检查original_column是否为空
            data = str(row['original_column'])
            
            # 1. 先进行某些预处理操作,如删除括号等
            processed_data = re.sub(r'\([^)]*\)', '', data)
            
            matched_indices = []
            new_data = []
            remaining_data = []
            for i, item in enumerate(processed_data.split(', ')):
                # 以单词为最小单位进行模糊匹配
                item_words = item.lower().split()
                match_found = False
                for b_item in df_b['target_column']:
                    b_item_words = str(b_item).lower().split()
                    if all(word in b_item_words for word in item_words):
                        new_data.append(b_item)
                        matched_indices.append(i)
                        match_found = True
                        break
                if not match_found:
                    # 使用 fuzzywuzzy 进行模糊匹配
                    highest_match_score = 0
                    highest_match_item = None
                    for b_item in df_b['target_column']:
                        match_score = fuzz.token_set_ratio(' '.join(item_words), str(b_item).lower())
                        if match_score > highest_match_score:
                            highest_match_score = match_score
                            highest_match_item = b_item
                    if highest_match_score >= 80:
                        new_data.append(highest_match_item)
                        matched_indices.append(i)
                    else:
                        remaining_data.append(data.split(', ')[i])
            
            # 更新相关列
            df.at[index, 'new_column'] = ', '.join(new_data)
            df.at[index, 'original_column'] = ', '.join(remaining_data)
    return df

进程池加速处理

num_processes = mp.cpu_count() * 2
pool = mp.Pool(processes=num_processes)
df_a_all = pool.apply_async(process_dataframe, args=(df_a.copy(),)).get()
df_a_link = pool.apply_async(process_dataframe, args=(df_a.copy(),)).get()
pool.close()
pool.join()

完整代码

import pandas as pd
from fuzzywuzzy import fuzz
from tqdm import tqdm
import re
import multiprocess as mp

# 读取数据
df_a = pd.read_csv('../Desktop/DataSource_A/Data_A_updated.csv')
df_b = pd.read_csv('../Desktop/DataSource_B/Data_B.csv')

# 定义一个函数来处理单个dataframe
def process_dataframe(df):
    df['new_column'] = ''  # 初始化new_column列
    
    for index, row in tqdm(df.iterrows(), total=len(df), desc=f'Processing {mp.current_process().name}'):
        if pd.notnull(row['original_column']):  # 检查original_column是否为空
            data = str(row['original_column'])
            
            # 1. 先进行某些预处理操作,如删除括号等
            processed_data = re.sub(r'\([^)]*\)', '', data)
            
            matched_indices = []
            new_data = []
            remaining_data = []
            for i, item in enumerate(processed_data.split(', ')):
                # 以单词为最小单位进行模糊匹配
                item_words = item.lower().split()
                match_found = False
                for b_item in df_b['target_column']:
                    b_item_words = str(b_item).lower().split()
                    if all(word in b_item_words for word in item_words):
                        new_data.append(b_item)
                        matched_indices.append(i)
                        match_found = True
                        break
                if not match_found:
                    # 使用 fuzzywuzzy 进行模糊匹配
                    highest_match_score = 0
                    highest_match_item = None
                    for b_item in df_b['target_column']:
                        match_score = fuzz.token_set_ratio(' '.join(item_words), str(b_item).lower())
                        if match_score > highest_match_score:
                            highest_match_score = match_score
                            highest_match_item = b_item
                    if highest_match_score >= 80:
                        new_data.append(highest_match_item)
                        matched_indices.append(i)
                    else:
                        remaining_data.append(data.split(', ')[i])
            
            # 更新相关列
            df.at[index, 'new_column'] = ', '.join(new_data)
            df.at[index, 'original_column'] = ', '.join(remaining_data)
    return df

# 创建进程池并处理数据
num_processes = mp.cpu_count() * 2
pool = mp.Pool(processes=num_processes)
df_a_all = pool.apply_async(process_dataframe, args=(df_a.copy(),)).get()
df_a_link = pool.apply_async(process_dataframe, args=(df_a.copy(),)).get()
pool.close()
pool.join()