kaggle入门笔记（Day5:Inconsistent Data Entry）（数据输入不一致问题）

最新推荐文章于 2024-01-17 22:24:30 发布

qq_18884827

最新推荐文章于 2024-01-17 22:24:30 发布

阅读量614

点赞数

分类专栏： kaggle

本文链接：https://blog.csdn.net/qq_18884827/article/details/79849733

版权

kaggle 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

简单的说就是输入的数据可能本来是一个东西，但是由于字母大小不一致，或者多个空格，或者由于输入问题，或者表达问题，导致一个单词有相似的表达方法，致统计出来的数据是多个。所以这节课主要是解决这类问题

1、Get our environment set up

# modules we'll use
import pandas as pd
import numpy as np

# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process
import chardet

# set seed for reproducibility
np.random.seed(0)

导包，记得考研朱伟讲过zz代表迷惑的意思，所以这里面的fuzzywuzzy肯定是处理字符不一致的包

首先用上节方法看一下csv编码格式

用推断出的格式打开.csv文件

2、 Do some preliminary text pre-processing

通过观察city，可以看到Lahore' and 'Lahore ', 'Lakki Marwat' and 'Lakki marwat'.本来属于一个城市，但是这个地方却重复显示

通过这种方法来应对刚才的情况

3、Use fuzzy matching to correct inconsistent data entry（用模糊匹配来解决数据输入不一致问题）

虽然已经解决了刚才的问题，但是我们观察‘d. i khan' and 'd.i khan'应该属于一个城市，上述方法并不能解决这个问题，而且'd.g khan'是不同的城市，不能划为一类。这个时候我们通过模糊匹配来解决

通过这个方法，我们可以找到前十个相近的城市，后面的分数代表相似度，100表示完全相同

fuzzywuzzy.process.extrace("d.i khan",cities,limit=10,socorer=fuzzywuzzy.fuzz.token_sort_ratio)

第一个参数表示标准字符串，第二个参数表示要比较的字符串，第三个参数表示输出前十个最相似，第四个参数表示打分

从输出结果来看，我们应该选择90分以上的归位一类

# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90  matches[0]表示城市，matches[1]表示分数
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe  得到所有包含在匹配结果里的数据的行数
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches  把对应的行和列修改为我们想要的结果
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

'd.g khan'

# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan"
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")

'd.g khan'

'd.g khan'

qq_18884827

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
kaggle入门笔记（Day5:Inconsistent Data Entry）（数据输入不一致问题）

简单的说就是输入的数据可能本来是一个东西，但是由于字母大小不一致，或者多个空格，或者由于输入问题，或者表达问题，导致一个单词有相似的表达方法，致统计出来的数据是多个。所以这节课主要是解决这类问题1、Get our environment set up# modules we'll useimport pandas as pdimport numpy as np# helpful mod...
复制链接

扫一扫