作者:禅与计算机程序设计艺术
1.简介
Data Cleaning is a crucial step in any Data Science or Data Mining project where raw data needs to be transformed into a format that can be used for further processing and analysis. In this article we will demonstrate how to use various machine learning techniques such as K-means clustering algorithm and Naive Bayes classifier to identify and correct errors in messy data by applying the automatic cleansing process on real world datasets like credit card transactions, employee records etc., which are commonly found in industries such as finance, healthcare, insurance etc. We will also compare their performance with other common data cleansing methods, such as regular expressions based validation and manual inspection. Finally,