1、未清理的数据
(1)脏数据(质量):
- 不准确数据
- 损坏的数据
- 重复数据
(2)杂乱数据(整洁度):
- 每个变量构成一列
- 每次观察构成一行
- 每类观察单元构成一个表格
2、目测评估
可在清洗数据前先备注评估问题,例如:
Quality
‘patients’ table
- zip code is a float not a string
- zip code has four digits sometimes
- Tim Neudorf height is 27 instead of 72 inch
- full state names sometimes, abbreviations other times
‘treatments’ table
- missing HbA1c_change
- The letter u in starting and ending doses for Auralin and Novodra
- lowercase given_name and surname
- missing records (280 instead of 350)