Unclean Data: Low Quality vs. Untidy
Unclean data 存在两类问题:数据质量低,数据不整洁。英文名称分别对应于Low Quality Data/Dirty Data,Untidy Data/Messy Data。
打个比方,在一个脏乱的房间里,脏数据(Low Quality Data/Dirty Data)就像房间内的垃圾、灰尘、香蕉皮等;杂乱数据(Untidy Data/Messy Data)就像房间里胡乱放置的东西、衣服、书等。
Low Quality Data/Dirty Data
低质量数据(Low Quality Data/Dirty Data)通常对应于内容问题(Content Issues)。
low quality data = dirty data = content issues
比如,
不准确的数据(inaccurate data),
损坏的数据(corrupted data),
重复数据(duplicate data)
Sources of Dirty Data
- We’re going to have user entry errors.
- In some situations, we won’t have any data coding standards, or where we do have standards they’ll be poorly applied, causing problems in the resulting data.
- We might have to integrate data where different schemas have been used for the same type of item.
- We’ll have legacy data systems, where data wasn’t coded when disc and memory constraints were much more restrictive than they are now. Over time systems evolve. Needs change, and data changes.
- Some of our data won’t have the unique identifiers it should.
- Other data will be lost in transformation from one format to another.
- And then, of course, there’s always programmer error.
- And finally, data might have been corrupted in transmission or storage by cosmic rays or other physical phenomenon. So hey, one that’s not our fault.
Untidy Data/Messy Data
不整洁数据(Untidy Data/Messy Data)通常对应于结构问题(Structural Issues)。
untidy data = messy data = structural issues
除了整洁数据,剩下的就是不整洁数据;那么何为整洁数据(Tidy data):
Tidy data requirements:
1. Each variable forms a column (每个变量构成一列)
2. Each observartion forms a row (每个观察构成一行)
3. Each type of observational unit form a table (每类观察单元构成一个表格)
by Hadley Wickham
Sources of Messy Data
Messy data is usually the result of poor data planning. Or a lack of awareness of the benefits of tidy data.