Unclean Data: Low Quality vs. Untidy

Unclean Data: Low Quality vs. Untidy

Unclean data 存在两类问题:数据质量低,数据不整洁。英文名称分别对应于Low Quality Data/Dirty DataUntidy Data/Messy Data

打个比方,在一个脏乱的房间里,脏数据(Low Quality Data/Dirty Data)就像房间内的垃圾、灰尘、香蕉皮等;杂乱数据(Untidy Data/Messy Data)就像房间里胡乱放置的东西、衣服、书等。

Low Quality Data/Dirty Data

低质量数据(Low Quality Data/Dirty Data)通常对应于内容问题(Content Issues)

low quality data = dirty data = content issues

比如,
不准确的数据(inaccurate data),
损坏的数据(corrupted data),
重复数据(duplicate data)

Sources of Dirty Data

  • We’re going to have user entry errors.
  • In some situations, we won’t have any data coding standards, or where we do have standards they’ll be poorly applied, causing problems in the resulting data.
  • We might have to integrate data where different schemas have been used for the same type of item.
  • We’ll have legacy data systems, where data wasn’t coded when disc and memory constraints were much more restrictive than they are now. Over time systems evolve. Needs change, and data changes.
  • Some of our data won’t have the unique identifiers it should.
  • Other data will be lost in transformation from one format to another.
  • And then, of course, there’s always programmer error.
  • And finally, data might have been corrupted in transmission or storage by cosmic rays or other physical phenomenon. So hey, one that’s not our fault.

Untidy Data/Messy Data

不整洁数据(Untidy Data/Messy Data)通常对应于结构问题(Structural Issues)

untidy data = messy data = structural issues

除了整洁数据,剩下的就是不整洁数据;那么何为整洁数据(Tidy data):

Tidy data requirements:
1. Each variable forms a column (每个变量构成一列)
2. Each observartion forms a row (每个观察构成一行)
3. Each type of observational unit form a table (每类观察单元构成一个表格)
by Hadley Wickham

(数据整洁度问题 详见此笔记)

Sources of Messy Data

Messy data is usually the result of poor data planning. Or a lack of awareness of the benefits of tidy data.

[23-05-29 20:39:13.243] {main} <proxy-cache> requires Resin Professional. Please see http://www.caucho.com for Resin Professional information and licensing. [23-05-29 20:39:13.400] {main} [23-05-29 20:39:13.400] {main} Resin-4.0.58 (built Fri, 24 Aug 2018 01:23:14 PDT) [23-05-29 20:39:13.401] {main} [23-05-29 20:39:13.402] {main} Windows 10 10.0 amd64 [23-05-29 20:39:13.402] {main} Java(TM) SE Runtime Environment 1.8.0_191-b12, GBK, zh [23-05-29 20:39:13.402] {main} Java HotSpot(TM) 64-Bit Server VM 25.191-b12, 64, mixed mode, Oracle Corporation [23-05-29 20:39:13.402] {main} [23-05-29 20:39:13.403] {main} user.name = WR182 [23-05-29 20:39:13.541] {main} [23-05-29 20:39:13.546] {main} server listening to 127.0.0.1:6800 [23-05-29 20:39:13.549] {main} [23-05-29 20:39:13.766] {main} Table[mnode:2,D:\WEAVER\Resin\resin-data\app-0\distcache\mnode.db] validating indexes due to unclean shutdown. [23-05-29 20:39:13.816] {main} Table[data:3,D:\WEAVER\Resin\resin-data\app-0\distcache\data.db] validating indexes due to unclean shutdown. [23-05-29 20:39:13.942] {main} [23-05-29 20:39:13.943] {main} resin.home = D:\WEAVER\Resin [23-05-29 20:39:13.943] {main} resin.root = D:\WEAVER\Resin [23-05-29 20:39:13.943] {main} resin.conf = d:\WEAVER\Resin\conf\resin.xml [23-05-29 20:39:13.943] {main} [23-05-29 20:39:13.943] {main} server = 127.0.0.1:6800 (app:app-0) [23-05-29 20:39:13.943] {main} stage = production [23-05-29 20:41:38.003] {main} Found library 'resin_os' as 'd:\WEAVER\Resin\win64\resin_os.dll', but the load failed. The JVM exception was: java.lang.UnsatisfiedLinkError: no resin_os in java.library.path [23-05-29 20:42:40.782] {resin-60} WebApp[production/webapp/default/lib] active [23-05-29 20:42:40.782] {main} Host[production/host/default] active [23-05-29 20:42:40.782] {main} ServletService[id=app-0,cluster=app] active [23-05-29 20:42:40.782] {main} [23-05-29 20:42:40.782] {main} http listening to *:8080 [23-05-29 20:42:40.782] {main} https listening to *:8444 [23-05-29 20:42:40.796] {main} [23-05-29 20:42:40.796] {main} Resin[id=app-0] started in 210426ms Connected to server [23-05-29 20:43:29.233] {resin-55} WebApp[production/webapp/default/ROOT] active
05-30
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值