初创公司如何搭建开发框架_简单的清洁框架如何帮助初创企业组织数据以促进增长...

初创公司如何搭建开发框架

Getting data into a clean format can sound confusing — it is the lengthiest aspect of data hygiene, yet has a number of steps that may not be anticipated by a small start-up team. Startups that rely on software — be it a website, an app, or a platform service. All software and activity trigger a need to apply data hygiene to keep the platform — and business model — operating smoothly.

将数据转换为干净的格式可能会造成混乱-这是数据卫生的最长过程,但其中的一些步骤可能是小型启动团队无法预期的。 依赖软件的创业公司-无论是网站,应用程序还是平台服务。 所有软件和活动都会触发应用数据卫生的需求,以保持平台和业务模型的平稳运行。

It can be daunting for a start-up to know where to best start with managing data. But keeping a few concepts in mind can help organize what to do to set up advanced analysis for regression or a tensor for TensorFlow.

对于一家初创企业而言,要知道从哪里最好地开始管理数据可能会令人望而生畏。 但是记住一些概念可以帮助组织如何设置高级分析以进行回归或TensorFlow的张量。

Here are three general concepts for a start-up team to consider. The key benefit from each statement is to use it to frame the right questions and consequential tasks for data cleaning.

这是创业团队要考虑的三个一般概念。 每个语句的主要好处是可以用它来框定正确的问题和相应的数据清理任务。

您可以识别干净的数据。 (Clean data is identifiable to you.)

This first statement means when you look at a data table you understand what the fields are meant to contain and can map how they are arranged. There may be an ID in each row that is recognizable. You may see duplicate entries that should not exist. In short, your knowledge of the subject associated with the data will drive the degree of data literacy needed for cleaning data.

第一条陈述意味着,当您查看数据表时,您将了解这些字段应包含的内容并可以映射它们的排列方式。 每行中可能都有一个可识别的ID。 您可能会看到不应该存在的重复条目。 简而言之,您对与数据相关联的主题的了解将推动清洁数据所需的数据素养程度。

干净的数据突出显示编程格式和库 (Clean data highlight programming format and libraries)

Every language and database has a structure that also dictates how the data is be considered in a calculation. For example, in R Programming a NULL is not counted as an element, but a NULL would be counted in another programming language. That fact can be a factor when planning a program that counts the number of observations.

每种语言和数据库都有一个结构,该结构还指示在计算中如何考虑数据。 例如,在R编程中,不将NULL计为元素,但是在另一种编程语言中将计为NULL。 当计划一个计入观察值的程序时,该事实可能是一个因素。

Thus structure dictates who queries and calculations are created in a given programming language. It also can trigger questions for dependencies — libraries used for specific functions. For example, libraries are used in R to set up a Tidydata structure or to use an API query to import data. The objective is to account for structure qualities when planning a data table or an advanced data model.

因此,结构决定了谁以给定的编程语言创建查询和计算。 它还可以触发有关依赖项的问题-用于特定功能的库。 例如,在R中使用库来建立Tidydata结构或使用API​​查询来导入数据。 目的是在计划数据表或高级数据模型时考虑结构质量。

干净的数据在统计上没有明显的不良细节。 (Clean data has no statistically obvious bad details.)

Obvious can be a subjective term in this case, because you are relying on what is obvious to the professional doing the data cleaning. In this case, you are working to incorporate the assumptions that you are making on your data for an analysis to be calculated. Did you decide to impute an average value for a set of missing data? How are outliers classified? What about moralizing a set of data? Is a need to classify variables in a table? Are they categorical?

在这种情况下,显而易见是一个主观术语,因为您依赖于专业人员进行数据清理时显而易见的内容。 在这种情况下,您正在努力将对数据所做的假设并入要计算的分析中。 您是否决定为一组丢失的数据估算平均值? 异常值如何分类? 道德化一组数据呢? 是否需要对表中的变量进行分类? 他们是绝对的吗?

Categorical variables are binary in nature (either a given observation is in a category or it is not). Because the parameters of a category are often well defined, categories raise the question — how should outliers of a category be dealt with? You can have unknown data categories — one in which you did not anticipate. Categories are sometimes treated as integers — e.g. customer selecting one in three options, do we treat the field as a category or integers?

分类变量本质上是二进制的(给定的观察值在类别中或不在类别中)。 因为类别的参数通常定义得很好,所以类别引发了一个问题-类别的离群值应如何处理? 您可能有未知的数据类别,这是您未曾预料到的。 类别有时被视为整数-例如,客户选择三个选项之一,我们将字段视为类别还是整数?

In short, the statistical details should be accounted for because the assumptions for treating the observations need to be clear and understood by others when a table, a model, or visualization is shared.

简而言之,应该考虑统计细节,因为当共享表,模型或可视化时,处理观察的假设需要清楚并由其他人理解。

其他注意事项 (Other Considerations)

Creating a clean data structure has other impacts. For starters, Data Hygiene has more exacting demands for predictive analytics than reporting dashboards, because the machine learning models are strengthened or weakened depending on the data collected. Very few models can handle empty fields. There are few, but as the data quality varies from the input of a model, predictor variable can vary even more wildly. Ultimately users should know how a model handles missing data.

创建干净的数据结构还有其他影响。 对于初学者而言,数据卫生比报告仪表板对预测分析的要求更高,这是因为根据所收集的数据来增强或削弱机器学习模型。 很少有模型可以处理空字段。 数量很少,但是随着数据质量与模型输入的不同,预测变量的变化甚至更大。 最终,用户应该知道模型如何处理丢失的数据。

Clean data also raises the topic of privacy compliance. Many compliance measures, like GDPR and CCPA, require that a data processor and controller are identified. These are the teams responsible for identifying the impact of data usage, such as retention of data, declaring the purpose for data collection, and documentation of associated processes.

干净的数据还引发了隐私合规性的话题。 许多合规性措施,例如GDPR和CCPA,都要求标识数据处理器和控制器。 这些团队负责确定数据使用的影响,例如数据的保留,声明数据收集的目的以及相关流程的文档。

Thus many aspects of this clean data checklist dovetail to what privacy compliance is looking for. If an analyst is deciding what is identifiable, it may help to determine what id fields could be considered as Personal Identifiable Information (PII). Any discussion can help better understand which datafields must play into an identity protection plan.

因此,此干净数据清单的许多方面与所寻求的隐私合规性吻合。 如果分析人员正在确定可识别的内容,则可能有助于确定哪些id字段可被视为个人可识别信息(PII)。 任何讨论都可以帮助您更好地了解哪些数据字段必须参与身份保护计划。

In short, clean data means getting the data and assumptions together so that all the people involved in a project can use the data with confidence.

简而言之,干净的数据意味着将数据和假设放在一起,以便项目中的所有人员都可以放心地使用数据。

翻译自: https://medium.com/swlh/how-a-simple-cleaning-framework-helps-startups-organize-data-for-growth-7e9693dd1f93

初创公司如何搭建开发框架

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值