数据库代码编写_在数据科学项目中编写第一行代码之前

最新推荐文章于 2024-07-29 22:40:28 发布

李_涛

最新推荐文章于 2024-07-29 22:40:28 发布

阅读量185

点赞数

文章标签：数据库 python java mysql django

原文链接：https://medium.com/@pratim09/before-writing-the-first-line-of-code-in-a-data-science-project-1b1705ae76d5

版权

数据库代码编写

Surprisingly today in the world of data science apart from cleaning data, building, tuning, and deploying models being the primary aspects of a project there are few other key areas that usually get unnoticed during the initial phase of data science project. So here is an attempt to key down some of which I have recently encountered and feel their importance in the successful delivery of a project.

令人惊讶的是，在当今的数据科学世界中，除了清理数据，构建，调整和部署模型是项目的主要方面之外，在数据科学项目的初始阶段通常很少有人注意到其他关键领域。因此，这是一种尝试，以解决我最近遇到的一些问题，并感到它们在成功交付项目中的重要性。

项目启动： (Inception of the project :)

Engaging with Business Stakeholders :

与业务利益相关者合作 ：

Business Analysts / Senior Data science Leads are in continuous talks with the stakeholders in different verticals like supply, finance, IT,HR etc to keep themselves updated with business processes and current challenges in those departments. During these discussions, they propose plausible solutions to the existing challenges using AI/ML. This plays a key role in an analytics department of an organization, as taking in confidence the stakeholders and showcasing the capabilities of the team play a huge role in the successful delivery of projects. Once the stakeholders get convinced of the possible solutions, the project gets their sign off, and hey presto! now we have a viable project in hand.

业务分析师/高级数据科学负责人与供应，财务，IT，人力资源等不同领域的利益相关者不断进行对话，以使他们及时了解业务流程和这些部门当前的挑战。在这些讨论中，他们提出了使用AI / ML解决现有挑战的可行方案 。这在组织的分析部门中扮演着关键角色，因为使利益相关者充满信心并展示团队的能力在项目的成功交付中发挥着巨大作用。一旦利益相关者确信可能的解决方案，该项目便会获得批准，嘿！现在我们有一个可行的项目。

用例构建： (Use case building:)

Define Problem statement :

定义问题陈述 ：

This is actually what the name suggests. Defining the problem statement neatly is an important aspect of the whole project life cycle. Here the focus should be on defining it simply so that even a non-technical person can understand it. This would need some good writing skills such that in small and crisp statements the problem is defined without missing any important attributes. This should be also signed off from stakeholders at the beginning of the project and should not be changed (typically) across the life cycle of a project.

这实际上就是名字所暗示的。整洁地定义问题陈述是整个项目生命周期的重要方面。这里的重点应该是简单地定义它，这样即使是非技术人员也可以理解它。这将需要一些良好的写作技巧，以便在简洁明了的陈述中定义问题而不会遗漏任何重要属性。在项目开始时，也应该从利益相关者处签字，并且(在整个项目的生命周期内)(通常)不应更改。

Define Scope :

定义范围：

Defining the scope early and vividly goes a long way in the successful delivery of the project. Also, we should give special attention to what is out of the scope of the project and document it. Getting a sign off on the scope from the stakeholders should not be skipped during this stage.

尽早而生动地定义范围对于成功交付项目大有帮助。另外，我们应该特别注意超出项目范围的内容，并将其记录下来。在此阶段， 不应跳过从利益相关者处获得范围的标志 。

Assumptions:

假设：

We should point out explicitly what are the assumptions of the project. Something very lucid and clearly understood by everyone during meetings should also be added here. As the background of stakeholders and technical people are vastly different, so something very obvious for business might not be so from the data scientist’s perspective and vice versa and this may lead to undesired results if the assumptions are not clearly spelled out.

我们应该明确指出该项目的假设。在会议期间，每个人都应该非常清楚和清楚地理解某些内容。由于利益相关者和技术人员的背景千差万别，因此从数据科学家的角度来看，对于业务而言非常明显的事情可能并非如此，反之亦然，如果未明确阐明这些假设，可能会导致不良结果。

当前/拟议流程： (Current/Proposed Process :)

We need to understand the current workflow of the system — who are the key role players in the system, how the system is behaving currently and where are the current impediments. Drawing a current architecture of the system that we are dealing with is crucial at this point along with where the proposed AI /ML solution fits into the existing architecture and how it will make an impact on business.

我们需要了解系统的当前工作流程-谁是系统中的关键角色参与者，系统当前的行为方式以及当前的障碍在哪里。在这一点上，绘制当前要处理的系统架构以及拟议的AI / ML解决方案适合现有架构的方式及其对业务的影响至关重要。

数据世界： (World of Data :)

Data Availability :

数据可用性：

Try to get answers as to how the final live data will be made available to the new proposed system . Key factor here is that we should not think only of the minimum viable product(MVP) which might be currently built on static data shared by business but also start thinking on the lines of how the MVP can be scaled /industrialized to the next level so that it helps business in production systems. Many MVP’s fail to scale to the next level as this step is given the least importance at an early stage and later scaling becomes a humongous task in hand.

尝试获取有关如何将最终实时数据提供给新提议的系统的答案。此处的关键因素是，我们不仅应该考虑目前可能基于业务共享的静态数据构建的最小可行产品(MVP)，还要开始考虑如何将MVP扩展/工业化到下一个水平，因此它有助于生产系统中的业务。许多MVP未能扩展到下一个级别，因为在早期阶段，此步骤的重要性不高，后来扩展成为一项艰巨的任务。

At this stage we should try to get information on few more data related questions and should be confident of :

在这一阶段，我们应该尝试获取有关其他一些与数据有关的问题的信息，并且应该对以下内容充满信心：

What are the current data sources.

什么是当前数据源。

What are the current data generation processes.

当前的数据生成过程是什么？

What are the current data storage and management processes.

当前的数据存储和管理流程是什么？

What are the current data collection processes and what is its frequency of data collection.

当前的数据收集过程是什么，数据收集的频率是多少。

What are the data tables involved and the types of the relationship among them.

涉及的数据表是什么，以及它们之间的关系类型。

What are all the data fields that are involved in the data.

数据中涉及的所有数据字段是什么？

Finally, we should build a data dictionary for each data field involved -eg Name, Description, Data Type, Mandatory/Optional, Remarks, etc. This part is a savior at various points in the project and should be built with close discussions and sign off from business.

最后，我们应该为每个涉及的数据字段建立一个数据字典-例如名称，描述，数据类型，强制/可选，备注等 。这一部分是项目中各个阶段的救星，应该在进行深入讨论的基础上构建并退出业务。

Feasibility study :

可行性研究：

Once we get the data from the business we debate on the feasibility of the project. It may so happen, more often than not that what business is wanting to solve and the shared data does not contain related data points. So feasibility study is a key factor in the life cycle and we conduct this study by doing a data quality check which involves few parameters.

一旦从业务中获得数据，我们就会就该项目的可行性进行辩论。可能会发生这种情况，而多数情况是企业想要解决什么，并且共享数据不包含相关数据点。因此，可行性研究是生命周期中的关键因素，我们通过进行涉及很少参数的数据质量检查来进行这项研究。

Few standard data quality checks :

很少进行标准数据质量检查：

Completeness: Ensuring that there are no undesired gaps in the data.

完整性 ：确保数据中没有不希望出现的空白。

Accuracy: Data collected is correct, and accurately represents what it should.

准确性 ：收集的数据是正确的，并且准确地表示了它应该是什么。

Timeliness: How up to date is the data. Does the data depict the problem?

及时性 ：数据的最新程度。数据是否描述了问题？

Consistency: The data should have a similar data format throughout.

一致性 ：整个数据应具有相似的数据格式。

Relevancy: Is the information available useful to the problem statement.

相关性 ：可用的信息对问题陈述有用吗？

成功标准 (Success Criteria)

Defining an agreed success criteria is a must during initial discussions with the stakeholders of the project .

在与项目涉众进行初步讨论时，必须确定商定的成功标准。

Current performance level:

目前的表现水平 ：

We should attempt to get the current process performance level of the system from the business and try to improve on that with the proposed new AI/ML solution . While business might not be having or comfortable in sharing such numbers but it should be called out during business discussions .

我们应该尝试从业务中获取系统的当前流程性能水平，并尝试通过提出的新AI / ML解决方案进行改进。尽管业务可能不愿意或不愿意共享这样的数字，但应在业务讨论期间将其调出。

Expected performance level :

预期绩效水平：

Here we define and explain business what will the KPI be and its expected value for the proposed solution that will determine the success of the project. Some usual KPIs are MAPE ,Accuracy, Precision, LogLoss etc. An example of success criteria is as follows

在这里，我们定义并解释业务，关键绩效指标将是什么以及它对拟议解决方案的预期价值，它将决定项目的成功。一些常见的KPI是MAPE，Accuracy，Precision，LogLoss等。成功标准的示例如下：

“Accepted success criteria of the current prediction model of MVP from business is -Precision should be greater than 0.85”

“目前公认的企业MVP预测模型的成功标准是- 精度应大于0.85 ”

Also here I remember a saying from one of my colleagues -‘A project can be successful from a data scientist perspective but may not be same from business side “ . So defining the problem statement crisply and success criteria clearly plays a crucial role in a successful delivery of a data science project .

同样在这里，我还记得我的一位同事所说的话： “从数据科学家的角度来看，一个项目可以成功，但是从业务角度来看，项目可能并不相同”。 因此，清晰地定义问题陈述和成功标准显然对成功完成数据科学项目至关重要。

解决方法： (Solution Approach :)

Now the core work starts for the data scientists and various solutions are experimented to meet the business success criteria.

现在，数据科学家的核心工作开始了，并尝试了各种解决方案以满足业务成功标准。

摘要 (Summary)

In conclusion,during initial stages of building a minimum viable product there are crucial non coding areas as discussed above where data scientists should spend some quality time for successful delivery of a project.

总之，在构建最低限度可行产品的初始阶段，存在至关重要的非编码领域，如上所述，数据科学家应在其中花费一些宝贵的时间来成功交付项目。

翻译自: https://medium.com/@pratim09/before-writing-the-first-line-of-code-in-a-data-science-project-1b1705ae76d5

数据库代码编写

李_涛

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据库代码编写_在数据科学项目中编写第一行代码之前

数据库代码编写Surprisingly today in the world of data science apart from cleaning data, building, tuning, and deploying models being the primary aspects of a project there are few other key areas that usuall...
复制链接

扫一扫