工业大数据团队角色构成
Data science is the most promising field in near future, with the advancement of technology and statistical models in recent times, a new data wave is knocking at our doors for a complete revolution. It relates to an interdisciplinary field of study that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. As diverse does this field sounds, its team also has to be diverse enough to carry out tasks efficiently! To understand this in a better way let’s follow the pipeline for a data science project.
数据科学是近期最有前途的领域,随着近来技术和统计模型的发展,新的数据浪潮敲响了我们的大门,以进行一场彻底的革命。 它涉及跨学科的研究领域,该领域使用科学的方法,过程,算法和系统从许多结构性和非结构化数据中提取知识和见解。 随着这个领域听起来的多样化,其团队也必须足够多样化才能有效地执行任务! 为了更好地理解这一点,让我们遵循数据科学项目的流程。
The most important aspect of this job is to Understand the Business Problem at the beginning, in the meeting with clients, a data science professional asks relevant questions, understands and defines objectives for the problem that needs to be tackled. Asking various questions in order to understand the project. in a better way is one of the many traits of a good data scientist. Now they care up for Data Acquisition to gather and scrape data from multiple sources like web servers, logs, databases, APIs and online repositories and finding the right data takes both time and effort.
这项工作最重要的方面是从一开始就了解业务问题,在与客户的会面中,数据科学专业人员会提出相关问题,理解并确定需要解决的问题的目标。 提出各种问题以了解项目。 更好的方法是优秀数据科学家的众多特征之一。 现在,他们关心的是Data Acquisition可以从Web服务器,日志,数据库,API和在线存储库等多个来源收集和抓取数据,找到正确的数据既费时又费力。
After data is gathered next comes Data Preparation which involves data cleaning and data transformation. Data Cleaning is the most time-consuming process as it involves handling many complex scenarios like dealing with inconsistent datatypes, misspelt attributes, missing and duplicate values and many more things. Then data is modified in the transformation step based on the mapping rule, in a project ETL tools are used to perform complex transformations that help the team to understand the data structure in a better way.
在收集数据之后,接下来是数据准备,其中涉及数据清理和数据转换。 数据清理是最耗时的过程,因为它涉及处理许多复杂的场景,例如处理不一致的数据类型,拼写错误的属性,丢失和重复的值以及更多其他事情。 然后,在转换步骤中根据映射规则修改数据,在项目中使用ETL工具执行复杂的转换,以帮助团队更好地理解数据结构。
Then to understand what can be actually done with the data is very crucial and for the same Exploratory Data Analysis is being applied. With the help of EDA, defining and selection of feature variables that will be used in model development is done. Next is the core activity of a data science project which is Data Modelling. Various machine learning techniques are being applied here such as KNN, Naive Bayes, Decision Tree, Support Vector Machine, etc to the data. in order to identify the model that best fits the business model. Next, the model is trained on the training dataset and testing is done to select the best performing model. Various computer languages such as Python, R, SAS etc are used by the team to model the data.
然后,了解如何实际处理数据非常关键,并且对于相同的探索性数据分析,该方法也正在应用。 借助EDA,可以定义和选择将在模型开发中使用的特征变量。 接下来是数据科学项目的核心活动,即数据建模。 此处将各种机器学习技术(例如KNN,朴素贝叶斯,决策树,支持向量机等)应用于数据。 为了确定最适合业务模型的模型。 接下来,在训练数据集上对模型进行训练,并进行测试以选择性能最佳的模型。 团队使用各种计算机语言(例如Python,R,SAS等)对数据进行建模。
Now come the trickiest part Visualisation and Communication in which the team meets the clients again to communicate the business findings in a simple and effective manner to convince the stakeholders, in which tools such as Tableau, Power BI, Qlik view, etc are used which can help to create powerful reports and dashboards. And finally, the model is being deployed and maintained. The selected model is tested in a pre-production environment before deploying it in a production environment and after successful deployment, the team uses dashboards and reports to get real-time analytics. Further, the team also monitors and maintains the project’s performance and this is how a data science project is completed!
现在,最棘手的部分是可视化和沟通,其中团队再次与客户会面,以一种简单有效的方式传达业务发现,以说服利益相关者,其中使用了Tableau,Power BI,Qlik view等工具,这些工具可以帮助创建功能强大的报告和仪表板。 最后,该模型正在部署和维护。 所选模型在生产前环境中进行测试之前要在生产前环境中进行测试,并且在成功部署后,团队使用仪表板和报告来获取实时分析。 此外,团队还监视并维护项目的绩效,这就是完成数据科学项目的方式!
Hence, Building and structuring of a good team here is very essential to meet the business need of an organisation. It is not very surprising to state that data science isn’t a single field. It is actually three different jobs with people working together to produce the final answers. These jobs can briefly be classified into three categories.
因此,在这里建立和组建一支优秀的团队对于满足组织的业务需求非常重要。 声明数据科学不是一个单一领域并不奇怪。 实际上,这是三个不同的工作,需要人们共同努力才能得出最终答案。 这些工作可以简要地分为三类。
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/ca0e49f08d5c2b06fa1fbf7b911f6f88.png)
Data Engineer
数据工程师
Data Engineers control the flow of information as information architects, they help in building specialised data storage systems and the infrastructure to ensure that the data is easy to obtain and process which they do by maintaining the data access. Most data engineers are very familiar with SQL, which they use to store and manage big and large quantities of data. They also use some of the programming languages such as Java, Scala or Python for processing data and automating data-related tasks.
数据工程师作为信息架构师来控制信息流,他们帮助构建专门的数据存储系统和基础架构,以确保易于维护和访问数据,从而确保数据易于获取和处理。 大多数数据工程师都非常熟悉SQL,他们将SQL用于存储和管理大量数据。 他们还使用某些编程语言(例如Java,Scala或Python)来处理数据并自动执行与数据相关的任务。
Data Analyst
数据分析师
Data Analysts describe the present view data, they do this by creating dashboards, Hypothesis Testing and data visualisation. They often have some background in statistics or computer science but tend to have less engineering experience than data engineers and have less math experience than machine learning scientist. Data Analysts use spreadsheets (Excel or google sheets) to perform simple analysis on small quantities of data (simple storage and analysis). They use SQL (the same language used by data engineers), for large scale analysis. While data engineers build and configure SQL storage solutions, data analysts use existing databases to consume and summarise data. Analysts also use Business Intelligence or BI Tools such as Tableau, Power BI or Looker for creating dashboards and sharing information and their analysis.
数据分析师描述当前的视图数据,他们通过创建仪表板,假设检验和数据可视化来实现。 他们通常具有统计学或计算机科学的背景,但往往比数据工程师具有更少的工程经验,并且比机器学习科学家具有更少的数学经验。 数据分析师使用电子表格(Excel或Google表格)对少量数据执行简单分析(简单存储和分析)。 他们使用SQL(数据工程师使用的相同语言)进行大规模分析。 在数据工程师构建和配置SQL存储解决方案时,数据分析师使用现有数据库来使用和汇总数据。 分析师还使用诸如Tableau,Power BI或Looker之类的商业智能或BI工具来创建仪表板,共享信息及其分析。
Machine Learning Scientist
机器学习科学家
Machine learning is perhaps the buzziest part of data science, it‟s used to predict and extrapolate what is likely to be true from what we already know. These scientists use training data to classify larger unrulier data, for example, machine learning can help us tell how much money stock may be worth in the next week, can help to predict which image contains a car by image processing or what sentiments are expressed using a tweet by automated text analysis or sentiment analysis. Machine learning scientist either use Python or R programming languages for creating predictive models. These both are great programming languages for data science and a candidate who knows one language can likely read code in the other language. This is to be noted that programming languages aren’t as difficult to learn as spoken languages. If someone knows how to speak Hindi, it might take them years to learn to speak Spanish. Programming languages are more similar to power tools. If we know how to use a power drill, we may not necessarily know how to use an electric saw, but we may probably learn with a little training or help.
机器学习也许是数据科学中最嗡嗡的部分,它被用来根据我们已经知道的东西预测和推断出可能是正确的。 这些科学家使用训练数据对较大的不规则数据进行分类,例如,机器学习可以帮助我们判断下周股票的价值,可以通过图像处理来预测哪个图像包含一辆汽车,或者使用通过自动文本分析或情感分析发布的推文。 机器学习科学家使用Python或R编程语言来创建预测模型。 这两种都是用于数据科学的出色编程语言,并且知道一种语言的候选人可能会读另一种语言的代码。 需要注意的是,编程语言并不像口语那样难学。 如果有人会说北印度语,那么他们可能要花几年的时间才能学会说西班牙语。 编程语言更类似于电动工具。 如果我们知道如何使用电钻,则可能不一定知道如何使用电锯,但是我们可能会在少量培训或帮助下学习。
Therefore to summarize;
因此总结一下;
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/b95994c763b5618c42c911646151369f.png)
Now after the roles are well defined for everyone inside the team, and once a business organisation hires some data professionals, there are three main ways a data team can be structured.
现在,在为团队中的每个人定义好角色之后,一旦企业组织雇用了一些数据专业人员,就可以采用三种主要方法来组建数据团队。
Isolated
孤立
An isolated type of data team can contain one or multiple kinds of data employees without any other team like engineer or product. This is a great structure for training new team members in quickly changing each project each member is working on.
隔离的数据团队类型可以包含一种或多种数据员工,而无需任何其他团队(如工程师或产品)。 这是培训新团队成员快速更改每个成员正在从事的每个项目的良好结构。
Embedded
嵌入式的
Alternatively, it can be helpful to use an embedded model. Where each data employee is part of a squad which also contains engineers and product managers. This model lets each data employee gain experience on a specific business project, making them a valuable expert.
或者,使用嵌入式模型可能会有所帮助。 每个数据员工都是小队的一部分,该小队还包含工程师和产品经理。 该模型使每个数据员工都能获得有关特定业务项目的经验,从而使他们成为有价值的专家。
Hybrid
杂种
Now the hybrid model seems similar to the embedded model, but with additional sync for all data employees across all squads. This additional layer of organisation allows, for uniform data processes and career development, regardless of which project an employee is assigned to.
现在,混合模型似乎与嵌入式模型相似,但是为所有班级的所有数据员工提供了额外的同步。 组织的这一额外层允许统一的数据流程和职业发展,无论员工分配到哪个项目。
翻译自: https://medium.com/analytics-vidhya/what-constitutes-a-perfect-data-team-fb5ceadfffff
工业大数据团队角色构成