前端繁琐_建立免费且繁琐的免费数据管道

前端繁琐

Are you someone that process the same data frequently, like weekly or monthly but lack the resources to purchases fancy software to do so. Read on if you’re on the same boat

您是经常处理相同数据(例如每周或每月)但缺乏购买精美软件的资源的人吗? 如果您在同一条船上,请继续阅读

How many of us find ourselves in this situation whereby:

在这种情况下,我们当中有多少人发现自己:

  1. You are working on a large set of data and processing same set of data monthly

    您正在处理大量数据,并每月处理同一组数据
  2. Lot of data cleansing

    大量数据清理
  3. Your IT department do not have time to entertain you

    您的IT部门没有时间招待您
  4. Your department/company do not have money to invest in software for data cleansing activities

    您的部门/公司没有钱投资用于数据清理活动的软件
  5. Your department doesn’t have enough money for MS Flow or MS Data Platform

    您的部门没有足够的资金用于MS Flow或MS Data Platform
  6. Or , you could write a python script for such data cleansing activities, but no one understand it but yourself thus rendering it useless once you moved on

    或者,您可以为此类数据清理活动编写一个python脚本,但没人能理解它,但您自己却无法继续前进

WELCOME TO THIS FREE GUIDE

欢迎使用本免费指南

If you tick most of the boxes, welcome to the club of “DIY NON IT Business Analyst” and let me guide you through the following: Prepare a FOC data pipeline, Easy to maintain and Understandable by most (hopefully the new generation are good with PowerQuery)

如果您在大多数方框中打勾,请欢迎进入“ DIY NON IT业务分析师”俱乐部,并让我指导您完成以下工作: 准备FOC数据管道,易于维护,并且大多数人都可以理解 (希望新一代产品可以PowerQuery)

If these are your current workflow and data volume, the effort is worthwhile

如果这些是您当前的工作流程和数据量,那么值得付出努力

  1. Working on huge data sets [more than 500k]

    处理海量数据集[超过500k]
  2. Desperate needs to automate your data cleansing activities

    迫切需要自动执行数据清理活动
  3. Feed those data to various analytics needs or data models

    将这些数据提供给各种分析需求或数据模型
Image for post
Rough Plan of a data pipeline from Source data to Visualization
从源数据到可视化的数据管道的粗略计划

Source : Identifying the source of the main data you are using and standard the download format. In terms of the fields and file type, excel,csv or text as these might derail your future steps once you start automating the process.

来源 :确定您正在使用的主要数据的来源并标准化下载格式。 就字段和文件类型而言,excel,csv或文本可能会在您开始自动执行过程时使您将来的步骤偏离。

Image for post
Folders, Files[Data Pipeline] [Mini Database]
文件夹,文件[数据管道] [小型数据库]
Image for post
Monthly download files within the folder
文件夹中的每月下载文件

Pipe it into folder — In this case, I have created a folder “3.PeopleTrackingMovement” and dump all the monthly data into the folder. For the next step, i created a new excel workbook / “Pipe”, “3.PeopleTrackingMovment”, and created a Query to read Folder 3.PeopleTrackingMovement”. Since the file format have been standardize, the query was created once and will clean the data and append it together on a monthly basis.

将其传送到文件夹中 —在这种情况下,我创建了一个文件夹“ 3.PeopleTrackingMovement”,并将所有月度数据转储到该文件夹​​中。 对于下一步,我创建了一个新的excel工作簿/“管道”,“ 3.PeopleTrackingMovment”,并创建了一个查询以读取文件夹3.PeopleTrackingMovement。 由于文件格式已经标准化,因此查询仅创建一次,并将清理数据并将其每月附加到一起。

Image for post
Data Models & Dependencies
数据模型与依存关系

Bring It together — Creating “Mini Database”, after creating the various “Pipe” from 1–4. In this example, “Consolidation” was created to handle more data wangling from [1] “1.AllTraining” & [3]“3.PeopleTrackingMovment”. Within [3] “3.PeopleTrackingMovment” is a huge data volume, more than 1mil rows of data so is better to create another file to handle the load. Some planning is needed here, as you need to know how this file is going to be used for a bigger visualization so at a later stage you don’t do much data waggling.

整合在一起 -在1-4中创建各种“管道”之后,创建“小型数据库”。 在此示例中,创建了“合并”以处理来自[1]“ 1.AllTraining”和[3]“ 3.PeopleTrackingMovment”的更多数据。 在[3]中,“ 3.PeopleTrackingMovment”是一个巨大的数据量,超过100万行数据,因此最好创建另一个文件来处理负载。 这里需要做一些计划,因为您需要知道该文件将如何用于更大的可视化,因此在以后的阶段中您不会进行太多的数据摆动。

For this use case, we need to know throughout the years, who [3]have done what [1]training. After much review, these 2 data set isn’t used in other “pipe”

对于此用例,我们需要多年来了解[3]谁做了[1]培训。 经过大量审查,这两个数据集未在其他“管道”中使用

Big data models — At this stage, is either PowerBI or Excel Report, whereby you pull 1 or 2 “Mini Database” and a few “Pipe” to create your final report.

大数据模型 -在此阶段,可以使用PowerBI或Excel报表,您可以拉出1或2个“小型数据库”和几个“管道”来创建最终报表。

Bringing all the smaller data models to build your reporting or visualization is one of the final steps. As we tend to build many types of reports or visualization, it is easier to prepared smaller data models, read the data models, and build it from there. This way, you will not overload or performance issue on the visualization or reporting.

引入所有较小的数据模型来构建报告或可视化是最终步骤之一。 由于我们倾向于构建多种类型的报表或可视化文件,因此准备较小的数据模型,读取数据模型并从那里构建数据变得更加容易。 这样,您将不会在可视化或报表上出现过载或性能问题。

This is a free and fuss free method to prepare your data for constant uses. Especially if your department doesn’t have a dedicated IT personal or money to fund for specialise ETL software.

这是一种免费且省事的方法,可为持续使用准备数据。 特别是在您的部门没有专门的IT人员或资金来专门化ETL软件的情况下。

If you’re experiencing similar issue preparing data, feel free to drop me a message.

如果您在准备数据时遇到类似的问题,请随时给我留言。

翻译自: https://medium.com/@pzy/build-a-free-and-fussy-free-data-pipeline-dee7a01b5980

前端繁琐

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值