史诗般的数据提取任务

最新推荐文章于 2022-12-16 14:19:29 发布

weixin_26721705

最新推荐文章于 2022-12-16 14:19:29 发布

阅读量548

点赞数

文章标签： python java 大数据

原文链接：https://medium.com/@Randy_Au/the-epic-data-fetch-quest-d2cf99dc8a3b

版权

Special thanks to the person on twitter who messaged me with this question and is letting me use it as a starting point of a post. Poking at real scenarios is real fun, and I can always take a bit of creative liberty in anonymizing details.

特别感谢推特上向我发送此问题的人员，并让我将其用作发布的起点。 在真实的场景中戳戳是很有趣的，我总是可以在匿名化细节方面有所创新。

Here’s their (paraphrased) problem statement.

这是他们的(表述)问题陈述。

I recently became a data analyst at a company. It looks like I need to do a lot of organization database creation work first. A lot of data is in Excel files in different systems. I want to gather everything, organize it, make it queryable and visualizable for users. Do you have any advice for a DB, tools, ways to design?

我最近成为一家公司的数据分析师。 看来我首先需要做很多组织数据库创建工作。 不同系统中的Excel文件中包含大量数据。 我想收集一切，组织起来，使其对用户可查询和可视化。 您对数据库，工具，设计方法有什么建议吗？

So there’s two parts to this question: 1) the overt question: things to do if I was going to embark on such an epic quest to increase my chances of success, and 2) the implied question: is the best thing to do right now?

因此，这个问题有两个部分：1)公开的问题：如果我打算进行一次史诗般的探索以增加成功的机会，该做的事情； 2)隐含的问题：这是现在最好的事情？

The general form that problems like this resemble a big RPG game, you’re dropped into a brand new world, everything is shiny and important-seeming, and lacking any stronger storyline quests thrust upon you by the gods, you are sent to fetch objects from all over the world in exchange for unspecified rewards. This is how I wind up spending a hundred hours doing side-quests and level grinding, which may be good as a leisure activity, but perhaps not ideal for your career.

像这样的问题类似于大型RPG游戏的一般形式，您掉入了一个崭新的世界，所有事物都是闪亮且重要的事物，并且缺少众神对您施加的更强的故事情节探索，您被派去获取物品来自世界各地，以换取未指定的奖励。这就是我花了100个小时做边际任务和水平磨削的方式，这可能是一种休闲活动，但对于您的职业而言可能并不理想。

I think people with some project management experience behind them are likely to have some alarm bells going off in their head “Warning! Unbounded scope!”

我认为具有一定项目管理经验的人可能会在他们的脑海中发出警钟：“警告！无限范围！”

So let’s go into it, from most important questions first:

因此，让我们从最重要的问题入手：

This is probably the key question for this endeavor. There are ways to be effective and impactful without embarking on a giant mission with no true end.

这可能是这项工作的关键问题。有一些方法可以有效和有影响力，而无需执行没有真正目的的宏伟任务。

Putting everything into “one centralized place that everyone shares and is more powerful” is a seemingly natural goal to have. “Everything is so hard right now because it’s scattered around. It takes work to put things together to even begin doing analysis work, this is why everything is horrible!”

将所有内容放到“每个人都共享并且更强大的一个集中位置”是看似自然的目标。 “现在一切都变得如此艰难，因为它分散了。需要花费很多时间才能将事情放在一起甚至开始进行分析工作，这就是为什么一切都太恐怖了！”

Such projects are also, in my personal experience, where projects go to die. I firmly believe that the second you attach the term “data warehouse” to a project, your chances of failure go up a ton. And I say this as someone who has designed and built/prototyped three of the damned things in my career. I hope there isn’t a fourth.

以我个人的经验，这样的项目也是项目消亡的地方。我坚信，将“数据仓库”一词附加到项目的第二秒，失败的机会就会增加很多。我说的是，这是我设计和建造/制作了我职业生涯中三件事的原型。我希望没有第四名。

The reason DW projects are hard has little to do with technical difficulty. They all fit well-known patterns. The tech part is relatively easy.

DW项目之所以很难，其原因与技术难度无关。它们都符合众所周知的模式。技术部分相对容易。

Data is routed to a centralized system designed for analytics.
数据被路由到专为分析而设计的集中式系统。
The centralized system is just a big database-like service.
集中式系统只是一个类似大型数据库的服务。
There’s a lot of annoying ETL jobs for data ingestion to be written.
有很多烦人的ETL作业要写入数据。
There’s interface(s) for pulling data back out and doing analysis.
有用于提取数据并进行分析的接口。
Finally, there’s lots of maintenance and upkeep.
最后，还有很多维护和保养。
It’s hard to draw the line between a long line of older data stores/analytics systems and what would resemble a modern data warehouse, but by the 1990s, all the pieces and scale were ready. That’s at least 30 years of technological refinement. Modern systems just take the core idea and do it at better scale, cost, and ease than before.
很难在一长串的旧数据存储/分析系统与类似于现代数据仓库的系统之间划清界限，但是到了1990年代，所有组件和规模都已准备就绪。那是至少30年的技术改进。现代系统只是采用了核心理念，并且以比以前更好的规模，成本和简便性来实现。

人类创造了问题，需要人类来解决 (Humans created the problem, humans will be needed to solve it)

The main reason these projects are hard is because the assumptions ignore the very human part of why things came to be as “messy” as they are in the first place. The desire for a grand “let’s sweep all the ugliness away and get everyone on the same page!” vision partly stems from not understanding that disparate systems were created by different people to fulfill a specific set of needs. Those systems then evolved until they worked reasonably well for their limited scopes. But because they were never coordinated at the beginning, getting them to work together now comes with a huge human cost.

这些项目之所以很难，主要是因为这些假设忽略了人性化的原因，即为什么事情一开始就变得“混乱”。渴望有一个宏伟的“让我们扫除所有的丑陋并使每个人都在同一页上！” 愿景的部分原因是不了解不同的系统是由不同的人创建来满足特定的需求集合的。然后，这些系统不断发展，直到它们在有限的范围内运行得很好为止。但是由于一开始他们从未被协调过，现在让他们一起工作需要付出巨大的人力成本。

Every input dataset has these things:

每个输入数据集都有以下内容：

an owner with unique needs and a different level of willingness to cooperate
具有独特需求和不同程度的合作意愿的所有者
new or slightly different definition of terms, and measurements that need to be understood and reconciled
新的或略有不同的术语定义，以及需要理解和协调的度量
different access permissions that need to be overcome
需要克服的不同访问权限
different human processes surrounding the data (collection, entry, consumption)
围绕数据的不同人工流程(收集，输入，使用)

It takes a lot of skill, patience, and resources to work with all these people and get the desired result. Many of those skills have nothing to do with software development or data analysis. So when you’re thinking about gathering literally everything into one system, this is what you’re signing up for.

与所有这些人一起工作并获得理想的结果需要大量的技巧，耐心和资源。这些技能中有许多与软件开发或数据分析无关。因此，当您考虑将字面上的所有内容收集到一个系统中时，这就是您要注册的对象。

You’ll be working with Alice from engineering to get access the analytics side of production systems, and have to debug all the data issues there as you use the system. You’ll then talk to Bob of customer support to learn about the 3rd party support ticketing system with custom in-house tooling, and the engineer maintaining it left last month. Then Eve from finance is totally swamped working on the Series B preparation and doesn’t have time to explain how any of the financial data works this quarter.

您将与工程技术部门的Alice一起使用，以访问生产系统的分析方面，并在使用系统时必须调试那里的所有数据问题。然后，您将与客户支持的Bob交谈，以了解带有自定义内部工具的第三方支持票务系统，以及维护该系统的工程师上个月离开了。然后，来自金融业的Eve完全陷入了B轮准备工作，没有时间解释本季度的任何财务数据如何运作。

Once you overcome the hurdles, understand, and merge all these data sources together into once place. You’re then trapped maintaining the system because you’re the only one who knows how the thing works and no one is invested in keeping it up to date.

一旦克服了障碍，就可以将所有这些数据源理解并合并到一个位置。这样一来，您就陷入了维护系统的困境，因为您是唯一知道事物运行方式的人，并且没有人花钱保持最新状态。

No good deed goes unpunished.

没有好的行为会受到惩罚。

Nothing says a data warehouse must spring forth, fully formed from your forehead, ready to provide wisdom to the world. As I went into above, to do so is generally fatal. You might call it, head-splitting. Instead, by invoking the time honored project management spell of “limiting scope” there are ways to survive.

没什么可说的，数据仓库必须从您的额头完全形成，并随时准备向世界提供智慧。正如我在上面提到的那样，这样做通常是致命的。您可能会称呼它为头劈裂。取而代之的是，通过援用久负盛名的项目管理法术“限制范围”，可以生存。

确定一些核心业务问题，首先针对这些问题 (Identify some core business questions, aim for those first)

While any organization will have a many functions and processes within it, an endless sea of analytic possibilities, there should be only a small handful of core functions that everything else supports. Whether it’s production of widgets or sales of subscriptions, some things are inherently more important than others. If it significantly threatens the life of the business when it goes bad, it’s probably worthy of attention.

尽管任何组织内部都有许多功能和流程，但分析的可能性是无穷的，但其他所有功能都应支持少量的核心功能。无论是小部件的生产还是订阅的销售，某些事情本来就比其他事情更重要。如果它在出现问题时严重威胁到企业的生存，则可能值得关注。

Identify those systems, and think of some good analyses that could help people understand them better. The goal is to build out something that supports just those core business questions first. It could be understanding costs, sales, repeat sales, getting new customers, keeping current customers, etc. It is most likely not how many coffee filters are used every month in the office (I’m sure you can find some data on this in the office manager’s data.)

识别那些系统，并考虑一些可以帮助人们更好地理解它们的良好分析。我们的目标是首先构建只支持那些核心业务问题的东西。可能是了解成本，销售，重复销售，吸引新客户，保留现有客户等。很可能不是每个月在办公室使用多少咖啡过滤器(我敢肯定您可以在其中找到一些数据)。办公室经理的数据。)

More often than not, the data that surrounds those core business questions are located in only a small handful of systems. They might even be on a single system, because this is the core of the business and everything grew up around the core. You only really want to work on a single system at a time at the best, and up to two systems at the absolute worst.

通常，围绕这些核心业务问题的数据仅位于少数几个系统中。它们甚至可能位于单个系统上，因为这是业务的核心，并且一切都围绕核心发展。您只真正想一次最好地在一个系统上工作，而在绝对最差的情况下要在两个系统上工作。

开始编码之前先做一些分析 (Do some analysis before you start coding)

Now that your scope is limited down to a handful of questions targeting a handful of questions, the next thing you need to do is work on doing some basic analysis.

现在您的工作范围仅限于针对少数几个问题的少数几个问题，接下来您需要做的是进行一些基本分析。

Pick a question, whatever makes you feel excited. You’re going to need that motivation for the next step. You’re going to make a one-off analysis to examine that question. Maybe it’s about customer retention, or time-to-sale, or cost of customer acquisition. Go through the actual mechanics of getting the data, understanding how it works, and most importantly, using the data to create an artifact that someone else finds useful.

选择一个问题，任何使您感到兴奋的事情。下一步需要这种动力。您将进行一次性分析，以检查该问题。可能与客户保留，销售时间或客户获取成本有关。仔细研究获取数据的实际机制，了解其工作原理，最重要的是，使用数据创建其他人认为有用的工件。

This is your prototype.

这是您的原型。

You’re going to learn so much about how those systems work just by doing this, it will make the rest of the process easier. You’ll most likely have to talk to people, ask them what various bits of data mean, where are things unreliable, where can data collection be improved. You’ll find weird bugs in the data that will break your analysis pipelines. It will take surprisingly longer than you would expect, even to do the hackiest, jankiest, throw-away analysis. You will likely be fixing bugs for weeks along the way.

仅通过执行此操作，您将学到很多有关这些系统如何工作的知识，这将使其余过程变得更加容易。您很可能必须与人们交谈，询问他们各种数据含义是什么，不可靠的地方在哪里，可以在哪里改善数据收集。您会在数据中发现奇怪的错误，这些错误会破坏您的分析流程。即使进行最棘手，最简陋的扔掉式分析，也将比您预期的长得多。在此过程中，您可能会修复错误数周。

But do not despair, this is all work you would have had to do to build the data warehouse to begin with, but now you are doing it for a concrete purpose instead of a vague abstract system in the future. There is an end in sight with a single deliverable.

但是请不要失望，这是构建数据仓库首先要做的所有工作，但是现在您是出于具体目的而不是将来使用模糊的抽象系统。单一交付物在眼前就是终点。

Finally, once you create your analysis result, you can show people. Hopefully they’re excited about what you created because it helped them understand something better, or offered insight. They might even have follow-up questions that you’ll have to go analyze further. But guess what you just did? You’ve built excitement for your project. You’ve shown them how you can help them, and they’re going to be a lot more willing to help you in the future if you need them to change their data processes for your data warehouse.

最后，一旦创建了分析结果，就可以向人们展示。希望他们对您创建的内容感到兴奋，因为它可以帮助他们更好地理解或提供见解。他们甚至可能有后续问题，您需要进一步分析。但是猜猜你刚才做了什么？ 您已经为项目创造了激情。您已经向他们展示了如何为他们提供帮助，并且如果您需要他们更改数据仓库的数据处理过程，他们将来会更加愿意为您提供帮助。

自动化分析 (Automate your analysis)

Now that you’ve actually done an analysis end-to-end, you should be familiar enough with the system to have a decent idea of how to automate it. So go ahead and do it.

既然您实际上已经完成了端到端的分析，那么您应该对系统足够熟悉，对如何自动化它有了一个不错的想法。所以继续吧。

You’re going to find new issues and hiccups when you do. The most important one being “wait, how do I automate this stuff, on what system?” If you’re the first person, there may not be any infrastructure for this. Time to put on your data engineer hat and figure stuff out (while working together w/ the operations folk).

当您这样做时，您将发现新的问题和麻烦。最重要的是“等等，我如何在什么系统上自动化这些东西？” 如果您是第一人称，那么可能没有任何基础设施。是时候穿上数据工程师的帽子，弄清楚东西了(与操作人员一起工作时)。

This is when you can start evaluating your technical needs. Do you use a simple relational database? (Answer: yes in 99% of cases). What languages do you want to use? Who’s responsible for keeping the systems up?

这是您可以开始评估技术需求的时候。您是否使用简单的关系数据库？ (答案：在99％的情况下是)。您想使用什么语言？谁负责保持系统正常运行？

There’s a huge amount of details and potential choices to be made here. The only thing I can stress is, consult your engineering partners! You want to avoid being the odd one out.

这里有大量的细节和可能的选择。我唯一需要强调的是，请咨询您的工程合作伙伴！您想避免成为奇怪的人。

If you’re the only person who knows R and everyone else is using Java, please don’t use R without a super, important, pressing reason. Doing so means volunteering to either be the only person to maintain the system forever, or you have to teach others a language they don’t know. This applies to your other infrastructure too. If you’re a SQL Server shop, don’t spin up a PostgreSQL instance without a good reason.

如果您是唯一认识R的人，而其他所有人都在使用Java，请不要在没有超级，重要而紧迫的理由的情况下使用R。这样做意味着自愿成为永久维护该系统的唯一人，或者您必须教别人一种他们不知道的语言。这也适用于您的其他基础结构。如果您是SQL Server商店，请不要在没有充分理由的情况下启动PostgreSQL实例。

Luckily, data warehouse solutions are plentiful. Every cloud service provider offers entire solutions that can cost as much money as you’re wiling to spend. You can spin up your own using open source tech, including just a simple PostgreSQL database. Pick what works for your environment and budget.

幸运的是，数据仓库解决方案很多。每个云服务提供商都提供完整的解决方案，这些解决方案的成本可能与您愿意花的钱一样多。您可以使用开放源代码技术(包括仅一个简单的PostgreSQL数据库)启动自己的数据库。选择适合您的环境和预算的方法。

Once you do all this stuff and you finally finish automating, congratulations, you’ve created a data pipeline! You even have users who are waiting to consume the output.

完成所有这些工作之后，终于可以完成自动化了，恭喜，您已经创建了数据管道！您甚至有正在等待使用输出的用户。

重复直到您对所有这些都感到厌倦 (Repeat until you’re tired of all this)

Now that you have one system working with an analysis, you can think about expanding. Find a new analysis, a new outcome you want, test it, build it, automate it, and integrate it into your system. Maybe this will involve integrating a new data source into your system, or you might just do a new analysis with the existing data.

现在您已经有了一个可以进行分析的系统，您可以考虑进行扩展。查找新的分析，所需的新结果，对其进行测试，构建，自动化并将其集成到系统中。也许这将涉及将新的数据源集成到您的系统中，或者您可能只是对现有数据进行新的分析。

Repeat this over and over. It’s sorta exhausting, but again, thanks to having limited scope there’s always an end in sight. You also will have to do less data engineering the second time around.

一遍又一遍地重复。这有点令人筋疲力尽，但是由于范围有限，总有一个结局。第二次您还需要减少数据工程量。

那些像AWS Redshift，GCP BigQuery这样的大数据仓库分析数据库又如何呢？ (What about those big data warehouse analytics databases like AWS Redshift, GCP BigQuery?)

You could use those, if they fit your need. They’re designed for large volumes of data. They’re often nice in that you can throw CSV files into those systems and they’ll ingest but presents you with a very zippy SQL interface. Since CSV is the lingua franca of data, it can make a lot of your ETL jobs easier. I used to have to write raw MapReduce jobs to do the things RedShift and BQ do in SQL. From that background, those new tools are amazing.

如果它们适合您，则可以使用它们。它们是为处理大量数据而设计的。它们通常很不错，因为您可以将CSV文件放入这些系统中，并且它们可以提取，但为您提供了一个非常活泼SQL界面。由于CSV是通用语言的数据，因此它可以使许多ETL作业变得更加容易。我曾经不得不编写原始MapReduce作业来执行RedShift和BQ在SQL中所做的事情。从这一背景来看，这些新工具是惊人的。

At the same time, you’re going to be paying for those amazing features, both in terms of log storage, as well as the cost of running queries. Those bills can be nontrivial. They’re marketed as being the foundation tech of data warehousing solutions, but they’re complete overkill if your data sources are primarily Excel. ( Excel only supports ~1 million rows per worksheet, that easily fits in your laptop’s RAM.)

同时，您将为这些惊人的功能付费，包括日志存储以及运行查询的成本。这些账单可以算是微不足道的。它们被视为数据仓库解决方案的基础技术而被市场推广，但是如果您的数据源主要是Excel，则它们完全是矫kill过正。 ( Excel每个工作表仅支持约100万行，可以轻松放入笔记本电脑的RAM中。)

My recommendation is to stick with the much cheaper small local relational databases to start, you can migrate and scale up with some medium amount of discomfort later.

我的建议是坚持使用便宜得多的小型本地关系数据库来启动，以后可以迁移并扩大规模，并带来一些中等的不适。

未雨绸缪 (Plan Ahead)

You can plan ahead a bit in your programming and system architecture. You can avoid writing too much throwaway code because you know your data pipelines and systems are going to be reused for other systems.

您可以在编程和系统体系结构中提前计划。您可以避免编写过多的一次性代码，因为您知道自己的数据管道和系统将被其他系统重用。

Design in a certain amount of flexibility from the start. A few extra layers of abstraction here and there will save you some refactoring in the future. It’s a bit of an art form to balance the level of abstraction, you can certainly go too far, but it only takes a few minutes to split things out into functions and methods instead of one giant mess of spaghetti. Avoid having your present self inflicting pain on your future self.

从一开始就具有一定的灵活性。在这里和那里再增加一些抽象层，将在将来为您节省一些重构。平衡抽象级别是一种艺术形式，您当然可以走得太远，但是只花几分钟就可以将其分解为功能和方法，而不用像一团意大利面那样。避免让你现在的自我对未来的自我造成痛苦。

A decent rule of thumb I use is be fairly UNIX-y in the separation of concerns, with an eye towards making modules of functionality that will represent different data systems in the future. The modules can then be swapped around/iterated over as needed.

我使用的一个不错的经验法则是在关注点分离方面相当UNIX-y ，着眼于制作功能模块，这些模块将代表将来的不同数据系统。然后可以根据需要交换/迭代模块。

与人共事 (Work with people)

People will make or break the process. If what you’re building doesn’t interface with their world, they won’t use it. You’re going to depend on them to give you data, so don’t disrupt their processes without their input. When your pipelines are complete, you’re going to depend on them to warn you if changes are incoming.

人们将决定这个过程的成败。如果您要构建的内容无法与他们的世界互动，那么他们将不会使用它。您将依靠他们为您提供数据，因此不要在没有他们输入的情况下中断他们的流程。当管道完成时，您将依赖它们来警告您是否有更改要传入。

If people are invested in your new system because they’re directly getting value out of it, then they’ll be interested in helping you keep all the moving parts running smoothly. Managing this relationship is critical. If people stop caring, things will start breaking.

如果人们因为直接从中获得价值而投资于您的新系统，那么他们将有兴趣帮助您保持所有运动部件的平稳运行。处理这种关系至关重要。如果人们不再关心，事情就会开始崩溃。

需要花时间 (It will take time)

This is not a one-week project. Make sure everyone involved has realistic expectations. Make sure they know things can derail due to bugs and issues.

这不是一个为期一周的项目。确保每个参与人员都有切合实际的期望。确保他们知道事情会由于错误和问题而出轨。

Then it’s just a matter of continuing on. Good luck!

然后，这只是继续的问题。祝好运！

Originally published on Randy’s free weekly newsletter, Counting Stuff, covering the more mundane, important parts of data and tech.

最初发表在兰迪(Randy)的免费每周新闻中，《计数的东西》 ( Counting Stuff) ，涵盖了数据和技术的更为平凡，重要的部分。

翻译自: https://medium.com/@Randy_Au/the-epic-data-fetch-quest-d2cf99dc8a3b

weixin_26721705

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
史诗般的数据提取任务

Special thanks to the person on twitter who messaged me with this question and is letting me use it as a starting point of a post. Poking at real scenarios is real fun, and I can always take a bit of ...
复制链接

扫一扫