ai人工智能数据处理分析_建立数据平台以实现分析和AI驱动的创新

最新推荐文章于 2024-09-03 13:48:44 发布

weixin_26632369

最新推荐文章于 2024-09-03 13:48:44 发布

阅读量1.5k

点赞数

文章标签：人工智能 python 大数据数据分析算法

原文链接：https://medium.com/swlh/building-a-data-platform-to-enable-analytics-and-ai-driven-innovation-1bd95e37efb9

版权

ai人工智能数据处理分析

重点 (Top highlight)

Businesses realize that as more and more products and services become digitized, there is an opportunity to capture a lot of value by taking better advantage of data. In retail, it could be by avoiding deep discounting by stocking the right items at the right time. In financial services, it could be by identifying unusual activity and behavior faster than the competition. In media, it could be by increasing engagement by offering up more personalized recommendations.

企业意识到，随着越来越多的产品和服务数字化，人们有机会通过更好地利用数据来获取很多价值。在零售中，可以通过在正确的时间存储正确的商品来避免大幅打折。在金融服务中，可以比竞争对手更快地识别异常活动和行为。在媒体中，可以通过提供更多个性化的建议来增加参与度。

关键挑战 (Key Challenges)

In my talk at Cloud Next OnAir, I describe that, in order to lead your company towards data-powered innovation, there are a few key challenges that you will have to address:

在Cloud Next OnAir上的演讲中，我描述了为了使您的公司迈向数据驱动型创新，您必须解决一些关键挑战：

The size of data that you will employ will increase 30–100% year on year. You are looking at a 5x data growth over the next 3–4 years. Do not build your infrastructure for the data you currently have. Plan for growth.
您将使用的数据大小将逐年增加30-100％。您正在寻找未来3-4年内5倍的数据增长。不要为当前的数据构建基础结构。规划增长。
25% of your data will be streaming data. Avoid the temptation of building a batch data processing platform. You will want to unify batch and stream processing.
您的数据的25％将是流数据。避免构建批处理数据处理平台的诱惑。您将要统一批处理和流处理。
Data quality reduces the farther away from the originating team the data gets. So, you will have to provide domain experts control over the data. Don’t centralize data in IT.
数据质量使数据离原始团队越远。因此，您将必须提供域专家对数据的控制。不要将数据集中在IT中。
The greatest value in ML/AI will be obtained by combining data that you have across your organization and even data shared by partners. Breaking silos and building a data culture will be key.
通过合并整个组织中的数据甚至合作伙伴共享的数据，将获得ML / AI的最大价值。打破孤岛和建立数据文化将是关键。
Much of your data will be unstructured — images, video, audio (chat), and free form text. You will be building data and ML pipelines that derive insights from unstructured data.
您的许多数据都是非结构化的-图像，视频，音频(聊天)和自由格式的文本。您将建立从非结构化数据中获得见解的数据和ML管道。
AI/ML skills will be scarce. You will have to take advantage of packaged AI solutions and systems that democratize machine learning.
AI / ML技能将稀缺。您将必须利用使机器学习民主化的打包式AI解决方案和系统。

The platform that you build will need to address all of these challenges and serve as an enabler of innovation.

您构建的平台将需要解决所有这些挑战并充当创新的推动力。

In this article, I will summarize the key points from my talk, and delve into technical details that I didn’t have time to cover. I recommend both watching the talk and reading this article because the two are complementary.

在本文中，我将总结演讲的要点，并深入研究我没有时间讨论的技术细节。我建议同时观看谈话和阅读本文，因为两者是相辅相成的。

五步旅程 (The 5-step journey)

Based on our experience helping many Google Cloud customers go through a digital transformation journey, there are five steps in the journey:

根据我们帮助许多Google Cloud客户进行数字化转型的经验，该过程包括五个步骤：

步骤1：简化运营并降低总拥有成本 (Step 1: Simplify operations and lower the total cost of ownership)

The first step for most enterprises is to find the budget. Moving your enterprise data warehouse and data lakes to the cloud can save you anywhere from 50% to 75%, mostly by reducing the need to spend valuable time doing resource provisioning. Ephemeral and spiky workloads will also benefit from autoscaling and the cloud economics of pay-for-what-you-use.

对于大多数企业来说，第一步是找到预算。将企业数据仓库和数据湖移动到云中，可以节省50％到75％的费用，这主要是通过减少花费宝贵时间进行资源调配的需求。临时和尖刻的工作负载也将受益于自动扩展和按需付费的云经济。

But when doing this, make sure you are setting yourself up for success because this is only the first step of the journey. Your goal is not just to save money; it is to drive innovation. You can get the ability to handle more data, more unstructured data, streaming data, and build a data culture (“modernize your data platform”) and save money at the same time by moving to a capable platform. Make sure to pick a platform that is serverless, self-tuning, highly scalable, provides high-performance streaming ingestion, allows you to operationalize ML without moving data, enables domain experts to “own” the data but share it broadly with the organization, and does all this in a robust, secure way.

但是，在进行此操作时，请确保已为成功做好准备，因为这只是整个旅程的第一步。您的目标不仅仅是省钱；是为了推动创新。通过迁移到功能强大的平台，您可以获得处理更多数据，更多非结构化数据，流数据以及建立数据文化(“现代化数据平台”)并节省资金的能力。确保选择一个无服务器，自我调整，高度可扩展，提供高性能流接收的平台，允许您在不移动数据的情况下运行ML，使域专家“拥有”数据，但可以与组织广泛共享，并以健壮，安全的方式完成所有这些操作。

When it comes to analytics, Google BigQuery is the recommended destination for structured and semi-structured data. Google Cloud Storage is what we recommend for unstructured data. We have low-risk migration offers to quickly move on-premises data (Teradata/Netezza/Exadata), Hadoop and Spark workloads, and point data warehouses like RedShift and Snowflake to BigQuery. Similarly, if you need to capture logs or changes from transactional databases to the cloud for analytics.

在分析方面，建议将Google BigQuery用作结构化和半结构化数据的目标。我们建议对非结构化数据使用Google Cloud Storage。我们提供了低风险的迁移功能，可快速将本地数据(Teradata / Netezza / Exadata)，Hadoop和Spark工作负载以及将RedShift和Snowflake等数据仓库指向BigQuery。同样，如果您需要捕获日志或从事务数据库到云的更改以进行分析。

步骤2：打破孤岛，使分析民主化，并建立数据文化 (Step 2: Break down silos, democratize analytics, and build a data culture)

My recommendation to choose the storage layer based on type of data might seem surprising. Shouldn’t you store “raw” data in a data lake, and “clean” data in a data warehouse? No, not a good idea. Data platforms and roles are converging and you need to be aware that traditional terminology like Data Lake and Data Warehouse can lead to status quo bias and bad choices. My recommendation instead is for you to think about what type of data it is, and choose your storage layer. Some of your “raw” data, if it is structured, will be in BigQuery and some of your final, fully produced media clips will reside in Cloud Storage.

我建议根据数据类型选择存储层可能令人惊讶。您不应该将“原始”数据存储在数据湖中，而将“干净”数据存储在数据仓库中吗？不，不是一个好主意。数据平台和角色正在融合，您需要意识到，诸如Data Lake和Data Warehouse之类的传统术语可能会导致现状偏差和错误选择。相反，我的建议是让您考虑数据类型是什么，然后选择存储层。您的某些“原始”数据(如果经过结构化)将存储在BigQuery中，而某些最终的，完全生成的媒体片段将存储在Cloud Storage中。

Don’t fall into the temptation of centralizing the control of data in order to break down silos. Data quality reduces the further away from the domain experts you get. You want to make sure that domain experts create datasets in BigQuery and own buckets in Cloud Storage. This allows for local control, but access to these datasets will be controlled through Cloud IAM roles and permissions. The use of encryption, access transparency, and masking with Cloud Data Loss Prevention can help ensure orgwide security even if the responsibility of data accuracy lies with the domain teams.

不要陷入集中控制数据以打破孤岛的诱惑。数据质量降低了您与域专家之间的距离。您要确保域专家在BigQuery中创建数据集，并在Cloud Storage中拥有自己的存储桶。这允许本地控制，但是对这些数据集的访问将通过Cloud IAM角色和权限进行控制。 Cloud Data Loss Prevention的加密，访问透明性和掩码的使用可以帮助确保组织范围的安全性，即使数据准确性的责任在于域团队。

Each analytics dataset or bucket will be in a single cloud region (or multi-region such as EU or US). Following Zhamak Dehghani’s nomenclature, you could call such a storage layer a “distributed data mesh” to avoid getting sidetracked by the lake vs. warehouse debate.

每个分析数据集或存储桶都将位于单个云区域(或欧盟或美国等多个区域)中。遵循Zhamak Dehghani的命名法，您可以将这样的存储层称为“分布式数据网格”，以避免因湖泊与仓库之争而陷入混乱。

Encourage teams to provide wide access to their datasets (“default open”). Owners of data control access to data, but subject to org-wide data governance policies. IT teams also have the ability to tag datasets (for privacy, etc.). Cloud IAM is managed by IT. Permissions to their datasets are managed by the data owners. Upskill your workforce so that they are discovering and tagging datasets through Data Catalog, and building no-code integration pipelines using Data Fusion to continually increase the breadth and coverage of your data mesh.

鼓励团队提供对其数据集的广泛访问(“默认打开”)。数据所有者控制对数据的访问，但要遵守组织范围内的数据治理策略。 IT团队还可以标记数据集(用于隐私等)。 Cloud IAM由IT管理。其数据集的权限由数据所有者管理。提高您的劳动力，使他们可以通过数据目录发现和标记数据集，并使用数据融合建立无代码集成管道，从而不断增加数据网格的广度和覆盖范围。

One problem you will run into when you build a democratized data culture is that you will start to see analytics silos. Each time a Key Performance Indicator (KPI) is calculated is one more opportunity for it to be calculated the wrong way. So, encourage data analytics teams to build a semantic layer using Looker and apply governance through that semantic layer:

建立民主化的数据文化时，您会遇到的一个问题是，您将开始看到分析孤岛。每次计算关键绩效指标(KPI)都是一次错误的计算方法。因此，鼓励数据分析团队使用Looker构建语义层，并通过该语义层应用治理：

This has the advantage of being multi-vendor and multi-cloud. The actual queries are carried out the underlying data warehouse, so there is no data duplication.

这具有成为多供应商和多云的优势。实际查询是在基础数据仓库中进行的，因此没有数据重复。

Regardless of where you store the data, you should bring compute to that data. On Google Cloud, the compute and storage are separate and you can mix and match. For example, your structured data can be in BigQuery, but you can choose to do your processing using SQL in BigQuery, Java/Python Apache Beam in Cloud Dataflow, or Spark on Cloud Dataproc.

无论将数据存储在何处，都应将计算引入数据中。在Google Cloud上，计算和存储是分开的，您可以混合使用。例如，您的结构化数据可以在BigQuery中，但是您可以选择在BigQuery中使用SQL，在Cloud Dataflow中使用Java / Python Apache Beam或在Cloud Dataproc中使用Spark进行处理。

Do not make copies of data.

不要复制数据。

步骤3：根据情况更快地做出决策 (Step 3: Make decisions in context, faster)

The value of a business decision, especially a decision that is made in the long tail, drops with latency and distance. For example, suppose you are able to approve a loan in 1 minute or in 1 day. The 1-minute approval is much, much more valuable than the 1-day turnaround. Similarly, if you are able to make a decision that takes into account spatial context (whether it is based on where the user currently lives, or where they are currently visiting), that decision is much more valuable than one devoid of spatial context.

业务决策(尤其是长尾决策)的价值随延迟和距离而下降。例如，假设您能够在1分钟或1天之内批准贷款。 1分钟的批准比1天的周转要有价值得多。同样，如果您能够做出考虑空间上下文的决策(无论是基于用户当前居住的位置还是他们当前正在访问的地方)，那么该决策比没有空间上下文的决策更具价值。

One goal of your platform should be that you can do GIS, streaming, and machine learning on data without making copies of the data. The principle above, of bringing compute to the data, should apply to GIS, streaming, and ML as well.

平台的目标之一是无需复制数据就可以对数据进行GIS，流传输和机器学习。以上将计算带入数据的原理也应适用于GIS，流技术和ML。

On Google Cloud, you can stream data into BigQuery, and all queries on BigQuery are streaming SQL. Even as you are streaming data into BigQuery, you can carry out time-window transformations (to take into account user- and business-context) in order to real-time AI and populate real-time dashboards.

在Google Cloud上，您可以将数据流式传输到BigQuery，并且BigQuery上的所有查询都在流式传输SQL。即使将数据流式传输到BigQuery中，您也可以进行时间窗口转换(以考虑用户和业务环境)，以实时AI并填充实时仪表板。

第4步：借助端到端AI解决方案实现跨越式发展 (Step 4: Leapfrog with end-to-end AI Solutions)

ML/AI is software, and like any software, you should consider whether you should build or whether you can buy. Google Cloud’s strategy in AI is to bring the best of Google’s AI to our customers in the form of APIs (e.g. Vision API) and building blocks (e.g. Auto ML Vision, where you can fine tune Vision API on your own data, with the advantage that you need much less of it).

ML / AI是软件，并且像任何软件一样，您应该考虑应该构建还是可以购买。 Google Cloud在AI中的策略是以API(例如Vision API)和构件块(例如Auto ML Vision)的形式为我们的客户提供Google最好的AI，您可以利用自己的数据微调Vision API您所需要的少得多)。

When it comes to AI (arguably, this is true of all tech, but it is particularly apparent in AI because it’s so new), every vendor seems to check all the boxes. We really encourage you to look at the quality of the underlying services. It is not the case that any competing natural language or text classifier comes close to Cloud Natural Language API or Auto ML Natural Language. The same holds for our vision, speech-to-text, etc. models.

当谈到人工智能时(可以说，所有技术都是如此，但由于它太新了，所以在人工智能中尤为明显)，每个供应商似乎都勾选了所有复选框。我们真的鼓励您查看基础服务的质量。并非任何竞争的自然语言或文本分类器都接近Cloud Natural Language API或Auto ML Natural Language。我们的愿景，语音转文本等模型也是如此。

We are also putting together our basic capabilities into higher-value, highly integrated solutions. Contact Center AI, where we do automated call handling, operator assistance, and call analytics as a packaged solution is one example. As is Document AI, where we tie together form parsing, and knowledge extraction.

我们还将基本能力整合到更高价值，高度集成的解决方案中。作为一个打包的解决方案，我们在其中进行自动呼叫处理，话务员帮助和呼叫分析的Contact Center AI就是一个例子。与文档AI一样，我们将表单解析和知识提取结合在一起。

步骤5：使用扩展的AI平台增强数据和ML团队的能力 (Step 5: Empower data and ML teams with scaled AI platforms)

I recommend that you split your portfolio of AI solutions into 3 categories. For many problems, using APIs and building blocks will be sufficient. Build out a data science team to solve AI problems that will uniquely differentiate you and give you sustainable advantage.

我建议您将AI解决方案组合分为3类。对于许多问题，使用API和构件块就足够了。建立一支数据科学团队来解决AI问题，这些问题将使您与众不同并为您带来可持续的优势。

Once you decide to build a data science team, though, make sure that you enable them to do machine learning efficiently. This will require the ability to experiment on models using notebooks, capture their ML workflows using experiments, deploy their ML models using containers, and do CI/CD for continuous training and evaluation. You should use our ML Pipelines for that. It is well integrated with our data analytics platform and with Cloud AI Platform services.

但是，一旦决定组建数据科学团队，请确保使他们能够高效地进行机器学习。这将需要具有使用笔记本进行模型实验，使用实验捕获其ML工作流，使用容器部署其ML模型以及进行CI / CD进行连续训练和评估的能力。您应该为此使用我们的ML管道。它与我们的数据分析平台和Cloud AI Platform服务很好地集成在一起。