数据分析生命周期概述

本文探讨了数据分析管道的重要性和组成部分,包括数据生成、收集、存储、处理等阶段,强调了在大量多样化数据中提取价值的挑战。通过理解这些流程,组织能够更好地利用数据进行运营和预测。
摘要由CSDN通过智能技术生成

What comprises of Data Analytics Pipeline ? Confused with n-number of data channels ? Don’t worry !! This blog will try explain it with much ease and efficacy.

数据分析管道包括什么? 与n个数据通道混淆? 不用担心! 该博客将尝试轻松且有效地对其进行解释。

数据弹药: (Data Ammunition :)

“ As Nearly all mechanical weapons require some form of ammunition to operate, likewise, organisations in today’s era need Data to Operate and forecast. “ — By Praveen

“由于几乎所有机械武器都需要某种形式的弹药才能运行,因此,当今时代的组织也需要数据来进行操作和预测。 -Praveen

Data is a strategic asset of every organisation. As data continues to grow, databases are becoming increasingly pivotal to understanding data and converting it to valuable insights. IT leaders and entrepreneurs need to look at different flavours of data and based on it look for ways to get more value from their data. With the rapid growth of data — Not just in volume and velocity but also in flavours, complexity and interconnectedness — the needs of data analytics and its corresponding databases have changed.

数据是每个组织的战略资产。 随着数据的持续增长,数据库对于理解数据并将其转化为有价值的见解变得越来越重要。 IT领导者和企业家需要研究不同类型的数据,并在此基础上寻找从数据中获得更多价值的方法。 随着数据的快速增长(不仅在数量和速度上,而且在风味,复杂性和互连性上),数据分析及其相应数据库的需求也发生了变化。

Before we begin discussing the Data Analytics pipeline, it’s imperative to understand the common data categories and use cases.

在我们开始讨论数据分析管道之前,必须了解常见的数据类别和用例

数据分析管道: (Data Analytics Pipeline :)

Before data can be analysed, it needs to be generated, collected, stored and processed.You can think of this as an analytics pipeline that extracts data from source systems, processes the data, and then loads it into data stores where it can be analysed. Analytics pipelines are designed to handle large volumes of incoming data from heterogeneous sources such as databases, applications, and devices.

在分析数据之前,需要先对其进行生成,收集,存储和处理。您可以将其视为一个分析管道,该管道可以从源系统中提取数据,进行数据处理,然后将其加载到可以对其进行分析的数据存储中。 。 分析管道旨在处理来自异构源(例如数据库,应用程序和设备)的大量传入数据。

In general, Data Analytics pipeline consists of six sections.We will understand each of these sections in details and evaluate accordingly.

通常,数据分析管道包括六个部分,我们将详细了解每个部分并进行相应评估。

1.产生: (1. Generate :)

Data is being continuously generated by several sources such as IoT devices, Web logs, Social Media feeds, Transaction and ERP systems.

数据源不断地生成数据,例如IoT设备,Web日志,社交媒体源,交易和ERP系统。

Image for post
Data Generation Process — Image AWS
数据生成过程— Image AWS
  • IoT system : Devices and sensors around the world send messages continuously.Organisations see a growing need today to capture this data and derive intelligence from it. Just like with a server, these devices that make up and generate logs. From a hardware perspective, the states of the onboard memory, the microcontroller, and any sensors are all described by logs. That data can tell you anything from whether the system is functioning as expected.

    物联网系统:世界各地的设备和传感器不断发送消息。当今,组织越来越需要捕获这些数据并从中获取情报。 就像服务器一样,这些设备组成并生成日志。 从硬件的角度来看,板载内存,微控制器和任何传感器的状态均由日志描述。 这些数据可以告诉您系统是否按预期运行。

  • Web Log : Logs generated from Web Servers. IIS, Apache, Tomcat,Web Sphere, NGINX, and every other web engine can generate useful logs.

    Web日志:从Web服务器生成的日志。 IIS,Apache,Tomcat,Web Sphere,NGINX和其他所有Web引擎都可以生成有用的日志。

  • Social Media : Social media logs.

    社交媒体:社交媒体日志。

  • Transactional Data : Data such as e-commerce purchase transactions and financial transactions, is typically stored in RDMS (Relational database management system). An RDBMS solution is suitable for recording transactions and when transactions may need to update multiple table rows.

    交易数据:诸如电子商务购买交易和金融交易之类的数据通常存储在RDMS(关系数据库管理系统)中。 RDBMS解决方案适用于记录事务以及事务可能需要更新多个表行的情况。

  • NoSQL Data : A NoSQL database is suitable when the data is not well structured to fit into a defined schema or when the schema changes very often.

    NoSQL数据:当数据的结构不能很好地适应已定义的架构或架构经常更改时,NoSQL数据库是合适的。

  • ERP : Perceptible logs are important for any software, an ERP is no exception.

    ERP:可感知的日志对于任何软件都很重要,ERP也不例外。

2.收集: (2. Collect :)

After data is generated, it needs to be collected somewhere. Web applications, mobile devices, and many software applications and services can generate staggering amounts of streaming data—sometimes terabytes per hour—that need to be collected.

生成数据后,需要将其收集在某个地方。 Web应用程序,移动设备以及许多软件应用程序和服务可以生成数量惊人的流数据(有时为每小时数TB),需要收集。

Image for post
Data Collection -Using Amazon DMS, S3, DataSync & Snowball : Image — AWS
数据收集-使用Amazon DMS,S3,DataSync和Snowball:图像— AWS
Image for post
Data Collection — Using Polling Application, Amazon Kinesis and Kafka Stream. Image — AWS
数据收集-使用轮询应用程序,Amazon Kinesis和Kafka Stream。 图像— AWS

Data is collected using :

数据收集使用:

  • Amazon DMS (Database Migration Services) : You can use AWS Database Migration Service to consolidate multiple source databases into a single target database. This can be done for homogeneous and heterogeneous migrations, and you can use this feature with all supported database engines. The source databases can be located in your own premises outside of AWS, running on an Amazon EC2 instance, or it can be an Amazon RDS database. For more info on AWS Migration, please visit my blog on Database Migration using AWS.

    Amazon DMS(数据库迁移服务):您可以使用AWS数据库迁移服务将多个源数据库整合到单个目标数据库中。 可以针对同构和异类迁移完成此操作,并且可以将此功能与所有受支持的数据库引擎一起使用。 源数据库可以位于AWS之外的您自己的前提中,可以在Amazon EC2实例上运行,也可以是Amazon RDS数据库。 有关AWS迁移的更多信息,请访问有关使用AWS进行数据库迁移的博客。

  • Amazon S3 (Simple Storage Service ) : One Amazon S3 Source can collect data from a single S3 bucket. However, you can configure multiple S3 Sources to collect from one S3 bucket. For example, you could use one S3 Source to collect one particular data type, and then configure another S3 Source to collect another data type.

    Amazon S3(简单存储服务):一个Amazon S3源可以从单个S3存储桶收集数据。 但是,您可以配置多个S3源以从一个S3存储桶中收集。 例如,您可以使用一个S3源来收集一种特定的数据类型,然后再配置另一个S3源来收集另一种数据类型。

  • Amazon DataSync : AWS DataSync makes it simple and fast to move large amounts of data online between on-premises storage and Amazon S3, Amazon Elastic File System (Amazon EFS), or Amazon FSx for Windows File Server.

    Amazon DataSync: AWS DataSync使在本地存储和Amazon S3,Amazon Elastic File System(Amazon EFS)或Windows文件服务器的Amazon FSx之间在线移动大量数据变得简单而快速。

  • Amazon SnowCone : You can use SnowCone to collect, process, and move data to AWS, either offline by shipping the device, or online with AWS DataSync (See point above). AWS SnowCone is the smallest member of the AWS Snow Family of edge computing and data transfer devices. SnowCone is portable, rugged, and secure.

    Amazon SnowCone:您可以使用SnowCone收集,处理数据并将数据移动到AWS,既可以通过运输设备脱机,也可以使用AWS DataSync在线(请参见上文) 。 AWS SnowCone是AWS Snow系列边缘计算和数据传输设备的最小成员。 SnowCone便携式,坚固且安全。

  • Amazon Snowmobile : Snowball Edge Storage Optimised devices provide both block storage and Amazon S3-compatible object storage, and 40 vCPUs. They are well suited for local storage and large scale-data transfer

    Amazon Snowmobile: Snowball Edge Storage经过优化的设备提供块存储和与Amazon S3兼容的对象存储,以及40个vCPU。 它们非常适合本地存储和大规模数据传输

  • Amazon Kinesis : Amazon Kineses is used to collect data / Streaming data. With Amazon Kinesis, you can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications.

    Amazon Kinesis: Amazon Kineses用于收集数据/流数据。 借助Amazon Kinesis,您可以摄取实时数据,例如视频,音频,应用程序日志,网站点击流以及用于机器学习,分析和其他应用程序的IoT遥测数据。

  • Amazon Managed Streaming of Kafka : With Amazon MSK, you can use native Apache Kafka APIs to populate data lakes, stream changes to and from databases, and power machine learning and analytics applications.

    Amazon Managed Streaming of Kafka:通过Amazon MSK,您可以使用本机Apache Kafka API填充数据湖,在数据库之间来回流更改以及为机器学习和分析应用程序提供动力。

3.商店: (3. Store :)

After collection process, now it’s time to store the Data. You can store your data in either a data lake or an analytical tools like a data warehouse. AWS provides several services to store your data. But before going into each of these services let’s understand the concepts of Data Lake, Date Warehouse and Data Mart concepts.

完成收集过程之后,现在该存储数据了。 您可以将数据存储在数据湖或数据仓库等分析工具中。 AWS提供了多种服务来存储您的数据。 但是在开始使用这些服务之前,我们先了解一下Data Lake,Date Warehouse和Data Mart的概念。

Data Lake : A Data Lake is a centralised repository for all data, including structured and unstructured. In a Data lake, the schema is not defined, enabling additional types of analytics like big data analytics, realtime analytics and Machine Learning. Data lakes can handle the scale, agility, and flexibility required to combine different types of data and analytics approaches to gain deeper insights..They give organisations the flexibility to use the widest array of analytics and machine learning services, with easy access to all relevant data, without compromising on security or governance.

数据湖:数据湖是所有数据(包括结构化和非结构化)的集中存储库。 在数据湖中,未定义架构,因此可启用其他类型的分析,例如大数据分析,实时分析和机器学习。 数据湖可以处理组合不同类型的数据和分析方法以获取更深入见解所需的规模,敏捷性和灵活性。它们使组织可以灵活地使用最广泛的分析和机器学习服务,并轻松访问所有相关数据数据,而不会影响安全性或治理。

Data Warehouse : A Data warehouse is a centralised repository of information coming from one or more data sources — or your Data lake, where data is transformed, cleansed and deduplicated to fit into a predefined data model. It is primarily designed for data Analytics, which involves reading large amounts of data to understand their relationships and help in finding trends in data.

数据仓库:数据仓库是来自一个或多个数据源(或您的数据湖)的集中信息存储库,在此数据经过转换,清理和重复数据删除以适应预定义的数据模型。 它主要用于数据分析,其中涉及读取大量数据以了解它们之间的关系并帮助查找数据趋势。

Data Mart : A data mart is a simple form of a data warehouse focused on a specific functional area or subject matter and contains copies of a subset of data in the data warehouse. For example, you can have specific data marts for each division in your organisation or segment data marts based on regions. You can build data marts from a large data warehouse, operational stores, or a hybrid of the two. Data marts are simple to design, build, and administer.

数据集市:数据集市是针对特定功能区域或主题的数据仓库的一种简单形式,它包含数据仓库中数据子集的副本。 例如,您可以为组织中的每个部门拥有特定的数据集市,也可以根据区域对数据集市进行细分。 您可以从大型数据仓库,运营商店或两者的结合来构建数据集市。 数据集市易于设计,构建和管理。

Image for post
Data Store — using Amazon S3, RDS and Database on Amazon EC2: Image — AWS
数据存储-在Amazon EC2上使用Amazon S3,RDS和数据库:图像-AWS
  • Amazon S3 : Amazon Simple Storage Service (S3) is the largest and most performant object storage service for structured and unstructured data and the storage service of choice to build a data lake. With Amazon S3, you can cost-effectively build and scale a data lake of any size in a secure environment.With a data lake built on Amazon S3, you can use native AWS services to run big data analytics, artificial intelligence (AI), machine learning (ML) and media data processing applications to gain insights from your unstructured data sets.

    Amazon S3: Amazon Simple Storage Service(S3)是针对结构化和非结构化数据的最大,性能最高的对象存储服务,也是构建数据湖的首选存储服务。 借助Amazon S3,您可以在安全的环境中经济高效地构建和扩展任何规模的数据湖。借助基于Amazon S3的数据湖,您可以使用本机AWS服务运行大数据分析,人工智能(AI),机器学习(ML)和媒体数据处理应用程序,以从非结构化数据集中获取见解。

  • Amazon RDS : Amazon RDS is available on variety of Database engines. Data stored is highly secured, highly available and compatible.

    Amazon RDS: Amazon RDS在各种数据库引擎上可用。 存储的数据是高度安全的,高度可用的和兼容的。

Image for post
Amazon Aurora, PostgreSQL, MySQL
Amazon Aurora,PostgreSQL,MySQL
Image for post
MariaDB, Oracle and MS Sql Server.
MariaDB,Oracle和MS Sql Server。

4. ETL(提取转换负载)或过程数据: (4. ETL (Extract Transform Load) Or Process Data :)

This process gathers or extracts data from data sources, transform the data, and stores the data in a separate destinations such as another database, a Data lake, or an analytics service like data warehouse (Amazon Redshift) where the data can be processed and Analysed.

此过程从数据源收集或提取数据,转换数据并将数据存储在单独的目的地,例如另一个数据库,Data Lake或诸如数据仓库(Amazon Redshift)之类的分析服务,可以在其中处理和分析数据。

ETL is the process of pulling or extracting data from multiple sources, transforming the data to fit a defined target schema (Schema-on-write), and loading the data into a destination data store. ETL is normally a continuous, on going process with the well-defined workflow that occurs at specific times, such as nightly. Setting up and running ETL jobs can be tedious task, and some ETL jobs may take hours to complete.

ETL是从多个源中提取或提取数据,将数据转换为适合定义的目标架构( Schema-on-write )并将数据加载到目标数据存储中的过程。 ETL通常是一个连续的,持续的过程,具有定义明确的工作流,该工作流在特定时间(例如每晚)发生。 设置和运行ETL作业可能是一项繁琐的任务,某些ETL作业可能需要数小时才能完成。

Similar to ETL, it is also important to understand ELT (Extract Load Transform). ELT is a variant of ETL where the extracted data is loaded into the targeted system before any transformations are made. The Schema is defined when the data is read or used (Schema-on-read). ELT typically works well when your target system is powerful enough to handle the transformations and when you want to explore the data in ways not consistent with the predefined format.

与ETL相似,了解ELT(提取负载转换)也很重要。 ELT是ETL的一种变体,其中在进行任何转换之前,将提取的数据加载到目标系统中。 在读取或使用数据时定义架构(读取架构)。 当您的目标系统功能强大到足以处理转换,并且您想以与预定义格式不一致的方式浏览数据时,ELT通常会很好地工作。

Image for post
ETL — Using EMR, Lambda and KCL: Image — AWS
ETL —使用EMR,Lambda和KCL:图像— AWS
Image for post
ETL — Using Amazon Glue : Image — AWS
ETL —使用Amazon Glue:图像— AWS

Here is the list of some of the Amazon Web services which can be used as an ETL.

这是一些可用作ETL的Amazon Web服务的列表。

  • Amazon EMR (Elastic Map Reduce) : EMR is an AWS tool for Big Data and Analysis. It uses big data frameworks like Apache Hadoop and Apache Spark. Amazon EMR can be used to quickly and cost-effectively perform the data transformation workloads (ETL) such as sort aggregate and join on large datasets.We can build an ETL workflow that uses AWS Data Pipeline to schedule an Amazon Elastic MapReduce (Amazon EMR) cluster to clean and process web server logs stored in an Amazon Simple Storage Service (Amazon S3) bucket.

    Amazon EMR(Elastic Map Reduce): EMR是用于大数据和分析的AWS工具。 它使用大数据框架,例如Apache HadoopApache Spark 。 Amazon EMR可用于快速且经济高效地执行数据转换工作负载(ETL),例如排序聚合和大型数据集的联接。我们可以构建一个使用AWS Data Pipeline安排Amazon Elastic MapReduce(Amazon EMR)的ETL工作流程集群以清理和处理存储在Amazon Simple Storage Service(Amazon S3)存储桶中的Web服务器日志。

  • Amazon Lambda : Lambda lets you run your data pipeline in server less mode. Serverless ETL is becoming the future for those who wants cost effective and at the same time wants to focus on the crux of the application without having to worry about the large infrastructure to power the data pipelines.

    Amazon Lambda: Lambda允许您在无服务器模式下运行数据管道。 对于那些希望节省成本,同时又希望专注于应用程序症结而又不必担心为数据管道供电的大型基础架构的人来说,无服务器ETL正成为未来。

  • Amazon Kinesis Client Library : This is one of the methods of developing customer applications that can process the data from Kinesis Data Stream. Kinesis Client libraries (KCL) library is available in multiple languages such as JAVA, .NET, Python.

    Amazon Kinesis Client Library:这是开发可以处理Kinesis Data Stream中的数据的客户应用程序的方法之一。 Kinesis Client Library(KCL)库可使用多种语言提供,例如JAVA,.NET,Python。

  • Amazon Glue : Amazon Glue is a server less, fully managed and cloud-optimised ETL service. You just need to point your data stored in AWS to the AWS Glue and it discovers your data and stores the associated metadata (Table schema and definition) on the AWS Glue data catalog. Once cataloged, your data is searchable, queryable and ready for ETL processing.

    Amazon Glue: Amazon Glue是一种服务器少,完全托管且经过云优化的ETL服务。 您只需要将存储在AWS中的数据指向AWS Glue,它就会发现您的数据并将关联的元数据(表架构和定义)存储在AWS Glue数据目录上。 一旦分类,您的数据就可以搜索,查询并准备进行ETL处理。

Image for post
Amazon Glue — Steps for ETL : Image — AWS
Amazon Glue — ETL步骤:图像— AWS

5.分析数据(5. Analyse Data :)

Now, we have reached to a stage where we are ready to unveil the real value of data. Let’s unlock what is hiding behind your data. A modern analytics pipeline can utilise a variety of tools to unlock value hidden in the data. We know that One size does not fit all. Any analytics tool should be able to access and process any data from the same source — your data lake.

现在,我们已经准备好揭露数据的真正价值。 让我们解锁隐藏在数据背后的东西。 现代的分析管道可以利用各种工具来解锁隐藏在数据中的价值。 我们知道,一种尺寸并不适合所有尺寸。 任何分析工具都应该能够访问和处理来自同一源(数据湖)的任何数据

Data can be copied from your data lake into your data warehouse to fit a structured and normalised data model that takes advantage of a high-performance query engine. At the same time, some use cases require analysis of unstructured data in context with the normalised data in the data warehouse. Here, extending data warehouse queries to include data residing in both the data warehouse and the data lake, without the delay of data transformation and movement, is essential to timely insights.

可以将数据从数据湖复制到数据仓库中,以适应利用高性能查询引擎的结构化和规范化数据模型。 同时,一些用例需要结合数据仓库中的规范化数据来分析非结构化数据。 在这里,扩展数据仓库查询以包括驻留在数据仓库和数据湖中的数据,而不会延迟数据转换和移动,对于及时洞察至关重要。

Other big data analytics tools should be able to access the same data in the data lake. So I am listing types of Analysis that is required by Data Scientists or Business users as per their use cases :

其他大数据分析工具应该能够访问数据湖中的相同数据。 因此,我根据数据用例列出了数据科学家或业务用户所需的分析类型:

  • Interactive Analysis : Interactive analysis typically uses standard SQL query tools to access and analyse data. End users want fast results and the ability to modify queries quickly and rerun them.

    交互式分析:交互式分析通常使用标准SQL查询工具来访问和分析数据。 最终用户希望获得快速结果,并希望能够快速修改查询并重新运行它们。

  • Data Warehousing Analytics : Data Warehousing provides the ability to run the complex Analytics queries against volumes of data — Petabytes- using high performance, optimised scalable query engine.

    数据仓库分析:数据仓库提供了使用高性能,优化的可扩展查询引擎对PB级数据运行复杂的Analytics查询的能力。

  • Data Lake Analytics : A new breed of data warehouse is emerging that extends data warehouse queries to a data lake to process structured or unstructured data in the data warehouse and data lake and scale up to exabytes without moving data.

    数据湖分析:一种新型的数据仓库正在出现,它将数据仓库查询扩展到数据湖,以处理数据仓库和数据湖中的结构化或非结构化数据,并在不移动数据的情况下扩展至EB。

  • Big Data Analytics : Big data processing uses the Hadoop and Spark Frameworks to process vast amounts of data.

    大数据分析:大数据处理使用Hadoop和Spark框架来处理大量数据。

  • Operational Analytics: Operational analytics focuses on improving existing operations and uses data such as application monitoring, logs, and clickstream data.

    运营分析:运营分析专注于改善现有运营并使用诸如应用程序监视,日志和点击流数据之类的数据。

  • Business intelligence (BI): BI software is an easy-to-use application

    商业智能(BI): BI软件是易于使用的应用程序

    that retrieves, analyses, transforms, and reports data for business decision-making. BI tools generally read data that is stored in an analytics service like a data warehouse or big data analytics system. BI tools create reports, dashboards, and visualisations and enable users to dive deeper into specific data on an ad-hoc basis.

    检索,分析,转换和报告数据以进行业务决策。 BI工具通常读取存储在分析服务(例如数据仓库或大数据分析系统)中的数据。 BI工具可创建报告,仪表板和可视化效果,并使用户可以临时深入研究特定数据。

Based on above, Organisations are applying Machine learning processes to automates tasks, provide customised services to the end users and increase the efficiency of operations by Analysing their data. Generally at first, you need to collect and prepare your training data to discover which elements of your data set are important. Anyway this is a vast topic. My intention here is just to apprise you about the Machine learning processing related to Analysis of the data.

基于以上所述,组织正在应用机器学习流程来自动化任务,为最终用户提供定制服务,并通过分析他们的数据来提高运营效率。 通常,首先,您需要收集并准备训练数据,以发现数据集中哪些元素很重要。 无论如何,这是一个巨大的话题。 我的目的只是让您了解与数据分析相关的机器学习处理。

Image for post
Data Analysis using Amazon Services. :Image — AWS
使用Amazon Services进行数据分析。 :Image — AWS

There are few Amazon Services which are used to Analyse the Data. I will go through each of the given services at a high level just to give you an overview :

很少有用于分析数据的Amazon Services。 我将在较高的级别上详细介绍每项给定的服务,以让您大致了解:

  • Amazon Athena : Amazon Athena is an interactive query service that makes it easy to analyse data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. Because Athena is a server-less query service, an analyst doesn’t need to manage any underlying compute infrastructure to use it. A data analyst accesses Athena through either the AWS Management Console, an application programming interface (API). Here are some features of Amazon Athena :

    Amazon Athena: Amazon Athena是一种交互式查询服务,可以使用标准SQL轻松地直接在Amazon Simple Storage Service(Amazon S3)中分析数据。 由于Athena是无服务器查询服务,因此分析师无需管理任何基础计算基础架构即可使用它。 数据分析师可以通过AWS管理控制台(一个应用程序编程接口(API))访问Athena。 以下是Amazon Athena的一些功能:

1. It unites Batch and Streaming Data;

1.结合批处理和流数据;

2. Query Data from Amazon S3 directly with ANSI SQL;

2.直接使用ANSI SQL从Amazon S3查询数据

3. Use CREATE TABLE AS SELECT (CTAS) to create new tables using a result of SELECT query.

3.使用CREATE TABLE AS SELECT(CTAS)通过SELECT查询结果创建新表。

4. Serverless, no infrastructure to manage

4.无服务器,无需管理基础架构

5. Pay $5/TB scanned by your query

5.支付您的查询扫描的$ 5 / TB

  • Amazon EMR : EMR can be used to quickly and cost-effectively perform the data transformation workloads (ETL) such as sort aggregate and join on large datasets.We can build an ETL workflow that uses AWS Data Pipeline to schedule an Amazon Elastic MapReduce (Amazon EMR) cluster to clean and process web server logs stored in an Amazon Simple Storage Service (Amazon S3) bucket.

    Amazon EMR: EMR可用于快速,经济高效地执行数据转换工作负载(ETL),例如排序汇总和大型数据集的联接。我们可以构建一个使用AWS Data Pipeline安排Amazon Elastic MapReduce(Amazon EMR)集群以清理和处理存储在Amazon Simple Storage Service(Amazon S3)存储桶中的Web服务器日志。

  • Amazon Redshift : It is a relational, OLAP-style database. It’s a data warehouse built for the cloud, to run the most complex analytical workloads in standard SQL.

    Amazon Redshift:这是一个关系型OLAP风格的数据库。 这是一个为云计算而构建的数据仓库,用于在标准SQL中运行最复杂的分析工作负载。

  • Amazon Redshift Spectrum : Amazon Redshift Spectrum is a feature of Amazon Redshift. Spectrum is a server-less query processing engine that allows to join data that sits in Amazon S3 with data in Amazon Redshift. Athena Athena follows the same logic as Spectrum, except that you’re going full-in on server-less and skip the warehouse.

    Amazon Redshift Spectrum: Amazon Redshift Spectrum是Amazon Redshift的功能。 Spectrum是一种无服务器查询处理引擎,它允许将Amazon S3中的数据与Amazon Redshift中的数据连接在一起。 雅典娜(Athena)雅典娜(Athena)遵循与Spectrum(Spectrum)相同的逻辑,只是您要在无服务器的情况下全神贯注并跳过仓库。

  • Amazon Kinesis Analytics : Amazon Kinesis Data Analytics enables you to easily and quickly build queries and sophisticated streaming applications in three simple steps: setup your streaming data sources, write your queries or streaming applications, and setup your destination for processed data.

    Amazon Kinesis Analytics: Amazon Kinesis Data Analytics使您能够通过三个简单步骤轻松快速地构建查询和复杂的流应用程序:设置流数据源,编写查询或流应用程序以及设置处理数据的目的地。

Image for post
Amazon Kinesis Data Analytics — Image from AWS
Amazon Kinesis数据分析—来自AWS的图像

6.可视化和报告:(6. Visualisation and Reporting :)

Business Analytics & Data Visualisation are two faces of the same coin. You need the ability to chart, graph, and plot your data. Now, we have reached the end of the Data Analytics pipeline process which is to Create Visualisations, Dashboards and Insightful reports to be used by Data scientists, Business users and other engagement platforms.

业务分析和数据可视化是同一枚硬币的两个面。 您需要能够绘制图表,图形和绘制数据的功能。 现在,我们已经完成了数据分析流程的最后阶段,即创建可视化,仪表盘和有见地的报表,供数据科学家,业务用户和其他参与平台使用。

A key aspect of our ability to understand what’s going on is to look for patterns.These patterns or insights are not delivered by just viewing data from the Data tables or logs. These patterns are made visible when we apply right tools and techniques to the data to make it more presentable to the end users.

我们了解正在发生的事情的能力的一个关键方面是寻找模式。这些模式或见解不是仅通过查看数据表或日志中的数据来提供的。 当我们将正确的工具和技术应用于数据以使其对最终用户更具呈现力时,这些模式就会变得可见。

Image for post
Visualisation & Reporting Process : Image — AWS
可视化和报告流程:图像— AWS

One of the tools for Visualisation is Amazon QuickSight :

可视化工具之一是Amazon QuickSight:

  • Amazon QuickSight : Amazon QuickSight lets you create interactive Dashboards, charts, and ML insights. These can be fit in your application or websites. It can be easily integrated with your cloud or on-premises setups.

    Amazon QuickSight: Amazon QuickSight使您可以创建交互式仪表板,图表和ML见解。 这些可以适合您的应用程序或网站。 它可以轻松地与您的云或本地设置集成。

Image for post
Amazon QuickSight — Image from AWS
Amazon QuickSight —来自AWS的图像
  • ELK : You can also use Open Source tool ELK. ELK, which stands for (E)lasticsearch, (L)ogstash, (K)ibana. Kibana is an open source data visualisation plugin for ElasticSearch. It provides visualisation capabilities and open user interface that lets you visualise your ElasticSearch data and navigate the Elastic Stack.

    ELK:您也可以使用开源工具ELK。 ELK,代表(E)lasticsearch,(L)ogstash,(K)ibana。 KibanaElasticSearch的开源数据可视化插件。 它提供可视化功能和开放的用户界面,使您可以可视化ElasticSearch数据并浏览Elastic Stack。

Image for post
Kibana used for Visualisation : Image Kibana from site
用于可视化的Kibana:站点上的图像Kibana

Please read and let me know if you have any questions surrounding this blog.

如果您对此博客有任何疑问,请阅读并告诉我。

Enjoy Reading !!

享受阅读 !!

翻译自: https://medium.com/@praveen.kasana/data-analytics-lifecycle-overview-aws-2bb73af90bad

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值