presto_介绍presto

最新推荐文章于 2024-05-08 03:40:23 发布

郝ren

最新推荐文章于 2024-05-08 03:40:23 发布

阅读量808

点赞数

原文链接：https://medium.com/oreillymedia/introducing-presto-839a26aac724

版权

presto

Editor’s Note: We think this piece is important because it reviews a tool designed to efficiently query vast amounts of data, allowing for analytics across an entire organization. As foremost experts in Presto, Matt Fuller, Manfred Moser, and Martin Traverso provide an introductory overview of this powerful open source, distributed SQL query engine. We’d love to hear from you about what you think about this piece.

编者注： 我们认为这很重要，因为它回顾了一种旨在有效查询大量数据，允许在整个组织中进行分析的工具。 作为Presto的最高级专家，Matt Fuller，Manfred Moser和Martin Traverso对该功能强大的开源分布式SQL查询引擎进行了介绍性概述。 我们很乐意听到您对您对此作品的看法。

大数据问题 (The Problems with Big Data)

Everybody is capturing more and more data from device metrics, user behavior tracking, business transactions, location data, software and system testing procedures and workflows, and much more. The insights gained from understanding that data and working with it can make or break the success of any initiative, or even a company.

每个人都在从设备指标，用户行为跟踪，业务交易，位置数据，软件和系统测试过程以及工作流等中获取越来越多的数据。通过了解数据并使用数据可以获得或获得成功或失败的机会，这可以使任何计划甚至一家公司取得成功。

At the same time, the diversity of storage mechanisms available for data has exploded: relational databases, NoSQL databases, document databases, key-value stores, object storage systems, and so on. Many of them are necessary in today’s organizations, and it is no longer possible to use just one of them. Dealing with this can be a daunting task that feels overwhelming.

同时，可用于数据的存储机制的多样性激增了：关系数据库，NoSQL数据库，文档数据库，键值存储，对象存储系统等。在当今的组织中，其中许多是必需的，并且不再只能使用其中之一。处理这个问题可能会让人感到不知所措。

In addition, all these different systems do not allow you to query and inspect the data with standard tools. Different query languages and analysis tools for niche systems are everywhere. Meanwhile, your business analysts are used to the industry standard, SQL. A myriad of powerful tools rely on SQL for analytics, dashboard creation, rich reporting, and other business intelligence work.

此外，所有这些不同的系统都不允许您使用标准工具查询和检查数据。针对利基系统的各种查询语言和分析工具随处可见。同时，您的业务分析师习惯于使用行业标准SQL。众多强大的工具都依赖SQL进行分析，创建仪表板，丰富报表和其他商业智能工作。

The data is distributed across various silos, and some of them can not even be queried at the necessary performance for your analytics needs. Other systems, unlike modern cloud applications, store data in monolithic systems that cannot scale horizontally. Without these capabilities, you are narrowing the number of potential use cases and users, and therefore the usefulness of the data.

数据分布在各个孤岛上，其中一些甚至无法查询满足分析需求的必要性能。与现代云应用程序不同，其他系统将数据存储在无法水平扩展的整体系统中。没有这些功能，您将缩小潜在用例和用户的数量，从而缩小数据的实用性。

The traditional approach of creating and maintaining large, dedicated data warehouses has proven to be very expensive in organizations across the globe. Most often, this approach is also found to be too slow and cumbersome for many users and usage patterns.

在全球范围内，事实证明，创建和维护大型专用数据仓库的传统方法非常昂贵。通常，对于许多用户和使用模式来说，这种方法也太慢且麻烦。

You can see the tremendous opportunity for a system to unlock all this value.

您会看到系统释放所有这些价值的巨大机会。

抢救 (Presto to the Rescue)

Presto is capable of solving all these problems, and of unlocking new opportunities with federated queries to disparate systems, parallel queries, horizontal cluster scaling, and much more.

Presto能够解决所有这些问题，并能够通过联邦查询对不同的系统，并行查询，水平集群扩展等进行更多开发，从而释放新的机遇。

Presto is an open source, distributed SQL query engine. It was designed and written from the ground up to efficiently query data against disparate data sources of all sizes, ranging from gigabytes to petabytes. Presto breaks the false choice between having fast analytics using an expensive commercial solution, or using a slow “free” solution that requires excessive hardware.

Presto是一个开源的分布式SQL查询引擎。它是从头开始设计和编写的，可针对各种大小(从千兆字节到PB大小)的不同数据源高效地查询数据。 Presto打破了使用昂贵的商业解决方案进行快速分析或使用需要大量硬件的缓慢“免费”解决方案之间的错误选择。

Designed for Performance and ScalePresto is a tool designed to efficiently query vast amounts of data by using distributed execution. If you have terabytes or even petabytes of data to query, you are likely using tools such as Apache Hive that interact with Hadoop and itsHadoop Distributed File System (HDFS). Presto is designed as an alternative to these tools to more efficiently query that data.

专为性能和规模而设计 Presto是一种工具，旨在通过使用分布式执行来有效查询大量数据。如果要查询的数据量为TB甚至是PB，则可能使用诸如Apache Hive之类的工具，这些工具可与Hadoop及其Hadoop分布式文件系统(HDFS)交互。 Presto旨在替代这些工具，以更有效地查询数据。

Analysts, who expect SQL response times from milliseconds for real-time analysis to seconds and minutes, should use Presto. Presto supports SQL, commonly used in data warehousing and analytics for analyzing data, aggregating large amounts of data, and producing reports. These workloads are often classified as online analytical processing (OLAP).

期望SQL响应时间从毫秒到实时分析到秒和分钟的分析师应该使用Presto。 Presto支持SQL，通常用于数据仓库和分析中以分析数据，聚合大量数据并生成报告。这些工作负载通常被归类为在线分析处理(OLAP)。

Even though Presto understands and can efficiently execute SQL, Presto is not a database, as it does not include its own data storage system. It is not meant to be a general-purpose relational database that serves to replace Microsoft SQL Server, Oracle Database, MySQL, or PostgreSQL. Further, Presto is not designed to handle online transaction processing (OLTP). This is also true of other databases designed and optimized for data warehousing or analytics, such as Teradata, Netezza, Vertica, and Amazon Redshift.

即使Presto理解并可以有效执行SQL，Presto也不是数据库，因为它不包括自己的数据存储系统。它并不意味着是一个通用的关系数据库，可以代替Microsoft SQL Server，Oracle数据库，MySQL或PostgreSQL。此外，Presto并非旨在处理在线交易处理(OLTP)。对于为数据仓库或分析而设计和优化的其他数据库，例如Teradata，Netezza，Vertica和Amazon Redshift，也是如此。

Presto leverages both well-known and novel techniques for distributed query processing. These techniques include in-memory parallel processing, pipelined execution across nodes in the cluster, a multithreaded execution model to keep all the CPU cores busy, efficient flat-memory data structures to minimize Java Garbage collection, and Java bytecode generation. For Presto users, these techniques translate into faster insights into your data at a fraction of the cost of other solutions.

Presto利用著名技术和新颖技术进行分布式查询处理。这些技术包括内存中并行处理，集群中跨节点的流水线执行，使所有CPU核心保持繁忙的多线程执行模型，有效的平面内存数据结构以最小化Java垃圾收集以及Java字节码生成。对于Presto用户而言，这些技术可以更快地洞察您的数据，而成本仅为其他解决方案的一小部分。

SQL-on-AnythingPresto was initially designed to query data from HDFS. And it can do that very efficiently. But that is not where it ends. On the contrary, Presto is a query engine that can query data from object storage, relational database management systems (RDBMSs), NoSQL databases, and other systems.

随便SQL最初是为了从HDFS查询数据而设计的。它可以非常有效地做到这一点。但这不是终点。相反，Presto是一个查询引擎，可以查询来自对象存储，关系数据库管理系统(RDBMS)，NoSQL数据库和其他系统的数据。

Presto queries data where it lives and does not require a migration of data to a single location. So Presto allows you to query data in HDFS and other distributed object storage systems. It allows you to query RDBMSs and other data sources. As such, it can really query data wherever it lives and therefore be a replacement to the traditional, expensive, and heavy extract, transform, and load (ETL) processes. Or at a minimum, it can help you with them and lighten the load. So Presto is clearly not just another SQL-on-Hadoop solution.

Presto会查询数据所处的位置，并且不需要将数据迁移到单个位置。因此，Presto允许您查询HDFS和其他分布式对象存储系统中的数据。它允许您查询RDBMS和其他数据源。这样，它可以真正查询任何位置的数据，因此可以替代传统，昂贵且繁重的提取，转换和加载(ETL)过程。或至少可以帮助您减轻负担。因此，Presto显然不仅仅是另一个SQL-on-Hadoop解决方案。

Object storage systems include Amazon Web Services (AWS) Simple Storage Service (S3), Microsoft Azure Blob Storage, Google Cloud Storage, and S3-compatible storage such as MinIO and Ceph. Presto can query traditional RDBMSs such as Microsoft SQL Server, PostgreSQL, MySQL, Oracle, Teradata, and Amazon Redshift. Presto can also query NoSQL systems such as Apache Cassandra, Apache Kafka, MongoDB, or Elasticsearch. Presto can query virtually anything and is truly a SQL-on-Anything system.

对象存储系统包括Amazon Web Services(AWS)简单存储服务(S3)，Microsoft Azure Blob存储，Google Cloud Storage和与S3兼容的存储，例如MinIO和Ceph。 Presto可以查询传统的RDBMS，例如Microsoft SQL Server，PostgreSQL，MySQL，Oracle，Teradata和Amazon Redshift。 Presto还可以查询NoSQL系统，例如Apache Cassandra，Apache Kafka，MongoDB或Elasticsearch。 Presto几乎可以查询任何内容，并且实际上是一个SQL-on-Anything系统。

For users, this means that suddenly they no longer have to rely on specific query languages or tools to interact with the data in those specific systems.They can simply leverage Presto and their existing SQL skills and their well-understood analytics, dashboarding, and reporting tools. These tools, built on top of using SQL, allow analysis of those additional data sets, which are otherwise locked in separate systems. Users can even use Presto to query across different systems with the SQL they know.

对于用户而言，这意味着突然之间，他们不再需要依赖特定的查询语言或工具来与这些特定系统中的数据进行交互，他们只需利用Presto及其现有SQL技能以及易于理解的分析，仪表板和报告即可工具。这些工具建立在使用SQL的基础之上，可以分析那些附加数据集，这些数据集否则将锁定在单独的系统中。用户甚至可以使用Presto通过他们知道SQL在不同的系统上进行查询。

Separation of Data Storage and Query Compute ResourcesPresto is not a database with storage; rather, it simply queries data where it lives. When using Presto, storage and compute are decoupled and can be scaled independently. Presto represents the compute layer, whereas the underlying data sources represent the storage layer.

数据存储和查询计算资源的分离 Presto并不是一个具有存储功能的数据库；相反，它只是查询数据所在的位置。使用Presto时，存储和计算是分离的，可以独立扩展。 Presto代表计算层，而基础数据源代表存储层。

This allows Presto to scale up and down its compute resources for query processing, based on analytics demand to access this data. There is no need to move your data, and provision compute and storage to the exact needs of the current queries, or change that regularly, based on your changing query needs.

这样，Presto可以根据对访问此数据的分析需求来扩展和缩减其计算资源以进行查询处理。无需移动数据，为当前查询的确切需求配置计算和存储，也无需根据不断变化的查询需求定期进行更改。

Presto can scale the query power by scaling the compute cluster dynamically, and the data can be queried right where it lives in the data source. This characteristic allows you to greatly optimize your hardware resource needs and therefore reduce cost.

Presto可以通过动态扩展计算集群来扩展查询能力，并且可以直接查询数据在数据源中的位置。此特性使您可以极大地优化硬件资源需求，从而降低成本。

学得更快。深入挖掘。看得更远。 (Learn faster. Dig deeper. See farther.)

Join the O’Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

加入O'Reilly在线学习平台。 立即获得免费试用版，即时找到答案，或者掌握一些新的有用的知识。

Learn more

学到更多

Matt Fuller is a cofounder at Starburst, the Presto Company. Prior to founding Starburst, Matt was a director of engineering at Teradata, where he worked to build the new Center for Hadoop division within the company. As a major part of this, Matt worked to bring Presto to the enterprise market. Matt has managed a team contributing to the open source Presto project since 2015 and led the internal Presto product roadmap. Starburst was later formed from this team at Teradata. Manfred Moser is a community advocate, writer, trainer and software engineer at Starburst. Manfred has a long history of developing and advocating open source software. He is an Apache Maven committer, wrote the Hudson book and others, and continues to be active in the open source community and his projects. He is a seasoned trainer and conference presenter for CI/CD, Cloud Native, Agile and other software development tools and processes, having trained well over 20,000 developers for companies including Walmart Labs, Sonatype, and Telus. Martin Traverso is the cofounder of the Presto Software Foundation and CTO at Starburst. Prior to Starburst, Martin worked as a software engineer at Facebook where he saw the need for fast interactive SQL analytics. Martin and three other engineers worked to create what became Presto. Martin led the Presto development team and in the spring of 2013 Presto was rolled out into production, later made open source in the fall of 2013. Since then, Presto has gained wide adoption both internal and external to Facebook.

Matt Fuller 是Presto公司Starburst的联合创始人。 在创立Starburst之前，Matt曾在Teradata担任工程总监，在那里他致力于在公司内部建立新的Hadoop中心。 作为其中的主要部分，Matt致力于将Presto推向企业市场。 自2015年以来，Matt一直管理着一个团队为开源Presto项目做贡献，并领导了内部Presto产品路线图。 Starburst后来由Teradata的这个团队组成。 Manfred Moser 是Starburst的社区倡导者，作家，培训师和软件工程师。 Manfred在开发和倡导开源软件方面有着悠久的历史。 他是Apache Maven的撰稿人，撰写了Hudson等书，并继续活跃于开源社区和他的项目中。 他是CI / CD，Cloud Native，Agile和其他软件开发工具和流程的经验丰富的培训师和会议主持人，曾为包括Walmart Labs，Sonatype和Telus在内的公司培训过20,000多名开发人员。 Martin Traverso 是Presto软件基金会和Starburst的CTO的共同创始人。 在加入Starburst之前，Martin在Facebook担任软件工程师，他了解了对快速交互式SQL分析的需求。 马丁和其他三名工程师共同创造了Presto。 马丁(Martin)领导了Presto开发团队，并于2013年Spring将Presto投入生产，并于2013年秋季实现了开源。此后，Presto在Facebook内部和外部得到了广泛的采用。

翻译自: https://medium.com/oreillymedia/introducing-presto-839a26aac724

presto

郝ren

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
presto_介绍presto

prestoEditor’s Note: We think this piece is important because it reviews a tool designed to efficiently query vast amounts of data, allowing for analytics across an entire organization. As foremost ex...
复制链接

扫一扫