2021年04月_过往记忆

12月 11月 10月 09月 08月 07月 06月 05月 04月 03月 02月 01月

转载知乎的 Flink 数据集成平台建设实践

摘要：本文由知乎技术平台负责人孙晓光分享，主要介绍知乎 Flink 数据集成平台建设实践。内容如下：业务场景历史设计全面转向 Flink 后的设计未来 Flink 应用场景的规划Tips：...

2021-04-28 09:00:00 432

转载贝壳 OLAP 平台架构及演进

分享嘉宾：肖赞贝壳资深工程师，编辑整理：赵冬生出品平台：DataFunTalk导读：随着大数据的持续发展及数字化转型的兴起，大数据OLAP分析需求越来越迫切，不论是大型互联网企业，还...

2021-04-27 09:00:00 1725

转载列存数据库仅仅需要列式存储？

黄峰，Kyligence 公司高级研发工程师，目前主要负责 Kyligence 企业级产品的开发以及维护工作。对 OLAP 场景的查询而言，单个查询往往需要在存储端扫描大量数据，再在内存中...

2021-04-26 09:00:00 359

转载每个大数据工程师都应该知道的消息队列演进

导语 |市面上有非常多的消息中间件，rabbitMQ、kafka、rocketMQ、pulsar、 redis等等，多得令人眼花缭乱。它们到底有什么异同，你应该选哪个？本文尝试通过技术演...

2021-04-25 12:23:47 419

转载一文理解 Redis 的核心原理与技术

文章作者：何永康，腾讯 CSIG 后台研发工程师。一、Redis 基础数据结构1. StringRedis 里的字符串是动态字符串，会根据实际情况动态调整。类似于 Go 里面的切片-sli...

2021-04-24 21:12:26 443

转载重磅！大数据知识总结和调参技巧开放下载了

大数据被誉为“新石油”，如何管理并洞悉数据的价值，是企业未来发展的核心竞争力。进入大数据时代，数据规模与日俱增。另一方面，数据仓库的市场份额被其他技术蚕食，比如大数据、机器学习和人工智能。...

2021-04-23 09:00:00 300

转载万台 HDFS 集群规模在快手的挑战与实践

作为快手内部数据规模和机器规模最大的分布式文件存储系统，HDFS一直伴随着快手业务的飞速发展而快速成长。HDFS架构介绍HDFS 全名 Hadoop Distributed File Sy...

2021-04-23 09:00:00 424

转载 Flink 容器化在唯品会的实践

唯品会自2017年开始基于k8s深入打造高性能、稳定、可靠、易用的实时计算平台，支持唯品会内部业务在平时以及大促的平稳运行。现平台支持Flink、Spark、Storm等主流框架。本文主要...

2021-04-22 12:20:12 427

转载地表最强！数据湖三剑客

Introduction在构建数据湖时，也许没有比数据格式存储更具有意义的决定。其结果将对其性能、可用性和兼容性产生直接影响。通过简单地改变数据的存储格式，我们就可以解锁新的功能，提高整个...

2021-04-21 12:51:02 1182

转载进群交流

话不多说，喜欢就进。我提供平台，您自便（蹭流量打广告的请绕行）二维码失效或者满人了的话请加个人微信号deltalake,备注进群 ...

2021-04-21 12:51:02 217

原创每个 Apache Kafka 开发者都应该知道的5件事

Apache Kafka 是一个开源流处理平台，如今有超过30％的财富500强企业使用该平台。Kafka 有很多特性使其成为事件流平台（event streaming platform）的...

2021-04-18 21:31:08 7815 6

转载 Apache Iceberg 在网易云音乐的实践

1iceberg 详细设计Apache iceberg 是Netflix开源的全新的存储格式，我们已经有了parquet、orc、arvo等非常优秀的存储格式以后，Netfix为什么还要设...

2021-04-16 09:48:10 604

转载数据湖 | 大数据存储架构峰会：数据湖技术论坛

2021年4月17日，9:00-12:45，大数据存储架构峰会：数据湖技术论坛将如约而至。本次论坛由阿里封神出品，邀请了来自阿里与Alluxio的5位专家，就数据湖相关技术主题进行分享。峰...

2021-04-16 09:48:10 695

原创 Apache Hudi 0.8.0 版本发布，Flink 集成有重大提升以及支持并行写

4月初，Apache Hudi 发布了 0.8 版本，这个版本供解决了 97 个 ISSUES，下面简单介绍一下这个版本的迁移以及重要特性。迁移指南•如果从 0.5.3 以下版本迁移，请检...

2021-04-15 09:00:00 881

转载滴滴基于 HDFS EC 节约大量存储成本的实践

桔妹导读：HDFS中默认的3副本方案在存储空间和其他资源（例如网络带宽）上有200％的开销。对于冷数据，使用纠删码（ErasureCoding，EC）存储代替副本存储是一种非常不错的替代方...

2021-04-14 09:00:00 1123 1

转载腾讯游戏实时计算应用平台建设实践

摘要：本文由腾讯游戏增值服务部数据中心许振文分享，主要介绍腾讯游戏实时计算应用平台的建设实践。内容包括：建设背景统一实时大数据开发OneData统一大数据接口服务 OneFun数据服务微...

2021-04-13 09:00:00 617

转载阿里、美团内部大数据资料！果然牛逼！

之前参加了一个技术论坛，有幸认识几位大佬，本着“近朱者赤近墨者黑”的原则，梦想有朝一日也当上大佬的我只好厚着脸皮要资料，据说有不少小伙伴靠这份秘籍成功掌握了大数据的核心技能，拿到了 BAT...

2021-04-13 09:00:00 1247

转载 ClickHouse 在爱奇艺视频生产实时数仓的应用

众所周知，爱奇艺拥有海量视频，在视频生产过程中产生的上千QPS的实时数据、T级别的数据存储。要支持这样的数据进行即席查询和多个大表的JOIN，是爱奇艺视频生产团队大数据应用的难点。具体来说...

2021-04-12 09:00:00 1095 1

转载即将发布的 Apache Kafka 2.8 将不需要依赖 Zookeeper，单集群支持数百万个分区

Apache Kafka 的核心设计是日志（Log）—— 一个简单的数据结构，使用顺序操作。以日志为中心的设计带来了高效的磁盘缓冲和 CPU 缓存使用、预取、零拷贝数据传输和许多其他好处，...

2021-04-11 21:00:00 1333

转载 Apache DolphinScheduler：国人主导的分布式工作流调度平台正式成为 Apache 顶级项目...

全球最大的开源软件基金会 Apache 软件基金会（以下简称 Apache）于北京时间 2021年4月9日在官方渠道宣布Apache DolphinScheduler 毕业成为Apache...

2021-04-10 00:00:00 1366 1

转载 DorisDB致工程师们的一封信

各位工程师同学们：你好！如果你在从事后端开发，你也许曾幻想过设计出性能比标准库还高数倍的数据结构或者算法；如果你在从事存储系统开发，你也许曾幻想过实现一种全新的存储模式，大幅提升读取和写...

2021-04-09 09:00:00 996 1

转载 Apache Flink 集成 Apache Hudi 快速入门指南

摘要：本文由阿里巴巴的陈玉兆分享，主要介绍 Flink 集成 Hudi 的最新版本功能以及快速上手实践指南。内容包括：背景环境准备Batch 模式的读写Streaming 读总结一、背景...

2021-04-08 09:00:00 1647 1

转载百万QPS，秒级延迟，携程基于实时流的大数据基础层建设

作者简介纪成，携程数据开发总监，负责金融数据基础组件及平台开发、数仓建设与治理相关的工作。对大数据领域开源技术框架有浓厚兴趣。一一、背景2017年9月携程金融成立，本着践行金融助力旅行的...

2021-04-07 09:00:00 859

转载一文理解 MVCC 实现原理

MVCC 到底是什么？MVCC 即多版本控制器，其特点就是在同一时间，不同事务可以读取到不同版本的数据，从而去解决脏读和不可重复读的问题。这样的解释你看了不下几十遍了吧！但是你真的理解什么...

2021-04-06 09:00:00 1250

转载 Kafka、Flink 数据中台实践：事件模型、调度、实时/离线数仓架构之道

去年年底，网传阿里董事局主席张勇，在阿里内网发帖称“现在阿里的业务发展太慢，要把中台变薄，变得敏捷和快速。”此言一出激起千层浪，难道中台概念真成也阿里败也阿里？有人戏称阿里“过河拆中台”，...

2021-04-05 21:00:00 926

转载美团图数据库平台建设及业务实践

图数据结构，能够更好地表征现实世界。美团业务相对较复杂，存在比较多的图数据存储及多跳查询需求，亟需一种组件来对千亿量级图数据进行管理，海量图数据的高效存储和查询是图数据库研究的核心课题。本...

2021-04-05 21:00:00 487 1

转载 Apache Iceberg 你需要知道的原理与技术

实时数据仓库的发展、架构和趋势这篇文章从实时数仓开始讲到批流一体，谈了谈对大数据架构体系发展趋势的看法。文章最后讲到了基于数据湖Iceberg实现的存储层统一方案，以及要实现此方案Ice...

2021-04-02 08:59:00 5388

转载实时数据聚合怎么破

作者简介数据猩猩，携程数据分析总监，关注分布式数据存储和实时数据分析。实时数据分析一直是个热门话题，需要实时数据分析的场景也越来越多，如金融支付中的风控，基础运维中的监控告警，实时大盘之...

2021-04-01 12:05:09 584

WeCenter 3.2.2

WeCenter 是一款开源知识型的社交化问答社区程序，专注于社区内容的整理、归类和检索，并通过连接微信公众平台，移动APP进行内容分发。

2018-09-13

HBase在不同版本（1.x, 2.x, 3.0）中针对不同类型的硬件（以IO为例，HDD/SATA-SSD/PCIe-SSD/Cloud）和场景（single/batch, get/scan）做了（即将做）各种不同的优化，这些优化都有哪些？如何针对自己的生产业务和硬件环境选择和使用合适的版本/功能？在生产环境可能出现各种问题，而监控系统是发现并解决问题的关键。目前HBase提供了大量的metrics用于监控，其中有哪些是要特别关注的？线上不同类型的问题应该重点查看哪些metrics来定位问题？如何结合metrics和客户端／服务端日志快速定位问题？

2018-08-13

HBase Procedure V2介绍

主要介绍一下Procedure V2的设计和结构，以及为什么用Procedure V2能比较容易实现出正确的AssignmentManager。最后介绍一下最近在2.1分支上对一些Procedure实现修正和改进。

2018-08-13

HBase在贝壳找房的应用实践

介绍贝壳基于hbase在多维分析（kylin）,楼盘字典等核心项目的应用，并分享在实践过程中遇到的问题和性能优化经验。

2018-08-13

Scala Cheat Sheet

本速查表可以用于快速地查找Scala语法结构。Licensed by Brendan O’Connor under a CC-BY-SA 3.0 license.

2018-07-04

Apache Hive Functions Cheat Sheet

How to create and use Hive Functions, Listing of Built-In Functions that are supported in Hive

2018-07-04

Apache Spark Cheat Sheet

Apache Spark has become the engine to enhance many of the capabilities of the ever-present Apache Hadoop environment. For Big Data, Apache Spark meets a lot of needs and runs natively on Apache Hadoop’s YARN. By running Apache Spark in your Apache Hadoop environment, you gain all the security, governance, and scalability inherent to that platform. Apache Spark is also extremely well integrated with Apache Hive and gains access to all your Apache Hadoop tables utilizing integrated security.

2018-07-04

spark-summit-north-america-2018-06 全部 PPT -part1

spark-summit-north-america-2018-06 全部 PPT -part1部分。 spark-summit-north-america-2018-06 全部 PPT -part1部分

2018-06-19

spark-summit-north-america-2018-06 全部 PPT -part2

spark-summit-north-america-2018-06全部PPT，下载。spark-summit-north-america-2018-06

2018-06-17

A Deep Dive into Stateful Stream Processing in Structured Streaming

A Deep Dive into Stateful Stream Processing in Structured Streaming A Deep Dive into Stateful Stream Processing in Structured Streaming

2018-06-17

Implementing AutoML Techniques at Salesforce Scale

Implementing AutoML Techniques at Salesforce Scale,Implementing AutoML Techniques at Salesforce Scale

2018-06-17

Using AI to Deliver a Device as a Service

Using AI to Deliver a Device as a Service,Using AI to Deliver a Device as a Service

2018-06-17

Foundations of streaming SQL

Covering ideas from across the Apache Beam, Apache Calcite, Apache Kafka, and Apache Flink communities, with thoughts and contributions from Julian Hyde, Fabian Hueske, Shaoxuan Wang, Kenn Knowles, Ben Chambers, Reuven Lax, Mingmin Xu, James Xu, Martin Kleppmann, Jay Kreps and many more, not to mention that whole database community thing...

2018-06-15

Deep Dive into Spark SQL with Advanced Performance Tuning

Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.

2018-06-11

HBase-The Definitive Guide-Second Edition-Early Release.pdf

If you’re looking for a scalable storage solution to accommodate a virtually endless amount of data, this updated edition shows you how Apache HBase can meet your needs. Modeled after Google’s BigTable architecture, HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant. Fully revised for HBase 1.0, this second edition brings you up to speed on the new HBase client API, as well as security features and new case studies that demonstrate HBase use in the real world. Whether you just started to evaluate this non-relational database, or plan to put it into practice right away, this book has your back. Launch into basic, advanced, and administrative features of HBase’s new client-facing API Use new classes to integrate HBase with Hadoop’s MapReduce framework Explore HBase’s architecture, including the storage format, write-ahead log, and background processes Dive into advanced usage, such extended client and server options Learn cluster sizing, tuning, and monitoring best practices Design schemas, copy tables, import bulk data, decommission nodes, and other tasks Go deeper into HBase security, including Kerberos and encryption at rest

2018-05-23

QCon北京2018－《RandonDb新一代分布式关系型数据库》－张雁飞.pdf

RadonDB ►可扩展 ►高可用 ►强一致 ►易部署 ►MyNewSQL

2018-05-16

QCon北京2018-《TiDB架构与开源之路》-申砾.pdf

TiDB架构与开源之路,TiDB架构与开源之路,TiDB架构与开源之路

2018-05-16

Qcon北京2018-《区块链服务在华为公有云平台上的重要问题设计实现及解决方法》-张子怡.pdf

区块链是在点对点网络中对交易具有防篡改功能的共享数据账本，Hyperledger fabric是一个比较知名的开源区块链框架，其中作为分布式系统的核心问题就是共识算法以及共识算法的效率问题。如何既保证这个共识算法能让参与区块链的联盟各方都认可它的安全可信，又能提高联盟成员间的共识效率就是一个所有人都关注的重要问题，这里我们将会介绍一种优化的bft共识算法的设计和使用方式。对于区块链服务的使用者，数据安全性是一个非常重要的问题，例如同态加密，零知识证明和国密算法等，我们会介绍这些高级功能特性，讲解这些特性的原理，以及介绍华为提供的这些特性支持中接口是怎么使用，还有通过代码示例演示怎么使用这些高级特性，让大家对区块链服务的基础和基于它的一些高级功能能有初步认识到基本实践的能力。

2018-05-16

QCon北京2018-《用正确分享来磨练专家实力——分享型专家升级记》-黄闻欣.pdf

有一期《奇葩说》，老罗说跨界很重要，实在想不到跨什么，就跨界去学演讲吧。他给的道理是影响力。我给的道理是演讲能从根本上提升你的软实力和硬实力。这次分享，我会用我的从工程师到专家工程师的亲身经历作为案例，从沟通力，学习力，思考力，强迫力，告诉大家，用怎样的钥匙才能打开这扇门。希望听众能收获并践行，让自己的职业生涯更进一步。

2018-05-16

QCon北京2018-《Oracle区块链架构及其应用开发》-蒋春明.pdf

Oracle区块链云服务基于开源的Hyperledger Fabric软件打造，是一个与其他高性能Oracle云服务相集成，且预先集成了Oracle SaaS和Oracle内部部署应用的开放的API式解决方案，能够与任何系统进行定制化整合。

2018-05-16

Apache iceberg：Netflix 数据仓库的基石

Apache Iceberg 是一种用于跟踪超大规模表的新格式，是专门为对象存储（如S3）而设计的。本文将介绍为什么 Netflix 需要构建 Iceberg，Apache Iceberg 的高层次设计，并会介绍那些能够更好地解决查询性能问题的细节。

2020-02-23

Apache Hadoop 3.x state of the union and upgrade guidance

Apache Hadoop YARN is the modern distributed operating system for big data applications. It morphed the Hadoop compute layer to be a common resource-management platform that can host a wide variety of applications. Many organizations leverage YARN in building their applications on top of Hadoop without repeatedly worrying about resource management, isolation, multitenancy issues, etc. The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. It employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters. Wangda Tan and Wei-Chiu Chuang the current status of Apache Hadoop 3.x—how it’s used today in deployments large and small, and they dive into the exciting present and future of Hadoop 3.x—features that further strengthen Hadoop as the primary resource-management platform and the storage system for enterprise data centers. They explore the current status and the future promise of features and initiatives for both YARN and HDFS of Hadoop 3.×. For YARN 3.x, there is powerful container placement, global scheduling, support for machine learning (Spark) and deep learning (TensorFlow) workloads through GPU and field-programmable gate array (FPGA) scheduling and isolation support, extreme scale with YARN federation, containerized apps on YARN, support for long-running services (alongside applications) natively without any changes, seamless application/services upgrades, powerful scheduling features like application priorities, intra-queue preemption across applications, and operational enhancements including insights through Timeline Service v2, a new web UI, better queue management, etc. Also, HDFS 3.0 announced GA for erasure coding, which doubles the storage efficiency of data and thus reduces the cost of storage for enterprise use cases. HDFS added support for multiple standby NameNodes for better availability. For better reliability of metadata and easier operations, Journal nodes have been enhanced to sync the edit log segments to protect against rolling failures. Disk balancing within a DataNode was another important feature added to ensure disks are evenly utilized in a DataNode, which also ensures better aggregate throughput and prevents from lopsided utilization if new disks are added or replaced in a DataNode. The HDFS team is currently driving the Ozone initiative, which lays the foundation of the next generation of storage architecture for HDFS where data blocks are organized in storage containers for higher scale and handling of small objects in HDFS. The Ozone project also includes an object store implementation to support new use cases. And you’ll leave with all the knowledge of how to upgrade painlessly from 2.x to 3.x to get all the benefits.

2020-02-04

Apache Doris (Incubating) 原理与实践.pdf

Doris（原百度 Palo）是一款基于大规模并行处理技术的分布式 SQL 数据库，由百度在 2017 年开源，2018 年 8 月进入 Apache 孵化器。

2019-12-10

Spark SQL 在字节跳动的优化实践-郭俊.pdf

Spark 在字节跳动内部扮演着重要角色。在数据仓库领域，Spark SQL 正在逐渐取代 Hive 成为主要的 ETL 计算引擎，另外它还是字节跳动内部重要的 ad-hoc 查询引擎。目前 Spark 每天处理百万亿级数据，单任务 Shuffle 数据量可超过 200TB。同时 Spark 与其它系统混合部署，因此性能与稳定性都是需要重点解决的问题。本次分享将会基于基础架构团队过往的工作成果，介绍字节跳动在提升基于 Spark SQL 的 ETL 稳定性以及优化 ad-hoc 查询的性能方面的实践。

2019-12-03

Spark+AI Summit Europe 2019 Part 3

Spark+AI Summit Europe 2019 补充PPT，解压密码请到 https://www.iteblog.com/archives/8424.html 获取。为期三天的 SPARK + AI SUMMIT Europe 2019 于 2019年10月15日-17日荷兰首都阿姆斯特丹举行。数据和 AI 是需要结合的，而 Spark 能够处理海量数据的分析，将 Spark 和 AI 进行结合，无疑会带来更好的产品。Spark+AI Summit Europe 2019 是欧洲最大的数据和机器学习会议，大约有1700多名数据科学家、工程师和分析师参加此次会议。本次会议的提议包括了Apache Spark™、TensorFlow、MLflow 、 PyTorch、Delta Lake、 MLflow 以及 Koalas 等开源技术的最新进展，以及在现实世界中部署人工智能的最佳实践。

2019-11-03

Spark+AI Summit Europe 2019_iteblog.zip.002

由于文件过大，分成2个文件下载。解压密码请到 https://www.iteblog.com/archives/8424.html 获取。为期三天的 SPARK + AI SUMMIT Europe 2019 于 2019年10月15日-17日荷兰首都阿姆斯特丹举行。数据和 AI 是需要结合的，而 Spark 能够处理海量数据的分析，将 Spark 和 AI 进行结合，无疑会带来更好的产品。Spark+AI Summit Europe 2019 是欧洲最大的数据和机器学习会议，大约有1700多名数据科学家、工程师和分析师参加此次会议。本次会议的提议包括了Apache Spark™、TensorFlow、MLflow 、 PyTorch、Delta Lake、 MLflow 以及 Koalas 等开源技术的最新进展，以及在现实世界中部署人工智能的最佳实践。

2019-11-01

Spark+AI Summit Europe 2019_iteblog.zip.001

2019-11-01

The Delta Architecture Delta Lake + Apache Spark Structured Streaming.pdf

数据工程师的纠结与运维的凌乱 • Delta Lake基本原理 • Delta 架构 • Delta 架构的特性 • Delta 架构的经典案例 & Demo • Delta Lake 社区

2019-10-28

Apache Spark 3.0, Koalas, Delta Lake 最新进展

In this talk, we will highlight major efforts happening in the Spark ecosystem. In particular, we will dive into the details of adaptive and static query optimizations in Spark 3.0 to make Spark easier to use and faster to run. We will also demonstrate how new features in Koalas, an open source library that provides Pandas-like API on top of Spark, helps data scientists gain insights from their data quicker.

2019-10-28

SPARK + AI SUMMIT 2019 全部 PPT

为期三天的 SPARK + AI SUMMIT 2019 于 2019年04月23日-25日在旧金山（San Francisco）进行。数据和 AI 是需要结合的，而 Spark 能够处理海量数据的分析，将 Spark 和 AI 进行结合，无疑会带来更好的产品。作为大数据领域的顶级会议，Spark+AI Summit 2019 吸引了全球大量技术大咖参会，而且 Spark+AI Summit 越做越大，本次会议议题快接近200多个。详情：https://www.iteblog.com/archives/2431.html

2019-09-21

From Stream Processor to a Unified Data Processing System

The Apache Flink community has pushed (and continues to push) the boundary for Stream Processing over the last years, following the understanding that Stream Processing is unifying paradigm to build data processing applications, beyond real-time analytics. The latest major effort in the Flink community is nothing less then re-architecting the API and runtime stack, with the goal to naturally support the spectrum of analytics and data-driven applications, to unify the APIs for batch and streaming (Table API and DataStream API), and to build a streaming runtime that is not only state-of-the-art in stream processing, but also in batch processing performance. In this keynote, we give an overview of the goals and technology behind the above effort, and look at the adoption of Apache Flink for Stream Processing and "beyond streaming" use cases, as well as various efforts in the community to support the growth in users, applications, and ecosystem.

2019-04-20

Apache Spark 2.4 and beyond

Apache Spark 2.4 comes packed with a lot of new functionalities and improvements, including the new barrier execution mode, flexible streaming sink, the native AVRO data source, PySpark’s eager evaluation mode, Kubernetes support, higher-order functions, Scala 2.12 support, and more. Xiao Li and We

2019-04-14

Flink社区专刊S2-重新定义计算

阿里巴巴最新一期Flink电子月刊《重新定义计算：Apache Flink 实践》正式发布，该月刊融合了 Apache Flink 在国内各大互联网公司的大规模实践和Flink Forward China峰会上的精彩演讲内容，希望对大家有所帮助。详情参考：https://mp.weixin.qq.com/s/HS9qoGTKzyd46VgjEpNiwg

2019-04-11

从MPP数仓迁移至Spark：案例与最佳实践分享

本次主要分享关于迁移实际案例与最佳实践更加深入的探讨。在迁移过程中，我们遇到了很多的预料之外的问题，如字符集问题，数字进位问题，各种OOM等等，更加深入地了解了Spark和RDMBS之间的差异。在弥补鸿沟和解决问题的过程中，我们做了很多的实践，贡献给了社区很多的反馈，也解决了很多的bug。即便对于Spark当前不能处理的场景，比如recurisve query，也有了一些可行的探索。此外，我们现在还开发了一套自动化框架来帮助加速迁移工作。在这次分享中，我们会深入迁移的关键步骤，并分享踩过的一些坑，最后会介绍我们的自动化工具，如SQL Converter等。相信对正工作在类似的任务或者即将开展类似工作的工程师们会有所帮助。下面是PPT原文：关注 Hadoop技术博文并回复 ebay_spark 获取本文PPT。

2019-03-31

2018 Apache HBase 技术实战专刊

本专刊由中国HBase技术社区整理，一共156页，包含HBase案例、组件、技术、平台等方面的介绍，详情参见https://www.iteblog.com/archives/2496.html

2019-01-07

Apache Spark Shuffle I/O 在 Facebook 的优化 [PDF]

我们都知道，Shuffle 操作在 Spark 中是一种昂贵的操作。在 Facebook，单个 Job 的 Shuffle 就可能往磁盘中写入 300TB 的数据；而且 shuffle reads 也是一种低效的操作，这会大大延长作业的整体执行时间，并且消耗大量的系统资源。为了提高 shuffle 的性能并提高资源利用率，Facebook 开发了 Spark-optimized Shuffle (SOS) 。这种 shuffle 技术有效地将大量小的 shuffle 读请求转换成少并且大的顺序 I/O 请求。目前这个技术于2018年4月已经在 Facebook 大规模使用了，作业整体的 I/O 提升了两倍，计算效率提高10％。值得高兴的是，这项技术 Facebook 打算共享给社区。本地址是这项技术的视频介绍。关注Hadoop技术博文(iteblog_hadoop) 公众号并回复 sos 获取本文相关ppt及相关技术论文。

2018-12-10

空空如也

TA创建的收藏夹 TA关注的收藏夹

TA关注的人

WeCenter 3.2.2

HBase in Practise: 性能、监控和问题排查

HBase Procedure V2介绍

HBase在贝壳找房的应用实践

Scala Cheat Sheet

Apache Hive Functions Cheat Sheet

Apache Spark Cheat Sheet

spark-summit-north-america-2018-06 全部 PPT -part1

spark-summit-north-america-2018-06 全部 PPT -part2

A Deep Dive into Stateful Stream Processing in Structured Streaming

Implementing AutoML Techniques at Salesforce Scale

Using AI to Deliver a Device as a Service

Foundations of streaming SQL

Deep Dive into Spark SQL with Advanced Performance Tuning

HBase-The Definitive Guide-Second Edition-Early Release.pdf

QCon北京2018－《RandonDb新一代分布式关系型数据库》－张雁飞.pdf

QCon北京2018-《TiDB架构与开源之路》-申砾.pdf

Qcon北京2018-《区块链服务在华为公有云平台上的重要问题设计实现及解决方法》-张子怡.pdf

QCon北京2018-《用正确分享来磨练专家实力——分享型专家升级记》-黄闻欣.pdf

QCon北京2018-《Oracle区块链架构及其应用开发》-蒋春明.pdf

Apache iceberg：Netflix 数据仓库的基石

Apache Hadoop 3.x state of the union and upgrade guidance

Apache Doris (Incubating) 原理与实践.pdf

Spark SQL 在字节跳动的优化实践-郭俊.pdf

Spark+AI Summit Europe 2019 Part 3

Spark+AI Summit Europe 2019_iteblog.zip.002

Spark+AI Summit Europe 2019_iteblog.zip.001

The Delta Architecture Delta Lake + Apache Spark Structured Streaming.pdf

Apache Spark 3.0, Koalas, Delta Lake 最新进展

SPARK + AI SUMMIT 2019 全部 PPT

From Stream Processor to a Unified Data Processing System

Apache Spark 2.4 and beyond

Flink社区专刊S2-重新定义计算

从MPP数仓迁移至Spark：案例与最佳实践分享

2018 Apache HBase 技术实战专刊

Apache Spark Shuffle I/O 在 Facebook 的优化 [PDF]

Apache Spark Shuffle I/O 在 Facebook 的优化

不仅仅是流计算：Apache Flink实践

Spark AI Summit Europe 2018 全部PPT - part1

Easy, Scalable, Fault-tolerant stream processing with Structured Streaming-TD

空空如也