mysql hadoop applier,MySQL Applier for Hadoop

Enabling Real-Time MySQL to HDFS Integration

Big Data is transforming the way organizations harness new insights from their business, and Apache Hadoop is at the center of that transformation.

Batch processing delivered by Map/Reduce remains central to Hadoop, but as the pressure to gain competitive advantage from "speed of thought" analytics grows, so Hadoop itself is undergoing significant evolution. The development of technologies allowing real time queries, such as Apache Drill, Cloudera Impala and the Stinger Initiative are emerging, supported by new generations of resource management with Apache YARN.

To support this growing emphasis on real-time operations, MySQL is releasing a new Hadoop Applier to enable the replication of events from MySQL to Hadoop / Hive / HDFS (Hadoop Distributed File System) as they happen. The Hadoop Applier complements existing batch-based Apache Sqoop connectivity.

This developer article gives you everything you need to get started in implementing real-time MySQL to Hadoop integration.

MySQL in Big Data

MySQL is playing a key role in many big data platforms. Based on estimates from a leading Hadoop vendor, MySQL is a core component of the big data pipeline in over 80% of deployments, including those implemented by the likes of FaceBook and Twitter.

Recent research by Intel

As an example, on-line retailers can use big data from their web properties to better understand site visitors’ activities, such as paths through the site, pages viewed, and comments posted - captured from clickstreams and weblogs. This data can be combined with user profiles and purchasing history to gain a deeper insight into its customers, and enable the delivery of highly targeted offers.

Of course, it is not just in the web that big data can make a difference. Many other business activities can benefit, with other common use cases including:

Sentiment analysis;

Marketing campaign analysis;

Customer churn modeling;

Fraud detection;

Research and Development;

Risk Modeling;

And more.

Today many users integrate data from MySQL to Hadoop using Apache Sqoop, allowing bulk transfers of data between MySQL and HDFS, or related systems such as Hive and HBase.

Apache Sqoop is a well-proven approach for bulk data loading. However, with the growing number of use-cases for streaming real-time updates from MySQL into Hadoop for immediate analysis, we need to look at complementary solutions.

In addition, the process of bulk loading to and from tables can place additional demands on production database infrastructure, potentially impacting performance, measured in predictable throughput and latency.

The Hadoop Applier is designed to address these issues to perform real-time replication of events between MySQL and Hadoop.

MySQL Applier for Hadoop

Replication via the Hadoop Applier is implemented by connecting to the MySQL master and reading binary log events as soon as they are committed, and writing them into a file in HDFS. "Events" describe database changes such as table creation operations or changes to table data.

0818b9ca8b590ca3270a3433284dd417.png

Figure 1: MySQL to HDFS Integration

The Hadoop Applier uses an API provided by libhdfs, a C library to manipulate files in HDFS. The library comes precompiled with Hadoop distributions.

It connects to the MySQL master to read the binary log and then:

Fetches the row insert events occurring on the master

Decodes these events, extracts data inserted into each field of the row, and uses content handlers to get it in the format required

Appends it to a text file in HDFS.

Databases are mapped as separate directories, with their tables mapped as sub-directories with a Hive data warehouse directory. Data inserted into each table is written into text files (named as datafile1.txt) in Hive / HDFS. Data can be in comma separated format; or any other, that is configurable by command line arguments.

0818b9ca8b590ca3270a3433284dd417.png

Figure 2: Mapping between MySQL and HDFS Schema

The installation, configuration and implementation are discussed in detail in the Hadoop Applier blog. Integration with Hive is also documented.

We also have a Hadoop Applier video demo which shows the integration.

In this first version WRITE_ROW_EVENTS are supported, i.e. only insert statements are replicated. Deletes and updates, and DDLs may be handled in the future releases. It would be great to get your requirements - please use the comments section in the Hadoop Applier blog.

Summary

The Hadoop Applier is a major step forward in providing real-time integration between MySQL and Hadoop.

With the growth in big data projects and Hadoop adoption, it would be great to get your feedback on how we can further develop the Applier to meet your real-time integration needs.

Key Resources

Download the Hadoop Applier from the MySQL Labs (select "Hadoop Applier" from the drop down menu)

1 http://www.intel.com/content/www/xa/en/big-data/data-insights-peer-research-report.html

MySQLHadoop 是两种不同的数据库技术和大数据处理框架,它们在数据管理和分析方面各有侧重。 **MySQL**: MySQL 是一种关系型数据库管理系统(RDBMS),被广泛用于企业级应用中,它支持事务处理、ACID特性,适合存储结构化数据,查询速度快,易于理解和管理。MySQL 主要用于数据的存取和应用程序的数据持久化,适用于单机或分布式环境中的中小型数据库需求。 **Hadoop**: Hadoop 是一个开源的大数据处理框架,主要用于处理大规模的分布式数据集。它由两个主要组件构成:Hadoop Distributed File System (HDFS) 和 MapReduce 计算模型。HDFS 提供了一个高容错性的分布式文件系统,用于存储海量数据;MapReduce 则负责在集群上并行执行复杂的计算任务。Hadoop 通常用于批处理分析、日志处理、搜索引擎等需要处理大量非结构化或半结构化数据的场景。 **关系**: - MySQL 可以作为 Hadoop 生态系统中的一个数据源,通过工具如 Apache Hive、Apache Pig 或 Impala 连接 MySQL,将其中的数据加载到 Hadoop 进行后续的分析。 - Hadoop 的大数据处理能力可以用来处理 MySQL 处理不了的大规模数据,或者对 MySQL 数据进行预处理、清洗和转换。 **相关问题--:** 1. MySQL 如何与 Hadoop 整合以处理大数据? 2. 在什么情况下会选择使用 Hadoop 而不是直接操作 MySQL 数据? 3. Hadoop 中的哪些组件能够与 MySQL 数据库交互?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值