etl java,Java ETL过程

I have this new challenge to load ~100M rows from an Oracle database and insert them in a remote MySQL database server.

I've divided the problem in two:

a server side REST server responsible for loading data into the MySQL server;

a client side Java app that is responsible from loading the Oracle data source.

At the Java side I've used plain JDBC for loading paginated content and transfer it over the wire to the server. This approach works well but it makes the code cumbersome and not very scalable as I'm doing pagination myself using Oracle's ROWNUM.....WHERE ROWNUM > x and ROWNUM < y.

I've now tried Hibernate's StatelessSession with my entities mapped through Annotations. The code is much more readable and clean but the performance is worse.

I've heard of ETL tools and SpringBatch but I don't know them very well.

Are there other approaches to this problem?

Thanks in advance.

UPDATE

Thank you for the invaluable suggestions.

I've opted for using SpringBatch to load data from the Oracle database because the environment is pretty tight and I don't have access to Oracle's toolset. SpringBatch is trie and true.

For the data writing step I opted for writing chunks of records using MySQL's LOAD DATA INFILE as you all stated. REST services are in the middle as they are hidden from each other for security reasons.

解决方案

100M rows is quite a lot. You can design it in plenty of ways: REST servers, JDBC reading, Spring Batch, Spring integration, Hibernate, ETL. But the bottom line is: time.

No matter what architecture you choose, you eventually have to perform these INSERTs into MySQL. Your mileage may vary but just to give you an order of magnitude: with 2K inserts per second it'll take half a day to populate MySQL with 100M rows (source).

According to the same source LOAD DATA INFILE can handle around 25K inserts/second (roughly 10x more and about an hour of work).

That being said with such an amount of data I would suggest:

dump Oracle table using native Oracle database tools that produce human readable content (or computer readable, but you have to be able to parse it)

parse the dump file using as fast tools as you can. Maybe grep/sed/gawk/cut will be enough?

generate target file compatible with MySQL LOAD DATA INFILE (it is very configurable)

Import the file in MySQL using aforementioned command

Of course you can do this in Java with nice and readable code, unit tested and versioned. But with this amount of data you need to be pragmatic.

That is for initial load. After that probably Spring Batch will be a good choice. If you can, try to connect your application directly to both databases - again, this will be faster. On the other hand this might not be possible for security reasons.

If you want to be very flexible and not tie yourself into databases directly, expose both input (Oracle) and output (MySQL) behind web-services (REST is fine as well). Spring integration will help you a lot.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值