Impala 是什么不用介绍了,Impala 的作用是什么也不用介绍了,我分享的主要都是干货,
主要是在hadoop平台上,测试 Impala 的性能。如下的代码,都是测试的干货。
准备环境(安装好Impala在linux系统上,),测试数据和创建数据库 :
execute the follwing commands:
a:mkdir test ; cd test
b:wget http://elasticmapreduce.s3.amazonaws.com/samples/impala/dbgen-1.0-jar-with-dependencies.jar
c:java -cp dbgen-1.0-jar-with-dependencies.jar DBGen -p /tmp/dbgen -b 1 -c 1 -t 1
java -Xms10240m -Xmx10240m -cp dbgen-1.0-jar-with-dependencies.jar DBGen -p /home/data -b 0.1 -c 1 -t 1
d:
hadoop fs -mkdir /data/
hadoop fs -put /tmp/dbgen/* /data/
hadoop fs -ls -h -R /data/
2.Create and Populate Impala Tables.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/query-impala-create-tables.html
execute the follwing commands:
a: impala-shell
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/books/';
c:create EXTERNAL TABLE customers( id BIGINT, name STRING, date_of_birth TIMESTAMP, gender STRING, state STRING, email STRING, phone STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/customers/';
d:create EXTERNAL TABLE transactions( id BIGINT, customer_id BIGINT, book_id BIGINT, quantity INT, transaction_date TIMESTAMP )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/transactions/';
3.Query Data in Impala
a:
SELECT COUNT(*)
FROM customers
WHERE name = 'Harrison SMITH';
b:
SELECT category, count(*) cnt
FROM books
GROUP BY category
ORDER BY cnt DESC LIMIT 10;
c:
SELECT tmp.book_category, ROUND(tmp.revenue, 2) AS revenue
FROM (
SELECT books.category AS book_category, SUM(books.price * transactions.quantity) AS revenue
FROM books JOIN [SHUFFLE] transactions ON (
transactions.book_id = books.id
AND YEAR(transactions.transaction_date) BETWEEN 2008 AND 2010
)
GROUP BY books.category
) tmp
ORDER BY revenue DESC LIMIT 10;
d:
SELECT tmp.book_category, ROUND(tmp.revenue, 2) AS revenue
FROM (
SELECT books.category AS book_category, SUM(books.price * transactions.quantity) AS revenue
FROM books
JOIN [SHUFFLE] transactions ON (
transactions.book_id = books.id
)
JOIN [SHUFFLE] customers ON (
transactions.customer_id = customers.id
AND customers.state IN ('WA', 'CA', 'NY')
)
GROUP BY books.category
) tmp
ORDER BY revenue DESC LIMIT 10;
主要来自:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-what-is-emr.html