Apache Drill是一种引擎,可以连接到许多不同的数据源,并为它们提供SQL接口。 它不仅是想跳入任何复杂事物的SQL接口,而且是一个功能强大的接口, 其中包括对许多内置函数和窗口函数的支持。 尽管它可以连接到您仍然可以使用SQL查询的标准数据源(例如Oracle或MySQL),但它也可以处理诸如CSV或JSON之类的平面文件以及Avro和Parquet格式。 正是这种针对文件运行SQL的能力首先激发了我对Apache Drill的兴趣。 我花了大量时间研究大数据架构和工具,包括大数据发现 。 在此过程中,尝试数据管道选项是我发现的空白之一,那就是在将它们引入Hive之类的东西之前,可以以原始状态挖掘文件的功能,这将使他们能够通过BDD和其他工具。
在本文中,我将逐步介绍Apache Drill的入门知识,并展示一些我认为是其有用性的绝佳示例。
入门
开始使用Apache Drill非常简单–只需下载并解压缩并运行即可。 虽然它可以跨机器分布运行以提高性能,但也可以在笔记本电脑上独立运行。
启动它
cd /opt/apache-drill-1.7.0/
bin/sqlline -u jdbc:drill:zk=local
如果您No current connection
或com.fasterxml.jackson.databind.JavaType.isReferenceType()Z
则您遇到冲突的JAR问题 (例如,我在Oracle BigDataLite VM上遇到此问题 ),应在干净的环境中启动它
env -i HOME="$HOME" LC_CTYPE="${LC_ALL:-${LC_CTYPE:-$LANG}}" PATH="$PATH" USER="$USER" /opt/apache-drill-1.7.0/bin/drill-embedded
有一个内置数据集可用于测试:
USE cp;
SELECT employee_id, first_name FROM `employee.json` limit 5;
如果您习惯使用SQL * Plus和类似的工具,那么在非常熟悉的环境中,这应该返回五行:
0: jdbc:drill:zk=local> USE cp;
+-------+---------------------------------+
| ok | summary |
+-------+---------------------------------+
| true | Default schema changed to [cp] |
+-------+---------------------------------+
1 row selected (1.776 seconds)
0: jdbc:drill:zk=local> SELECT employee_id, first_name FROM `employee.json` limit 5;
+--------------+-------------+
| employee_id | first_name |
+--------------+-------------+
| 1 | Sheri |
| 2 | Derrick |
| 4 | Michael |
| 5 | Maya |
| 6 | Roberta |
+--------------+-------------+
5 rows selected (3.624 seconds)
到目前为止,是如此的SQL,如此的关系-如此熟悉,真的。 Apache Drill开始偏离明显之处的是它对存储处理程序的使用。 在上面的查询中,cp是我们要对其运行查询的“数据库”,但实际上这是默认情况下定义的“类路径”(因此称为“ cp”)存储处理程序。 在“数据库”中,存在“方案”,它们是存储处理程序的子配置。 稍后我们将查看和定义这些内容。 现在,知道您还可以列出可用的数据库很有用:
0: jdbc:drill:zk=local> show databases;
+---------------------+
| SCHEMA_NAME |
+---------------------+
| INFORMATION_SCHEMA |
| cp.default |
| dfs.default |
| dfs.root |
| dfs.tmp |
| sys |
+---------------------+
注意databases
命令是schemas
的同义词; 两者都返回的是<database>.<schema>
。 在Apache Drill中,反引号用于封装标识符(例如模式名称,列名称等),并且它非常具体。 例如,这是有效的:
0: jdbc:drill:zk=local> USE `cp.default`;
+-------+-----------------------------------------+
| ok | summary |
+-------+-----------------------------------------+
| true | Default schema changed to [cp.default] |
+-------+-----------------------------------------+
1 row selected (0.171 seconds)
虽然不是:
0: jdbc:drill:zk=local> USE cp.default;
Error: PARSE ERROR: Encountered ". default" at line 1, column 7.
Was expecting one of:
<EOF>
"." <IDENTIFIER> ...
"." <QUOTED_IDENTIFIER> ...
"." <BACK_QUOTED_IDENTIFIER> ...
"." <BRACKET_QUOTED_IDENTIFIER> ...
"." <UNICODE_QUOTED_IDENTIFIER> ...
"." "*" ...
SQL Query USE cp.default
这是因为default
是保留字,因此必须加引号。 因此,您也可以使用
0: jdbc:drill:zk=local> use cp.`default`;
但不是
0: jdbc:drill:zk=local> use `cp`.default;
查询JSON数据
在Apache Drill网站上有一些有用的教程 ,其中包括一个使用Yelp提供的数据的教程 。 这是最初使我着眼于Drill的数据集,因为我将其用作大数据发现 (BDD)的输入,但在两个方面都遇到了困难。 首先是如何最好地在其上定义合适的Hive表,以便将其提取到BDD中。 从此之后,我们试图了解数据中可能具有什么价值,这将使我花多长时间来完善我在Hive中公开数据的方式。 以下示例显示了以表格形式查询时,复杂JSON可能带来的复杂性。
首先,查询JSON文件,并自动推断出架构。 很酷
0: jdbc:drill:zk=local> select * from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` limit 5;
+---------+------+-------------+-------+------+------+
| user_id | text | business_id | likes | date | type |
+---------+------+-------------+-------+------+------+
| -6rEfobYjMxpUWLNxszaxQ | Don't waste your time. | cE27W9VPgO88Qxe4ol6y_g | 0 | 2013-04-18 | tip |
| EZ0r9dKKtEGVx2CdnowPCw | Your GPS will not allow you to find this place. Put Rankin police department in instead. They are directly across the street. | mVHrayjG3uZ_RLHkLj-AMg | 1 | 2013-01-06 | tip |
| xb6zEQCw9I-Gl0g06e1KsQ | Great drink specials! | KayYbHCt-RkbGcPdGOThNg | 0 | 2013-12-03 | tip |
| QawZN4PSW7ng_9SP7pjsVQ | Friendly staff, good food, great beer selection, and relaxing atmosphere | KayYbHCt-RkbGcPdGOThNg | 0 | 2015-07-08 | tip |
| MLQre1nvUtW-RqMTc4iC9A | Beautiful restoration. | 1_lU0-eSWJCRvNGk78Zh9Q | 0 | 2015-10-25 | tip |
+---------+------+-------------+-------+------+------+
5 rows selected (2.341 seconds)
我们可以使用标准的SQL聚合,例如COUNT
:
0: jdbc:drill:zk=local> select count(*) from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json`;
+---------+
| EXPR$0 |
+---------+
| 591864 |
+---------+
1 row selected (4.495 seconds)
以及GROUP BY
操作:
0: jdbc:drill:zk=local> select `date`,count(*) as tip_count from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` group by `date` order by 2 desc limit 5;
+-------------+------------+
| date | tip_count |
+-------------+------------+
| 2012-07-21 | 719 |
| 2012-05-19 | 718 |
| 2012-08-04 | 699 |
| 2012-06-23 | 690 |
| 2012-07-28 | 682 |
+-------------+------------+
5 rows selected (7.111 seconds)
稍微研究一下数据,我们可以看到它并不完全平坦–请注意,例如hours
列,它是一个嵌套的JSON对象:
0: jdbc:drill:zk=local> select full_address,city,hours from `/user/oracle/incoming/yelp/business_json` b limit 5;
+--------------+------+-------+
| full_address | city | hours |
+--------------+------+-------+
| 4734 Lebanon Church Rd
Dravosburg, PA 15034 | Dravosburg | {"Friday":{"close":"21:00","open":"11:00"},"Tuesday":{"close":"21:00","open":"11:00"},"Thursday":{"close":"21:00","open":"11:00"},"Wednesday":{"close":"21:00","open":"11:00"},"Monday":{"close":"21:00","open":"11:00"},"Sunday":{},"Saturday":{}} |
| 202 McClure St
Dravosburg, PA 15034 | Dravosburg | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} |
| 1 Ravine St
Dravosburg, PA 15034 | Dravosburg | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} |
| 1530 Hamilton Rd
Bethel Park, PA 15234 | Bethel Park | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} |
| 301 South Hills Village
Pittsburgh, PA 15241 | Pittsburgh | {"Friday":{"close":"17:00","open":"10:00"},"Tuesday":{"close":"21:00","open":"10:00"},"Thursday":{"close":"17:00","open":"10:00"},"Wednesday":{"close":"21:00","open":"10:00"},"Monday":{"close":"21:00","open":"10:00"},"Sunday":{"close":"18:00","open":"11:00"},"Saturday":{"close":"21:00","open":"10:00"}} |
+--------------+------+-------+
5 rows selected (0.721 seconds)
0: jdbc:drill:zk=local>
使用Apache Drill,我们可以简单地使用点表示法来访问嵌套值。 在执行此操作时,必须为表加上别名(在本示例中为b
):
0: jdbc:drill:zk=local> select b.hours from `/user/oracle/incoming/yelp/business_json` b limit 1;
+-------+
| hours |
+-------+
| {"Friday":{"close":"21:00","open":"11:00"},"Tuesday":{"close":"21:00","open":"11:00"},"Thursday":{"close":"21:00","open":"11:00"},"Wednesday":{"close":"21:00","open":"11:00"},"Monday":{"close":"21:00","open":"11:00"},"Sunday":{},"Saturday":{}} |
+-------+
嵌套对象本身可以嵌套-Apache Drill并不存在问题,我们只是将点符号进一步链接起来:
0: jdbc:drill:zk=local> select b.hours.Friday from `/user/oracle/incoming/yelp/business_json` b limit 1;
+-----------------------------------+
| EXPR$0 |
+-----------------------------------+
| {"close":"21:00","open":"11:00"} |
+-----------------------------------+
1 row selected (0.238 seconds)
注意,使用反引号( `
)引用保留的open
和close
关键字:
0: jdbc:drill:zk=local> select b.hours.Friday.`open`,b.hours.Friday.`close` from `/user/oracle/incoming/yelp/business_json` b limit 1;
+---------+---------+
| EXPR$0 | EXPR$1 |
+---------+---------+
| 11:00 | 21:00 |
+---------+---------+
1 row selected (0.58 seconds)
嵌套列本身就是查询中的适当对象,也可以用作谓词:
0: jdbc:drill:zk=local> select b.name,b.full_address,b.hours.Friday.`open` from `/user/oracle/incoming/yelp/business_json` b where b.hours.Friday.`open` = '11:00' limit 5;
+------------------------+------------------------------------------------+---------+
| name | full_address | EXPR$2 |
+------------------------+------------------------------------------------+---------+
| Mr Hoagie | 4734 Lebanon Church Rd
Dravosburg, PA 15034 | 11:00 |
| Alexion's Bar & Grill | 141 Hawthorne St
Greentree
Carnegie, PA 15106 | 11:00 |
| Rocky's Lounge | 1201 Washington Ave
Carnegie, PA 15106 | 11:00 |
| Papa J's | 200 E Main St
Carnegie
Carnegie, PA 15106 | 11:00 |
| Italian Village Pizza | 2615 Main St
Homestead, PA 15120 | 11:00 |
+------------------------+------------------------------------------------+---------+
5 rows selected (0.404 seconds)
您会在上面的输出中注意到full_address
字段中包含换行符—我们可以使用SQL函数用逗号替换换行符:
0: jdbc:drill:zk=local> select b.name,regexp_replace(b.full_address,'\n',','),b.hours.Friday.`open` from `/user/oracle/incoming/yelp/business_json` b where b.hours.Friday.`open` = '11:00' limit 5;
+------------------------+------------------------------------------------+---------+
| name | EXPR$1 | EXPR$2 |
+------------------------+------------------------------------------------+---------+
| Mr Hoagie | 4734 Lebanon Church Rd,Dravosburg, PA 15034 | 11:00 |
| Alexion's Bar & Grill | 141 Hawthorne St,Greentree,Carnegie, PA 15106 | 11:00 |
| Rocky's Lounge | 1201 Washington Ave,Carnegie, PA 15106 | 11:00 |
| Papa J's | 200 E Main St,Carnegie,Carnegie, PA 15106 | 11:00 |
| Italian Village Pizza | 2615 Main St,Homestead, PA 15120 | 11:00 |
+------------------------+------------------------------------------------+---------+
5 rows selected (1.346 seconds)
查询联合
因此,Apache Drill使您能够对多种格式和位置的数据运行SQL查询,这本身就非常有用。 但比这更好的是,它使您可以在单个查询中联合这些源。 这是在HDFS和Oracle中的数据之间进行联接的示例:
0: jdbc:drill:zk=local> select X.text,
. . . . . . . . . . . > Y.NAME
. . . . . . . . . . . > from hdfs.`/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` X
. . . . . . . . . . . > inner join ora.MOVIEDEMO.YELP_BUSINESS Y
. . . . . . . . . . . > on X.business_id = Y.BUSINESS_ID
. . . . . . . . . . . > where Y.NAME = 'Chick-fil-A'
. . . . . . . . . . . > limit 5;
+--------------------------------------------------------------------+--------------+
| text | NAME |
+--------------------------------------------------------------------+--------------+
| It's daddy daughter date night here and they go ALL OUT! | Chick-fil-A |
| Chicken minis! The best part of waking up Saturday mornings. :) | Chick-fil-A |
| Nice folks as always unlike those ghetto joints | Chick-fil-A |
| Great clean and delicious chicken sandwiches! | Chick-fil-A |
| Spicy Chicken with lettuce, tomato, and pepperjack cheese FTW! | Chick-fil-A |
+--------------------------------------------------------------------+--------------+
5 rows selected (3.234 seconds)
您可以为此定义一个视图:
0: jdbc:drill:zk=local> create or replace view dfs.tmp.yelp_tips as select X.text as tip_text, Y.NAME as business_name from hdfs.`/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` X inner join ora.MOVIEDEMO.YELP_BUSINESS Y on X.business_id = Y.BUSINESS_ID ;
+-------+-------------------------------------------------------------+
| ok | summary |
+-------+-------------------------------------------------------------+
| true | View 'yelp_tips' replaced successfully in 'dfs.tmp' schema |
+-------+-------------------------------------------------------------+
1 row selected (0.574 seconds)
0: jdbc:drill:zk=local> describe dfs.tmp.yelp_tips;
+----------------+--------------------+--------------+
| COLUMN_NAME | DATA_TYPE | IS_NULLABLE |
+----------------+--------------------+--------------+
| tip_text | ANY | YES |
| business_name | CHARACTER VARYING | YES |
+----------------+--------------------+--------------+
2 rows selected (0.756 seconds)
然后将其作为任何常规对象进行查询:
0: jdbc:drill:zk=local> select tip_text,business_name from dfs.tmp.yelp_tips where business_name like '%Grill' limit 5;
+------+------+
| text | NAME |
+------+------+
| Great drink specials! | Alexion's Bar & Grill |
| Friendly staff, good food, great beer selection, and relaxing atmosphere | Alexion's Bar & Grill |
| Pretty quiet here... | Uno Pizzeria & Grill |
| I recommend this location for quick lunches. 10 min or less lunch menu. Soup bar ( all you can eat) the broccoli cheddar soup is delicious. | Uno Pizzeria & Grill |
| Instead of pizza, come here for dessert. The deep dish sundae is really good. | Uno Pizzeria & Grill |
+------+------+
5 rows selected (3.272 seconds)
查询Twitter JSON数据
这是一个使用Drill查询包含一些Twitter数据的本地文件的示例。 如果您想尝试自己查询文件,可以在这里下载文件 。
首先,我切换到使用dfs
存储插件:
0: jdbc:drill:zk=local> use dfs;
+-------+----------------------------------+
| ok | summary |
+-------+----------------------------------+
| true | Default schema changed to [dfs] |
+-------+----------------------------------+
然后尝试对文件进行选择。 注意limit 5
子句–仅在检查文件结构时非常有用。
0: jdbc:drill:zk=local> select * from `/user/oracle/incoming/twitter/geo_tweets.json` limit 5;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input within/between OBJECT entries
File /user/oracle/incoming/twitter/geo_tweets.json
Record 2819
Column 3503
Fragment 0:0
一个错误? 那不应该发生的。 我有一个JSON文件,对不对? 事实证明,JSON文件是每行一个完整的JSON对象。 除了不在最后一个记录上。 请注意上面的错误– 2819中给出的记录计数:
[oracle@bigdatalite ~]$ wc -l geo_tweets.json
2818 geo_tweets.json
因此,该文件只有2818条完整的行。 嗯 让我们使用头/尾巴打击组合查看该记录:
[oracle@bigdatalite ~]$ head -n 2819 geo_tweets.json |tail -n1
{"created_at":"Sun Jul 24 21:00:44 +0000 2016","id":757319630432067584,"id_str":"757319630432067584","text":"And now @HillaryClinton hires @DWStweets: Honorary Campaign Manager across the USA #corruption #hillarysamerica https://t.co/8jAGUu6w2f","source":"TweetCaster for iOS","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2170786369,"id_str":"2170786369","name":"Patricia Weber","screen_name":"InnieBabyBoomer","location":"Williamsburg, VA","url":"http://lovesrantsandraves.blogspot.com/","description":"Baby Boomer, Swing Voter, Conservative, Spiritual, #Introvert, Wife, Grandma, Italian, ♥ Books, Cars, Ferrari, F1 Race♥ #tcot","protected":false,"verified":false,"followers_count":861,"friends_count":918,"listed_count":22,"favourites_count":17,"statuses_count":2363,"created_at":"Sat Nov 02 19:13:06 +0000 2013","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000107659131/3589f
这就是文件中的完整数据-所以Drill是正确的-JSON已损坏。 如果我们删除最后一条记录并创建一个新文件( geo_tweets.fixed.json
)
然后再次查询,我们得到了一些东西!
0: jdbc:drill:zk=local> select text from `/users/rmoff/data/geo_tweets.fixed.json` limit 5;
+------+
| text |
+------+
| Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim |
| We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
| Donald Trump accepted the Republican nomination last night. Isis claimed responsibility. |
| Obama: "We must stand together and stop terrorism"
Trump: "We don't want these people in our country"
|
| Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw |
+------+
5 rows selected (0.246 seconds)
此处的text
是json字段之一。 我可以select *
但不是很清楚:
0: jdbc:drill:zk=local> select * from `/users/rmoff/data/geo_tweets.fixed.json` limit 5;
+------------+----+--------+------+--------+-----------+------+-----+-------------+-------+-----------------+---------------+----------------+----------+-----------+-----------+--------------------+--------------+------+--------------+----------+------------+-----------+------------------+----------------------+--------------------+-------------------+-----------------------+---------------------+-----------------+------------+---------------+---------------+------------+-----------+--------------------------------+-----------+----------+----------------+-------------------+---------------------------------+-----------------------+---------------------------+---------------------+-------------------------+-------------------------+------------------+-----------------------+------------------+----------------------+---------------+
| created_at | id | id_str | text | source | truncated | user | geo | coordinates | place | is_quote_status | retweet_count | favorite_count | entities | favorited | retweeted | possibly_sensitive | filter_level | lang | timestamp_ms | @version | @timestamp | user_name | user_screen_name | user_followers_count | user_friends_count | user_listed_count | user_favourites_count | user_statuses_count | user_created_at | place_name | place_country | hashtags_list | urls_array | urls_list | user_mentions_screen_name_list | longitude | latitude | hashtags_array | extended_entities | user_mentions_screen_name_array | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | in_reply_to_screen_name | retweeted_status | retweeted_screen_name | quoted_status_id | quoted_status_id_str | quoted_status |
+------------+----+--------+------+--------+-----------+------+-----+-------------+-------+-----------------+---------------+----------------+----------+-----------+-----------+--------------------+--------------+------+--------------+----------+------------+-----------+------------------+----------------------+--------------------+-------------------+-----------------------+---------------------+-----------------+------------+---------------+---------------+------------+-----------+--------------------------------+-----------+----------+----------------+-------------------+---------------------------------+-----------------------+---------------------------+---------------------+-------------------------+-------------------------+------------------+-----------------------+------------------+----------------------+---------------+
| Fri Jul 22 19:37:11 +0000 2016 | 756573827589545984 | 756573827589545984 | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim | dlvr.it | false | {"id":67898674,"id_str":"67898674","name":"Vancouver Press","screen_name":"Vancouver_CP","location":"Vancouver, BC","url":"http://vancouver.cityandpress.com/","description":"Latest news from Vancouver. Updates are frequent.","protected":false,"verified":false,"followers_count":807,"friends_count":13,"listed_count":94,"favourites_count":1,"statuses_count":131010,"created_at":"Sat Aug 22 14:25:37 +0000 2009","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"FFFFFF","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_link_color":"8A1C3B","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"FFFFFF","profile_text_color":"2A2C31","profile_use_background_image":false,"profile_image_url":"http://pbs.twimg.com/profile_images/515841109553983490/_t0QWPco_normal.png","profile_image_url_https":"https://pbs.twimg.com/profile_images/515841109553983490/_t0QWPco_normal.png","profile_banner_url":"https://pbs.twimg.com/profile_banners/67898674/1411821103","default_profile":false,"default_profile_image":false} | {"type":"Point","coordinates":[49.2814375,-123.12109067]} | {"type":"Point","coordinates":[-123.12109067,49.2814375]} | {"id":"1e5cb4d0509db554","url":"https://api.twitter.com/1.1/geo/id/1e5cb4d0509db554.json","place_type":"city","name":"Vancouver","full_name":"Vancouver, British Columbia","country_code":"CA","country":"Canada","bounding_box":{"type":"Polygon","coordinates":[[[-123.224215,49.19854],[-123.224215,49.316738],[-123.022947,49.316738],[-123.022947,49.19854]]]},"attributes":{}} | false | 0 | 0 | {"urls":[{"url":"https://t.co/joI9GMfRim","expanded_url":"http://toplocalnow.com/ca/vancouver?section=trends","display_url":"toplocalnow.com/ca/vancouver?s…","indices":[70,93]}],"hashtags":[],"user_mentions":[],"media":[],"symbols":[]} | false | false | false | low | en | 1469216231616 | 1 | 2016-07-22T19:37:11.000Z | Vancouver Press | Vancouver_CP | 807 | 13 | 94 | 1 | 131010 | Sat Aug 22 14:25:37 +0000 2009 | Vancouver | Canada | | ["toplocalnow.com/ca/vancouver?s…"] | toplocalnow.com/ca/vancouver?s… | | -123.12109067 | 49.2814375 | [] | {"media":[]} | [] | null | null | null | null | null | {"user":{},"entities":{"user_mentions":[],"media":[],"hashtags":[],"urls":[]},"extended_entities":{"media":[]},"quoted_status":{"user":{},"entities":{"hashtags":[],"user_mentions":[],"media":[],"urls":[]},"extended_entities":{"media":[]}}} | null | null | null | {"user":{},"entities":{"user_mentions":[],"media":[],"urls":[],"hashtags":[]},"extended_entities":{"media":[]},"place":{"bounding_box":{"coordinates":[]},"attributes":{}},"geo":{"coordinates":[]},"coordinates":{"coordinates":[]}} |
在twitter数据中,存在根级字段(例如text
)以及嵌套字段(例如user
字段中有关高音扬声器的信息)。 正如我们在上面看到的,您使用点表示法引用了嵌套字段。 现在是指出您可能遇到的几个常见错误的好时机。 首先是不引用保留字,而是检查是否Encountered "."
诸如“ Encountered "."
类的错误的第一件事Encountered "."
:
0: jdbc:drill:zk=local> select user.screen_name,text from `/users/rmoff/data/geo_tweets.fixed.json` limit 5;
Error: PARSE ERROR: Encountered "." at line 1, column 12.
[...]
其次是在使用点表示法时声明表别名-如果您不这样做,则Apache Drill会认为父列实际上是表名( VALIDATION ERROR: [...] Table 'user' not found
):
0: jdbc:drill:zk=local> select `user`.screen_name,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` limit 5;
Aug 10, 2016 11:16:45 PM org.apache.calcite.sql.validate.SqlValidatorException
SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'user' not found
Aug 10, 2016 11:16:45 PM org.apache.calcite.runtime.CalciteException
SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 13: Table 'user' not found
Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 13: Table 'user' not found
SQL Query null
[Error Id: 1427fd23-e180-40be-a751-b6f1f838233a on 192.168.56.1:31010] (state=,code=0)
修正了这些错误之后,我们可以看到用户的屏幕名称:
0: jdbc:drill:zk=local> select tweets.`user`.`screen_name` as user_screen_name,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 2;
+------------------+------+
| user_screen_name | text |
+------------------+------+
| Vancouver_CP | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
+------------------+------+
2 rows selected (0.256 seconds)
0: jdbc:drill:zk=local>
除了嵌套对象,JSON还支持数组。 Twitter数据中的一个示例是#标签或URL,在给定的tweet中,两者都可以为零,一个或多个。
0: jdbc:drill:zk=local> select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 5;
+--------+
| EXPR$0 |
+--------+
| [] |
| [{"text":"hiring","indices":[6,13]},{"text":"Job","indices":[98,102]},{"text":"SkilledTrade","indices":[103,116]},{"text":"Tucson","indices":[117,124]},{"text":"Jobs","indices":[129,134]}] |
| [] |
| [] |
| [{"text":"lol","indices":[72,76]},{"text":"nowthatsfunny","indices":[77,91]}] |
+--------+
5 rows selected (0.286 seconds)
使用FLATTEN
函数,每个数组条目都会变成一个新行,因此:
0: jdbc:drill:zk=local> select flatten(tweets.entities.hashtags) from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 5;
+----------------------------------------------+
| EXPR$0 |
+----------------------------------------------+
| {"text":"hiring","indices":[6,13]} |
| {"text":"Job","indices":[98,102]} |
| {"text":"SkilledTrade","indices":[103,116]} |
| {"text":"Tucson","indices":[117,124]} |
| {"text":"Jobs","indices":[129,134]} |
+----------------------------------------------+
5 rows selected (0.139 seconds)
请注意, limit 5
子句仅显示前五个数组实例,实际上这只是上述列表中第一个tweet的主题标签。
要访问主题标签的文本,我们使用子查询和点表示法来访问text
字段:
0: jdbc:drill:zk=local> select ent_hashtags.hashtags.text from (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) as ent_hashtags limit 5;
+---------------+
| EXPR$0 |
+---------------+
| hiring |
| Job |
| SkilledTrade |
| Tucson |
| Jobs |
+---------------+
5 rows selected (0.168 seconds)
对于相同的结果,可以使用通用表表达式 (CTE,也称为子查询分解)来提高可读性:
0: jdbc:drill:zk=local> with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets)
. . . . . . . . . . . > select ent_hashtags.hashtags.text from ent_hashtags
. . . . . . . . . . . > limit 5;
+---------------+
| EXPR$0 |
+---------------+
| hiring |
| Job |
| SkilledTrade |
| Tucson |
| Jobs |
+---------------+
5 rows selected (0.253 seconds)
将展平的数组与现有字段结合起来,使我们能够看到类似推文列表及其相关主题标签的内容:
0: jdbc:drill:zk=local> with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.`user`.screen_name as user_screen_name from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp limit 10;
+------------------+------+---------+
| user_screen_name | text | hashtag |
+------------------+------+---------+
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | hiring |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | SkilledTrade |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Tucson |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Jobs |
| johnmayberry | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw | lol |
| johnmayberry | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw | nowthatsfunny |
| greensboro_nc | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 | WinstonSalem |
| greensboro_nc | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 | ws |
| trendinaliaSG | 6. Hit The Stage
7. TTTT
8. Demi Lovato
9. Beijing
10. Donald Trump
2016/7/23 03:36 SGT #trndnl https://t.co/psP0GzBgZB | trndnl |
+------------------+------+---------+
10 rows selected (0.166 seconds)
我们还可以根据主题标签进行过滤:
0: jdbc:drill:zk=local> with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.`user`.screen_name as user_screen_name from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp where tmp.hashtags.text = 'Job' limit 5;
+------------------+------+---------+
| user_screen_name | text | hashtag |
+------------------+------+---------+
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job |
| tmj_VAL_health | Want to work at Genesis Rehab Services? We're #hiring in #Clinton, MD! Click for details: https://t.co/4lt7I4gMZk #Job #Healthcare #Jobs | Job |
| tmj_in_retail | Want to work in #Clinton, IN? View our latest opening: https://t.co/UiimnlubYs #Job #Retail #Jobs #Hiring #CareerArc | Job |
| tmj_la_hrta | Want to work at SONIC Drive-In? We're #hiring in #Clinton, LA! Click for details: https://t.co/aQ1FrWc7iR #Job #SONIC #Hospitality #Jobs | Job |
| tmj_ia_hrta | We're #hiring! Click to apply: Department Manager - https://t.co/SnoKcwwHFk #Job #Hospitality #Clinton, IA #Jobs #CareerArc | Job |
+------------------+------+---------+
5 rows selected (0.207 seconds)
以及总结标签计数:
0: jdbc:drill:zk=local> with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets)
. . . . . . . . . . . > select ent_hashtags.hashtags.text,count(ent_hashtags.hashtags.text) from ent_hashtags
. . . . . . . . . . . > group by ent_hashtags.hashtags.text
. . . . . . . . . . . > order by 2 desc;
+-----------------------------+---------+
| EXPR$0 | EXPR$1 |
+-----------------------------+---------+
| Trump | 365 |
| trndnl | 176 |
| job | 170 |
| Hiring | 127 |
| Clinton | 108 |
| Yorkshire | 100 |
| CareerArc | 100 |
[...]
要过滤掉可能没有数组值的记录(例如,不是每个推特都具有的哈希标签)并且没有查询的记录可能会失败,请对数组的第一个索引的属性使用IS NOT NULL
:
0: jdbc:drill:zk=local> select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags[0].text is not null limit 5;
+--------+
| EXPR$0 |
+--------+
| [{"text":"hiring","indices":[6,13]},{"text":"Job","indices":[98,102]},{"text":"SkilledTrade","indices":[103,116]},{"text":"Tucson","indices":[117,124]},{"text":"Jobs","indices":[129,134]}] |
| [{"text":"lol","indices":[72,76]},{"text":"nowthatsfunny","indices":[77,91]}] |
| [{"text":"WinstonSalem","indices":[0,13]},{"text":"ws","indices":[92,95]}] |
| [{"text":"trndnl","indices":[89,96]}] |
| [{"text":"trndnl","indices":[92,99]}] |
+--------+
5 rows selected (0.187 seconds)
如果您尝试比较数组本身,则无法使用:
0: jdbc:drill:zk=local> select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags is not null limit 5; Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize incoming schema. Errors: Error in expression at index -1. Error: Missing function implementation: [isnotnull(MAP-REPEATED)]. Full expression: --UNKNOWN EXPRESSION--.. Fragment 0:0 [Error Id: 99ac12aa-f6b4-4692-b815-8f483da682c4 on 192.168.56.1:31010] (state=,code=0)
上面的示例演示了如何使用数组索引,这是FLATTEN
的替代选择,如果您知道它们将存在,则可以访问数组中的单个对象:
0: jdbc:drill:zk=local> select tweets.entities.hashtags[0].text as first_hashtag,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags[0].text is not null limit 5;
+---------------+------+
| first_hashtag | text |
+---------------+------+
| hiring | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
| lol | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw |
| WinstonSalem | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 |
查询CSV文件
JSON文件相对易于解释,因为其中包含半定义的架构,包括列名。 另一方面,在可靠地推断列名时,CSV(通常是字符分隔文件)更像是“狂野的西部”。 如果愿意,可以配置Apache Drill忽略CSV文件的第一行(假定它是标题),也可以将它们用作列名。 如果您不这样做,则查询如下所示的CSV文件:
[oracle@bigdatalite ~]$ head nyc_parking_violations.csv
Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
1360858775,PHW9801,OH,PAS,07/01/2015,20,SUBN,HONDA,P,61490,26160,26190,0,0044,44,44,929822,0044,0000,0653P,,BX,O,651,RIVER AVE,,0,408,D,,BBBBBBB,ALL,ALL,,0,0,-,0,,,,,
您将获得两个记录,每个记录为一列宽,它们是一个数组:
0: jdbc:drill:zk=local> select * from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv` LIMIT 5;
+---------+
| columns |
+---------+
| ["Summons Number","Plate ID","Registration State","Plate Type","Issue Date","Violation Code","Vehicle Body Type","Vehicle Make","Issuing Agency","Street Code1","Street Code2","Street Code3","Vehicle Expiration Date","Violation Location","Violation Precinct","Issuer Precinct","Issuer Code","Issuer Command","Issuer Squad","Violation Time","Time First Observed","Violation County","Violation In Front Of Or Opposite","House Number","Street Name","Intersecting Street","Date First Observed","Law Section","Sub Division","Violation Legal Code","Days Parking In Effect ","From Hours In Effect","To Hours In Effect","Vehicle Color","Unregistered Vehicle?","Vehicle Year","Meter Number","Feet From Curb","Violation Post Code","Violation Description","No Standing or Stopping Violation","Hydrant Violation","Double Parking Violation"] |
| ["1360858775","PHW9801","OH","PAS","07/01/2015","20","SUBN","HONDA","P","61490","26160","26190","0","0044","44","44","929822","0044","0000","0653P","","BX","O","651","RIVER AVE","","0","408","D","","BBBBBBB","ALL","ALL","","0","0","-","0","","","","",""] |
要访问CSV文件中的实际列,您需要使用columns[x]
语法来引用它们。 请注意,各columns
区分大小写,并且编号从零开始:
0: jdbc:drill:zk=local> select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv` limit 5;
+----------+--------------------+
| PlateID | RegistrationState |
+----------+--------------------+
| AR877A | NJ |
| 73268ME | NY |
| 2050240 | IN |
| 2250017 | IN |
| AH524C | NJ |
+----------+--------------------+
5 rows selected (0.247 seconds)
为了使重复处理数据更加容易,您可以定义数据视图:
0: jdbc:drill:zk=local> create view dfs.tmp.NYC_Parking_01 as select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`;
+-------+-----------------------------------------------------------------+
| ok | summary |
+-------+-----------------------------------------------------------------+
| true | View 'NYC_Parking_01' created successfully in 'dfs.tmp' schema |
+-------+-----------------------------------------------------------------+
1 row selected (0.304 seconds)
这使用的是dfs
存储插件及其内部的tmp
模式,它具有以下存储配置–请注意, writeable
为true
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
(如果您使用了错误的数据库[存储插件]或架构,您将获得Schema [hdfs] is immutable.
)
查询新视图
0: jdbc:drill:zk=local> select * from dfs.tmp.NYC_Parking_01 limit 5;
+-----------+---------------------+
| PlateID | RegistrationState |
+-----------+---------------------+
| Plate ID | Registration State |
| PHW9801 | OH |
| K8010F | TN |
| GFG6211 | NY |
| GHL1805 | NY |
+-----------+---------------------+
5 rows selected (0.191 seconds)
通过视图,或直接针对CSV路径,您还可以运行聚合:
0: jdbc:drill:zk=local> select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) > 1 limit 1;
+----------+---------+
| PlateID | EXPR$1 |
+----------+---------+
| 2050240 | 4 |
+----------+---------+
1 row selected (15.983 seconds)
尽管对于相同的结果,这不能重新运行-可能是由于limit
子句
0: jdbc:drill:zk=local> select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) > 1 limit 1;
+----------+---------+
| PlateID | EXPR$1 |
+----------+---------+
| AR877A | 3 |
+----------+---------+
1 row selected (12.881 seconds)
在视图下,视图定义被写入/tmp
–如果要在重新启动后保留此数据,则需要移动此路径:
[oracle@bigdatalite parking]$ cat /tmp/NYC_Parking_01.view.drill
{
"name" : "NYC_Parking_01",
"sql" : "SELECT `columns`[1] AS `PlateID`, `columns`[2] AS `RegistrationState`\nFROM `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`",
"fields" : [ {
"name" : "PlateID",
"type" : "ANY",
"isNullable" : true
}, {
"name" : "RegistrationState",
"type" : "ANY",
"isNullable" : true
} ],
"workspaceSchemaPath" : [ "hdfs" ]
您也可以使用CTAS(选择时创建表)创建实际表:
0: jdbc:drill:zk=local> create table dfs.tmp.parking as select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`;
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 1_1 | 4471875 |
| 1_0 | 4788421 |
+-----------+----------------------------+
2 rows selected (42.913 seconds)
它存储在磁盘上(按dfs
配置),默认情况下以Parquet格式存储:
[oracle@bigdatalite parking]$ ls -l /tmp/parking/
total 76508
-rw-r--r--. 1 oracle oinstall 40623288 Aug 10 22:53 1_0_0.parquet
-rw-r--r--. 1 oracle oinstall 37717804 Aug 10 22:53 1_1_0.parquet
Drill的Web界面
Drill带有Web界面,您可以从http://访问
:8047 /对
定义存储插件
在Drill Web界面中,您可以查看现有的存储插件或定义新的存储插件。 要创建一个新文件,请在“存储”页面上的“ 新存储插件”下输入其名称(例如hdfs
,但可能会引起fred
-只是一个标签),然后单击“创建”。 在“配置”框中粘贴必要的JSON定义,然后单击“创建”。 如果您不想使用GUI,那么还有REST API。
存储插件配置存储在Zookeeper中(运行分布式Drill时),或者独立运行时在sys.store.provider.local.path
路径中本地sys.store.provider.local.path
。 默认情况下,该/tmp
位于/tmp
下,该/tmp
在服务器重新启动时被清除。 要保留自定义存储配置,请修改drill-override.conf
的sys.store.provider.local.path
,例如:
drill.exec: {
cluster-id: "drillbits1",
zk.connect: "localhost:2181"
sys.store.provider.local.path="/home/oracle/drill/"
}
处理文件系统数据
这是一个使Drill能够访问CDH群集的HDFS的存储配置示例:
{
"type": "file",
"enabled": true,
"connection": "hdfs://cdh57-01-node-01:8020/",
"config": null,
"workspaces": {
"root": {
"location": "/",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"json": {
"type": "json",
"extensions": [
"json"
]
}
}
}
除了HDFS的connection
参数本身外,此配置中的重要部分是formats
部分。 这就告诉Drill如何处理找到的文件,而最终用户不必显式声明其类型。
对于基于文件系统的插件dfs
(可以包括本地文件,HDFS甚至Amazon S3),您可以浏览可用的“表”:
列出HDFS中的文件(以前use hdfs;
选择)
0: jdbc:drill:zk=local> show files;
+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+
| name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime |
+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+
| hbase | true | false | 0 | hbase | supergroup | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-07-25 14:46:08.212 |
| share | true | false | 0 | hdfs | supergroup | rwxrwxrwx | 1969-12-31 19:00:00.0 | 2016-05-15 12:28:08.152 |
| solr | true | false | 0 | solr | solr | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-06-01 18:34:50.716 |
| tmp | true | false | 0 | hdfs | supergroup | rwxrwxrwt | 1969-12-31 19:00:00.0 | 2016-06-24 04:54:41.491 |
| user | true | false | 0 | hdfs | supergroup | rwxrwxrwx | 1969-12-31 19:00:00.0 | 2016-06-21 15:55:59.084 |
| var | true | false | 0 | hdfs | supergroup | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-11 17:53:29.804 |
+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+
6 rows selected (0.145 seconds)
显示给定路径中的文件:
0: jdbc:drill:zk=local> show files in `/user/oracle`;
+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+
| name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime |
+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+
| .Trash | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-23 20:42:34.815 |
| .sparkStaging | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-07-06 03:56:38.863 |
| .staging | true | false | 0 | oracle | oracle | rwx------ | 1969-12-31 19:00:00.0 | 2016-06-01 18:37:04.005 |
| incoming | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-08-03 05:34:12.38 |
| mediademo | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-06-01 18:59:45.653 |
| moviedemo | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:02:55.652 |
| moviework | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:03:01.497 |
| oggdemo | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:03:01.552 |
| oozie-oozi | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:03:01.651 |
+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+
9 rows selected (0.428 seconds)
您还可以通过指定通配符匹配来查询多个文件。 以下是可用文件的截断列表:
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> show files in `hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/`;
+--------------------------+--------------+---------+----------+--------+--------+--------------+--------------------------+--------------------------+
| name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime |
+--------------------------+--------------+---------+----------+--------+--------+--------------+--------------------------+--------------------------+
| FlumeData.1466176113171 | false | true | 1055675 | rmoff | rmoff | rw-r--r-- | 2016-08-10 21:28:27.072 | 2016-06-17 16:08:38.023 |
| FlumeData.1466176113172 | false | true | 1051411 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.756 | 2016-06-17 16:08:40.597 |
| FlumeData.1466176113173 | false | true | 1054734 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.752 | 2016-06-17 16:08:43.33 |
| FlumeData.1466176113174 | false | true | 1050991 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.743 | 2016-06-17 16:08:44.361 |
| FlumeData.1466176113175 | false | true | 1053577 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.748 | 2016-06-17 16:08:45.162 |
| FlumeData.1466176113176 | false | true | 1051965 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.752 | 2016-06-17 16:08:46.261 |
| FlumeData.1466176113177 | false | true | 1049555 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:47.425 |
| FlumeData.1466176113178 | false | true | 1050566 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:48.23 |
| FlumeData.1466176113179 | false | true | 1051751 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.756 | 2016-06-17 16:08:49.381 |
| FlumeData.1466176113180 | false | true | 1052249 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.757 | 2016-06-17 16:08:50.042 |
| FlumeData.1466176113181 | false | true | 1055002 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:50.896 |
| FlumeData.1466176113182 | false | true | 1050812 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:52.191 |
| FlumeData.1466176113183 | false | true | 1048954 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.757 | 2016-06-17 16:08:52.994 |
| FlumeData.1466176113184 | false | true | 1051559 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.773 | 2016-06-17 16:08:54.025 |
[...]
计算一个文件( FlumeData.1466176113171
)中的记录数:
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171`(type => 'json'));
+---------+
| EXPR$0 |
+---------+
| 277 |
+---------+
1 row selected (0.798 seconds)
在几个文件中( FlumeData.146617611317*
):
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.146617611317*`(type => 'json'));
+---------+
| EXPR$0 |
+---------+
| 2415 |
+---------+
1 row selected (2.466 seconds)
在文件夹( *
)中的所有文件中:
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/*`(type => 'json'));
+---------+
| EXPR$0 |
+---------+
| 7414 |
+---------+
1 row selected (3.867 seconds)
甚至跨多个文件夹:
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/flume/incoming/twitter/2016/06/*/*`(type => 'json'));
+---------+
| EXPR$0 |
+---------+
| 206793 |
+---------+
1 row selected (87.545 seconds)
查询不带识别扩展名的数据
Drill依赖于orer中存储扩展配置的format
子句来确定如何根据文件的扩展名解释文件。 您将不会总是拥有可用的或已定义的扩展。 如果尝试查询此类数据,您将走不远。 在此示例中,我正在以JSON格式但不带.json
后缀的HDFS上查询数据:
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT text FROM `hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171` limit 5;
Error: VALIDATION ERROR: From line 1, column 18 to line 1, column 23: Table 'hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171' not found
SQL Query null
不要担心–您可以将它们声明为查询语法的一部分。
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT text FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171`(type => 'json')) limit 5;
+------+
| text |
+------+
| RT @jjkukrl: susu bayi jg lagi mahal nih ugh ayah harus semangat cari duit ^^9 https://t.co/2NvTOShRbI |
| Oracle Java 1Z0-808 Web Exam Simulator https://t.co/tZ3gU8EMj3 |
| @TribuneSelatan ahaha kudu gaya atuh da arek lebarann ahahaha |
| Short impression of yesterday's speech. What a great day it was! #lifeatoracle #team #salesincentive #oracle https://t.co/SVK2ovOe3U |
| Want to work at Oracle? We're #hiring in New York! Click for details: https://t.co/NMTo1WMHVw #Sales #Job #Jobs #CareerArc |
+------+
5 rows selected (1.267 seconds)
存储配置– Oracle
根据文档,可以很容易地查询RDBMS(例如Oracle)中的数据。 只需将JDBC驱动程序复制到Apache Drill的jar文件夹中:
cp /u01/app/oracle/product/12.1.0.2/dbhome_1/jdbc/lib/ojdbc7.jar /opt/apache-drill-1.7.0/jars/3rdparty/
然后添加必要的存储配置,我称之为ora:
{
"type": "jdbc",
"driver": "oracle.jdbc.OracleDriver",
"url": "jdbc:oracle:thin:moviedemo/welcome1@localhost:1521/ORCL",
"username": null,
"password": null,
"enabled": true
}
如果遇到错误, Please retry: error (unable to create/ update storage)
然后检查目标Oracle数据库是否已启动,密码是否正确,等等。
然后,您可以在Hive中查询数据:
0: jdbc:drill:zk=local> use ora.MOVIEDEMO;
+-------+--------------------------------------------+
| ok | summary |
+-------+--------------------------------------------+
| true | Default schema changed to [ora.MOVIEDEMO] |
+-------+--------------------------------------------+
1 row selected (0.205 seconds)
0: jdbc:drill:zk=local> show tables;
+----------------+-----------------------------+
| TABLE_SCHEMA | TABLE_NAME |
+----------------+-----------------------------+
| ora.MOVIEDEMO | ACTIVITY |
| ora.MOVIEDEMO | BDS_CUSTOMER_RFM |
| ora.MOVIEDEMO | BUSINESS_REVIEW_SUMMARY |
[...]
0: jdbc:drill:zk=local> select * from ACTIVITY limit 5;
+--------------+---------+
| ACTIVITY_ID | NAME |
+--------------+---------+
| 3.0 | Pause |
| 6.0 | List |
| 7.0 | Search |
| 8.0 | Login |
| 9.0 | Logout |
+--------------+---------+
5 rows selected (1.644 seconds)
如果出现错误:DATA_READ错误:尝试设置SQL查询时JDBC存储插件失败。 然后在Apache Drill中启用详细错误,以查看问题所在:
0: jdbc:drill:zk=local> ALTER SESSION SET `exec.errors.verbose` = true;
+-------+-------------------------------+
| ok | summary |
+-------+-------------------------------+
| true | exec.errors.verbose updated. |
+-------+-------------------------------+
1 row selected (0.154 seconds)
0: jdbc:drill:zk=local> select * from ora.MOVIEDEMO.YELP_BUSINESS limit 1;
Error: DATA_READ ERROR: The JDBC storage plugin failed while trying setup the SQL query.
sql SELECT *
FROM "MOVIEDEMO"."YELP_BUSINESS"
plugin ora
Fragment 0:0
[Error Id: 40343dd5-1354-48ed-90ef-77ae1390411b on bigdatalite.localdomain:31010]
(java.sql.SQLException) ORA-29913: error in executing ODCIEXTTABLEOPEN callout
ORA-29400: data cartridge error
KUP-11504: error from external driver: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
这是Oracle正在查询的外部表的问题( ORA-29913: error in executing ODCIEXTTABLEOPEN
)。 它实际上是一个Hive表之上的Oracle外部表,显然Drill可以直接查询-但是,我们只是在这里进行沙箱测试…
查询执行
就像Oracle有其基于成本的优化程序(CBO)来帮助它确定如何执行查询以及最有效地执行该查询一样,Apache Drill的执行引擎也可以确定如何实际执行您提供给它的查询。 这还包括如何将其拆分为多个节点(“钻头”)(如果可用),以及在某些情况下进行优化(例如分区修剪) 。 您可以在此处详细了解查询执行的工作原理 ,并在此处观看有关它的详细说明 。 要查看查询的解释计划,请使用explain plan
:
0: jdbc:drill:zk=local> !set maxwidth 10000
0: jdbc:drill:zk=local> explain plan for select `date`,count(*) as tip_count from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` group by `date` order by 2 desc limit 5;
+------+------+
| text | json |
+------+------+
| 00-00 Screen
00-01 Project(date=[$0], tip_count=[$1])
00-02 SelectionVectorRemover
00-03 Limit(fetch=[5])
00-04 SelectionVectorRemover
00-05 TopN(limit=[5])
00-06 HashAgg(group=[{0}], tip_count=[$SUM0($1)])
00-07 HashAgg(group=[{0}], tip_count=[COUNT()])
00-08 Scan(groupscan=[EasyGroupScan [selectionRoot=hdfs://localhost:8020/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json, numFiles=1, columns=[`date`], files=[hdfs://localhost:8020/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json]]])
| {
"head" : {
"version" : 1,
"generator" : {
"type" : "ExplainHandler",
"info" : ""
[...]
您还可以使用Drill Web界面查看有关查询执行方式的信息:
钻探器
MapR Drill ODBC驱动程序附带一个名为Drill Explorer的工具。 这是一个GUI,使您可以通过导航数据库(==存储插件)和其中的文件夹/文件,预览数据甚至创建视图来浏览数据。
钻探客户
在Drill客户端中,有多种可用设置:
0: jdbc:drill:zk=local> !set
autocommit true
autosave false
color true
fastconnect true
force false
headerinterval 100
historyfile /home/oracle/.sqlline/history
incremental true
isolation TRANSACTION_REPEATABLE_READ
maxcolumnwidth 15
maxheight 56
maxwidth 1000000
numberformat default
outputformat table
propertiesfile /home/oracle/.sqlline/sqlline.properties
rowlimit 0
showelapsedtime true
showheader true
shownestederrs false
showwarnings true
silent false
timeout -1
trimscripts true
verbose false
要更改一个,例如显示的输出宽度:
0: jdbc:drill:zk=local> !set maxwidth 10000
要连接到远程Drill,请指定存储Drillbit连接信息的Zookeeper节点:
rmoff@asgard-3:apache-drill-1.7.0> bin/sqlline -u jdbc:drill:zk=cdh57-01-node-01.moffatt.me:2181,cdh57-01-node-02.moffatt.me:2181,cdh57-01-node-03.moffatt.me:2181
结论
Apache Drill是一个功能强大的工具,用于针对不同的数据源使用熟悉的查询语言(SQL)。 在小范围内,仅能够通过JSON之类的结构化文件进行切片和切块是一个巨大的胜利。 在更大范围内,尝试在一组计算机集群中查询更大数量的数据时与Apache Drill进行比较是很有意思的,也许可以与诸如Impala之类的工具进行比较。
有关Apache Drill的更多信息,请参见如何从OBIEE中访问Drill。
翻译自: https://www.javacodegeeks.com/2016/09/guide-getting-started-apache-drill.html