1)json操作
hive> select get_json_object("{\"a\":1}","$.a") from filterd_url limit 2;
OK
1
1
Time taken: 364.962 seconds, Fetched: 2 row(s)
hive> select * from filterd_url where get_json_object("{\"a\":1}","$.a") <=> NULL limit 2;
OK
Time taken: 1.06 seconds
hive> select * from filterd_url where get_json_object("{\"a\":1}","$.b") <=> NULL limit 2;
OK
http://my.tv.sohu.com/us/54198372/14304719.shtml 20140505 noblog
http://www.letv.com/ptv/vplay/2126589.html 20140505 noblog
Time taken: 0.072 seconds, Fetched: 2 row(s)
hive>
2)hive的等于和不等于操作时遇见NULL的情况hive ;> select * from filterd_url where get_json_object("{\"a\":1}","$.a") = '1' limit 2;OKhttp://my.tv.sohu.com/us/54198372/14304719.shtml 20140505 nobloghttp://www.letv.com/ptv/vplay/2126589.html 20140505 noblogTime taken: 0.109 seconds, Fetched: 2 row(s)hive> select * from filterd_url where get_json_object("{\"a\":1}","$.a") = NULL limit 2;OKTime taken: 463.32 secondshive> select * from filterd_url where get_json_object("{\"a\":1}","$.b") = NULL limit 2;OKTime taken: 151.736 secondshive> select * from filterd_url where !(get_json_object("{\"a\":1}","$.a") <=> NULL) limit 2;OKhttp://my.tv.sohu.com/us/54198372/14304719.shtml 20140505 nobloghttp://www.letv.com/ptv/vplay/2126589.html 20140505 noblogTime taken: 1.092 seconds, Fetched: 2 row(s)hive>hive> select * from filterd_url where get_json_object("{\"a\":1}","$.a") is not NULL limit 2;OKhttp://my.tv.sohu.com/us/54198372/14304719.shtml 20140505 nobloghttp://www.letv.com/ptv/vplay/2126589.html 20140505 noblogTime taken: 1.07 seconds, Fetched: 2 row(s)所以,有效过滤出json字符串中不存在的信息的方法是:!(get_json_object("{\"a\":1}","$.a") <=> NULL)或者使用:get_json_object("{\"a\":1}","$.a") is not NULL
不等于操作符:<>,如果操作数中只要有一个为NULL,则返回NULL;否则,执行比较操作
不等于操作符:<=>,如果操作数中一个为NULL,则返回False;如果两个操作数都为NULL,则返回True;否则,执行比较操作
3)hive group by 操作
4)hive xml 操作select uid, concat_ws(" ", collect_set(cluster_ids)) from site_user_cluster_info group by uid;
> SELECT xpath_int ('<a>1</a>', 'a') FROM wxj_test_url;
1
> SELECT xpath_int ('<a><b>1</b><c>2</c></a>', 'a') FROM wxj_test_url;
12
> SELECT xpath_int ('<a><b>1</b><c>2</c></a>', 'a/*') FROM wxj_test_url;
1
> SELECT xpath_int ('<a><b>1</b><c>2</c></a>', 'sum(a/*)') FROM wxj_test_url;
3
> SELECT xpath_int ('<a><b>5</b><c>2</c></a>', 'a/b * a/c') FROM wxj_test_url;
10