hive的xiaodemo

最新推荐文章于 2022-10-25 14:33:41 发布

gadofcarobama

最新推荐文章于 2022-10-25 14:33:41 发布

阅读量200

点赞数

本文链接：https://blog.csdn.net/qq_39319689/article/details/81176412

版权

在hadoop和hive 上执行gis 的一个程序。

Sample tools that demonstrate full stack implementations of all the resources provided to solve GIS problems using Hadoop
Templates for building custom tools that solve specific problems

只保留部分有用的数据，其他删除，做好demo演示的前期工作。。。。。。。。。。。。

在此之前已经建立好hadoop环境，mysql环境。

启动mysql建立hive数据库，建立用户名和密码hive_user，hive_pass。导入hive中scripts/metastore/upgrade/mysql/中的hive-schema-2.1.0.mysql文件到hive数据库中。

前期准备好，之后建立hive环境：

下载，解压。。。。

可以export hive路径，也可以不用这个不是必须.

在conf下面建立hive-ste.xml

<configuration>

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://mysql主机名:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive_user</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive_pass</value>
</property>
</configuration>

这里没有弄明白metastore是不是存储到了mysql，这里有疑问？？？、！！！！

这里大概配置了，应该还有配置信息没配好，通过查找资料知道以下内容（大概的意思2启用了新的CLI：Beelin客户端和HiveServer2服务端）：

Beeline – 一个新的命令行Shell

HiveServer2 supports a new command shell Beeline that works with HiveServer2. It's a JDBC client that is based on the SQLLine CLI (http://sqlline.sourceforge.net/). There’s detailed documentation of SQLLine which is applicable to Beeline as well.

The Beeline shell works in both embedded mode as well as remote mode. In the embedded mode, it runs an embedded Hive (similar to Hive CLI) whereas remote mode is for connecting to a separate HiveServer2 process over Thrift. Starting inHive 0.14, when Beeline is used with HiveServer2, it also prints the log messages from HiveServer2 for queries it executes to STDERR.

Icon

In remote mode HiveServer2 only accepts valid Thrift calls – even in HTTP mode, the message body contains Thrift payloads.

Beeline 要与HiveServer2配合使用，支持嵌入模式和远程模式

启动HiverServer2 , ./bin/hiveserver2

启动Beeline

$ ./bin/beeline

beeline> !connect jdbc:hive2://localhost:10000

https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients

http://blog.cloudera.com/blog/2014/02/migrating-from-hive-cli-to-beeline-a-primer/

----------------------------------------------------------------------------------------------------------------------------------------------------

至此大概就配好了hive，还有很多东西和概念不知道，例如hive ql语句.可以参照：

http://www.cnblogs.com/HondaHsu/p/4346354.html

这里执行hive出现命令提示符，表示可以运行了。

将demo解压，将数据放入到/gis/data（这里是hdfs中的路径，需要hdfs dfs -put .........上传到hdfs中）中，其中有2个文件夹，一个是json格式，一个是cvs格式

Permission	Owner	Group	Size	Last Modified	Replication	Block Size	Name
drwxr-xr-x	hadoop	supergroup	0 B	2016/7/3 下午5:43:00	0	0 B	counties-data
drwxr-xr-x	hadoop	supergroup	0 B	2016/7/3 下午5:43:02	0	0 B	earthquake-data

修改demo中的

add jar
/home/hadoop/gis-for-hadoop/lib/esri-geometry-api.jar #这里是通过hive中的add jar 将包加入到环境中。home/hadoop/gis-for-had........这个是ubuntu里的物理路径。具体讲jar加入到哪里了，还不清楚，
/home/hadoop/gis-for-hadoop/lib/spatial-sdk-hadoop.jar; #不过，在运行的过程中，程序去hdfs中找这个两个文件，所以需要将这两个jar 传入到hdfs中，具体原因不知道。也许是程序代码的缘故，也许是
#这里add jar，命令使用不得当的原因，具体问题在学习代码和hive命令之后才知道！！

create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';

#将hdfs文件的信息存入表中，这里出现问题,localhost开始是用master的主机名代替的，可是不行，提示拒绝连接后来查找这里

#http://wiki.apache.org/hadoop/ConnectionRefused

#http://blog.csdn.net/z363115269/article/details/39048589

#http://www.iteblog.com/archives/802

#知道查询hive下的DBS表DB_LOCATION_URI列：select DB_LOCATION_URI from DBS; 这里是用了localhost的缘故。所有下面用了localhost。具体原因，不知道！！！这里需要进行学习！！！！
CREATE EXTERNAL TABLE IF NOT EXISTS earthquakes (earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, depth DOUBLE, magnitude DOUBLE,
magtype string, mbstations string, gap string, distance string, rms string, source string, eventid string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'hdfs://localhost:9000/gis/data/earthquake-data';

#导入另外一个文件到表中！！
CREATE EXTERNAL TABLE IF NOT EXISTS counties (Area string, Perimeter string, State string, County string, Name string, BoundaryShape binary)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs://localhost:9000/gis/data/counties-data';

#这里是关键，hive的精华！！！我的理解用表管理用的的数据文件，然后通过hive sql 将任务分解为mapreduce的任务。依托hdfs实现文件管理和文件访问。用hive实现将hive sql的语句转化成MR过程来执行，降低入门门槛

#用sql这样的句子代替编程！优点是对静态文件访问和MR过程变得简单！

SELECT counties.name, count(*) cnt FROM counties
JOIN earthquakes
WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
GROUP BY counties.name
ORDER BY cnt desc;

进入hive,执行下面hive ql，看到下面的结果：

以下为运行结果：

hive> SELECT counties.name, count(*) cnt FROM counties
> JOIN earthquakes
> WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude))
> GROUP BY counties.name
> ORDER BY cnt desc;
Warning: Map Join MAPJOIN[20][bigTable=?] in task 'Stage-2:MAPRED' is a cross product
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases.
Query ID = hadoop_20160703034243_ce75c555-bf04-48f9-9b5a-a6466a70a9e1
Total jobs = 2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/apache-hive-2.1.0/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.7.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-07-03 03:42:58 Starting to launch local task to process map join;maximum memory = 518979584
2016-07-03 03:43:02 Dump the side-table for tag: 0 with group count: 1 into file: file:/tmp/hadoop/3e288758-d62b-4417-93e8-f379f4969ea2/hive_2016-07-03_03-42-43_138_2061802838610441568-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile20--.hashtable
2016-07-03 03:43:02 Uploaded 1 File to: file:/tmp/hadoop/3e288758-d62b-4417-93e8-f379f4969ea2/hive_2016-07-03_03-42-43_138_2061802838610441568-1/-local-10006/HashTable-Stage-2/MapJoin-mapfile20--.hashtable (181836 bytes)
2016-07-03 03:43:02 End of local task; Time Taken: 4.391 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-07-03 03:43:10,532 Stage-2 map = 0%, reduce = 0%
2016-07-03 03:43:22,064 Stage-2 map = 100%, reduce = 0%
2016-07-03 03:43:23,077 Stage-2 map = 100%, reduce = 100%
Ended Job = job_local718067817_0003
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-07-03 03:43:26,449 Stage-3 map = 100%, reduce = 100%
Ended Job = job_local1888560612_0004
MapReduce Jobs Launched:
Stage-Stage-2: HDFS Read: 13516872 HDFS Write: 0 SUCCESS
Stage-Stage-3: HDFS Read: 15548312 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Kern 36
San Bernardino 35
Imperial 28
Inyo 20
Los Angeles 18
Riverside 14
Monterey 14
Santa Clara 12
Fresno 11
San Benito 11
San Diego 7
Santa Cruz 5
San Luis Obispo 3
Ventura 3
Orange 2
San Mateo 1
Time taken: 43.319 seconds, Fetched: 16 row(s)
hive>

文章标签： hadoop hive gis

gadofcarobama

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive的xiaodemo

在hadoop和hive 上执行gis 的一个程序。 Sample tools that demonstrate full stack implementations of all the resources provided to solve GIS problems using Hadoop Templates for building custom tools that solve...
复制链接

扫一扫