一.在Hive中创建存放要处理的数据的表
$hive> create table textlines(line string)
导入数据:
$hive> load data inpath ’tmp/input’ overwrite into table textlines;
如果想从本地导入,可以使用这个load data local inpath '...',这里的就会是这样的/tmp/input/wordcount。
查看是否导入成功:
$hive> select * from
texlines;
二.用HIVE实现map/reduce的计算
下面是HIVE编程指南给出的word count HQL:
SELECT word,count(1) AS count FROM (SELECT explode(split(,'[ \t]+') ) AS word FROM docs ) w GROUP BY word ORDER BY word;
我讲上面语句拆分成两条语句:
SELECT explode(split(line,’\s’)) AS word FROM docs;
SELECT word,count(1) FROM w GROUP BY word ORDER BY word;
所以我先创建一个words表,存放第一条查询的结果:
$hive> create table words(word STRING);
$hive> insert overwrite table words select explode(split(line,'\s’)) word from textlines;
注:关键语句是select explode(split(line,'\s’)) 这句将处罚Hadoop进行MapReduce计算,explode()函数表示将字符串分割为数组,\s是正则,表示空白字符,split(line,'\s’)就是将line(textiles表的字段)按照正则\s分割,然后explode函数将KEY相同的value存在数组里。这HQL条语句相当于Hadoop中的Map函数。
现在来处理第二条HQL语句:SELECT word,count(1) FROM w GROUP BY word ORDER BY word;
$hive> SELECT word,count(1) FROM words GROUP BY word ORDER BY word;
注:这条语句相当于Hadoop中的reduce函数。
三.将我们处理获得的结果导出:
$hive> INSERT OVERWRITE LOCAL DIRECTORY ‘/tmp/wordcount_result’ SELECT word,count(1) FROM words GROUP BY word ORDER BY word;
LOCAL去掉表示导出到HDFS的目录;
下面就用Sqoop将HDFS上的结果存储到mysql上了