- 显示已有数据库列表
show databases;
- 使用数据库db
use db;
- 显示数据表列表
show tables;
- 描述数据表结构
desc/describe tables;
- 创建数据表
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE; - 向表中加载数据
LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
- 将表中数据分组
INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
- 利用Python脚本
weekday_mapper.py:
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])
使用上述脚本:
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
add FILE weekday_mapper.py;
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;
区别:order by、sort by和cluster by
order by:全局排序,因此只有一个reducer,因此当数据量较大时,排序耗费时间很长。
sort by: 非全局排序,可以有多个reducer,只保证每个reducer的输出有序,之后可以对各个reducer的输出做归并排序
distribute by: 主要是控制map端如何拆分数据给reduce端
cluster by:可以理解为distribute by + sort by但排序只能是倒叙,不能指定排序规则