1. 下载数据:
http://www.bayareabikeshare.com/datachallenge
2. 登陆HUE
File Browser 上传文件: 201408_station_data.csv 201408_trip_data.csv
Data Browser/Metastore Tables,选择数据库 default, 执行action: Create a new table from a file。输入表名、文件路径名,持续Next,最后点击 Create Table。
Query Editors/Hive, 单机刷新按钮,点击Settings/FILE RESOURCES,需要设置解析CSV文件的serde Jar包。
登陆网站: https://github.com/ogrodnek/csv-serde 下载文件csv-serde-1.1.2-0.11.0-all.jar
返回File Browser 上传文件:csv-serde-1.1.2-0.11.0-all.jar
再次执行Query Editors/Hive, 单机刷新按钮,点击Settings/FILE RESOURCES,将刚刚下载的Jar包引入进来,并执行SQL:
alter table station set serde 'com.bizo.hive.serde.csv.CSVSerde';
或者 alter table trip set serde 'com.bizo.hive.serde.csv.CSVSerde';
3. 进行交互式查询进行数据分析
执行SQL:
SELECT startterminal, startstation, COUNT(1) AS count FROM bikeshare.trips GROUP BY startterminal, startstation ORDER BY count DESC LIMIT 10
找出路线最多的前10名的站点,点击Chart按钮,设置bar graph。
4. 执行Long查询
找到站点ID=70的总路线和平均运行时间:
SELECT
hour,
COUNT(1) AS trips,
ROUND(AVG(duration) / 60) AS avg_duration
FROM (
SELECT
CAST(SPLIT(SPLIT(t.startdate, ' ')[1], ':')[0] AS INT) AS hour,
t.duration AS duration
FROM `bikeshare`.`trips` t
WHERE
t.startterminal = 70
AND
t.duration IS NOT NULL
) r
GROUP BY hour
ORDER BY hour ASC;
设置scatterplot graph。