一、网页分析
1. Top250页面分析
2. 电影Pulp Fiction评论页面分析
二、Java WebMagic爬取信息
1. 爬取Top250电影的详细信息,以键值对形式保存为Json文件
2. 爬取电影Pulp Fiction的用户评论信息,以键值对形式保存为Json文件
三、数据清洗
将保存在本地的Json文件上传到HDFS中:
hadoop fs -mkdir -p /myimdb/movie/in
hadoop fs -put /data/imdb1/movie/www.imdb.com/* /myimdb/comment/in
hadoop fs -mkdir -p /myimdb/comment/in
hadoop fs -put /data/imdb1/comment/www.imdb.com/* /myimdb/comment/in
MapReduce对数据清洗:读取Json文件、去重、去除空值、生成格式化的数据文件:
四、利用Hive进行电影信息的数据分析
在Hive中创建imdb数据库:
create database imdb;
创建外部表movies:
create external table if not exists movies (
movie_name string,
movie_year int,
movie_area string,
movie_type string,
movie_score float,
movie_director string
) row format delimited fields terminated by '\t' location '/myimdb/movie/out/1';
创建外部表comments:
create external table if not exists comments (
comm_username string,
comm_score int,
comm_title string,
comm_detail string
) row format delimited fields terminated by '\t' location '/myimdb/cmments/out/1';
将格式化的数据文件下载到本地:
cd /data/imdb1/movie/
hadoop fs -get /myimdb/movie/out/1/*
cd /data/imdb1/comment/
hadoop fs -get /myimdb/comment/out/1/*
将数据导入到Hive的movies表中:
load data local inpath '/data/imdb1/movie/part-r-00000' into table movies;
select * from movies limit 5; //查看数据,验证是否导入成功
load data local inpath '/data/imdb1/comment/part-r-00000' into table comments;
select * from comments limit 5; //查看数据,验证是否导入成功
统计共有多少条目:
select count(1) as num from movies;
select count(1) as num from comments;
创建score_movienum表,统计各个评分分别有多少部电影,将统计结果以临时文件形式保存到score_movienum表中:
create table score_movienum (movie_score float, num int)
row format delimited fields terminated by '\t' stored as textfile;
insert into table score_movienum select movie_score, count(1) as num
from movies group by movie_score;
select * from score_movienum;
创建year_movienum表,统计各个年代分别有多少部电影,将统计结果以临时文件形式保存到year_movienum表中:
create table year_movienum (movie_year int, num int)
row format delimited fields terminated by '\t' stored as textfile;
insert into table year_movienum select movie_year, count(1) as num
from movies group by movie_year;
select * from year_movienum;
创建type_movienum表,统计各个类型分别有多少部电影,将统计结果以临时文件形式保存到type_movienum表中:
create table type_movienum (movie_type string, num int)
row format delimited fields terminated by '\t' stored as textfile;
insert into table type_movienum select movie_type, count(1) as num
from movies group by movie_type;
select * from type_movienum;
创建area_movienum表,统计各个类型分别有多少部电影,将统计结果以临时文