目 录
基于协同过滤算法的电影推荐的实现过程........................................................................ 1
1.2 使用Kettle工具把数据ETL到HDFS中....................................................... 1
2.1 编写spark代码 实现推荐功能................................................................... 2
2.3 编写spark程序 将结果存入Mysql............................................................. 3
3.4 在网页中调用程序并展现结果..................................................................... 8
1 把数据集ETL到HDFS中
1.1 数据集下载
本案例采用的数据集为movie_recommend.zip。在Linux系统中,访问本教程官网的“下载专区”,找到“数据集”目录,把该目录下的movie_recommend.zip文件下载到本地Linux系统中,下载的文件默认存放在“~/Downloads”目录中。
movie_recommend.zip中包含3个数据集:
- 用户评分数据集ratings.dat
- 样本评分数据集personalRatings.txt
- 电影数据集movies.dat
首先,使用如下命令对movie_recommend.zip文件进行解压:
$ cd ~/Downloads $ unzip movie_recommend.zip -d movie_recommend |
然后,使用如下命令启动Hadoop:
$ cd /usr/local/hadoop $ ./sbin/start-dfs.sh |
接下来,需要建立一个用于存放本案例数据集的HDFS目录input_spark,如果之前还没有创建该目录,则使用如下命令创建:
$ cd /usr/local/hadoop $ ./bin/hdfs dfs -mkdir /input_spark |
1.2 使用Kettle工具把数据ETL到HDFS中
请参照“第4章 ETL工具Kettle的安装和使用方法”中的“4.3 实例2:使用Kettle把数据加载到HDFS中”,分别把ratings.dat、personalRatings.txt、movies.dat、user.dat上传到HDFS的“input_spark”目录中。
传输完毕后,可以在Linux终端中使用HDFS Shell命令查看刚才传输到HDFS中的各个文件。比如,可以使用如下命令查看ratings.dat的前五行数据:
$ cd /usr/local/hadoop $ bin/hdfs dfs -cat /input_spark/ratings.dat | head -5 |
可以看到类似如下的数据:
1::1::5::978300760 1::2::3::978302109 2::1::1::978300761 2::2::2::978302102 2::3::3::978301963 |
2编写Spark程序实现电影推荐
2.1 编写spark代码 实现推荐功能
代码参考如下:
from pyspark import SparkConf,SparkContext from pyspark.sql import SQLContext, Row from pyspark.sql.types import * from pyspark.mllib.recommendation import ALS from pyspark.mllib.recommendation import Rating def insert_mysql(): sc = SparkContext(master='local', appName='sql') spark = SQLContext(sc) # mysql 配置(需要修改) prop = {'user': 'root', 'password': '123456', 'driver': 'com.mysql.cj.jdbc.Driver'} # database 地址(需要修改) url = 'jdbc:mysql://localhost:3306/movierecommend'
rdd2 = rdd.map(lambda x:x.split("::")).map(lambda x :Row(x[0],x[1],x[2])) schema = StructType([StructField('id', StringType()), StructField('name', StringType()),StructField('movieType', StringType())]) df=spark.createDataFrame(rdd2,schema) # 写入数据库 df.write.jdbc(url=url, table='movies', mode='append', properties=prop) # 关闭spark会话 sc.stop() def read_from_mysql(userid):
sc = SparkContext(master='local', appName='sql') spark = SQLContext(sc) # mysql 配置(需要修改) prop = {'user': 'root', 'password': '123456', 'driver': 'com.mysql.cj.jdbc.Driver'} # database 地址(需要修改) url = 'jdbc:mysql://localhost:3306/movierecommend'
personalratingsDF = spark.read.jdbc(url=url, table='personalratings', properties=prop).load() personalratings.show() personalRatingsDF.createOrReplaceTempView("personalratings") prDF=spark.sql("select * from personalratings where userid="+userid) myrdd=prDF.map(lambda x:str(x[0]) + "::"+str(x[1])+"::"+str(x[2])+"::"+str(x[3])) myrdd.foreach(print) if __name__ == "__main__": sparkConf = SparkConf().setAppName("Recommendation") sc = SparkContext(sparkConf)
# logger = sc._jvm.org.apache.log4j # Logger = logger.LogManager.getLogger(_name_) # Logger.info("pyspark init")
filepath = "file:///data/soft/movie_recommend/ratings.dat" ratingsRDD = sc.textFile(filepath).map(lambda x: x.split("::")).map(lambda x:(int(x[0]),int(x[1]),float(x[2])))
numRatings = ratingsRDD.count() numUsers = ratingsRDD.map(lambda x:x[0]).distinct().count() numMovies = ratingsRDD.map(lambda x:x[1]).distinct().count() print(str(numRatings)+" " + str(numUsers) +" " + str(numMovies))
model = ALS.train(ratingsRDD, 5, 20, 0.1) model.recommendProducts(100,5) model.recommendUsers(product=200, num=5)
|
2.2 提交代码执行程序
通过pyspark交互式编程方式或者spark-submit方式将代码提交到集群执行
2.3 编写spark程序 将结果存入Mysql
代码参考如下:
from pyspark import SparkConf,SparkContext from pyspark.sql import SQLContext, Row from pyspark.sql.types import * from pyspark.mllib.recommendation import ALS from pyspark.mllib.recommendation import Rating def insert_mysql(): sc = SparkContext(master='local', appName='sql') spark = SQLContext(sc) # mysql 配置(需要修改) prop = {'user': 'root', 'password': '123456', 'driver': 'com.mysql.cj.jdbc.Driver'} # database 地址(需要修改) url = 'jdbc:mysql://localhost:3306/movierecommend'
rdd2 = rdd.map(lambda x:x.split("::")).map(lambda x :Row(x[0],x[1],x[2])) schema = StructType([StructField('id', StringType()), StructField('name', StringType()),StructField('movieType', StringType())]) df=spark.createDataFrame(rdd2,schema) # 写入数据库 df.write.jdbc(url=url, table='movies', mode='append', properties=prop) # 关闭spark会话 sc.stop() def read_from_mysql(userid):
sc = SparkContext(master='local', appName='sql') spark = SQLContext(sc) # mysql 配置(需要修改) prop = {'user': 'root', 'password': '123456', 'driver': 'com.mysql.cj.jdbc.Driver'} # database 地址(需要修改) url = 'jdbc:mysql://localhost:3306/movierecommend'
personalratingsDF = spark.read.jdbc(url=url, table='personalratings', properties=prop).load() personalratings.show() personalRatingsDF.createOrReplaceTempView("personalratings") prDF=spark.sql("select * from personalratings where userid="+userid) myrdd=prDF.map(lambda x:str(x[0]) + "::"+str(x[1])+"::"+str(x[2])+"::"+str(x[3])) myrdd.foreach(print) if __name__ == "__main__": sparkConf = SparkConf().setAppName("Recommendation") sc = SparkContext(sparkConf)
# logger = sc._jvm.org.apache.log4j # Logger = logger.LogManager.getLogger(_name_) # Logger.info("pyspark init")
filepath = "file:///data/soft/movie_recommend/ratings.dat" ratingsRDD = sc.textFile(filepath).map(lambda x: x.split("::")).map(lambda x:(int(x[0]),int(x[1]),float(x[2])))
numRatings = ratingsRDD.count() numUsers = ratingsRDD.map(lambda x:x[0]).distinct().count() numMovies = ratingsRDD.map(lambda x:x[1]).distinct().count() print(str(numRatings)+" " + str(numUsers) +" " + str(numMovies))
model = ALS.train(ratingsRDD, 5, 20, 0.1) model.recommendProducts(100,5) model.recommendUsers(product=200, num=5)
|
3 使用Flask在网页中展现结果
3.1 创建项目目录
3.2 创建路由器
在flask项目根目录创建app.py并配置路由器 参考如下
from flask import Flask, render_template, request,url_for,redirect class UserData(object): def __init__(self,name,password,phone): self.name = name self.password = password self.phone = phone app = Flask(__name__) names = ['Marry','Linda','Alexi','Mia','Bessie','Adam','韩梅梅','安吉丽娜','王刚'] users = [ UserData('a1','123','12345678910'), UserData('b1','123','12345678910'), UserData('b1','123','12345678910') ] @app.route('/login', methods=['GET','POST']) def login(): return render_template('login.html') @app.route('/logincheck', methods=['POST']) def login_check(): name = request.form.get('name') password = request.form.get('password') for user in users: if name == user.name and password == user.password: return redirect(url_for('index', msg='登录成功')) else: return render_template('login.html', msg='登录失败请重新登录') @app.route('/index', methods=['GET','POST']) def main(): return render_template('main.html') @app.route('/index/<msg>', methods=['GET','POST']) def index(msg): return render_template('main.html',msg=msg) @app.route('/', methods=['GET','POST']) def fisrt(): return render_template('index.html') @app.route('/scores', methods=['GET','POST']) def scores(): return render_template('scores.html') @app.route('/recommend', methods=['GET','POST']) def recommend(): return render_template('recommend.html')
@app.route('/register',methods=['GET','POST']) def register(): if request.method == "POST": # 获取post提交的数据 username = request.form.get('username') password = request.form.get('password') phone = request.form.get('phone') repassword = request.form.get('repassword') if password == repassword: #创建User对象 for user in users: if user.name == username: return render_template("register.html",msg = "用户名已存在") user = UserData(username,password,phone) users.append(user) return redirect(url_for('index', msg='注册成功')) else: return render_template("register.html",msg = "密码输入不一致,请校对") return render_template("register.html")
if __name__ == '__main__': app.run() |
上面的代码用于启动一个HTTP服务器,并监听从 5000 端口进入的所有连接请求。
3.3 添加模板文件
在当前项目目录下添加名称为“templates”的子目录,增加登录、注册、主页、电影评分页面和电影推荐页面,并在templates目录下添加需要的模板文件,参考如下
<html> <h1>用户登录</h1> <p style="color: red;"> {{ msg }} </p> <form action="/logincheck" method="post"> <p><input type="text" name="name" placeholder="用户名"></p> <p><input type="password" name="password" placeholder="密码"></p> <p><input type="submit" value="用户登录"></p> </form> <form action="/" method="post"> <p><button>返回主页</button></p> </form> </html> <html> <h1>用户注册</h1> <p style="color: red;"> {{ msg }} </p> <form action="/register" method="post"> <p><input type="text" name="username" placeholder="用户名"></p> <p><input type="password" name="password" placeholder="密码"></p> <p><input type="password" name="repassword" placeholder="确认密码"></p> <p><input type="number" name="phone" placeholder="手机号码"></p> <p><input type="submit" value="用户注册"></p> </form> <form action="/" method="post"> <p><button>返回主页</button></p> </form> </html> |
3.4 在网页中调用程序并展现结果
通过cmd打开控制窗口,进入flask项目目录,终端中输入如下命令启动该HTTP服务器:
$ python app.py |
至此,本案例全部完成!