Machine Learning With Spark学习笔记(提取10万电影数据特征)

注:原文中的代码是在spark-shell中编写执行的,本人的是在eclipse中编写执行,所以结果输出形式可能会与这本书中的不太一样。

首先将用户数据u.data读入SparkContext中,然后输出第一条数据看看效果,代码如下:

val sc = new SparkContext("local", "ExtractFeatures")
val rawData = sc.textFile("F:\\ScalaWorkSpace\\data\\ml-100k\\u.data")
println(rawData.first())

注意:第一行代码我创建了spark上下文,如果你是在spark-shell中运行代码,它会自动创建好spark上下文,名字为sc,我是在eclipse中编写代码,所以需要自己编写代码创建spark上下文,我们可以看到有如下输出:

这里写图片描述

每条数据是由“\t”分隔的,我们现在要取出每条数据,然后再取到每条数据的前三个元素,即用户ID,电影ID,用户给电影的评分,代码如下:

val rawRatings = rawData.map(_.split("\t").take(3))
rawRatings.first().foreach(println)

可以看到类似如下的输出:
这里写图片描述

接下来我们将使用spark内置的MLlib库来训练我们的模型。先来看看有哪些方法可以使用,需要什么参数作为输入。首先我们导入内置库文件ALS:

import org.apache.spark.mllib.recommendation.ALS

接下来的操作是在spark-shell中完成的。在控制台下输入ALS.(注意ALS后面有一个点)加上tap键:
这里写图片描述
我们将要使用到的方法是train方法。

如果我们输入ALS.train,会返回一个错误,但是我们可以从这个错误中看看这个方法的细节:
这里写图片描述
可以看到,我们最少要提供三个参数:ratings,rank,iterations,第二个方法还需要另外一个参数lambda。我们先来看看参数rating的类Rating:
这里写图片描述
我们可以看到,我们需要向ALS模型提供一个包含Rating的RDD,Rating将user id,movie id(就是这里的product)和rating封装起来。我们将在评分数据集(rating dataset)上使用map方法,将ID和评分的数组转换成Rating对象:

val ratings = rawRatings.map {
      case Array(user, movie, rating) =>
        Rating(user.toInt, movie.toInt, rating.toDouble)
    }
println(ratings.first())

输出如下:
这里写图片描述
现在我们得到了一个Rating类型的RDD。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Machine Learning with Spark - Second Edition by Rajdeep Dua English | 4 May 2017 | ASIN: B01DPR2ELW | 532 Pages | AZW3 | 9.6 MB Key Features Get to the grips with the latest version of Apache Spark Utilize Spark's machine learning library to implement predictive analytics Leverage Spark’s powerful tools to load, analyze, clean, and transform your data Book Description This book will teach you about popular machine learning algorithms and their implementation. You will learn how various machine learning concepts are implemented in the context of Spark ML. You will start by installing Spark in a single and multinode cluster. Next you'll see how to execute Scala and Python based programs for Spark ML. Then we will take a few datasets and go deeper into clustering, classification, and regression. Toward the end, we will also cover text processing using Spark ML. Once you have learned the concepts, they can be applied to implement algorithms in either green-field implementations or to migrate existing systems to this new platform. You can migrate from Mahout or Scikit to use Spark ML. By the end of this book, you will acquire the skills to leverage Spark's features to create your own scalable machine learning applications and power a modern data-driven business. What you will learn Get hands-on with the latest version of Spark ML Create your first Spark program with Scala and Python Set up and configure a development environment for Spark on your own computer, as well as on Amazon EC2 Access public machine learning datasets and use Spark to load, process, clean, and transform data Use Spark's machine learning library to implement programs by utilizing well-known machine learning models Deal with large-scale text data, including feature extraction and using text data as input to your machine learning models Write Spark functions to evaluate the performance of your machine learning models
Build machine learning models, natural language processing applications, and recommender systems with PySpark to solve various business challenges. This book starts with the fundamentals of Spark and its evolution and then covers the entire spectrum of traditional machine learning algorithms along with natural language processing and recommender systems using PySpark. Machine Learning with PySpark shows you how to build supervised machine learning models such as linear regression, logistic regression, decision trees, and random forest. You'll also see unsupervised machine learning models such as K-means and hierarchical clustering. A major portion of the book focuses on feature engineering to create useful features with PySpark to train the machine learning models. The natural language processing section covers text processing, text mining, and embedding for classification. After reading this book, you will understand how to use PySpark's machine learning library to build and train various machine learning models. Additionally you'll become comfortable with related PySpark components, such as data ingestion, data processing, and data analysis, that you can use to develop data-driven intelligent applications. What You Will Learn Build a spectrum of supervised and unsupervised machine learning algorithms Implement machine learning algorithms with Spark MLlib libraries Develop a recommender system with Spark MLlib libraries Handle issues related to feature engineering, class balance, bias and variance, and cross validation for building an optimal fit model Who This Book Is For Data science and machine learning professionals.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值