Spark
https://spark.apache.org/docs
Apache Spark as a central tool for mining and analytics in big data.
Spark使用DAG(有向无环图)模型作为其执行模型,
并且主要使用内存计算的方式进行任务计算。
Spark基于一套统一的数据模型(RDD)和编程模型(Transformation/Action)。
Spark是Apache软件基金会的顶级项目,
在Hadoop MapReduce基础上的一种改进。
Hadoop MapReduce是面向磁盘的,
在处理复杂数据时,例如:
Hadoop MapReduce需要进行迭代计算,
第一次计算的结果要给第二次计算使用,
这样导致磁盘IO增多,速度变慢。
包括:实时计算、交互式数据查询等方面都很低效,
Hadoop MapReduce的设计初衷并不是为了满足迭代循环计算等场景,不适合机器学习场景。
Spark是面向内存的,
这使得Spark能够为多个不同数据源的数据提供近乎实时的处理性能,
适用于需要多次操作特定数据集的应用场景。
在相同的实验环境下处理相同的数据,
若在内存中运行,
那么Spark要比MapReduce快100倍。
其它方面,例如处理迭代运算、
计算数据分析类报表、排序等,
Spark都比MapReduce快很多。
此外,Spark在易用性、通用性等方面,
也比Hadoop更强。
Spark比较迟内存,
在内存不足的情况下会执行失败,
不能完全代替Hadoop MapReduce,
需要区分应用场景。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DuVTgjnZ-1629958217999)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p573)]
结构化数据,流数据,
机器学习算法库,图形挖掘计算框架与算法库。
Spark原生是用Scala写的,
API支持Python, Java, Scala, R。
使用Scala编写的程序在某些场景会快一些,例如RDD,但在SparkSQL方面则性能一致。
SparkCore
核心组件
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tyRCENzz-1629958218003)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p577)]
Spark对底层集群管理不可知。
只要能够获取到executor进行,
并且这些进程之间可以通信,
它就能比较容易的运行在其他通用集群资源调度框架之上,如Mesos和YARN。
Spark的应用以一组独立进程的形式运行在一个集群之上,由主程序中的SparkContext对象进行协调(也被称为driver程序)。
如图2-17所示,一旦SparkContext连接到集群,Spark首先会从集群的节点中获得一些executor进程,这些进程会用来执行我们程序中的计算和存储逻辑,
接着它会通过jar包的形式分发我们的程序代码到各个executor进程。
最后,SparkContext会分派任务到各executor进程进行执行。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Jdh14f0v-1629958218005)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p586)]
SparkContext (sc) is created in the driver.
Using the sc, a connection with the cluster manager is established.
Once connected, executors are requested.
An executor is a process that performs the computation and stores the data.
The driver sends the code and tasks to the executors.
RDDs are distributed across the whole cluster
Spark中的driver都是串行化,
而worker都是并行化。
SparkContext创建在driver,
与cluster manager建立连接,
之后executors进程(干活的,负责计算与存储数据)被请求,
driver把[代码]和[任务](函数和全局变量)发给executors进程。
注意!你的driver可能运行在1个worker节点上(用了Yarn)。
每个程序都有独自隔离的executors,完全没有通信,
这意味着executors修改了全局变量后在driver上是不可见的,
在worker节点之间[共享/发送]大结构数据集会性能很低。
可以使用Broadcast variables与Accumulators进行共享。
1个worker节点可以运行多个executors。
Driver/Master
Driver程序只运行在一个节点上,
运行的所有东西都是串行的。
Spark的一个好处是可以合并一些操作,
有些操作是并行的,有些操作是串行的。
如果在Yarn上去运行Spark,
Driver程序可能运行在工作节点上,
这就是Resource Manager如何管理所有资源的。
Driver程序会发送代码和全局变量给task,
task给到Executor,Executor干完活之后把计算结果返回给Driver程序。
SparkContext在Driver节点上创建,与Cluster Manager建立连接,之后可以请求Executor。
Worker
并行操作,可以在集群的多个计算节点中执行,
也可以在自己电脑上的本地线程中执行。
1个Worker节点可能运行多个Executor。
Worker -> Executor -> Task
Executor
Executor进程复制执行计算与数据存储,
Driver发送给Executor代码与任务。
Executor是在容器中执行进程,
是运行在Worker节点上的一个进程,
负责运行各种Task(运行在Executor上的工作单元)。
每个应用都拥有自己的Executor进程,这些进程会在整个应用生命周期内持续运行并以多线程的方式执行具体的任务。这种设计的好处是将各个应用之间的资源消耗进行了隔离,每个应用都运行在它们各自的JVM中。但是这也意味着不同应用之间的SparkContext无法共享数据,除非借助扩展的存储媒介。
From different programs/applications.
Depending on the number of cores and RAM memory needed.
When using YARN, an executor runs in a container,driver程序可能运行在worker节点上。
数据模型RDD
RDD(Spark 1.0)
RDD是弹性分布式数据集(Resilient Distributed Datasets)的缩写,它是MapReduce模型的扩展和延伸。
它也是一个可容错的、可并行的、可缓存的数据结构([元素/对象]集合),可以让用户指定将数据存储到磁盘和内存中,并能控制数据的分区。
RDD的数据分散到各个分布式集群节点上。
RDD创建后不可变,只能新建,
在发生故障时将被自动重建,
同时它还提供了一些高效的编程接口操作数据集。
Spark之所以能够同时支撑大数据的多个领域,
在很大程度上是依靠了RDD的能力。
虽然批处理、流计算、图计算和机器学习这些计算场景之间初看起来风马牛不相及,但是它们都存在一个共同的需求,那就是在并行计算阶段能够高效的共享数据。
RDD的设计者们洞穿了这一现象,于是通过高效的数据共享概念和类似MapReduce的操作设计了RDD,使得它能模拟迭代式算法、关系查询、MapReduce和流式处理等多种编程模型。
Spark借助其RDD的出色设计,做到了横跨多个领域的支撑。这意味着我们在一套程序逻辑之中可以集成多种操作。
例如使用SQL查询过滤数据,然后进行机器学习或是通过SQL的方式操作流数据。在提升便利的同时也降低了开发人员的学习曲线,基于Spark,只需要学习一套编程模型即可处理多个领域。
https://blog.csdn.net/dsdaasaaa/article/details/94181269
创建RDD
创建RDD后,对RDD进行的所有操作并不会在Driver节点中运行,而是[并行/分布式]地在Worker节点中运行,将代码和数据下发给Worker节点,之后将计算结果返回给Driver节点。
声明的变量在driver中是全局变量,
所有worker都可以访问。
There are three ways to create RDDs:
1.Parallelising an existing collection in your driver program.
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data, 4)
# 打印分区数量
print(rdd.getNumPartitions())
# 4
sc.parallelize() is a lazy operation. There isn’t any kind of computation. Spark simply saves how to create the RDD with 4 partitions.
如果不写4,默认会把数据分成若干个分区,
可以进行并行计算,
设置每个CPU核的2-4个分区比较合适。
每个分区代表着1个task。
- 在外部存储系统中引用数据集,如共享文件系统、HDFS、Hbase等。
❑You can reference HDFS, Amazon S3, Hbase,…
❑If you use "*”, you can load all files in a folder.
所有Worker节点都可以访问目标文件。
distFile = sc.textFile("hdfs://localhost:9000/user/pszit/quixote.txt", 4)
sc.textFile() is also a lazy
operation.
lines = sc.textFile("README.md")
pythonLines = lines.filter(lambda line: "Python" in line)
# 打印第一行,只返回第一个数据并不是返回一个列表
print(pythonLines.first())
lines = sc.textFile("data/")
# 访问目录下的所有文件
print(lines.count())
# 43461
# In this case, we put every single line of each file as an element of the lines RDD.
# And there is another function wholeTextFile() that provide a key-value RDD where the file-names as key, and the content as value.
files = sc.wholeTextFiles("data/")
files.count()
# 2
# We can also save an RDD as a text file using saveAsTextFile(filename). For example, I can first filter and then save:
bookRDD = sc.textFile("quixote.txt")
quixote_lines = bookRDD.filter(lambda line: "Quixote" in line)
quixote_lines.saveAsTextFile("quixote_lines")
# But, quixote_lines will not be a single file, but a directory with multiple files! (one file per partition). This is because each node will write in in different places.
All operations to RDDs are executed in parallel. So, the instruction is sent to all workers.
读取JSON文件
import json
data = sc.textFile("data/data.json").map(lambda x: json.loads(x))
print(f"Number of elements in the RDD from data.json: {data.count()}")
# Number of elements in the RDD from data.json: 3
data.map(lambda x: json.dumps(x)).saveAsTextFile("data/output.json")
# We could write it back to drive:
data.map(lambda x: json.dumps(x)).saveAsTextFile("data/output.json")
# This again will be a directory!
读取CSV文件
# We could load a csv file and parse it like this:
import csv
from io import StringIO
def loadRecord(line):
"""Parse a CSV line"""
input_data = io.StringIO(line)
reader = csv.DictReader(input_data, fieldnames=["Name", "Phone", "email"])
return next(reader)
inputCSV = sc.textFile("data/personas.csv").map(loadRecord)
inputCSV.take(2)
# [OrderedDict([('Name', 'Name'), ('Phone', 'Phone'), ('email', 'email')]), OrderedDict([('Name', 'Dalton Murphy'), ('Phone', '815-368-844'), ('email', 'mauris@mollisPhaselluslibero.ca')])]
# And you can write things back to drive:
def writeRecords(records):
"""Write out CSV lines"""
output = io.StringIO()
writer = csv.DictWriter(output, fieldnames=["Name", "Phone", "email"])
for record in records:
writer.writerow(record)
return [output.getvalue()]
inputCSV.mapPartitions(writeRecords).saveAsTextFile("data/output.csv")
3.Transforming an existing RDD.
Create a new RDD from an existing one.
详见下面的"RDD编程模型章节"
RDD编程模型
RDD有2种并行操作:
转换(Transformation),行动(Action)。
注意,tuple类型(key, value)的key不能是一个集合list。
Transformation
Create a new RDD from an existing one.
转换操作负责定义新的RDD,把1个RDD转换成另外1个RDD
,例如:
将(“hello”, “hello”, “hello”), (“world”, “world”) 变成 (“hello”, 3), (“world”, 2)。
转换操作是一种惰性操作,不会立即执行,
等到执行Action操作时才会执行,
因为会有一些自动的优化操作。
在Spark中,函数调函数很常见,
a().b().c()
rdd = sc.parallelize([1, 2, 3])
print(rdd.map(lambda x: x * 2))
# [2, 4, 6]
print(rdd.map(lambda x: [x, x+5]))
# [[1, 6], [2, 7], [3, 8]]
print(rdd.flatMap(lambda x: [x, x+5]))
# [1, 6, 2, 7, 3, 8]
print(rdd.filter(lambda x: x != 1))
# [2, 3]
print(rdd.filter(lambda x: x % 2 == 0))
# [2]
map(func)
Return a new distributed dataset formed by passing each element of the source through a function func.
Returns a new RDD which is built applying to each element of the original RDD the function func.
返回一个新的分布式数据集,该数据集通过函数func传递源的每个元素而形成。
结构的转换,例如把集合list变成数量size,
把str变成(str,1)。
# 反转key与value
x = [(1, "a"), (2, "b"), (3, "c")]
print(sc.parallelize(x).map(lambda x : (x[1],x[0])).collect())
# [('a', 1), ('b', 2), ('c', 3)]
rdd = sc.parallelize([(1,2),(3,4),(3,6)])
rdd.map(lambda k_v: (k_v[0],1)).collect()
# [(1, 1), (3, 1), (3, 1)]
flatMap(func)
Similar to map, but each input [element/item] can be mapped to 0 or more output [elements/items].
类似于map,但是每个输入项都可以映射到0个或更多个输出项(所以func应该返回一个列表而不是单个值,否则会报错!)。
扁平化操作,将整体拆成个体,把1份拆成n份。
flatMap is expecting that your input function func returns a sequence rather than a single item.
flatMap的入参函数的返回值必须是集合list。
rdd = sc.parallelize([1,2,3,4])
print(rdd.flatMap(lambda x: [x * 2]).collect())
# [2, 4, 6, 8]
lines = sc.parallelize(["hello world", "hi"])
print(lines.flatMap(lambda line: line.split(" ")).collect())
# ["hello", "world", "hi"]
# 如果是用map()来实现则会出现list of list
filter(func)
Return a new dataset formed by selecting those elements of the source on which func returns true.
Returns a new RDD which is composed of the elements of the original RDD that return true after applying func.
返回一个新的数据集,
该数据集是通过选择源文件中func返回true的那些元素形成的。
过滤
groupBy
分组
keyBy
为各个元素,按指定的函数生成key,
形成key-value的RDD。
demo = [("T", 18), ("A", 23), ("G", 17), ("A", 7)]
rdd = sc.parallelize(demo)
print(rdd.keyBy(lambda x: x[0]).collect())
# [('T', ('T', 18)), ('A', ('A', 23)), ('G', ('G', 17)), ('A', ('A', 7))]
str = ["aaa" ,"aa", "a"]
rdd = sc.parallelize(str)
print(rdd.keyBy(lambda x: x.__len__()).collect())
# [(3, 'aaa'), (2, 'aa'), (1, 'a')]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Xy7khyea-1629958218011)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p580)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LJ4TMMhk-1629958218013)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p581)]
【pseudo-sets】
distinct()
去重
Return an RDD without repetitions. Actually I showed you this before, but I didn’t tell you that this requires a Shuffle! (sending data through the network! may be slow!)
rdd2 = sc.parallelize([1, 2, 2, 3, 4])
print(rdd2.distinct())
# [1, 4, 2, 3]
rdd1 = sc.parallelize(["agua", "vino", "cerveza", "agua", "agua", "vino"])
rdd2 = sc.parallelize(["cerveza", "cerveza", "agua", "agua", "vino", "coca-cola", "nara"])
print(rdd1.distinct().collect())
# ['cerveza', 'agua', 'vino']
print(rdd1.union(rdd2).collect())
# ['agua', 'vino', 'cerveza', 'agua', 'agua', 'vino', 'cerveza', 'cerveza', 'agua', 'agua', 'vino', 'coca-cola', 'nara']
print(rdd1.intersection(rdd2).collect())
# ['agua', 'vino', 'cerveza']
print(rdd1.subtract(rdd2).collect())
# []
print(rdd1.distinct().cartesian(rdd2.distinct()).collect())
# [('cerveza', 'cerveza'), ('cerveza', 'coca-cola'), ('cerveza', 'nara'), ('cerveza', 'agua'), ('cerveza', 'vino'), ('agua', 'cerveza'), ('vino', 'cerveza'), ('agua', 'coca-cola'), ('agua', 'nara'), ('vino', 'coca-cola'), ('vino', 'nara'), ('agua', 'agua'), ('agua', 'vino'), ('vino', 'agua'), ('vino', 'vino')]
distinct(rdd)
Return an RDD without repetitions. Actually I showed you this before, but I didn’t tell you that this requires a Shuffle! (sending data through the network! may be slow!)
union(rdd)
Return the union of two RDDs (keeps duplicates)
intersection(rdd)
Return the intersection of two RDDs (remove duplicates) - Warning! Require a shuffle!
subtract(rdd)
Return the elements that are in the first RDD but not in the second - Warning! Require a shuffle!
cartesian(rdd)
Return an RDD with all potential pairs of elements of both RDDs.
【key-value】
groupByKey([numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. This one might be very time consuming as it moves lots of data through the network!
Returns a new RDD of tuples (k, iterable(v)) - Warning! Require a shuffle!
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])
# 把迭代器x[1]强转成list,groupByKey比较低效,可以用redueByKey来代替
print(rdd.groupByKey().map(lambda x: (x[0], list(x[1]))).collect())
# [(1, [2]), (3, [4, 6])]
reduceByKey(func, [numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V.
Returns a new RDD of tuples (k,v) where the values of each key k are aggregated using using the function func. This function should take two elements of type v and return the same type.
注意reduceByKey是一个transformation,
并不是action。
分组和聚合可以使用一个方法来实现,
相同的key的数据会对value进行reduce,
例如把相同key的value累加,WordCount。
效率较高,有内部优化机制。
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])
print(rdd.reduceByKey(lambda a, b: a + b).collect())
# [(1, 2), (3, 10)]
sortByKey([ascending], [numPartitions])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
Returns a new RDD of tuples (k,v) that has been sorted (by default in ascending order).
rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])
print(rdd2.sortByKey().collect())
# [(1, 'a'), (1, 'b'), (2, 'c')]
join(rdd)
Inner join between RDDs
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("a", 3)]) print(sorted(x.join(y).collect()))
# [('a', (1, 2)), ('a', (1, 3))]
leftOuterJoin(rdd)
Joins the elements of two RDDs where the key must be present in the second RDD.
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2)])
print(sorted(x.leftOuterJoin(y).collect()))
# [('a', (1, 2)), ('b', (4, None))]
rightOuterJoin(rdd)
Joins the elements of two RDDs where the key must be present in the first RDD.
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2)])
print(sorted(x.rightOuterJoin(y).collect()))
# [('a', (1, 2))]
fullOuterJoin(rdd)
Joins the elements of two RDDs where the key must be present in any of the two RDDs.
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("c", 8)])
print(sorted(x.fullOuterJoin(y).collect()))
# [('a', (1, 2)), ('b', (4, None)), ('c', (None, 8))]
Action
Return a result to the driver node.
行动操作则是立即执行计算,得到计算结果,
它要么返回结果给Driver进程,
或是将结果输出到外部存储。
rdd = sc.parallelize([1, 2, 3])
print(rdd.count())
# 3
print(rdd.take(2))
# [1, 2]
print(rdd.collect())
# [1, 2, 3]
rdd = sc.parallelize([5, 3, 1, 2])
# 取得排序后的前3个元素(正序,倒序)
print(rdd.takeOrdered(3))
# [1, 2, 3]
print(rdd.takeOrdered(3, lambda s: -1 * s))
# [5, 3, 2]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TTsKr0AW-1629958218014)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p582)]
takeOrdered(n, key=func)
Return n elements in ascending order or in the order determined by the optional function func
count()
Return the number of elements in the dataset.
计算数量总和
take(n)
Return a list of the first n elements of an RDD.
first()
Return the first element of the dataset (similar to take(1)).
collect()
Return all the elements of the dataset as an array at the driver program
. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
bring the data from workers to the driver.
采集,运算并得到RDD中数据的内容,返回的数据类型是list。
save
reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
# 查看字母总数
str = ["alex", "yang", "python"]
strRDD = sc.parallelize(str)
print(strRDD.map(lambda x : len(x)).reduce(lambda a,b: a + b))
# 14
lst = [1, 2, 3]
lstRDD = sc.parallelize(lst)
print(lstRDD.reduce(lambda a, b: a * b))
# 6
如果rdd是元组形式,则reduce可以用
lambda x,y: (x[0] + y[0], x[1] + y[1])的形式对每个元组的key和value进行操作
foreach(func)
Apply the function func to each element of the RDD. It doesn’t return anything. It could be useful to insert stuff in a database.
rdd = sc.parallelize([1, 2, 3])
# 打印发生在worker节点,所以看不到效果
rdd.foreach(lambda x: print(x))
key-value
countByKey()
Counts the number of elements for each key. Returns a dictionary.
key出现次数的总和
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])
print(rdd.countByKey())
# defaultdict(<class 'int'>, {1: 1, 3: 2})
rdd1 = sc.parallelize(["a", "b", "a", "a"])
print(rdd1.countByKey())
# defaultdict(<class 'int'>, {'a': 3, 'b': 1})
collectAsMap()
Collect the RDD as a dictionary, but it only provides one of the values!
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])
print(rdd.collectAsMap())
# {1: 2, 3: 6}
lookup(key)
Return the value associated to a given key.
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)])
print(rdd.lookup(3))
# [4, 6]
RDD缓存
RDD默认不会执行缓存操作,
计算结果不会存储在内存中,
如果有重复使用场景则需要单独设置,
将数据缓存到内存中,可以用于重复使用,
否则将浪费很多计算资源。
When re-using RDDs, we should use the cache() (or persist) function to maintain the RDD
in the main memory of the workers, and therefore using it more efficiently.
persist()
cache()
把运算结果保存在内存中,
不用每次要显示结果时都去运算。
lines = sc.textFile("...", 4)
comments = lines.filter(lambda line: "Python" in line)
print(lines.count(), comments.count())
# 当走到comments.count()时,
# Spark会重新计算lines,
# 包括重新读取数据、为每个分区执行操作、在driver节点合并中间数据等等。
# 这样做效率太低,因此我们需要使用缓存,
# 将lines缓存住。
lines = sc.textFile("...", 4)
lines.cache()
...
There are different persistance/caching levels. E.g.:
❑Memory Only
❑Memory and Disk
❑Disk Only
❑…
In Python, all objects are serialised with the pickle library.
RDD高级特性
共享变量Shared variables
Spark sends functions and global variables to all executors for each task automatically(从Driver节点自动发送到Worker节点), and there isn’t communication among executors.
Any changes made in global variables by executors are not visible in the driver!
executor之间也无法看到改动,没办法在内存中共享,因为是分散在集群中的不同位置。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NKrLuDhe-1629958218014)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p605)]
This might be a problem for iterative jobs that share big data structures.
Inefficient to send large sized collections to each worker.
如果全局变量的数据量很大则性能很低,
因为要传递给每个Task。
Spark provides two advanced variables:
【广播变量Broadcast variables】
❑Read-only variables that are sent efficiently to all executors and will remained cached.
❑Stored in the workers (in main memory), so that, they can be used by one or more Spark operations. It is only sent once, not for each task.
Image you have a big (yet possible to store in main memory without parallelisation) look-up table.
For example, a table to determine the name of a country given its abbreviation. To simulate that, I will create a dictionary that I would like to be shared (efficiently) by all nodes.
tableLookUp = {1: "a", 2: "b", 3: "c", 4: "d"}
rdd = sc.parallelize([1, 2, 3, 4])
print(rdd.map(lambda v: tableLookUp[v]).collect())
# ['a', 'b', 'c', 'd']
# tableLookUp是全局变量,默认共享给所有节点,
# 但数据量大了之后,就会变得低效,需要使用共享变量。
tableLookUp_bc = sc.broadcast(tableLookUp)
# This uses peer-to-peer strategies to share those specific variables more efficiently than sending it one-by-one to each node.
# Then, we use that broadcast variable in our map function, which we know will be happening in parallel. Pay attention to the use of the ’attribute’ value to access the value of the broadcast variable:
print(rdd.map(lambda v: tableLookUp_bc.value[v]).collect())
累加器Accumulators
允许在worker节点对全局变量做操作,
在worker节点进行改动后,在driver节点与其他worker节点上可以看到效果,也许可以用来进行操作次数的计数。
但基于容错机制,如果先累加后出错,会执行2次累加,
累加器不太适用于Transformation。
❑Aggregate values from the executors in the driver
❑Only the driver can access the values of these variables
❑For the tasks (executors), accumulators are written-only
❑You can only add values; typically used to implement efficient counters and parallel additions.
# These are global variables to all the execution of our cluster.
# This means that every node will see the same value, even it is also been modified by other node.
# They may be useful as counters of operations that are happening.
# We can define an accumulator variables as:
accum = sc.accumulator(0)
# If we have an RDD in which we want to apply a function to each element,
# but it is not going to return anything, as we will accumulate in a global variable the result.
# We can use the action foreach for that.
rdd = sc.parallelize([1, 2, 3, 4])
# I first define the function I want to apply to each element. In this case, simply add the element
# to the accumulator, that is a global variable!
def f(x):
global accum
accum += x
# Now, we apply the action foreach, and we will see it doesn’t return anything:
rdd.foreach(f)
print(accum.value)
# 10
# If you run the foreach multiple times, the accumulator will keep increasing!
# The previous example wasn’t really useful, because we could do the same with a reduce, more efficiently and elegantly.
# But how about reading the ’Quixote’ book and do two things at once, extract the words and count the number of blank lines.
quixote_rdd = sc.textFile("quixote.txt")
# We could create a function that do both:
blank_lines = sc.accumulator(0)
def extract_words_blanklines(line):
global blank_lines
if line == "":
blank_lines += 1
return line.split(" ")
# And use that function within the flatMap:
words_quixote = quixote_rdd.flatMap(extract_words_blanklines)
print(words_quixote.count())
# 437863
print(blank_lines.value)
# 6820
Using Accumulators on transformations might not be ideal, maybe good for debugging only. If a task fails, we can’t guarantee that it won’t add 1 to that global variable! In this example, the counter could be wrong! This won’t happen if we use accumulators with Actions.
分区Partitions
分区:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7])
sum_count = rdd.map(lambda num: (num, 1)).reduce(lambda x,y: (x[0]+y[0],x[1]+y[1]))
#(28, 7)
共享变量(广播变量),使用分区,例如mapPartitions,都是用来提升性能。
Instead of applying an operation to every single element of an RDD, we might want to use all the elements of a partition at once.
mapPartitions(func)
Apply the function functo each partition of the RDD. func receives an iterator and returns another iterator that can be of different type.
In many occasions, we may need to use entire data partitions to perform some operations. For example in machine learning using mapPartitions could be an obvious way to do a divide-and- conquer approach, so we learn a model in a subset of the data.
Let’s see the difference between map and mapPartitions with an example.
Let’s say we want to compute the average of an RDD of integers. We could do something like this:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7])
# We will try to come up with a tuple (a,b),
# where a is the sum of all elements, and b is the count of elements.
# The map could simply return the number as key, and the value 1 (to indicate you counted it once).
# The reduce could aggregate the elements of the tuple individually:
sum_count = rdd.map(lambda num: (num, 1)).reduce(lambda x,y: (x[0]+y[0],x[1]+y[1]))
print(sum_count)
# (28, 7)
print(sum_count[0]/sum_count[1])
# 4.0
# With mapPartitions,
# we should probably define a function to determine how to compute the tuple in an entire partition of data (which contains multiple tuples).
# We have a number of partitions:
print(rdd.getNumPartitions())
# 4
# We said before that you can split the data into different partitions when loading it rather than apply the operation element-by-element.
def partitionCtr(nums):
"""Compute sumCounter for partition"""
sum_count = [0, 0]
for num in nums:
sum_count[0] += num
sum_count[1] += 1
return [sum_count] # this should return an iterator!
# The reduce would be the same:
print(rdd.mapPartitions(partitionCtr).reduce(lambda x,y: (x[0]+y[0],x[1]+y[1])))
# (28, 7)
# Which one is best? In a single computer, and for this very simple example, we won’t see much difference.
# However, pre-computing things in each partition (map) before sending things away to the reducers may save a lot of network overhead!
mapPartitionsWithIndex(func)
Apply the function func to each partition of the RDD. func receives a tuple (integer, iterator) where the integer represent the index of the partition, and iterator contains all the elements of the partitions.
# Same as before, but we can figure out the index of each partition, which could be really useful:
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7], 3)
# The function we define must have 2 input parameters, the index,
# and the iterator (which allow us to go through all the elements in the partition).
# Let’s simply print out both indexes and elements of each partition:
def show(index, iterator):
return ['index: ' + str(index) + " values: " + str(list(iterator))]
print(rdd.mapPartitionsWithIndex(show).collect())
# ['index: 0 values: [1, 2]', 'index: 1 values: [3, 4]', 'index: 2 values: [5, 6, 7]']
foreachPartition(func)
Apply the function func to each partition of the RDD but it doesn’t return anything. func receives an iterator and returns nothing.
数值RDD操作
Spark provides some built-in methods to generate some descriptive statistics of a numeric RDDs. E.g. stats(), count(), mean(), max(), etc.
stats()
If you have an RDD of integers or real numbers, you can easily apply the stats method to obtain
some basic stats of your RDD.
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7])
results = rdd.stats()
print(results)
# (count: 7, mean: 4.0, stdev: 2.0, max: 7, min: 1)
print(type(results))
# <class 'pyspark.statcounter.StatCounter'>
# This method returns a StatsCounter object with all the statistics of this RDD.
# The main benefit of this method is that all the stats are computed in one single go over the data.
# You could also apply specific operations like count, mean, sum, max, and so forth,
# but if you do it individually it will be less efficient than stats() because every time we run one of these operations we go over the data.
# So the following print statement provides exactly the same output but it is more inefficient than stats().
print(f"(count: {rdd.count()}, mean: {rdd.mean()}, stdev: {rdd.stdev()}, max: {rdd.max()}, min: {rdd.min()}")
RDD依赖
we need to distinguish between two types of transformations:
narrow and wide dependencies.
-
Narrow dependencies: Each partition will contribute to only one output partition. What does it mean? It means that the result is applied to each element of an RDD and to compute the output we don’t need any other information. Imagine for example, a map function that multiply by two each element of an RDD (e.g rdd.map(lambda x: x * 2)). The result does not depend on the values of other elements, but simply of the element.
-
Wide dependencies: Multiple input partitions contributed to many output partitions. Here we are talking about transformations that will provoke a shuffle, and therefore, traffic over the network, because the result depends on more than one input. E.g. a rdd.groupByKey() is a wide dependency, because it will require data from different nodes to create the list of values for each key.
I already anticipated that Spark performs some optimisations. The interesting thing about this distinction here is that narrow can be optimised, while wide dependencies somehow mark end of a stage in Spark.
RDD作为数据结构,
本质上是一个只读的分区记录集合。
一个RDD可以包含多个分区,
每个分区是一个数据片段。
RDD可以相互依赖。
如果父RDD的每个分区最多被一个子RDD的分区使用,则称之为窄依赖;
❑ Each partition will contribute to only one output partition
若多个子RDD分区依赖一个父RDD的分区,
则称之为宽依赖。
❑Input partitions contribute to many
output partitions
❑There will be a shuffle, exchanging partitions/data across a cluster
❑Define Spark Stages
❑E.g. sort, reduceByKey, groupByKey, join.
不同的操作依据其特性,可能会产生不同的依赖。
例如map操作会产生窄依赖,
而join操作则产生宽依赖。
Spark之所以将依赖分为两种,基于两点原因。
首先,
窄依赖支持在同单个集群上以管道的形式执行,
例如在执行了map后,紧接着执行filter。
相反,宽依赖需要所有的父RDD数据都可用并通过shuffle动作才可继续执行。
其次,窄依赖的失败恢复更加高效,因为它只需要重新计算丢失的父分区,并且这些计算可以并行的在不同节点同时进行。
与此相反,在宽依赖的继承关系中,单个失败的节点可能导致一个RDD的所有先祖RDD中的一些分区丢失,导致计算的重新执行。
如图2-16所示,说明了窄依赖与宽依赖之间的区别。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nlbVhoQd-1629958218015)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p606)]
每个蓝色小方块代表1个分区,以map为例,左侧分区的原始数据通过map变成了右侧分区的变更数据,右侧分区并没有依赖其他分区,只依赖左侧对应的原来的分区,数据在网络上没有进行移动,还是在1个分区上进行操作,这就窄依赖。
窄依赖可以进行内部优化,而宽依赖却不行。
narrow - depends on 1 partition, or wide, depends on several partitions
(and have an impact in several partitions).
Sparks creates a Directed Acyclic Graph to optimise the execution of an Application.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JcXZe9VK-1629958218016)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p607)]
- An action defines a new job (e.g. a count(), collect(), etc)
- A job is split into a number of stages that are defined by the appeareance of wide transformations (That can’t be further optimised).
- Each stage is composed of a number of tasks, which is basically applying an operation to each partition of an RDD.
I am going to investigate one simple example. We have some data, in which I am going to perform a number of operations (several transformation and one final action):
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9])
print(rdd.filter(lambda x : x<5).map(lambda x: (x,x)).groupByKey().map(lambda k_v : (sum(k_v[1]),k_v[0])).sortByKey().count())
# 4
# First we remove those elements greater or equal than 5
# then we transform it into
In the code above we have one action: count() at the end, that defines a job. We have two wide transformations: the groupByKey and the sortByKey that split the processing into 3 stages.
In the first two lines, we first remove those elements greater or equal than 5, and then we transform each element into a tuple of (element, element).
Both filter and map are narrow transformations, and we could do both in one single pass through the data!
rdd.filter(lambda x : x<5)\
.map(lambda x: (x,x))\
The next instruction was a .groupByKey(), which is a wide transformation, and required shuf- fling data around. So the previous two instructions formed a Stage that is optimised and can be applied in one single go.
Then, we don’t have a wide transformation until we get to the sortByKey, so the next two lines form a different stage:
.groupByKey()
.map(lambda k_v : (sum(k_v[1]),k_v[0]))\
Here, we grouped the tuples by key, and then the map will be used to apply an operation on each key.
Finally, the last wide transformation is put together as a final Stage with the action to complete the job.
.sortByKey()\
.count()
这一堆transformation与最后1个action加起来组成了1个job,1个job分了3个stage。stage由宽依赖来定义。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6zGBKouB-1629958218017)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p608)]
当用户对一个RDD执行了行动操作之后,
调度器会根据RDD的依赖关系生成一个DAG(有向无环图)图来执行程序。
DAG由若干个stage组成,
每个stage内都包含多个连续的窄依赖。
而各个stage之间则是宽依赖。
如图2-15所示,实线方框代表的是RDD。
方框内的矩形代表分区,
若分区已在内存中保存则用黑色表示。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RT1ozehj-1629958218018)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p575)]
容错
传统分布式系统的容错方案有据复制和恢复日志两种方案。
对于以数据为中心的系统而言,
这两种方式都非常昂贵,
因为它需要跨集群网络复制大量数据,
而网络带宽的速度远远低于内存访问的速度。
RDD天生是支持容错的。
首先,它自身是一个不变的数据集,
其次,Spark使用DAG作为其执行模型,
所以它能够通过RDD的依赖特性记住一系列操作生成一张DAG图。
因此当执行的任务失败时,
Spark只需根据DAG图进行重新计算即可实现容错机制。
由于无须采用复制的方式支持容错,
Spark很好地降低了跨网络的数据传输成本。
RDD发生故障时自动重构。
RDD问题
RDD的性能比较低因为:
We learnt that Spark optimises the execution of the parallel tasks on RDDs but this is only possible for narrow transformations.
Spark can’t optimise much more RDD executions because transformations are somehow ’black-boxes’. We apply a function to a partition of data and return another partition, as an iterator (Partition => Iterator[T]). However, the data type ([T]) of each partition could be anything (that needs to be serialised when sending it to other nodes), and that is a great thing, providing lots of flexibility, but at the same time makes it very difficult to be optimised!
Although RDDs did reduce the I/O overhead, when Spark is in need of writing something to disk or distribute it over the network, it does this using Java serialisation (and something called Kryo for quicker serialisation). The overhead of serialising individual Java and Scala objects is expensive and requires sending both data and structure between nodes.
SparkSQL可以提升RDD的性能。
RDD的性能还跟编程语言有关,而DataFrame则与编程语言无关。
Spark can’t optimise much more RDD executions because transformations are somehow ’black-boxes’. We apply a function to a partition of data and return another partition, as an iterator (Partition => Iterator[T]). However, the data type ([T]) of each partition could be anything (that needs to be serialised when sending it to other nodes), and that is a great thing, providing lots of flexibility, but at the same time makes it very difficult to be optimised!
If you like the world of Big Data, you will soon realise that one of the main challenges is to keep up with the latest advances in the available frameworks. Spark is a great example of this, as it continues evolving very rapidly, and one of its latest modules, SparkSQL aims to improve the performance of RDDs.
RDDs are very good but they are so flexible that Spark can’t apply further optimisations.
Transformation中有很多东西是未知的。
只在narrow transformations上有优化。
下面需要使用SparkSQL。
SparkSQL
这个视频的后半部分
https://www.bilibili.com/video/BV16z4y1m7yL?from=search&seid=8985173065620006881
Spark SQL is a module in Spark for structured data processing. The key idea of this module is to impose structure to data, so that, Spark can perform further optimisations by knowing more about the underlying data.
❑Impose structure to the data to optimise the execution
❑This means to organise the data – similar to Pandas in Python
❑Spark SQL is a module to process structured data
❑It allows further optimisation as Spark knows more about the underlying data
❑However, RDDs are still being used underneath!DataFrames与Datasets的底层依然存在RDD,RDD依然会被使用。
❑We can apply SQL queries
【关于tuple的相关Spark用法】
rdd = sc.parallelize([(1,2),(3,4),(3,6)])
rdd.map(lambda k_v: (k_v[0],1)).collect()
#[(1,1),(3,1),(3,1)]
rdd.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
#[(1,[2]),(3,[4,6])]
#groupByKey的性能低,用reduceByKey来替换。
#because all the key-value pairs are shuffled around
#Because reduceByKey will apply some optimisations (using combiners) by
#performing local reductions before moving things across the network!
rdd.reduceByKey(lambda a, b: a + b).collect()
#[(1, 2),(3, 10)]
#相同key的元素们会去到相同的节点上
words_quixote = quixote_rdd.flatMap(lambda line: line.split(’ ')).cache()
#读取文件,把文件中的字符串全部打散并分割并缓存
words_quixote.filter(lambda line: “blockhead” in line).count()
#从rdd中得到"blockhead"出现的次数
DataFrames
DataFrames (>= Spark 1.3)
DataFrames的性能要比RDD高。
❑RDDs of rows with columns that can be accessed by their names
❑Similar to Pandas in Python (dataframes in R)
❑Avoid Java serialization performed by RDDs,Imposing structure on your data allows Spark to avoid Java serialisation and object overhead.
❑API natural for developers familiar with building query plans (SQL)
❑Introduced as a part of Tungsten project ❑Efficient memory management
❑Concept of schema to describe data
The DataFrame API was originally quite different from the RDD API because it builds a rela- tion query plan, so it was more intuitive for database programmers, and did not follow an object- oriented programming style. The DataFrame API introduced the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization!
Structured data vs. Non-structured data: DataFrames vs RDDs
❑RDDs:
Advantages
• Easy to understand and very flexible
• Object-oriented programming (e.g. data.map())
• Compile-time type-safety (obviously not in Python!)
Main weaknesses
• Performance in some cases
• Data distribution over network or storing implies→serialisation
• Java Serialization or Kryo(序列化的开销比较大)
• Overhead for each object (storage and send it)
• Overhead of the Garbage Collector
写法对比:
RDDs
rdd.filter(lambda x: x.age > 21)
DataFrames
SQL style
df.filter(“age > 21”);
OOP style
df.filter(df.col(“age”).gt(21)); df.filter(df.col(“age”) > 21);
RDDs:
❑Allow us to decide HOW to do operations
❑Limits the optimisation that Spark can do.
DataFrames/Datasets:
❑Allow us to define WHAT we want to do ❑Let Spark decides how to do it
DataFrames have become the primary Machine learning API for Spark.
What is a DataFrame?
❑A 2-D data structure (a table), which can be accessed by the name of the column.
❑All elements of a column belong to the same type.
❑In Spark, a DataFrame is a distributed collection of elements of type Row.
❑A Row is an array that can be indexed by name (like a dictionary).
❑Like RDDs, they are immutable!
Age | Name |
---|---|
null | Michael |
30 | Andy |
19 | Justin |
DataFrame使用所有语言的性能是一样的。
使用SparkSession作为入口。
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.master("local[*]") \ .appName("pySparkSQL example") \ .getOrCreate()
sc = spark.sparkContext
# SparkSession也可以用回SparkContext
创建DataFrame
❑We can create a DataFrame from an existing RDD, and Hive Table or any other Spark data source.
❑We require a schema that defines the column names and their types.
❑Also to determine if the values could be NULL or not (similar to
what to a CREATE TABLE in SQL)
To create a DataFrame we need to provide a schema, which can be define following different strategies:
To define a schema in Spark there are 3 strategies:
❑Infer the schema automatically from the data (e.g. JSON files, RDDs, etc)
❑Infer the schema automatically from metadata (e.g. JDBC, JavaBeans)
❑Explicit definition
Similar to RDDs, we can create DataFrames from an existing collection or an storage system.
❑Note that the use of those strategies depend on the input data (e.g. JSON vs RDD) and the language you’re using.
Inferring the schema from the data:
# From tuples
tuples = [('Alice', 1)]
df = sparkSession.createDataFrame(tuples)
print(type(df))
# <class 'pyspark.sql.dataframe.DataFrame'>
print(df)
# DataFrame[_1: string, _2: bigint]
print(df.printSchema())
# root
# |-- _1: string (nullable = true)
# |-- _2: long (nullable = true)
print(df.collect())
# [Row(_1='Alice', _2=1)]
print(df.show())
# +-----+---+
# | _1| _2|
# +-----+---+
# |Alice| 1|
# +-----+---+
# 指定列名
print(sparkSession.createDataFrame(tuples, ['name', 'age']).collect())
# [Row(name='Alice', age=1)]
# From an RDD of tuples
rdd = sc.parallelize(tuples)
print(sparkSession.createDataFrame(rdd).collect())
# [Row(_1='Alice', _2=1)]
df2 = sparkSession.createDataFrame(rdd, ['name', 'age'])
print(df2.collect())
# [Row(name='Alice', age=1)]
# 只显示第一行
df2.show(1)
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 1|
# +-----+---+
rdd.toDF().show(1)
# +-----+---+
# | _1| _2|
# +-----+---+
# |Alice| 1|
# +-----+---+
Inferring the schema from the data:
# From RDDs of Rows
from pyspark.sql import Row
rdd = sc.parallelize([('Alice', 1)])
Person = Row('name', 'age')
print(Person)
# <Row('name', 'age')>
Person("Isaac",34)
# Row(name='Isaac', age=34)
person = rdd.map(lambda r: Person(*r))
print(sparkSession.createDataFrame(person).collect())
# [Row(name='Alice', age=1)]
row = Row(name="Isaac", age=34)
print(row)
# Row(name='Isaac', age=34)
# We can access the fields like ’attributes’ or like values of a dictionary.
print(row.name)
# Isaac
# 建议使用这种方式避免潜在名字冲突
print(row["name"])
# Isaac
# A few operators you can use in this data type.
print("name" in row)
# True
print("surname" in row)
# False
Manually specifying the Schema:
❑You will normally try to do this automatically, but if you need to
specify it manually, you can use StructType.
❑You need to access the different SQL types of Spark.
from pyspark.sql.types import *
rdd = sc.parallelize([('Alice', 1)])
schema = StructType([StructField("name", StringType(), True), StructField("age", IntegerType(), True)])
df = sparkSession.createDataFrame(rdd, schema)
print(df.collect())
# [Row(name='Alice', age=1)]
print(df.printSchema())
# root
# |-- name: string (nullable = true)
# |-- age: integer (nullable = true)
print(sparkSession.createDataFrame(rdd, "a: string, b: int").collect())
# [Row(a='Alice', b=1)]
You can read from different data types and infer their schema.
sparkSession.read.json(”data/data.json")
Inferring a schema for a large dataset might be too slow. Use a random subset of the data to do this.
dataDF = sparkSession.read.option("samplingRatio", 0.2).json(”data/data.json")
dataDF.show(1)
dataDF.printSchema()
DataFrame to RDD to DataFrame. Both data structure are interoperable:
df.rdd
spark.createDataFrame(rdd)
The column names are like indexes, but doing something like df[“colName”] will not provide the actual content, but just the column name. You need to perform an operation to select the column.
DataFrame操作
Again we have two types of operations:
❑Transformations:
Create a new DataFrame using a lazy evaluation, i.e., there is no computation until an action is triggered.
❑Actions:
Use a DataFrame to return a result to the driver (or to save it to drive)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CI0TH8dD-1629958218019)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p609)]
Something important to highlight is that in the SparkSQL there are various ways to perform the same operations. Some times is more Object-Oriented Programming style, others more SQL style. Use the ones you feel more comfortable with.
Transformations
❑Similar to SQL type of operations
❑Many more like distinct, dropDuplicates, drop, and cache()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tof1saFP-1629958218020)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p610)]
In SparkSQL there are many equivalent ways to program operations.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AYgutm3f-1629958218021)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p611)]
Column Transformations
❑Very important transformations, very useful to operate at the
column level (not at the DataFrame level!)
❑These are functions we can use INSIDE of transformations.
❑These functions return column expressions (don’t do anything)
❑Same problem as we showed before when doing: df[‘colName’]
❑They don’t do anything directly, you need to put them in a transformation (e.g. a select)
❑Example of these: alias(), between(), when(), explode()…
tuples = [('Alice', 1), ('Bob', 4), ('Juan', 10), ('Pepe', 25), ('Panchito', 15)]
df = sparkSession.createDataFrame(tuples, ['name', 'age'])
print(df)
# DataFrame[name: string, age: bigint]
select( * cols)
Return a new DataFrame using a series of expressions that could be column names or Column objects. If " * " is used, all the columns of the DataFrame are returned.
Use select to create a new DataFrame with the content of another DataFrame.
df.select('*').show()
# +--------+---+
# | name|age|
# +--------+---+
# | Alice| 1|
# | Bob| 4|
# | Juan| 10|
# | Pepe| 25|
# |Panchito| 15|
# +--------+---+
# 下面三行结果一样
df.select('age').show(2)
df.select(df.age).show(2)
df.select(df['age']).show(2)
# +---+
# |age|
# +---+
# | 1|
# | 4|
# +---+
# only showing top 2 rows
df.select(df.name, (df.age + 10)).show()
# +--------+----------+
# | name|(age + 10)|
# +--------+----------+
# | Alice| 11|
# | Bob| 14|
# | Juan| 20|
# | Pepe| 35|
# |Panchito| 25|
# +--------+----------+
df.select(df.name, (df.age + 10).alias('age')).show()
# +--------+---+
# | name|age|
# +--------+---+
# | Alice| 11|
# | Bob| 14|
# | Juan| 20|
# | Pepe| 35|
# |Panchito| 25|
# +--------+---+
selectExpr( * expr)
Variation of select to allow for SQL expressions.
You can apply a number of SQL expressions on this, this might look a bit funny:
df.selectExpr("age * 2", "abs(age)").show()
# +---------+--------+
# | (age * 2)|abs(age)|
# +---------+--------+
# | 2| 1|
# | 8| 4|
# | 20| 10|
# | 50| 25|
# | 30| 15|
# +---------+--------+
filter(condition) / where(condition)
Filter the rows using a condition.
Filter and where are the same transformation and they are very similar to what we did with RDDs.
However, there are two different styles to do this:
• Object Oriented Programming (OOP) style
• SQL style
Both styles do the same, choose the one you feel more comfortable with!
# OOP style
df.filter(df.age > 18).show()
# +----+---+
# | name|age|
# +----+---+
# |Pepe| 25|
# +----+---+
df.where(df.age == 1).show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 1|
# +-----+---+
# SQL style
df.filter("age > 18").show()
# +----+---+
# | name|age|
# +----+---+
# |Pepe| 25|
# +----+---+
df.where("age = 1").show()
# +-----+---+
# | name|age|
# +-----+---+
# |Alice| 1|
# +-----+---+
orderBy( * cols, ascending)
Return a new DataFrame sorted by the specified columns. By default, ascending order.
sort( * cols, ** kwargs)
Return a new DataFrame sorted by the specified columns. The condition must be of type Column or a SQL expression
They have different parameters but they perform the same operation. In both, we first indicate the columns that will be used to sort the DataFrame, and then the way we want to sort it, i.e. ascending or descending.
Let’s do descending order by age:
# 下面3个结果一样
print(df.sort(df.age.desc()).collect())
print(df.sort("age", ascending=False).collect())
print(df.orderBy(df.age.desc()).collect())
# [Row(name='Pepe', age=25), Row(name='Panchito', age=15), Row(name='Juan', age=10), Row(name='Bob', age=4), Row(name='Alice', age=1)]
from pyspark.sql.functions import *
print("Ascending order by age:")
print(df.sort(asc("age")).collect())
print("Descending order by age but ascending by name:")
print(df.orderBy(desc("age"), "name").collect())
print(df.orderBy(["age", "name"], ascending=[0, 1]).collect())
# Ascending order by age:
# [Row(name='Alice', age=1), Row(name='Bob', age=4), Row(name='Juan', age=10), Row(name='Panchito', age=15), Row(name='Pepe', age=25)]
# Descending order by age but ascending by name:
# [Row(name='Pepe', age=25), Row(name='Panchito', age=15), Row(name='Juan', age=10), Row(name='Bob', age=4), Row(name='Alice', age=1)]
# [Row(name='Pepe', age=25), Row(name='Panchito', age=15), Row(name='Juan', age=10), Row(name='Bob', age=4), Row(name='Alice', age=1)]
distinct()
Return a new DataFrame with the uniques rows of the original.
dropDuplicates( * cols)
Return a new DataFrame without duplicates considering the columns specified
The difference between these two operations is that in dropDuplicates we can specify the fields to consider if two rows are duplicated.
I am going to create a DataFrame from an RDD of Rows with attributes name, age and height:
from pyspark.sql import Row
df2 = sc.parallelize([Row(name='Alice', age=5, height=80), Row(name='Alice', age=5, height=80), Row(name='Alice', age=10, height=80)]).toDF()
df2.show()
# +-----+---+------+
# | name|age|height|
# +-----+---+------+
# |Alice| 5| 80|
# |Alice| 5| 80|
# |Alice| 10| 80|
# +-----+---+------+
# 下面2个效果一样
df2.distinct().show()
df2.dropDuplicates().show()
# +-----+---+------+
# | name|age|height|
# +-----+---+------+
# |Alice| 5| 80|
# |Alice| 10| 80|
# +-----+---+------+
df2.dropDuplicates(['name', 'height']).show()
# +-----+---+------+
# | name|age|height|
# +-----+---+------+
# |Alice| 5| 80|
# +-----+---+------+
drop(col)
Return a new DataFrame without a specified column.
This transformation removes the column but doesn’t modify the input DataFrame (as they are immutable). It simply returns the DataFrame without a column
Let’s remove the age column in df:
# 下面2种情况相同
print(df.drop('age').collect())
print(df.drop(df.age).collect())
# [Row(name='Alice'), Row(name='Bob'), Row(name='Juan'), Row(name='Pepe'), Row(name='Panchito')]
# df是保持不变的
print(df.collect())
# [Row(name='Alice', age=1), Row(name='Bob', age=4), Row(name='Juan', age=10), Row(name='Pepe', age=25), Row(name='Panchito', age=15)]
limit(num)
Limit the number of rows obtained as a result.
You can indicate the number of Rows you want to get before applying an action:
print(df.limit(0).collect())
# []
print(df.limit(1).collect())
# [Row(name='Alice', age=1)]
withColumn(colName, col)
Return a new DataFrame adding a new column or replacing an existing column with the given name.
Similar to what we did before with select, we can add a new column, provide a name and perform
an operation to it:
print(df.withColumn('age2', df.age + 2).collect())
# [Row(name='Alice', age=1, age2=3), Row(name='Bob', age=4, age2=6), Row(name='Juan', age=10, age2=12), Row(name='Pepe', age=25, age2=27), Row(name='Panchito', age=15, age2=17)]
withColumnRenamed(existing, new)
Return a new DataFrame renaming an existing column.
You can rename the columns:
print(df.withColumnRenamed('age', 'age2').collect())
# [Row(name='Alice', age2=1), Row(name='Bob', age2=4), Row(name='Juan', age2=10), Row(name='Pepe', age2=25), Row(name='Panchito', age2=15)]
cache()
Keep the DataFrame cached for future re-use.
These are important transformations that are applied on Columns and only return a Column, i.e. a SQL expression. To apply them to a DataFrame, they must be used in conjunction with other transformations such as select. Note that they are not applied on a DataFrame directly either; e.g. df.alias() wouldn’t work.
alias( * alias)
Return a column with a new name
Change the name of a column:
print(df.name.alias("Hello"))
# Column<b'name AS `Hello`'>
# This hasn’t done anything at all in df nor has returned a new DataFrame. It only returns that Column expression that can be used to modify the name.
df.select(df.name.alias("Hello")).show()
# +--------+
# | Hello|
# +--------+
# | Alice|
# | Bob|
# | Juan|
# | Pepe|
# |Panchito|
# +--------+
# Again, df is not modified, but a new DataFrame is returned with the column name name changed to Hello.
between(lowerBound, upperBound)
True if the values is in the range specified
df.age.between(18, 65)
# Column<b'((age >= 18) AND (age <= 65))'>
df.select(df.name, df.age.between(2, 4)).show()
+--------+---------------------------+
| name|((age >= 2) AND (age <= 4))|
+--------+---------------------------+
| Alice| false|
| Bob| true|
| Juan| false|
| Pepe| false|
|Panchito| false|
+--------+---------------------------+
df.filter(df.age.between(2, 4)).show()
# +----+---+
# |name|age|
# +----+---+
# | Bob| 4|
# +----+---+
isNull() / isNotNull()
True if the value is NULL or not
print(df.age.isNull())
# Column<b'(age IS NULL)'>
print(df.age.isNotNull())
# Column<b'(age IS NOT NULL)'>
df.filter(df.name.isNull()).show()
# +----+---+
# |name|age|
# +----+---+
# +----+---+
when(condition, value) / otherwise(value)
Evaluate a list of conditions and perform a transformation based on those.
Similar to an if-else structure. For these cases, we need to import the SQL functions, here I am
going to import it as ’F’:
from pyspark.sql import functions as F
print(F.when(df.age > 18, "adult").otherwise("kid"))
# Column<b'CASE WHEN (age > 18) THEN adult ELSE kid END'>
df.select(df.name, F.when(df.age > 18, 'adult').otherwise('kid')).show()
+--------+--------------------------------------------+
| name|CASE WHEN (age > 18) THEN adult ELSE kid END|
+--------+--------------------------------------------+
| Alice| kid|
| Bob| kid|
| Juan| kid|
| Pepe| adult|
|Panchito| kid|
+--------+--------------------------------------------+
df.select(df.name, df.age, F.when(df.age > 18, 'Adult').when(df.age < 12, 'kid').otherwise("teenager")).show()
+--------+---+--------------------------------------------------------------------------+
| name|age|CASE WHEN (age > 18) THEN Adult WHEN (age < 12) THEN kid ELSE teenager END|
+--------+---+--------------------------------------------------------------------------+
| Alice| 1| kid|
| Bob| 4| kid|
| Juan| 10| kid|
| Pepe| 25| Adult|
|Panchito| 15| teenager|
+--------+---+--------------------------------------------------------------------------+
startswith(other), substring(startPos, len), like(otheR)*
Function to operate with strings.
print(df.select(df.name.substr(1, 3).alias("Abbr")).collect())
# [Row(Abbr='Ali'), Row(Abbr='Bob'), Row(Abbr='Jua'), Row(Abbr='Pep'), Row(Abbr='Pan')]
isin(* cols)
True si if the value is in the list of arguments
The expression Column to see if my name and Pepe’s is in the DataFrame:
print(df.name.isin("Isaac", "Pepe"))
# Column<b'(name IN (Isaac, Pepe))'>
# 下面2个效果一样
print(df.filter(df.name.isin("Isaac", "Pepe")).collect())
print(df[df.name.isin("Isaac", "Pepe")].collect())
# [Row(name='Pepe', age=25)]
explode(col)
Returns a new row for each element of the array.
This is an important operation that can simulate the use of flatMap in RDDs because it creates as many rows as values that are contained in an array or a dictionary.
It basically expands a list, tuple or dictionary into rows.
eDF = sparkSession.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])
eDF.show()
# +---+---------+--------+
# | a| intlist|mapfield|
# +---+---------+--------+
# | 1|[1, 2, 3]|[a -> b]|
# +---+---------+--------+
eDF.select("*", explode("intlist")).show()
# +---+---------+--------+---+
# | a| intlist|mapfield|col|
# +---+---------+--------+---+
# | 1|[1, 2, 3]|[a -> b]| 1|
# | 1|[1, 2, 3]|[a -> b]| 2|
# | 1|[1, 2, 3]|[a -> b]| 3|
# +---+---------+--------+---+
# As we used the "*", columns a,intlist and mapfield remain the same, but duplicated in each row. There is now a new column col that contains the elements of intlist. We could actually give it a new name with alias:
eDF.select("*", explode("intlist").alias('new_explode')).show()
# +---+---------+--------+-----------+
# | a| intlist|mapfield|new_explode|
# +---+---------+--------+-----------+
# | 1|[1, 2, 3]|[a -> b]| 1|
# | 1|[1, 2, 3]|[a -> b]| 2|
# | 1|[1, 2, 3]|[a -> b]| 3|
# +---+---------+--------+-----------+
eDF.select(F.explode(eDF.intlist).alias("anInt")).show()
# +-----+
# |anInt|
# +-----+
# | 1|
# | 2|
# | 3|
# +-----+
eDF.select(F.explode(eDF.mapfield).alias("key", "value")).show()
# +---+-----+
# |key|value|
# +---+-----+
# | a| b|
# +---+-----+
lit(value)
Create a column with value provided
from pyspark.sql import functions as F
print(F.lit(0))
# Column<b'0'>
df.select("*", F.lit(0).alias("Balance")).show()
# +--------+---+-------+
# | name|age|Balance|
# +--------+---+-------+
# | Alice| 1| 0|
# | Bob| 4| 0|
# | Juan| 10| 0|
# | Pepe| 25| 0|
# |Panchito| 15| 0|
# +--------+---+-------+
length(col)
Returns the length of the column
print(F.length(df.name))
# Column<b'length(name)'>
df.select(F.length(df.name).alias('len')).show()
# +---+
# |len|
# +---+
# | 5|
# | 3|
# | 4|
# | 4|
# | 8|
# +---+
union(rdd)
Returns the union of elements of two DataFrames (keep duplicates)
intersect(rdd)
Returns the union of elements of two DataFrames (remove duplicates) Warning! Required shuffle!
subtract(rdd)
Return a DataFrame with the elements present in el first DataFrame and not
in the second Warning! Required shuffle!
Actions
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8RB8u7mL-1629958218022)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p612)]
show(n=20, truncate=True)
Print n rows of the DataFrame. Truncate indicates whether strings should be cut if too long
count()
Return the number of rows
collect()
Return all the rows of the DataFrame as a list of Rows Warning: Should fit in main memory
first()
Return the first row
take(n)
Return the first n rows as a list of Rows
toPandas()
Return the content as a pandas.DataFrame Warning: Should fit in main memory
columns
Return the column names a list
describe(* cols)
Provide some statistics for numeric columns (count, mean, stddev, min, and max)
explain(extended=False)
Print out the physical and logical
plans for debugging.
df.filter(df.age > 10).select(df.age).explain(True)
# == Parsed Logical Plan ==
# Project [age#7L]
# +- Filter (age#7L > cast(10 as bigint))
# +- LogicalRDD [name#6, age#7L], false
#
# == Analyzed Logical Plan ==
# age: bigint
# Project [age#7L]
# +- Filter (age#7L > cast(10 as bigint))
# +- LogicalRDD [name#6, age#7L], false
#
# == Optimized Logical Plan ==
# Project [age#7L]
# +- Filter (isnotnull(age#7L) AND (age#7L > 10))
# +- LogicalRDD [name#6, age#7L], false
#
# == Physical Plan ==
# *(1) Project [age#7L]
# +- *(1) Filter (isnotnull(age#7L) AND (age#7L > 10))
# +- *(1) Scan ExistingRDD[name#6,age#7L]
DataFrame缓存
❑Caching DataFrames: it doesn’t happen by default!!
Lines_df = spark.read.text("...")
words_df = lines_df.select(explode(split('value', ‘ ')).alias('word’)).cache()
words_df.where(words_df.word.like("%block-head%")).count()
words_df.where(words_df.word.like("%spear%")).count()
Datasets
Datasets (>= Spark 1.6)
• Idea: Strongly typed RDDs
• Functional transformations (map, flatMap, filter)
• Best of both RDDs and DataFrames
• Object-oriented programming
• Compile-time type safety
• Catalyst optimisation
• Tungsten off-heap memory optimisation
• Only for Scala and Java
Spark 2.0后DataFrames与Dataset进行了合并。
(from version 1.6) the Dataset API emerged to bring the best of both RDD and DataFrame worlds, allowing for object-oriented programming and compile-time type safety, but this was only for Scala and Java. The key idea for this was to provide strongly typed RDDs.
From Spark 2.0, Dataset and DataFrame APIs were fused, so that a DataFrame is simply a Datasets of Rows. In Python, we don’t have Datasets and obviously not compile-time type-safety, as this is an interpreted language.
SparkSQL高级特性
聚合Aggregations
❑Aggregations and operations on grouped DataFrames
❑Similar to the descriptive statistics methods for numeric RDDs, there are specific operations that allow to aggregate numeric columns of a DataFrame
❑E.g. to compute avg, max, min, sum or count
❑You can first group your data in a certain way prior to perform the
aggregation (using groupBy)
❑E.g. you want to count the number of people with the same
name
❑groupBy provides a different DataType (GroupedData) for which we can apply different transformations (e.g. count, max, aggregations, etc).
agg(* exprs)
Perform aggregations over the entire DataFrame – exprs can be a dictionary or list of expression
Perform aggregations over the entire DataFrame, without groups! (equivalent to do df.groupBy().agg())
Available aggregations: avg, max, min, sum, count. * exprs can be:
• A key-value dictionary in which key is the name of a column and value an aggregation func- tion
• A list of Column aggregation expressions With a dictionary we could do something like:
df.agg({"age": "max"}).show()
# +--------+
# |max(age)|
# +--------+
# | 25|
# +--------+
from pyspark.sql import functions as F
df.agg(F.min(df.age)).show()
# +--------+
# |min(age)|
# +--------+
# | 1|
# +--------+
df.agg({"age": "max", "age": "min"}).show()
# +--------+
# |min(age)|
# +--------+
# | 1|
# +--------+
df.agg(F.min(df.age),F.max(df.age),F.mean(df.age)).show()
# +--------+--------+--------+
# |min(age)|max(age)|avg(age)|
# +--------+--------+--------+
# | 1| 25| 11.0|
# +--------+--------+--------+
groupBy(* cols)
Group a DataFrame using the specified columns to perform aggregations – GroupedData.
This operation groups together the rows of a DataFrame based on the specified columns. It cre- ates a DataFrame GroupedData that can later be used by aggregations for groups. As this is a transformation, we will have to apply the previous function agg
For example, we could group our DataFrame by name and for each name we count the number of repetitions and the average age. Let me first get a more complete DataFrame:
df.show()
# +--------+---+
# | name|age|
# +--------+---+
# | Alice| 1|
# | Bob| 4|
# | Juan| 10|
# | Pepe| 25|
# |Panchito| 15|
# +--------+---+
tuples2 = [('Alice', 5), ('Bob', 10), ('Juan', 15), ('Juan', 20)]
df2 = sparkSession.createDataFrame(tuples2, ['name', 'age'])
df3 = df.union(df2)
df3.show()
# +--------+---+
# | name|age|
# +--------+---+
# | Alice| 1|
# | Bob| 4|
# | Juan| 10|
# | Pepe| 25|
# |Panchito| 15|
# | Alice| 5|
# | Bob| 10|
# | Juan| 15|
# | Juan| 20|
# +--------+---+
print(df3.groupBy('name'))
# <pyspark.sql.group.GroupedData object at 0x7fb62f9af7c0>
# This data type cannot be collected
df3.groupBy('name').count().show()
# +--------+-----+
# | name|count|
# +--------+-----+
# | Pepe| 1|
# | Bob| 2|
# | Alice| 2|
# |Panchito| 1|
# | Juan| 3|
# +--------+-----+
df3.groupBy('name').agg(F.mean(df3.age), F.count(df3.age)).show()
# +--------+--------+----------+
# | name|avg(age)|count(age)|
# +--------+--------+----------+
# | Pepe| 25.0| 1|
# | Bob| 7.0| 2|
# | Alice| 3.0| 2|
# |Panchito| 15.0| 1|
# | Juan| 15.0| 3|
# +--------+--------+----------+
avg(* cols) / mean(* cols)
Calculate the average value for each group in each numeric column.
count()
Count the number of rows for each group.
- max(*** cols)
Calculate the max for each group in each numeric column specified.
min(* cols)
Calculate the min for each group in each numeric column specified.
sum(* cols)
Calculate the sum for each group in each numeric column specified.
pivot(pivot_col, values)
Pivot over a column of the DataFrame to perform a specific aggregation.
This function allows you to pivot over a column to then perform an aggregation. This is similar to
a pivot table in Pandas (if you are familiar with it).
• pivot_col indicates the column in which we want to pivot.
• values specifies which values will appear in the columns of the resulting DataFrame. If not
provided, Spark computes it automatically (less efficient though!).
data = [(2019,'Alice', 67, 60, 'COMP4008'), (2019,'Alice', 67, 25, 'COMP4103'),
(2019,'Bob', 34, 70, 'COMP4008'),
(2019,'Bob', 34, 95, 'COMP4103'),
(2020,'Thorsten', 67, 60, 'COMP4008'),
(2020,'Thorsten', 67, 25, 'COMP4103'),
(2020,'Isaac', 34, 70, 'COMP4008'),
(2020,'Isaac', 34, 95, 'COMP4103') ]
df = sparkSession.createDataFrame(data, ['year','name','age','grade','module_name'])
df.show()
# +----+--------+---+-----+-----------+
# |year| name|age|grade|module_name|
# +----+--------+---+-----+-----------+
# |2019| Alice| 67| 60| COMP4008|
# |2019| Alice| 67| 25| COMP4103|
# |2019| Bob| 34| 70| COMP4008|
# |2019| Bob| 34| 95| COMP4103|
# |2020|Thorsten| 67| 60| COMP4008|
# |2020|Thorsten| 67| 25| COMP4103|
# |2020| Isaac| 34| 70| COMP4008|
# |2020| Isaac| 34| 95| COMP4103|
# +----+--------+---+-----+-----------+
df.groupBy("year").pivot("module_name", ["COMP4008", "COMP4103"]).avg("grade").show()
# +----+--------+--------+
# |year|COMP4008|COMP4103|
# +----+--------+--------+
# |2020| 65.0| 60.0|
# |2019| 65.0| 60.0|
# +----+--------+--------+
# 不写列名也可以,但效率会低一些
df.groupBy("year").pivot("module_name").avg("grade").show()
# +----+--------+--------+
# |year|COMP4008|COMP4103|
# +----+--------+--------+
# |2020| 65.0| 60.0|
# |2019| 65.0| 60.0|
# +----+--------+--------+
agg(* exprs)
Same as before, but to each group of the DataFrame.
用户定义方法User Defined Functions
❑SparkSQL allows for User Defined Functions (UDFs) to transform a DataFrame.
You can define your own function to work over the columns of a DataFrame. I included this to mention that it is possible, but I wouldn’t advice it unless strictly necessary. Native functions are highly optimised, whilst your function won’t work as fast on Spark.
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
slen = udf(lambda s: len(s), IntegerType())
df.select(slen(df.name).alias('slen')).collect()
# [Row(slen=5), Row(slen=3)]
df.select(df.name,slen(df.name).alias('slen')).show()
df.select(F.length(df.name).alias('length')).show()
❑Note that these functions won’t be as efficient as the native functions provided by Apache Spark.
使用SQL
❑We can apply SQL queries on a DataFrame
❑They follow the same optimisation (Catalyst, Tungsten)
❑Return a DataFrame
❑The only thing you need to do is to ‘register’ a Data Frame as a SQL temporary view.
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")
sqlDF = sparkSession.sql("SELECT * FROM people")
sqlDF.show()
sparkSession.sql("SELECT * FROM people WHERE age > 34 AND module_name != 'COMP4103'").show(
读写文件
❑Compared to RDDs, reading and writing CSV, JSON or text files is natively implemented in SparkSQL.
# 读文件
dfText = sparkSession.read.text("data.txt")
dfJSON = sparkSession.read.json("data.json") # infer the Schema automatically
dfCSV = sparkSession.read.csv("data.csv", inferSchema=True, header=True)
# 写文件
dfText.write.text("datos/salidaTXT1.txt") dfJSON.write.json("datos/salidaJSON1.json") dfCSV.write.csv("datos/salidaCSV1.csv", header=True)
More at:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql. DataFrameWriter
运行环境
类似Tomcat,我们的应用程序没办法独立运行,
需要在Spark环境中运行。
本地模式
支持本地环境运行,主要用来做练习和测试,
主要用于学习场景,不能用于工作场景。
在本机安装Spark,之后命令行中编程并执行。
可以用IDE开发的程序提交给本地环境去执行。
./bin/spark-submit /Users/yangminghan/PycharmProjects/yangdemo/spark/SparkDemo.py
计算监控页面
在本地环境中输入localhost:4040可以进入。
如果有多个SparkContent,
则是4041,4042,以此类推。
集群模式
主要用于工作场景。
Spark目前支持3种集群运行方式:
Spark既可以通过standlone(独立部署)模式独立运行,
也可以运行在Mesos或者YARN之上。
YARN、Mesos是集群资源管理工具。
Standlone
standlone用主从模式,
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8WegT8ol-1629958218023)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p583)]
打开/conf/slaves,
把localhost改成集群节点名称,
例如:
linux1
linux2
linux3
打开/conf/spark-env.sh
在底部加入
export JAVA_HOME=/xxx/xxx/java
SPARK_MASTER_HOST=linux1
SPARK_MASTER_PORT=7077
分发Spark-standalone目录:
xsync spark-standalone
启动集群命令:
sbin/start-all.sh
xcall jps
看看当前启动的进程
查看监控页面
localhost:8080
./bin/spark-submit /Users/yangminghan/PycharmProjects/yangdemo/spark/SparkDemo.py --master spark://linux1:7077
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SAZX1Oi6-1629958218024)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p584)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zpnu9aL0-1629958218025)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p585)]
历史服务
https://www.bilibili.com/video/BV11A411L7CK?p=15
看到执行过的计算详细信息监控页面。
高可用
https://www.bilibili.com/video/BV11A411L7CK?p=16
Yarn(Hadoop)
https://www.bilibili.com/video/BV11A411L7CK?p=17
Mesos与K8S
https://www.bilibili.com/video/BV11A411L7CK?p=18
Spark应用
我们有自己的应用程序自己的业务逻辑,
之后连接Spark框架去运行我们的应用程序,
用完之后再关闭连接。
一些配置信息
sparkConf = SparkConf().setMaster(“local”).setAppName(“My App”)
master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode.
sc = new SparkContext(sparkConf)
用来建立连接。
关闭连接
sc.stop()
PySpark
Python Interface for Spark
环境配置
把/conf里文件里的template后缀名都去掉。
在spark文件夹的根目录运行命令跑demo:
./bin/run-example SparkPi 10
命令行输入pyspark启动本地Spark环境,
可以在命令行中进行编程,退出按exit()。
访问浏览器:
http://localhost:4040
修改Python的默认版本是3.x,
在spark目录下修改文件:
./conf/spark-env.sh
在文件末尾添加
export PYSPARK_PYTHON=/usr/local/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3
在pycharm中需要设置项目的环境变量,
PYTHONPATH
/Users/xxx/spark-3.0.2-bin-hadoop2.7/python
PYTHONPATH=/Users/yangminghan/spark-3.0.2-bin-hadoop2.7/python
系统环境变量
PATH=“/Library/Frameworks/Python.framework/Versions/3.9/bin:${PATH}”
export PATH
export SPARK_HOME=/Users/yangminghan/spark-3.0.2-bin-hadoop2.7
export PATH=
P
A
T
H
:
PATH:
PATH:SPARK_HOME/bin
苹果让下载什么tool
xcode-select: note: no developer tools were found at ‘/Applications/Xcode.app‘
https://developer.apple.com/download/more
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fTNstE4j-1629958218027)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p576)]
SparkSQL经验
❑Always use the API from SparkSQL to perform operations
❑Do not use collect() when handling big datasets!
❑Use take(n)
❑Cache your DataFrames!
SparkMLlib
Spark提供很多机器学习库,SparkMLlib,
里面有大量的算法,可以直接拿来使用。
数据分析的过程
由于其速度,Apache Spark 成为一个最流行的大数据处理工具。MLlib 是 Spark 的可扩展机器学习库。它集成了 Hadoop 并可以与 NumPy 和 R 进行交互操作。它包括了许多机器学习算法如分类、回归、决策树、推荐、集群、主题建模、功能转换、模型评价、ML 管道架构、ML 持久、生存分析、频繁项集和序列模式挖掘、分布式线性代数和统计。
The MLlib library provides multiple tools such as
❑ML algorithms for classification, regression, clustering and
collaborative filtering.
❑Feature transformation algorithms such as feature extraction, dimensionality reduction, etc
❑Pipelines
: allow us to construct ML pipelines, constructing and evaluating them
❑Persistence tools to save and load models, pipelines, etc
❑“As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.”
From raw data to knowledge/insights there are multiple stages, known as the Knowledge Discovery in Databases process that includes: data selection, data preprocessing, data mining to extract patterns that are interpreted and evaluated to obtain knowledge.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DtSHKT9H-1629958218028)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p613)]
例如:
data preprocessing techniques (such as data reduction) are supposed to provide smaller datasets which are manageable for machine learning techniques, but they are affected by the amount of initial data, they need to be able to scale up.
我们将通过多个管道来学习如何使用原始数据来训练机器学习模型。
❑Data mining techniques have demonstrated to be very useful tools to extract new valuable knowledge from data.
❑The knowledge extraction process from big data has become a very difficult task for most of the classical and advanced data mining tools.
❑The main challenges are to deal with: ❑The increasing scale of data
❑at the level of instances
❑at the level of features
❑The complexity of the problem.
❑And many other points
我们可以使用open-source machine learning libraries来进行数据分析,
不用从0开始开发一个机器学习算法。
核心概念
❑ DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
❑ Transformer
: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.
This is an algorithm that modifies/transforms our DataFrame into a new one, generally by adding one or more columns. This could be any machine learning model which is capable of transforming an input DataFrame with features into a DataFrame with predic- tions. Implements the method transform()
.
❑ Estimator
: An Estimator is an algorithm which can be fit on a DataFrame to produce
a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
This is the learning algorithm. It takes a DataFrame as input and fits a model, returning a Transformer object. That transformer is used as above to make predictions. Implements the method fit()
.
❑Parameter: All Transformers and Estimators now share a common API for specifying parameters.
Pipeline
https://spark.apache.org/docs/latest/ml-pipeline.html#pipeline
A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
Pipeline将多个Transformers和Estimators连接起来以确定一个ML工作流程。
管道的概念允许我们以特定的顺序运行一系列算法,
在应用任何学习算法之前对数据进行塑造。
在机器学习中,通常运行一系列算法来处理和学习数据。
例如,简单的文本文档处理工作流程可能包括几个阶段:
将每个文档的文本分成单词。
将每个文档的单词转换为数字特征向量。
使用特征向量和标签学习预测模型。
MLlib将这个样一个工作流程成为一个pipeline,其包括一些列的按顺序执行的PipelineStages (Transformers 和Estimators) 。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TcliCgdz-1629958218029)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p614)]
For example, before applying a linear regression as an estimator, we need to preprocess the data (e.g. binarising numerical data, remove some features). This pipeline can then be re-used as many times as you like! The result of ’fitting’ that pipeline is a PipelineModel, which can be used to transform the test set and obtain the predictions.
❑It is an estimator that consists of 1 or more phases.
❑A re-usable data workflow.
❑It can be composed of estimators or transformers.
https://spark.apache.org/docs/latest/ml-pipeline.html
PipelineModel:
❑Model obtained from a Pipeline
❑After fitting the data with fit()
❑It is used to transform the test set and obtain the predictions.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mHLeJyB9-1629958218030)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p615)]
https://spark.apache.org/docs/latest/ml-pipeline.html
Example:
Standard Scaler Pipeline
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b5Qj5Mzw-1629958218032)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p616)]
An example of this could be a scaler, which needs to be applied in both the training and test phases. In the training phase, we create a Pipeline that uses the scaler to normalise the data before feeding the data to a LinearRegression algorithm. When that Pipeline is fitted, we get a PipelineModel which is composed of a ScalerModel plus a LinearRegresionModel. The test data is fed to that PipelineModel to tranform the data and then use the LinearRegression model to make the predictions.
❑Pipeline details:
❑Runtime checks: Feeding a wrongly formatted DataFrame will produce a runtime error before running the Pipeline.
❑Unique Pipeline stages: You can’t run the same preprocessing operation more than once in the same pipeline. If you need anything like that, you would have to create two different pipelines and apply them to your data.
There are other details about Pipelines that must be discussed, e.g. Use of parameters, storing your models, etc. But let’s do this with an example.
There are a few more details that could be interesting, e.g. how to indicate parameters to different models, saving models, etc, but let’s see these with an example.
如果不用管道,
那train和test的步骤是重复的,
而用了管道就方便了。
pipeline = Pipeline(stages=[assembler, lr])
pipeline_model = pipeline.fit(train)
prediction = pipeline_model.transform(test)
在训练数据之前,必须要哦告诉Spark哪些是输入特征features(向量格式VectorAssembler),哪些是输出label。
VectorAssembler关注的是列:
assembler = VectorAssembler(inputCols=feature_cols, outputCol=“features”)
回归算法计算错误率
evaluator = RegressionEvaluator(metricName=“mae”, labelCol=“Total”, predictionCol="pred…
evaluator.evaluate(prediction)
clustering聚类算法?
k-means
Logistic regression
线性回归
通过12个同学的4个练习成绩(Ex01-04)与项目成绩(Project),
来预测他们的总成绩(Total)。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-z0BfimLW-1629958218032)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p620)]
# 先初始化SparkSession
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sparkSession = SparkSession.builder.master("local[*]").appName("MLlib").getOrCreate()
sc = sparkSession.sparkContext
import pyspark.sql.functions as F
# 加载数据,csv文件
df = sparkSession.read.format('csv').option("header", 'true').load("/Users/linear-regression-result.csv")
# header=true表示表格中有的第一行是列名
df.cache()
# 缓存
df.printSchema()
# |-- Question 1: string (nullable = true)
# |-- Question 2: string (nullable = true)
# |-- Question 3: string (nullable = true)
# |-- Question 4: string (nullable = true)
# |-- Ex01: string (nullable = true)
# |-- Ex02: string (nullable = true)
# |-- Ex03: string (nullable = true)
# |-- Ex04: string (nullable = true)
# |-- Project: string (nullable = true)
# |-- Total: string (nullable = true)
# 读取的数据都是String类型,需要强转成Float类型
df = df.select([F.col(c).cast("float").alias(c) for c in df.columns])
df.printSchema()
# root
# |-- Question 1: float (nullable = true)
# |-- Question 2: float (nullable = true)
# |-- Question 3: float (nullable = true)
# |-- Question 4: float (nullable = true)
# |-- Ex01: float (nullable = true)
# |-- Ex02: float (nullable = true)
# |-- Ex03: float (nullable = true)
# |-- Ex04: float (nullable = true)
# |-- Project: float (nullable = true)
# |-- Total: float (nullable = true)
# 看看2条数据
df.show(2)
# +----------+----------+----------+----------+-----+----+----+----+-------+-----+
# |Question 1|Question 2|Question 3|Question 4| Ex01|Ex02|Ex03|Ex04|Project|Total|
# +----------+----------+----------+----------+-----+----+----+----+-------+-----+
# | 18.0| 16.0| 23.0| 22.0|100.0|85.0|80.0|70.0| 80.0| 81.0|
# | 20.0| 13.0| 11.0| 22.0|100.0|85.0|80.0|90.0| 93.0| 79.0|
# +----------+----------+----------+----------+-----+----+----+----+-------+-----+
# only showing top 2 rows
# 删掉一些没有用的列
df = df.drop('Question 1').drop('Question 2').drop('Question 3').drop('Question 4')
df.show(2)
# +-----+----+----+----+-------+-----+
# | Ex01|Ex02|Ex03|Ex04|Project|Total|
# +-----+----+----+----+-------+-----+
# |100.0|85.0|80.0|70.0| 80.0| 81.0|
# |100.0|85.0|80.0|90.0| 93.0| 79.0|
# +-----+----+----+----+-------+-----+
# only showing top 2 rows
# 将原始数据分成2个Dataframe,分别用于训练集与测试集
train, test = df.randomSplit([0.7, 0.3])
print(train.count())
# 7
print(test.count())
# 5
# 【学习模型】
# 选择线性回归,引包
from pyspark.ml.regression import LinearRegression
# 创建线性回归实例,可以传入一些参数
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
print(type(lr))
# <class 'pyspark.ml.regression.LinearRegression'>
# 参数解释
print(lr.explainParams())
# aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
# elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0, current: 0.8)
# epsilon: The shape parameter to control the amount of robustness. Must be > 1.0. Only valid when loss is huber (default: 1.35)
# featuresCol: features column name. (default: features)
# fitIntercept: whether to fit an intercept term. (default: True)
# labelCol: label column name. (default: label)
# loss: The loss function to be optimized. Supported options: squaredError, huber. (default: squaredError)
# maxIter: max number of iterations (>= 0). (default: 100, current: 10)
# predictionCol: prediction column name. (default: prediction)
# regParam: regularization parameter (>= 0). (default: 0.0, current: 0.3)
# solver: The solver algorithm for optimization. Supported options: auto, normal, l-bfgs. (default: auto)
# standardization: whether to standardize the training features before fitting the model. (default: True)
# tol: the convergence tolerance for iterative algorithms (>= 0). (default: 1e-06)
# weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)
这里需要介绍2个非常重要的概念:
特征Feature是数据的属性,是输入数据。
标签Label是对数据的预测结果,是输出数据。
以听歌举例,我们在听一首歌的时候,
这首歌的“节奏”、“强度”、“时长”等特征会输入到我们的大脑,
再由我们大脑处理后将其判断为“喜欢”或者“不喜欢”2个标签。
如我们根据一个地区的人均犯罪率、住宅平均房间数、城镇师生比例3个属性(特征)去预测该地区的房价y时,x1=人均犯罪率,x2=住宅平均房间数,x3=城镇师生比例,而所有特征的集合称为特征向量,用一个列向量表示:特征向量x=[x1;x2;x3]。
# 【管道】
# 不能直接把训练集的Dataframe喂给线性回归算法,需要把数据格式化并转换成跟向量相关的结构
# 我们需要告诉Spark哪些是Label,哪些是Features。
# 需要建立Pipeline,需要变成MLlib认可的数据形式,需要使用VectorAssemble,我们需要把所有的features列放在一起,形成1个单独的向量列。
# 导包
from pyspark.ml.feature import VectorAssembler
# 创建Features列,把若干列合并成1列
feature_cols = train.columns
feature_cols.remove('Total')
print(feature_cols)
# ['Ex01', 'Ex02', 'Ex03', 'Ex04', 'Project']
# 创建assembler,传入Features,并新起1个名字
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
# To understand what the assembler does, let's apply it on the training data:
assembler.transform(train).show(10)
# +-----+----+-----+----+-------+-----+--------------------+
# | Ex01|Ex02| Ex03|Ex04|Project|Total| features|
# +-----+----+-----+----+-------+-----+--------------------+
# | 75.0|90.0| 75.0|70.0| 93.0| 86.0|[75.0,90.0,75.0,7...|
# | 80.0|80.0| 65.0|60.0| 55.0| 62.0|[80.0,80.0,65.0,6...|
# | 80.0|85.0|100.0|80.0| 92.0| 81.0|[80.0,85.0,100.0,...|
# | 90.0|70.0| 45.0|65.0| 90.0| 52.0|[90.0,70.0,45.0,6...|
# | 90.0|75.0|100.0|90.0| 74.0| 70.0|[90.0,75.0,100.0,...|
# |100.0|85.0| 80.0|90.0| 93.0| 79.0|[100.0,85.0,80.0,...|
# +-----+----+-----+----+-------+-----+--------------------+
# only showing top 10 rows
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, labelCol="Total")
# 创建线性回归模型,参数为Total与features
lr.fit(assembler.transform(train).select('Total','features'))
# But if I now want to apply the model into a the test set, I would have to first transform the test into the right format (i.e. a vector column) before making the predictions.
# This is when Pipelines come very handy. We know the two stages are: assembly the vector of features and then apply linear regression. With that in mind, I create and fit the pipeline as follows:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[assembler, lr])
pipeline_model = pipeline.fit(train)
print(type(pipeline_model))
# <class 'pyspark.ml.pipeline.PipelineModel'>
# 使用管道模型来获取预测结果
prediction = pipeline_model.transform(test)
for row in prediction.collect():
print(row.prediction, row.Total)
# 72.88048609448553 39.0
# 79.6539978844677 81.0
# 77.96547405268844 42.0
# 86.80957209509462 79.0
# 因为数据量较小,所以误差可能有的会比较大。
# 【模型评估】
# 我们通常使用误差度(或精准度)来确定预测是否成功。在回归中,一个非常常见的误差量是平均绝对误差,正如名字所示,计算在预测所有测试学生的平均绝对误差。
# MLlib为评估回归模型提供了一个模块,导包
from pyspark.ml.evaluation import RegressionEvaluator
# 实例化评估对象
evaluator = RegressionEvaluator(metricName="mae", labelCol="Total", predictionCol="prediction")
# Above I indicated the metric name (mean absolute error), the column name of the target output, and the name of the predicted column. We now simply have to call the method `evaluate` on the data frame we obtained before after transforming the test set.
print(evaluator.evaluate(prediction))
print(evaluator.isLargerBetter())
# 16.078176575930197
# False
# 错误率比较高为16,因为数据量太少了,
【模型优化(参数优化)】
有2种方式进行参数优化,找到最佳参数后,
再使用整个训练数据集来训练模型。
交叉验证:
将训练数据集分割成不同的份数并执行交叉验证。
例如,使用3折交叉验证3 fold cross validation, CrossValidator生成3份数据集,2份数据进行训练,1份数据进行测试。还可以有10折交叉验证,之后看这些验证后的精确度、精确率、召回率等,最后确定最佳参数。
CrossValidator requires an Estimator, a set of parameters and and evaluator. In addition, we also need to indicate the number of partitions in which the training data will be split.
# 交叉验证CrossValidation:
# 导包
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# First, we create a 'grid' of parameters, that is, the parameters we want to investigate and the different values we want to investigate. To do this, we use ParamGridBuilder as follows:
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.5, 0.1]).addGrid(lr.maxIter, [1, 5, 10, 20]).build()
# The above grid will investigate 2*4 parameter combinations.
# We can now create a CrossValidator that uses the pipeline we created before as an estimator.
# I am going to use mae as metric again, and 3 folds.
crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=RegressionEvaluator(metricName="mae", labelCol="Total", predictionCol="prediction"), numFolds=3)
# Alternatively, you could indicate the parameters when fitting the model. But to do so, you need to create a ParamMap. A ParamMap is a set of (parameter, value) pairs.
cvModel = crossval.fit(train)
# 需要花费很长时间
# Now we use that cvModel,
# which is a cross-validated pipeline model to transform the test and make predictions.
prediction = cvModel.transform(test)
print(evaluator.evaluate(prediction))
# 10.144908042843117
# 比之前的结果变好了,找到最好的模型
print(cvModel.bestModel)
# PipelineModel_0e474e33faae
bestPipeline = cvModel.bestModel
# I now want to see what happen in that best pipeline in the stage 1. Stage 0 of our pipeline was the assembler.
bestLRModel = bestPipeline.stages[1]
bestParams = bestLRModel.extractParamMap()
print(bestParams)
# {Param(parent='LinearRegression_198456858a82', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
# Param(parent='LinearRegression_198456858a82', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.8,
# Param(parent='LinearRegression_198456858a82', name='epsilon', doc='The shape parameter to control the amount of robustness. Must be > 1.0. Only valid when loss is huber'): 1.35,
# Param(parent='LinearRegression_198456858a82', name='featuresCol', doc='features column name.'): 'features',
# Param(parent='LinearRegression_198456858a82', name='fitIntercept', doc='whether to fit an intercept term.'): True,
# Param(parent='LinearRegression_198456858a82', name='labelCol', doc='label column name.'): 'Total',
# Param(parent='LinearRegression_198456858a82', name='loss', doc='The loss function to be optimized. Supported options: squaredError, huber.'): 'squaredError',
# Param(parent='LinearRegression_198456858a82', name='maxIter', doc='max number of iterations (>= 0).'): 5,
# Param(parent='LinearRegression_198456858a82', name='predictionCol', doc='prediction column name.'): 'prediction',
# Param(parent='LinearRegression_198456858a82', name='regParam', doc='regularization parameter (>= 0).'): 0.5,
# Param(parent='LinearRegression_198456858a82', name='solver', doc='The solver algorithm for optimization. Supported options: auto, normal, l-bfgs.'): 'auto',
# Param(parent='LinearRegression_198456858a82', name='standardization', doc='whether to standardize the training features before fitting the model.'): True,
# Param(parent='LinearRegression_198456858a82', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06}
# 找到线性回归的系数
print(bestLRModel.coefficients)
# [0.11649930340941787,-1.1560140805968824,0.0,-0.8269535229814308,0.9723840735463517]
训练/验证拆分:将训练数据集拆分为两个子集,训练集和验证集,并根据验证集的性能确定最佳参数。
The CrossValidator may become very expensive. Alternatively you can use TrainValidationSplit which only evaluates each combination of parameters once. As CrossValidator, this requires an Estimator, a set of parameters and and evaluator. In addition, we also need to indicate the split training and validation.
# 训练验证集Train/Validation split:
# 导包
from pyspark.ml.tuning import TrainValidationSplit
# 80% of the data will be used for training, 20% for validation.
tvs = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(metricName="mae", labelCol="Total", predictionCol="prediction"),trainRatio=0.8)
tvsModel = tvs.fit(train)
prediction = tvsModel.transform(test)
print(evaluator.evaluate(prediction))
# Challenge #1: In this example, I have treated the prediction of the mark as a regression problem, but I could aim to predict the 'final classification' instead: e.g. Fail <50, Pass [50,60) , merit [60,70) or distinction >=70. You will have to transform your DataFrame adding a column with the right labels (e.g. Fail, Pass, Merit or distinction).
此外,如果原始数据集不是CSV文件,
而是一些手动数据该怎么办?
# 创建线性回归实例,可以传入一些参数
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# 不能直接把Dataframe喂给线性回归算法,需要格式化成向量相关的数据结构
# 我们需要告诉Spark哪些是Label(输出),哪些是Features(输入),并且所有Features都应该是数组(向量)的一部分。
# 引包
from pyspark.ml.linalg import Vectors
# 手动做一些训练数据集
simple_training = sparkSession.createDataFrame([ (1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.5, Vectors.dense([2.0, 1.3, 1.0])),
(1.5, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
# 把新的数据格式喂给线性回归算法得到模型
model1 = lr.fit(simple_training)
print(type(model1))
# <class 'pyspark.ml.regression.LinearRegressionModel'>
# 创建线性回归实例时,可以直接传入一些参数,也可以在喂数据时设置参数,但需要创建ParamMap。
param_map = ({lr.regParam: 0.1, lr.elasticNetParam: 0.55})
print(type(param_map))
# <class 'dict'>
print(lr.fit(simple_training, param_map))
# LinearRegressionModel: uid=LinearRegression_c7c2d035bcf4, numFeatures=3
# 做预测,手动做一些测试数据集
simple_test = sparkSession.createDataFrame([
(1.0, Vectors.dense([-1.0, 1.5, 1.3])), (0.3, Vectors.dense([3.0, 2.0, -0.1])),
(1.5, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])
# 线性回归模型可以用来做预测
prediction = model1.transform(simple_test)
print(type(prediction))
# <class 'pyspark.sql.dataframe.DataFrame'>
# 新增一列prediction
print(prediction.collect())
# [Row(label=1.0, features=DenseVector([-1.0, 1.5, 1.3]), prediction=1.2195974883443117), Row(label=0.3, features=DenseVector([3.0, 2.0, -0.1]), prediction=0.2804025116556883), Row(label=1.5, features=DenseVector([0.0, 2.2, -1.5]), prediction=0.9847987441721558)]
# 只得到预测结果
prediction.select('prediction').show()
# +------------------+
# | prediction|
# +------------------+
# |1.2195974883443117|
# |0.2804025116556883|
# |0.9847987441721558|
# +------------------+
梯度提升树回归
The aim of this notebook is to play with the MLlib of Apache Spark to create a Machine Learning pipeline that preprocess a dataset, train a model and make predictions.
家电能源预测
Data: This dataset contains Energy consumption from appliances at 10 min resolution for about 4.5 months. The house temperature and humidity conditions were monitored with a wireless sensor network. Each wireless node transmitted the temperature and humidity conditions around 3.3 min. Then, the wireless data was averaged for 10 minutes periods. The energy data was logged every 10 minutes. Weather from the nearest airport weather station (Chievres Airport, Belgium) was also logged. This dataset is from Candanedo et al and is hosted by the UCI Machine Learning Repository. UCI Machine Learning Repository.
Goal: We want to learn to predict appliances’ energy consumption based on weather information. It would also be nice to know which input features are the most relevant to make predictions.
Approach: We will use Spark ML Pipelines, which help users piece together parts of a workflow such as feature processing and model training. We will also demonstrate model selection (a.k.a. hyperparameter tuning) using Cross Validation in order to fine-tune and improve our ML model.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FS1hc8bT-1629958218033)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p622)]
from pyspark.sql import SparkSession
sparkSession = SparkSession \
.builder \
.master("local[*]") \
.appName("Lab 5") \
.getOrCreate()
sc = sparkSession.sparkContext
# %matplotlib inline
import matplotlib.pyplot as plt
from test_helper import Test
import pyspark.sql.functions as F
from pyspark.sql import Row
# 【1. Load and understand the data】
# # We begin by loading the data, which is in Comma-Separated Value (CSV) format. For that, we use spark.read to read the file. We also cache the data so that we only read it from disk once.
df = sparkSession.read.format('csv').option("header", 'true').load("/Users/energydata_complete.csv")
# 缓存数据
df.cache()
Test.assertEquals(df.count(), 19735, 'Incorrect number of rows')
Test.assertEquals(df.is_cached, True, 'df not cached')
# 得知列含义:
# date time year-month-day hour:minute:second
# Appliances, energy use in Wh
# lights, energy use of light fixtures in the house in Wh
# T1, Temperature in kitchen area, in Celsius
# RH_1, Humidity in kitchen area, in %
# T2, Temperature in living room area, in Celsius
# RH_2, Humidity in living room area, in %
# T3, Temperature in laundry room area
# RH_3, Humidity in laundry room area, in %
# T4, Temperature in office room, in Celsius
# RH_4, Humidity in office room, in %
# T5, Temperature in bathroom, in Celsius
# RH_5, Humidity in bathroom, in %
# T6, Temperature outside the building (north side), in Celsius
# RH_6, Humidity outside the building (north side), in %
# T7, Temperature in ironing room , in Celsius
# RH_7, Humidity in ironing room, in %
# T8, Temperature in teenager room 2, in Celsius
# RH_8, Humidity in teenager room 2, in %
# T9, Temperature in parents room, in Celsius
# RH_9, Humidity in parents room, in %
# To, Temperature outside (from Chievres weather station), in Celsius
# Pressure (from Chievres weather station), in mm Hg
# RH_out, Humidity outside (from Chievres weather station), in %
# Wind speed (from Chievres weather station), in m/s
# Visibility (from Chievres weather station), in km
# Tdewpoint (from Chievres weather station), °C
# rv1, Random variable 1, nondimensional
# rv2, Random variable 2, nondimensional
# The target variable is the energy use of the Appliances.
# For now, we will leave the two variables rv1 and rv2 in our dataset,
# to see if they are affecting much our methods,
# then we can try to remove them and see if we improve the results.
# Use show to visualise the data. Be careful not to show the entire data frame :-)
df.show(5)
# +-------------------+----------+------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+--------------------+--------------------+
# | date|Appliances|lights| T1| RH_1| T2| RH_2| T3| RH_3| T4| RH_4| T5| RH_5| T6| RH_6| T7| RH_7| T8| RH_8| T9| RH_9| T_out| Press_mm_hg| RH_out| Windspeed| Visibility| Tdewpoint| rv1| rv2|
# +-------------------+----------+------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+--------------------+--------------------+
# |2016-01-11 17:00:00| 60| 30|19.890000000000001|47.596666666666700|19.199999999999999|44.789999999999999|19.789999999999999|44.729999999999997|19.000000000000000|45.566666666666698|17.166666666666700|55.200000000000003| 7.02666666666666...|84.2566666666667032|17.199999999999999|41.626666666666701|18.199999999999999|48.899999999999999|17.033333333333299|45.530000000000001| 6.59999999999999...|733.50000000000000| 92.000000000000000| 7.00000000000000000|63.0000000000000000| 5.29999999999999...|13.27543315710499...|13.27543315710499...|
# |2016-01-11 17:10:00| 60| 30|19.890000000000001|46.693333333333300|19.199999999999999|44.722499999999997|19.789999999999999|44.789999999999999|19.000000000000000|45.992500000000000|17.166666666666700|55.200000000000003| 6.83333333333333...|84.0633333333333042|17.199999999999999|41.560000000000002|18.199999999999999|48.863333333333301|17.066666666666698|45.560000000000002| 6.48333333333332...|733.60000000000002| 92.000000000000000| 6.66666666666666963|59.1666666666666998| 5.20000000000000...|18.60619498183950...|18.60619498183950...|
# |2016-01-11 17:20:00| 50| 30|19.890000000000001|46.299999999999997|19.199999999999999|44.626666666666701|19.789999999999999|44.933333333333302|18.926666666666701|45.890000000000001|17.166666666666700|55.090000000000003| 6.55999999999999...|83.1566666666666947|17.199999999999999|41.433333333333302|18.199999999999999|48.729999999999997|17.000000000000000|45.500000000000000| 6.36666666666666...|733.70000000000005| 92.000000000000000| 6.33333333333333037|55.3333333333333002| 5.09999999999999...|28.64266816759482...|28.64266816759482...|
# |2016-01-11 17:30:00| 50| 40|19.890000000000001|46.066666666666698|19.199999999999999|44.590000000000003|19.789999999999999|45.000000000000000|18.890000000000001|45.723333333333301|17.166666666666700|55.090000000000003| 6.43333333333333...|83.4233333333333036|17.133333333333301|41.289999999999999|18.100000000000001|48.590000000000003|17.000000000000000|45.399999999999999| 6.25000000000000...|733.79999999999995| 92.000000000000000| 6.00000000000000000|51.5000000000000000| 5.00000000000000...|45.41038949973881...|45.41038949973881...|
# |2016-01-11 17:40:00| 60| 40|19.890000000000001|46.333333333333300|19.199999999999999|44.530000000000001|19.789999999999999|45.000000000000000|18.890000000000001|45.530000000000001|17.199999999999999|55.090000000000003| 6.36666666666666...|84.8933333333333024|17.199999999999999|41.229999999999997|18.100000000000001|48.590000000000003|17.000000000000000|45.399999999999999| 6.13333333333333...|733.89999999999998| 92.000000000000000| 5.66666666666666963|47.6666666666666998| 4.90000000000000...|10.08409655187278...|10.08409655187278...|
# +-------------------+----------+------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+------------------+--------------------+-------------------+------------------+------------------+------------------+------------------+------------------+------------------+--------------------+------------------+-------------------+--------------------+-------------------+--------------------+--------------------+--------------------+
# only showing top 5 rows
# 【2. Data preprocessing】
# This dataset is nicely prepared for Machine Learning and required very little preprocessing.
# However, rather than keeping the date as a timestamp, I would like to have some additional columns,
# including 'day of the year', 'hour', and 'month of the year'.
# Please use the naming dayofyear, hour and month, respectively.
# Hint: Of course the SparkSQL library has a function to transform strings with datetime!
# Would you like to have any other information from the datetime?
# Feel free to add other features.
df = df.withColumn("dayofyear", F.dayofyear('date'))
df = df.withColumn("hour", F.hour('date'))
df = df.withColumn("month", F.month('date'))
# When you dataframe df has the additional columns, please remove the column date:
df = df.drop('date')
Test.assertEquals("date" in df.columns, False, "Column date has not been remove!")
Test.assertEquals("hour" in df.columns, True, "The hour hasn't been added")
Test.assertEquals("dayofyear" in df.columns, True, "The dayofyear hasn't been added")
Test.assertEquals("month" in df.columns, True, "The month hasn't been added")
# Check the schema of your dataframe:
df.printSchema()
# root
# |-- Appliances: string (nullable = true)
# |-- lights: string (nullable = true)
# |-- T1: string (nullable = true)
# |-- RH_1: string (nullable = true)
# |-- T2: string (nullable = true)
# |-- RH_2: string (nullable = true)
# |-- T3: string (nullable = true)
# |-- RH_3: string (nullable = true)
# |-- T4: string (nullable = true)
# |-- RH_4: string (nullable = true)
# |-- T5: string (nullable = true)
# |-- RH_5: string (nullable = true)
# |-- T6: string (nullable = true)
# |-- RH_6: string (nullable = true)
# |-- T7: string (nullable = true)
# |-- RH_7: string (nullable = true)
# |-- T8: string (nullable = true)
# |-- RH_8: string (nullable = true)
# |-- T9: string (nullable = true)
# |-- RH_9: string (nullable = true)
# |-- T_out: string (nullable = true)
# |-- Press_mm_hg: string (nullable = true)
# |-- RH_out: string (nullable = true)
# |-- Windspeed: string (nullable = true)
# |-- Visibility: string (nullable = true)
# |-- Tdewpoint: string (nullable = true)
# |-- rv1: string (nullable = true)
# |-- rv2: string (nullable = true)
# |-- dayofyear: integer (nullable = true)
# |-- hour: integer (nullable = true)
# |-- month: integer (nullable = true)
# Dammit, all the input features have been inferred as strings rather than numeric values.
# This is because we read the data from a CSV file with the data in between quotes.
# Your task now is to transform that into numerical values.
# All of the features are actually numeric, so you could cast all of them.
# You are recommended to use functions like cast and col to do this.
# You could try to leave out the datetime columns we created, but it's fine if you transform them to float.
numerical_cols = df.columns
df = df.select([F.col(c).cast("float").alias(c) for c in numerical_cols])
df.printSchema()
# root
# |-- Appliances: float (nullable = true)
# |-- lights: float (nullable = true)
# |-- T1: float (nullable = true)
# |-- RH_1: float (nullable = true)
# |-- T2: float (nullable = true)
# |-- RH_2: float (nullable = true)
# |-- T3: float (nullable = true)
# |-- RH_3: float (nullable = true)
# |-- T4: float (nullable = true)
# |-- RH_4: float (nullable = true)
# |-- T5: float (nullable = true)
# |-- RH_5: float (nullable = true)
# |-- T6: float (nullable = true)
# |-- RH_6: float (nullable = true)
# |-- T7: float (nullable = true)
# |-- RH_7: float (nullable = true)
# |-- T8: float (nullable = true)
# |-- RH_8: float (nullable = true)
# |-- T9: float (nullable = true)
# |-- RH_9: float (nullable = true)
# |-- T_out: float (nullable = true)
# |-- Press_mm_hg: float (nullable = true)
# |-- RH_out: float (nullable = true)
# |-- Windspeed: float (nullable = true)
# |-- Visibility: float (nullable = true)
# |-- Tdewpoint: float (nullable = true)
# |-- rv1: float (nullable = true)
# |-- rv2: float (nullable = true)
# |-- dayofyear: float (nullable = true)
# |-- hour: float (nullable = true)
# |-- month: float (nullable = true)
# Split data into training and test sets:
# Our final data preparation step is to split our dataset into training and test sets.
# We will train and tune our model on the training set, and then see how well we do in the test.
# Your task is to split the dataframe df into 70% for training and 30% for test. Please use the same random seed.
seed = 1203123
train, test = df.randomSplit([0.7, 0.3], seed)
# Even though we have fixed the random seed, you will not get the exact same split.
# Note that this is the simplest way of validating your results.
# You may want to carry out a k-fold cross validation and split the dataset into k folds,
# and build and test k models.
# We will do later cross validation but for parameter tuning! not to validate our approach!
# Data visualisation:
# Before applying any machine learning algorithm,
# it is a good practice to try to visualise your data.
# For example, we could see how much energy is spent in appliances depending on the month.
hist_elect = train.groupBy('month').sum('Appliances').sort("month").collect()
(x_values, y_values) = zip(*hist_elect)
plt.bar(x_values, y_values)
plt.title('Histogram of repetitions (up to 10)')
plt.xlabel('Month')
plt.ylabel('Sum of Appliance Energy')
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3yHMeSby-1629958218035)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p621)]
# As we don't seem to have all the data from the 1st of January,
# well, we have less consumption in that particular months,
# so probably the month is not a good feature, don't you think?
# I am going to remove it from the dataframe and the training and test!
df = df.drop('month')
train = train.drop('month')
test = test.drop('month')
# You could do other plots to understand better the data and practice with Spark :-).
# This is a good opportunity to practice with the DataFrame API.
# 【3. Create a Pipeline with Spark ML】
# As you know, we can't feed the data frame directly to a machine learning algorithm,
# as we need to put all the input features as an Array,
# and indicate which one is the output feature (in our case, the 'Appliances' column!).
# We will put together a simple Pipeline with the following stages:
# 1. VectorAssembler: To combine all the input columns into a single vector column
# (i.e. all the columns but the 'Appliances' one.
# 2. Learning algorithm: I feel like using Gradient-Boosted Trees GBTs for this example,
# but feel free to use anything else.
# 3. CrossValidator: I will use cross validation to tune the parameters of the GBT model.
# Yes, this can be added as part of a pipeline!
# This is going to change the way we access the best model later!
# Step 1: Create the VectorAssembler:
from pyspark.ml.feature import VectorAssembler
featuresCols = df.columns
featuresCols.remove('Appliances')
vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol="features")
# Step 2: Create an instance of GBTRegressor in which you don't indicate any parameters but the class label (i.e. labelCol to 'Appliances'.
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(labelCol="Appliances")
# Step 3: Create a CrossValidator for gbt.
# You can explore the parameters you like for GBT.
# Full documentation here.
# I would suggest to create a 'grid' for at least the depth of the tree and the number of iterations.
# Don't investigate more than 4-8 combinations! :-)
# Import the right libraries:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
paramGrid = ParamGridBuilder().addGrid(gbt.maxDepth, [5, 8]).addGrid(gbt.maxIter, [10, 20]).build()
# Create a RegressionEvaluator that uses the Root Mean Squared Error (RMSE) as our performance metric.
# Import the right package first:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol=gbt.getLabelCol(), predictionCol=gbt.getPredictionCol())
# We can now create a CrossValidator that uses the gbt as estimator,
# as well as the evaluator and grid we defined above.
cv = CrossValidator(estimator=gbt, evaluator=evaluator, estimatorParamMaps=paramGrid)
# Step 4. Create a Pipeline that contains the two stages: vectorAssembler and cv
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[vectorAssembler, cv])
# Step 5. Finally, fit the model. This might take quite a bit of time!
pipelineModel = pipeline.fit(train)
# It could be a good idea to save this to disk, in case it takes too long, so we can read it later.
# pipelineModel.save("Model-lab5")
# 【4. Evaluate the results】
# To obtain the predictions in the test set, apply the method transform() of the trained pipeline on the test DataFrame!
# This will not apply the cross-validation of course!
predictions = pipelineModel.transform(test)
# It is easier to view the results when we limit the columns displayed to:
# * Appliances: the consumption of the Appliances in Wh
# * prediction: our predicted prediction
predictions.select("Appliances", "prediction").show(5)
# +----------+------------------+
# |Appliances| prediction|
# +----------+------------------+
# | 10.0| 36.21900839930263|
# | 10.0| 40.01802383573517|
# | 10.0| 37.57194764506838|
# | 20.0|103.72147179350603|
# | 20.0| 37.02464784367771|
# +----------+------------------+
# only showing top 5 rows
# Are these results any good? Let's compute the RMSE using the evaluator!
rmse = evaluator.evaluate(predictions)
print(rmse)
# 78.69480542310478
# Seems a bit high? Well,
# this number is closer to what it is reported in the original paper with RF (RMSE around 69).
# But maybe you can investigate a bit more if you can improve that.
# Can you find out the importance of the features from the GBTs?
# You first need to find out the best model!!
# In the way we trained the pipeline,
# you can find the trained cvModel as one of the stages of the pipelineModel:
cvModel = pipelineModel.stages[1]
print(cvModel.bestModel)
# GBTRegressionModel: uid=GBTRegressor_fa35adca6aa1, numTrees=20, numFeatures=29
# Check the feature importances of that cvModel:
print(cvModel.bestModel.featureImportances)
# (29,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,27,28],[0.028672878137592415,0.035589721486073395,0.054071129135496394,0.0339124471076883,0.06307568985335354,0.053452321055447005,0.048505788679583446,0.028403823814875325,0.023390015834817826,0.036978504159729836,0.04517951108601859,0.0320188028515642,0.03741482934578118,0.025898607161440847,0.02201627957577745,0.036289940860545114,0.028223738628164696,0.013627210660775515,0.03649095974105368,0.029868018555982756,0.0401727145293604,0.03251964192852627,0.04070991817485187,0.017912011990612275,0.02763196520253157,0.03940120853803313,0.02129154179032882,0.06728078011399392])
# Uhm, looks like our model gave feature #28 quite a bit of importance,
# which is the hour of the day if I recall correctly.
# Features #25 and #26 were random,
# and the model noticed that 26 was completely useless,
# but gave some importance to 'rv1'.
# GBTs perform somehow an implicit feature selection,
# so those low importance features won't affect much their performance,
# but I wonder if we could just remove low importance features?
# Task: Create a list of those features with less than for example 0.05.
pairs = cvModel.bestModel.featureImportances
lst = []
to_remove = []
for i in range(len(featuresCols)):
lst.append((featuresCols[i],pairs[i]))
if pairs[i]<0.05:
to_remove.append(featuresCols[i])
print(lst)
# [('lights', 0.028672878137592415), ('T1', 0.035589721486073395), ('RH_1', 0.054071129135496394), ('T2', 0.0339124471076883), ('RH_2', 0.06307568985335354), ('T3', 0.053452321055447005), ('RH_3', 0.048505788679583446), ('T4', 0.028403823814875325), ('RH_4', 0.023390015834817826), ('T5', 0.036978504159729836), ('RH_5', 0.04517951108601859), ('T6', 0.0320188028515642), ('RH_6', 0.03741482934578118), ('T7', 0.025898607161440847), ('RH_7', 0.02201627957577745), ('T8', 0.036289940860545114), ('RH_8', 0.028223738628164696), ('T9', 0.013627210660775515), ('RH_9', 0.03649095974105368), ('T_out', 0.029868018555982756), ('Press_mm_hg', 0.0401727145293604), ('RH_out', 0.03251964192852627), ('Windspeed', 0.04070991817485187), ('Visibility', 0.017912011990612275), ('Tdewpoint', 0.02763196520253157), ('rv1', 0.03940120853803313), ('rv2', 0.0), ('dayofyear', 0.02129154179032882), ('hour', 0.06728078011399392)]
print(to_remove)
# ['lights', 'T1', 'T2', 'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'rv1', 'rv2', 'dayofyear']
# Removing low importance features:
# Check the current schema of the training data:
train.printSchema()
# root
# |-- Appliances: float (nullable = true)
# |-- lights: float (nullable = true)
# |-- T1: float (nullable = true)
# |-- RH_1: float (nullable = true)
# |-- T2: float (nullable = true)
# |-- RH_2: float (nullable = true)
# |-- T3: float (nullable = true)
# |-- RH_3: float (nullable = true)
# |-- T4: float (nullable = true)
# |-- RH_4: float (nullable = true)
# |-- T5: float (nullable = true)
# |-- RH_5: float (nullable = true)
# |-- T6: float (nullable = true)
# |-- RH_6: float (nullable = true)
# |-- T7: float (nullable = true)
# |-- RH_7: float (nullable = true)
# |-- T8: float (nullable = true)
# |-- RH_8: float (nullable = true)
# |-- T9: float (nullable = true)
# |-- RH_9: float (nullable = true)
# |-- T_out: float (nullable = true)
# |-- Press_mm_hg: float (nullable = true)
# |-- RH_out: float (nullable = true)
# |-- Windspeed: float (nullable = true)
# |-- Visibility: float (nullable = true)
# |-- Tdewpoint: float (nullable = true)
# |-- rv1: float (nullable = true)
# |-- rv2: float (nullable = true)
# |-- dayofyear: float (nullable = true)
# |-- hour: float (nullable = true)
# You don't need to remove the columns from train and test partitions as you won't be able to re-train the pipeline without modifying the VectorAssemble.
# You simply need to indicate the columns you want to use when creating again the vectorAssembler:
featuresCols2 = train.columns
featuresCols2.remove('Appliances')
for feat in to_remove:
featuresCols2.remove(feat)
#print(feat)
#print(featuresCols)
vectorAssembler2 = VectorAssembler(inputCols=featuresCols2, outputCol="features")
# and create a new pipeline with that new vectorAssemble2:
pipeline2 = Pipeline(stages=[vectorAssembler2, cv])
# And fit the new pipeline:
pipelineModel2 = pipeline2.fit(train)
# Finally, make predictions and compute the error:
predictions2 = pipelineModel2.transform(test)
rmse = evaluator.evaluate(predictions2)
print(rmse)
# 85.10075391405992
# Check the features of the best model and feature importances:
cvModel2 = pipelineModel2.stages[1]
print(cvModel2.bestModel)
# GBTRegressionModel: uid=GBTRegressor_fa35adca6aa1, numTrees=20, numFeatures=4
print(cvModel2.bestModel.featureImportances)
# (4,[0,1,2,3],[0.22949786681182618,0.22741050145272015,0.2676727302003895,0.2754189015350641])
# In my case, I got the same performance as before,
# but this time my model only considered 3 input features!
# This might not have made our model more precise (Because GBTs already ignored those features),
# but makes it more interpretable!
# 【5. Improving further your model】
# There might be many ways to improve the results we obtained here.
# If you find any alternatives, please post your approach on the Moodle discussion forum.
# A few ideas for you to think about:
# 1. Parameter tuning: I have used a relatively small set of parameters,
# and I haven't investigated what happened in training and test,
# is there overfitting of the training? would we be able to use a large number of trees?
# 2. The features of this dataset are numerical,
# are there other classifiers that may be more appropriate than GBTs?
# 3. We haven't really done any careful pre-processing of the data.
# Are there outliers or noise that might be having an impact in the results?
# 4. Do we need any normalisation?
Global vs Local
Global与Local的区别并不仅仅只是Dataframe与RDD的区别,
Dataframe与RDD只是两种不同的数据表现方式而已,他们俩都可以用于Global与Local。
Global是x个节点公用一个Model(需要广播),
x个节点对这1个Model进行训练,所有的数据都用于这个1个Model。而Local是x个节点里有x个model,每个节点只有一部分数据来训练各自的模型,到最后再把模型训练的结果做整合。
Global与Local的区别取决于你用的数据对于你的集群节点来说是不是local的,
如果不是local的那就是全部数据。
Designing ML methods in Big Data: Global vs Local
What is the easiest way to apply Machine Learning on Big Data?
Divide-and-conquer!
❑A model is created for each partition of the data (only using the data of that partition)
❑All the models are combined when predicting the class of a new example→Ensemble
We usually call this Local approaches
Decision Tree with a single MapReduce process:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ooyp9Ixd-1629958218036)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p626)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iHpnHvJH-1629958218037)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p627)]
Questions:
❑Is this an approximation or an exact implementation of Decision Tree
for Big Data?
❑This is certainly an approximation – we call this Local models
❑Should we always develop this kind of strategy?
❑Task: Analyse Runtime vs Accuracy depending on the number of
partitions
❑With Spark we are not limited to a single MapReduce stage, could we not do anything better than the previous divide-and-conquer approach?
Decision Trees in Spark: Key design principles
❑Iterate multiple times through the data to build a tree (exploiting in-memory
❑Stats are computed in parallel
❑Model is brought back to driver at each step and broadcast to all nodes.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7E7LTiQx-1629958218038)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p628)]
Decision Trees for Big Data
Step 1: Calculate root node.
❑Calculate the best cut point for each feature, based on impurity metrics like Gini or Entropy (highly parallelisable!)
❑Aggregate stats from each worker to determine best feature to split and create root node.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rDkbqZdQ-1629958218039)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p629)]
Step 2: Broadcast model to all workers
❑Whilst the workers contributed to compute the stats, they don’t know yet which one is the root node, so the driver broadcasts that info to all workers.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XXBEQQ9d-1629958218039)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p630)]
Step 3: Recursively, compute all nodes a level-by-level basis
❑All the nodes in a level are learnt with a single pass through the
whole dataset.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XEW6K9Pe-1629958218042)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p631)]
❑Decision Trees in Spark
❑Reducing computational cost
❑Numeric attributes are discretised into bins in order to reduce the computational cost
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Gan6IwQK-1629958218043)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p632)]
Advantages and disadvantages - reflection
❑Decision Trees in Spark
Advantages:
❑Same classification/regression performance independently of number of partitions – Recommended to set the partitions to the total of cores
❑Stats can be used with DataFrames/Datasets, which are very efficient
❑These are the base for the RandomForest implementation of Spark!
Caveats: some limitations w.r.t. the original Decision tree
❑Lack of Pruning
❑Maximum depth ❑Better than local models?
Local models
Advantages:
❑Usually faster (training phase)
❑Gets faster as the number of
partitions is increased
❑Any existing model can be applied
❑Only the aggregation phase has to be designed
Disadvantages:
❑Slow in test phase, too many models have to be executed
❑Loss of accuracy as the number of partitions increases
❑With few partitions, accuracy can improve due to the ensemble effect
❑With too many partitions, the accuracy tends to drop, since there are not enough examples in each partition
❑They do not take advantage of the data as a whole
Alternatively, you could attempt …
Globally
❑A single model is created using all the available data
❑Try to obtain the same model as that would be obtained if the method could be executed in a single node
Global model
Advantages
❑Greater accuracy is expected (not proved)
❑All the examples are used to learn a single model
❑Anyway, a global ensemble can also be built
❑The model is independent of the number of partitions
❑Faster in test phase
Disadvantages
❑More complex design and implementation
❑Distributed nature of Big Data processing has to be taken into account (computation/communication)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SrTIYivh-1629958218044)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p633)]
General Rules to design Big Data Solutions:
❑Rule #1: Computation and storage should be linear
❑Rule #2: Perform computations in parallel and in main memory
(Cache!)
❑Rule #3: Minimise network communication
❑Remember the memory and communication hierarchy!
❑Imagine an imbalanced classification problem in which there are about 1000 positive instances, and 2 million negative instances.
❑Could you apply a Local approach to tackle this problem?
❑If yes, explain how you would do it.
❑If not, explain why it wouldn’t be suitable.
In the examples above, we have seen that when designing machine learning algorithms for big datasets, there are two main approaches that can be followed.
Local models: Local approaches are approximations of the original algorithms in which the data is split into a number of smaller subsets and the original algorithm is applied locally in each chunk. Then, the results from each individual partition are (smartly) combined. Obviously, these models are partial and approximate because they are learned only considering a part of the data and without interacting with the rest.
Global models: Global models, or sometimes known as exact approaches, aim to replicate the behaviour of the sequential version by looking at the data as a whole (and not as a combination of smaller parts). This means that we need to know well how the original machine learning method works, and which would be the bottlenecks to distribute its processing. All the machine learning algorithms developed in Spark’s MLlib library \cite{mllib} are global models.
Let’s compare these two approaches.
Advantages of Local models over Global models:
Local models usually involve a single pass through the data to build a model (training phase), whilst global model may require multiple iterations.
Local models are conceptually very simple to understand and we could ‘virtually’ model any machine learning model in that way, while global models require a careful design. The only aspect that needs to be design carefully in local models is the aggregation phase; which might not be suitable for every machine learning technique. For example, although possible, I don’t see a nice way of doing k-means with a local model.
If we found a good number of partitions for your dataset, you might benefit of the ‘ensemble effect’, but determining the best number of partitions for a dataset requires an experimental validation!
The main advantages of Global models over local models:
In global models, all the data is used as a whole, which one would expect them to be more robust and to yield a greater accuracy. However, this is not always the case in practice.
The classification/regression performance of the global models is independent of the number of partitions. If we have more cores, we can split more our data, and we will get the exact same result, but faster. However, in local models, the final performance, error/accuracy and runtime depend on the number of partitions. If we increase the number of partitions we might reduce the runtime but we might decrease the accuracy!
The test phase of a global model would be really quick, as we only have one model. However, in a local design, if we have too many models, this might become very slow in the test phase.
pandas读取csv文件并遍历:
https://blog.csdn.net/Jarry_cm/article/details/99683788
import pandas as pd
results = pd.read_csv('/Users/artificial-marks.csv')
# read the data as a Pandas DataFrame
X = results.loc[:,'Ex01':'Project']
y = results['Total']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = \
train_test_split(X,y,\
test_size=0.3,\
random_state=1)
from sklearn.tree import DecisionTreeRegressor
regr = DecisionTreeRegressor(random_state=0, max_depth=2)
# max_depth代表最大深度
print(regr.fit(X_train, y_train))
# DecisionTreeRegressor(max_depth=2, random_state=0)
# I built that tree using the scikit-learn library,
# which is a library for standard machine learning algorithms. Hence, no parallelisation.
# What would happen if I feed a very big dataset that library? well, it will either take ages to create the tree, or you might actually run out of memory to build the tree. So, why don’t we try to do this with what we have learnt so far?
# What is the easiest way to apply Machine Learning to Big Datasets?
# DIVIDE-AND-CONQUER
# When we load the data with Spark, the data is distributed among the workers. Using the idea of MapReduce (divide-and-conquer), why don’t we simply create a separate Decision Tree for each partition of the data (only using the data of that partition)?
# The problem then is how to combine all these models to make predictions.
# The easiest way to do this with a Decision Tree is to combine them together as independent classifiers which give us a ’vote’ about the predicted class for an unseen example. We could apply a simple majority voting scheme to make the final classification. In other words, we create an Ensemble of Trees :-). By the way, is this way of ’merging’ the different models going to be general for all Machine Learning models? Definitely not! There are methods that don’t create any model, for example, lazy learning methods like the k-nearest neighbour, or performing clustering. But for most Eager Learning approaches that build a model, this approach would actually do the trick to use all the data in some way.
# Let’s implement it with Spark!
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sparkSession = SparkSession.builder.master("local[*]").appName("MLbig").getOrCreate()
sc = sparkSession.sparkContext
df = sparkSession.read.format('csv').option("header", 'true').load("/Users/yangminghan/Downloads/Nottingham/正课/春季/大数据/文件/artificial-marks.csv")
# Remove the columns I am not interested in:
df = df.drop('Question 1').drop('Question 2').drop('Question 3').drop('Question 4')
train, test = df.randomSplit([0.7, 0.3])
# I am not using the MLlib, but I want to directly do this myself using the Spark API. Actually, to apply that divide and conquer approach, the DataFrame API is not really flexible. However, if you remember from the chapter about RDDs there was a transformation called mapPartitions that does exactly what we need, apply a function on an entire partition of the data. So, I read the data as a DataFrame, but I am going to transform this to an RDD, and I want to split this into 4 partitions. Note that the DataFrame API determines the number of partitions automatically! and it doesn't let you do it as we did with RDDs, so I will split it myself using repartition:
print(train.rdd.getNumPartitions())
# 1
train_rdd = train.rdd.repartition(4)
# For test, I will also split it, so we can parallelise also that operation:
test_rdd = test.rdd.repartition(2)
# I will need to know the name of the columns later, so I am going to store it in a global variable.
column_names = df.columns
# Building the model in parallel
# The definition of the mapPartitions(func) is as follows: Apply the function func each partition of the RDD. func receives an iterator and returns another iterator that can be of different type
#
# So, we will get an iterator to go through all the instances/rows of a partition. What we need to do is to use those instances to train a Decision Tree model, and return that model. To do so, rather than implemeting my own Decision Tree in each partition, I will just simply use the scikit-learn library. However, this means that I need to create a Pandas DataFrame for each partition.
def toPandas_partition(instances):
panda_df = pd.DataFrame(columns=column_names) # using the global variable
for instance in instances: # each instance is of Row type
panda_df = panda_df.append(instance.asDict(), ignore_index=True)
return [panda_df]
rdd_pandas = train_rdd.mapPartitions(toPandas_partition)
def build_model(partition):
from sklearn.tree import DecisionTreeRegressor
regr = DecisionTreeRegressor(random_state=0, max_depth=2)
X_train = partition.loc[:, 'Ex01':'Project']
y_train = partition['Total']
model = regr.fit(X_train, y_train)
return model
# Now let's train the model, this is going to be my equivalent to fit.
# As the model will be relatively light-weight, we can bring them all back to the driver:
model = rdd_pandas.map(build_model).collect()
type(model)
# Testing the model in parallel
# model is a variable that exists in the 'driver' program. To perform the test, we need that model to be available in the workers. If the size of the variable model was too high, I would suggest to broadcast the model! In our case, however, it is a very small list of models, I will pass it as a global variable to the workers.
#
# We have the test data as an RDD in test_rdd. Again, I want to use the scikit-learn implementation in every single worker to predict the output of each single element of test_rdd. We could again use the previous functions toPandas_partition to transform the each partition (composed of Spark Rows) into Pandas.
test_rdd_pandas = test_rdd.mapPartitions(toPandas_partition)
# Now, we need to carry out two operations.
# First we need to predict the outputs with all the models existing in model.
# I am going to define a function test_classifier that will receive a partition,
# which is already transformed into a Pandas data frame, and I am going to predict with each one of the models the outputs, generating a list of lists:
def test_classifier(partition):
predictions = []
X_test = partition.loc[:,'Ex01':'Project']
for m in model:
predictions.append(m.predict(X_test).tolist())
return predictions
# I am going to use that function to transform the test_rdd_pandas into predictions, but later I will have to aggregate those predictions, because we have 4 predictions (one per partition building phase) for each test sample. Thus, I create a function aggregate_predictions that will take predictions and aggregate them. As this is a regression problem, this final prediction could simply be the mean value of the 4 predictions.
def aggregate_predictions(preds):
agg = [0] * len(preds[0]) # list of 0s where I will aggregate the resutls
for lst in preds:
for i in range(len(lst)):
agg[i] += lst[i]
agg = [number / len(preds) for number in agg]
return agg
# I apply now the two transformations to the previous rdd to obtain the predictions:
predictions = test_rdd_pandas.map(test_classifier).flatMap(aggregate_predictions)
y_pred = predictions.collect()
# Let's see how well we have done. I am going to compute the mean absolute error.
# First I take the real values for the test set:
y_test = list(map(lambda x: float(x.Total), test.select("Total").collect()))
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(y_test,y_pred))
# 9.774107142857146
# Alright, we have now completed the first implementation of a Big Data solution, by splitting the original data into chunks and applying locally independent models. This is what we could call a Local approach, which is obviously an approximation of the original algorithm. This doesn't ''see'' the data as a whole, but in different chunks. This was the implementation offered in Apache Mahout with Hadoop.
# How does this differ from using the sequential implementation? This should be worse, right?
#
# Let me use the scikit-learn on the same training and test data.
train_pandas = train.toPandas()
test_pandas = test.toPandas()
from sklearn.tree import DecisionTreeRegressor
regr = DecisionTreeRegressor(random_state=0, max_depth=2)
X_train = train_pandas.loc[:, 'Ex01':'Project']
y_train = train_pandas['Total']
regr.fit(X_train, y_train)
X_test = test_pandas.loc[:, 'Ex01':'Project']
y_pred = regr.predict(X_test)
print(mean_absolute_error(y_test,y_pred))
# 10.817582417582418
# Oh, this is fun. I have used a local model, which is obviously not seeing all the data at once, but we have got even better results! How is that possible?
#
# This is what we call the 'ensemble' effect. We now have 4 partitions, and therefore 4 'weak' regressors, that agree on a final result. Ensemble learning is a well known approach in machine learning, and here, we have exploited that concept very well to reduce the error and using all the data in very simple way with little, and probably very fast to build this model.
#
# It is somehow remarkable how easy this was, we didn't even need to understand much about the 'underlying model'. We could replace Decision Trees by any other existing regression or classification algorithm from the scikit learn library. So, should we always develop this kind of Local approach?
#
# To answer that question, I want you to think about this question: what is going to happen if I increase the number of partitions?
#
# The building phase will probably faster, because we have less data in each data partition, and we can parallelise the process even further.
# The final performance will depend on number of partitions we make!
# If we split the data into to many partitions, we may end up with insufficient data to learn any significant model!
# The test phase will become slower!
Challenge #1: Modify the code above to test how much we can parallelise this before decreasing the performance. It would also be good to quantify the runtime!
What is the alternative to this Local model?
The alternative to this would be to create a single global model using all the available data at once. In other words, try to obtain the same model as that would be obtained if the method could be executed in a single node. What does it imply? it means that we need to know very well how the method and parallelise its operations, rather than distributing the execution of the whole method in pieces of the data.
As I mentioned before, the MLlib is mostly based on ‘exact’ replicas of the original methods, which means that they aim to implement global methods. With Spark we are not limited to a single MapReduce stage, so it is a suitable platform to attempt global models in which we may need to iterate through the data multiple times!
决策树
Distributing the internal mechanisms of a Decision Tree is exactly what Spark implements in the MLLib. Databrick provided a great presentation describing the details about how Spark parallelise Decision Trees that can be found here. Although that’s how the RDD-API was implemented, the ideas are the same. As a side note, if you want to look at the source code, you will see that the Decision Tree is implemented as a Random Forest of 1 single tree and using all features; so you need to look at this.
My goal is not to discuss all the details about how this is implemented, but for you to get an idea of how the processing has been distributed. As mentioned before, a Decision Tree searches for the best feature/attribute and cut-point value that allow us to split the data into two groups that are more similar among themselves with respect to the target variable. And this process is repeated recursively to construct the tree.
This means that we need to iterate through the data multiple times, but we don’t really perform any modifications to the data, but we simply compute some statistics about the input features and the output feature, and that will shape our tree. The tree structure is actually not expected to be a really big data structure, as this contains nodes with the information about which attributes are more important, and the values that are used to virtually ‘‘split’’ the data. Thus, in the Spark implementation of Decision Trees, the model is a variable that lives in the driver node and it is constructed iteratively from the information/stats provided by the workers.
This is a brief description of the steps that are performed to create the tree in parallel.
We assume that we have the training data as either a DataFrame or an RDD, distributed and ‘cached’ in the main memory of multiple worker nodes.
Step 1: Compute the root node: For each feature in the training data, we need to compute the best cut-points using impurity metrics like gini, or other dispersion metrics.
In a categorical feature, e.g. colour that takes values green, red, blue, that is straightforward; for example, we can measure the dispersion of the target variable when is red, green and blue and see which one is best to split the data. However, in a numerical/continuous variable this is a slightly more difficult because if the values of a feature ranges from [0,1], there are infinite cut-points we could test there. This is when a process called ‘binning’ is used to discretise numerical features. If the number of bins is set to 12 for example, we could test the values, 0, 0.1, 0.2,…, 0.9, and 1. Spark set a limit for the maximum number of bins to reduce the computation cost when tackling numerical features.
In either case, this process is simply the computation of some stats for each attribute, which can easily be done in parallel. This information is sent to a single worker node with reduceByKey() that chooses the global best (feature, split) pair, or chooses to stop splitting if the stopping criteria are met.
These stats need to be aggregated and return to the driver so we can determine the best feature to split the dataset, and therefore create the root node.
Step 2: Broadcast current model to all workers: In the previous step we have come up with the root node, but that information is only in the driver node. To further construct the model, all the workers need to know the current shape of the tree. Thus, the current model is broadcast to all workers before computing the next level of the tree.
Step 3 Repeat Step 1 in a level-by-level basis: In the next iteration, we repeat the process again, but to compute the stats that the driver nodes need to update the model, we needed the current model from the previous step. To accelerate the process, Spark computes all the necessary stats in only 1 pass over the data to construct the next level. This means that the running time scales linearly with dataset size!
Caveats: Whilst the statistics are computed globally, and this is very similar to the original implementation of Decision Tree in sequential, there are few things that are different.
Max Depth: the maximum depth of the tree (i.e. the number of levels/iterations) is capped to a certain level to limit the runtime required to construct the tree.
Binning: similarly, the number of bins for numerical features is also limited.
No pruning: while having a maximum depth might be relatively problematic, the main issue we might have with Decision Trees is that they may overfit the training data if they build very deep trees. We usually use pruning techniques to reduce overfitting, but this has not been included in the MLlib as far as I know.
On a final reflection, the key question is: is this global implementation better than the local model I did before?
Spark资料
Hi Ian, sorry to trouble you. I was wondering if someone said xxx function must be commutative and associative, so could you please tell me what’s the meaning of that?
Hi Minghan. A commutative function is one where you can swap the order of the operands and still get the same result: addition is commutative because 3+4 = 4+3, but subtraction isn’t because 3-4 != 4-3
An associative function is one where changing the placement of parentheses doesn’t affect the result: so addition is associative because (2+3)+4 = 2+(3+4)
http://dblab.xmu.edu.cn/blog/1709-2/
https://spark.apache.org/docs/latest/api/python/index.html
https://spark.apache.org/docs/latest/quick-start.html
https://blog.csdn.net/dxyna/article/details/79772343
https://www.sohu.com/a/435103777_167201
雷慢曾说过,互联网是一个喂养型的社会。
这种喂养是经过大数据分析、
机器学习决策后,推送给人的。
大数据可以记录用户在某张图上停留了多长时间,
来判断用户对那种图感兴趣。
为了获得数据,这些公司需要监视用户去过的每个地方,每种喜好,每种行为数据,无限追踪、分析、评估,无限逐利。
你喜欢看NBA的视频?好的,在你看下一个同类视频前,给你插播一条篮球鞋的广告。
算法模型会给用户画像,这个画像是由无数的行为数据建立的。
这些数据不用人看管,机器可以自动深度学习,给出预判。
预判人们行为轨迹的模型是他们做的,
能够自动学习的程序代码是他们写的。
在中国,中老年人喜欢看的一些鼓吹、煽动的假新闻视频, 并不是制作者一定相信这些事,而是假新闻传播比真新闻更快,流量能给他们带来商业变现,免费看视频的人,才是被卖的商品。
欠采样与过采样:
https://www.zhihu.com/question/269698662
RDD是不可变的,
对RDD进行操作一定要用一个新的RDD来接。
newRDD = oldRDD.filter();
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, labelCol=“Total”)
初始化线性回归算法
print(lr.explainParams())
算法参数解释
拟合是什么意思?给模型喂数据?
fit = 拟合
训练时,label的数据与feature的数据都是已知的,
因为都是已知的,所以才能去教机器。
对于一个监督学习模型来说,
过小的特征集合使得模型过于简单,
过大的特征集合使得模型过于复杂 。
对于特征集过小的情况,称之为 欠拟合( underfitting )
对于特征集过大的情况,称之为 过拟合( overfitting )
算法 + 训练数据集 = 模型
模型 + 测试数据集 = 预测结果
fit() is an action, transform() is obviously a transformation,
so it is waiting for an action to trigger.
创建管道
Transformer(Binarizer) -> Transformer(VectorAssembler) -> Estimator(LinearRegression)
实例化LinearRegression,传入若干参数,需要告知Label是哪列。
实例化VectorAssembler,传入若干参数,需要告知Features是哪列。
LinearRegression对象.fit(VectorAssembler对象.tranform(训练数据集))@@@@@@
实例化Pipeline,传入stages参数,通过LinearRegression对象与VectorAssembler对象,
创建管道模型对象,通过pipeline.fit(train)。
管道模型用来预测,prediction = 管道模型对象.transform(test)
课上所谓的全局变量是针对Spark的worker节点来说的。
RDD有一些限制(不好的地方),
DataFrames的性能要比RDD高,
DataFrames have become the primary Machine learning API for Spark.
do not use collect() when handling big datasets!
❑Use take(n)
❑Cache your DataFrames!
Spark的函数列表
sc.parallelize()
把list变成RDD
map()
可以把数字list变成tuple-list,
flatMap()
可以按空格分割把字符串都打散
groupByKey()
性能低,可以用reduceByKey()来取代
reduceByKey()
用于tuple结构,把key相同的value做聚合
sortByKey()
len()
个数,例如计算list里元素的个数
list()
转换成列表
take()
取前几个
filter()
过滤
stats(),count(),mean(),stdev(),max(),min()
RDD统计,获得个数、均值、最大小值等
sc.textFile()
读文件,读目录
saveAsTextFile()
写文件,写到一个文件夹中,1个分区1个文件
spark.createDataFrame(tuples)
spark.createDataFrame(tuples, [‘name’, ‘age’])
创建dataframe,把[rdd/tuple]转成dataframe
df.printSchema()
看dataframe的列信息,可以spark推测的不对,需要自己转换类型
df = df.select([F.col©.cast(“float”).alias© for c in df.columns])
列类型转换
df = df.drop(‘Question 1’).drop(‘Question 2’).drop(‘Question 3’).drop(‘Question 4’)
删列
train, test = df.randomSplit([0.7, 0.3])
把数据随机分成2分,用于机器学习
df.show(2)
df.collect()
显示出dataframe里的内容,show可传参显示几条数据
rdd.toDF()
rdd转dataframe
spark.read.json(“data/data.json”)
读取json文件
spark.read.option(“samplingRatio”, 0.2).json(“data/data.json”)
读取json文件,一小部分数据,对于耗时大文件来说
lines_df = spark.read.text(“./data/quixote.txt”)
dfText = spark.read.text(“data/pg2000.txt”)
dfJSON = spark.read.json(“data/json.json”) # infer the Schema automatically
dfCSV = spark.read.csv(“data/personas.csv”, inferSchema=True, header=True)
spark.read.format(‘csv’).option(“header”, ‘true’).load(“data/results.csv”)
dataframe读文件
dfText.write.text(“datos/salidaTXT1.txt”)
dfJSON.write.json(“datos/salidaJSON1.json”)
dfCSV.write.csv(“datos/salidaCSV1.csv”, header=True)
dataframe写文件
dataDF.rdd
dataframe转rdd
df.name
df[“name”]
访问dataframe的列
df.select(*cols)
dataframe查询数据
df.selectExpr(*expr)
df.selectExpr(“age * 2”, “abs(age)”).show()
dataframe用表达式查询数据
df.filter(condition)
df.where(condition)
dataframe过滤数据
df.orderBy(*cols, ascending)
df.sort(*cols, **kwargs)
dataframe排序数据
df.select(df.name, (df.age + 10)).show()
df.select(df.name, (df.age + 10).alias(‘age’)).show()
dataframe动态改数据,改列名
df.distinct()
df.dropDuplicates(*cols)
过滤重复数据
df.withColumn(colName, col)
dataframe添加新列,在新列上执行一些运算
df.withColumnRenamed(existing, new)
df.name.alias(“Hello”)
对列进行重命名
df.drop(col)
dataframe删列
df.limit(num)
dataframe限制显示个数
df.cache()
dataframe缓存
df.age.between(18, 65)
df.select(df.name, df.age.between(2, 4)).show()
df.filter(df.age.between(2, 4)).show()
dataframe左右之间
df.age.isNull()
df.age.isNotNull()
df.filter(df.name.isNull()).show()
dataframe空与非空
df.select(df.name, F.when(df.age > 18, ‘adult’).otherwise(‘kid’)).show()
dataframe的if-else
df.select(df.name, df.age, F.when(df.age > 18, ‘Adult’).when(df.age < 12, ‘kid’).otherw
可以连接多个条件
df.select(df.name.substr(1, 3).alias(“Abbr”)).collect()
startswith(other)
substring(startPos, len)
like(otheR)
dataframe字符串操作
isin(*cols)
df.name.isin(“Isaac”, “Pepe”)
df.filter(df.name.isin(“Isaac”, “Pepe”)).collect()
dataframe是否字符串包含
explode(col)
eDF.select(F.explode(eDF.intlist).alias(“anInt”)).show()
dataframe列传行,把原来列中的元组,字典,列表转成一些行,类似flatMap。
df.select(“*”, F.lit(0).alias(“Balance”)).show()
新建一列并赋值
df.select(F.length(df.name).alias(‘len’)).show()
计算dataframe列的字符串长度
df.distinct()
dataframe过滤重复
df.union(df2).show()
dataframe的并集(包含重复)
df1.intersect(df2).show()
dataframe的交集(去掉重复,性能低)
df1.subtract(df2).show()
dataframe的差集(性能低
df.show(n=20, truncate=True)
dataframe显示多少条数据
df.count()
dataframe返回数据个数
df.collect()
dataframe返回所有数据
df.first()
dataframe返回第一行数据
df.take(2)
dataframe返回若干条数据
df.toPandas()
转换成pandas
df.columns
dataframe的列名列表
df.describe(*cols)
dataframe描述列的统计信息(均值,极值)
df.explain(extended=False)
df.filter(df.age > 10).select(df.age).explain(True)
dataframe的debug信息,查看优化
df.cache()
lines_df.select(explode(split(‘value’, ’ ')).alias(‘word’)).cache()
words_df.where(words_df.word.like(“%blockhead%”)).count()
words_df.where(words_df.word.like(“%spear%”)).count()
dataframe缓存
agg(*exprs)
df.groupBy().agg()
df.agg(F.min(df.age),F.max(df.age),F.mean(df.age)).show()
df.agg({“age”: “max”}).show()
dataframe的聚合(avg, max, min, sum, count),不分组
groupBy(*cols)
dataframe的分组,类似MySQL。
df3.groupBy(‘name’).count().show()
df3.groupBy(‘name’).agg(F.mean(df3.age), F.count(df3.age)).show()
下面都可以用于分组聚合中(必须是数字数据)
avg(*cols) / mean(*cols)
count()
*max(***cols)
min(*cols)
sum(*cols)
pivot(pivot_col, values),行转列
agg(*exprs)
小组项目
Understanding experimental validation procedures. For example, Fold-cross validation, performance metrics appropriate for the given problem, etc.
You do understand that we need to split data into training and test and perform some sort of fold cross-validation approach to validate your experiments that you need different metrics depending on the kind of problem you’re dealing with,
and all this stuff are going to be really needed specially when you do your group project that I will discuss in a little bit.
Problem-based project
如果有数据,
调查研究不同的知识点,
连接它们去解决问题,
你可以结合不同的预处理知识点(with classifier)去得到结果。
需要机器学习的知识,
要根据数据做成正确的选择。
Papers must be clearly presented in English, must not exceed 8 pages,
including tables, figures, references and appendixes, in IEEE Computer
Society proceedings format with Portable Document Format (.pdf).
要掌握一些基本的机器学习算法。
抄袭网上的代码的话会被很快检测出来,
你将被要求解释演示中的各种细节。
Successful completion means that you are able to explain your solution during the presentations.
PB06: A Classification Model for Sentiment analysis of Tweets
You can determine exactly what you implement, e.g. which algorithms and how you solve it.
Here we suggest you aim to classify twitter text messages according to hot hashtags for US election 2020, which was proposed recently in Kaggle. To tackle this problem, your solution should be able to quickly extract the text data features to highlight the hot keywords which are re-tweeted more than once, and apply various machine learning algorithms to understand if there is any correlation between what’s going on twitter and the actual results in the elections.
#是标签,在推文中加标签可以让看到这条推文的人知道你要谈论的话题是关于什么的,
并可以快速搜索带有相同标签的推文。
通过2020美国大选热门[标签/话题]
对推特的文本消息进行分类
,解决方案:快速提取文本数据的特征来突出被转发不止一次的热门关键字。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lA5w77E1-1629958218046)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p601)]
https://www.kaggle.com/manchunhui/us-election-2020-tweets/activity
https://www.sciencedirect.com/science/article/pii/S1877050920306669
资料列表:
Spark MLlib 情感数据数据(实验楼)
https://www.lanqiao.cn/activities/user-invitation/?from=modal
https://www.lanqiao.cn/courses/1283
https://www.lanqiao.cn/courses/1283/learning/?id=10763
https://www.cnblogs.com/yidada/p/11868870.html
安装python相关的包:
wget https://labfile.oss.aliyuncs.com/courses/722/numpy-1.11.3-cp27-cp27mu-manylinux1_x86_64.whl
wget https://labfile.oss.aliyuncs.com/courses/722/pyparsing-2.1.10-py2.py3-none-any.whl
wget https://labfile.oss.aliyuncs.com/courses/722/pytz-2016.10-py2.py3-none-any.whl
wget https://labfile.oss.aliyuncs.com/courses/722/cycler-0.10.0-py2.py3-none-any.whl
wget https://labfile.oss.aliyuncs.com/courses/722/python_dateutil-2.6.0-py2.py3-none-any.whl
wget https://labfile.oss.aliyuncs.com/courses/722/matplotlib-1.5.3-cp27-cp27mu-manylinux1_x86_64.whl
sudo pip install numpy-1.11.3-cp27-cp27mu-manylinux1_x86_64.whl
sudo pip install pyparsing-2.1.10-py2.py3-none-any.whl
sudo pip install pytz-2016.10-py2.py3-none-any.whl
sudo pip install cycler-0.10.0-py2.py3-none-any.whl
sudo pip install python_dateutil-2.6.0-py2.py3-none-any.whl
sudo pip install matplotlib-1.5.3-cp27-cp27mu-manylinux1_x86_64.whl
可视化分析结果使用的是python的地图可视化第三方包Basemap,basemap是Python中一个可用于地理信息可视化的包,安装__basemap__过程如下:
$ wget https://labfile.oss.aliyuncs.com/courses/722/basemap-1.0.7.tar.gz
$ tar zxvf basemap-1.0.7.tar.gz
$ cd basemap-1.0.7
$ cd geos-3.3.3
$ ./configure
$ make
$ sudo make install
![9267a303d166d5dd9e75c21f204a2ce5.png](evernotecid://85A94EB2-9BD2-45DE-A16D-D9C9CD7ECF0B/appyinxiangcom/12192613/ENResource/p966)
上述过程需等待一段比较漫长的时间,这期间可以先浏览后续实验步骤。当执行完毕后返回目录basemap-1.0.7并安装basemap
$ cd ..
$ sudo vim /etc/apt/sources.list # 把与 deb-src 相关行的注释符号取消掉
# 先更新依赖包,再进行安装
$ sudo apt-get update
$ sudo apt-get build-dep python-lxml
$ sudo python setup.py install
# 安装后更新,并安装python-tk
$ sudo apt-get update
$ sudo apt-get install python-tk
![9267a303d166d5dd9e75c21f204a2ce5.png](evernotecid://85A94EB2-9BD2-45DE-A16D-D9C9CD7ECF0B/appyinxiangcom/12192613/ENResource/p966)
进入examples目录运行程序simplestest.py查看是否安装成功
$ cd examples
$ python simpletest.py
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6JS6pmcY-1629958218048)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p635)]
wget https://labfile.oss.aliyuncs.com/courses/722/tweets.json
wget https://labfile.oss.aliyuncs.com/courses/722/donald.json
wget https://labfile.oss.aliyuncs.com/courses/722/hillary.json
from future import print_function
import json
import re
import string
import numpy as np
from pyspark import SparkContext, SparkConf
from pyspark import SQLContext
from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.feature import Normalizer
from pyspark.mllib.regression import LabeledPoint
conf = SparkConf().setAppName(“sentiment_analysis”)
sc = SparkContext(conf=conf)
sc.setLogLevel(“WARN”)
sqlContext = SQLContext(sc)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-w2nK9PLn-1629958218049)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p634)]
实验中我们利用获得的 tweets.json 作为情感分析分类器模型的训练集,这里测试集是用__hillary.json__,想要分析川普的,只需将测试集对应换成__donald.json__即可
据图可知,我们首先将每一条的推特数据分词,由于这里分析的是英文的tweet推特内容,因此只需将每个单词分开即可(与中文情感分析的不同),并定义stop_words 停用词一些无关情感分析的词
Spark MLlib中提供的机器学习模型处理的是向量形式的数据,因此我们需将文本转换为向量形式,为了节省时间,这里我们利用Spark提供的Word2Vec功能结合其提供的text8文件中的一部分单词进行了word2vec模型的预训练,并将模型保存至word2vecM_simple文件夹中,因此本次实验中将tweets转换为向量时直接调用此模型即可,从以下地址中获得此离线模型并解压:
cd ~
wget https://labfile.oss.aliyuncs.com/courses/722/word2vecM_simple.zip
unzip word2vecM_simple.zip
由于实验楼的在线环境限制,同学们可线下使用text8自行训练词向量转换模型,或线上搜索利用tweets进行分词训练的word2vec模型
情感分析相关的函数定义好后,我们便可从__json__文件中读入数据,创建RDD对象,利用spark mllib的分类器进行情感分析:
基于上一小节的RandomForest随机森林的训练结果,我们利用该分类器对推特上关于hillary__的情感分析结果,结合basemap将其展示到美国的48个州(除去阿拉斯加及夏威夷)上,观察这48个州对于 __hillary 的看法。
我们将在函数res_visulization中绘制可视化结果,首先我们需要定义该函数。
导入作图相关模块。
from mpl_toolkits.basemap import Basemap
from mpl_toolkits.basemap import cm
import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.colors import rgb2hex
from matplotlib.patches import Polygon
我们本次通过美国的shapefile结合basemap进行绘图,因此需要以下三个文件(这个三个文件需放置在目录shiyanlou_cs722下):
wget https://labfile.oss.aliyuncs.com/courses/722/st99_d00.shp
wget https://labfile.oss.aliyuncs.com/courses/722/st99_d00.dbf
wget https://labfile.oss.aliyuncs.com/courses/722/st99_d00.shx
函数res_visulization可视化情感分析结果
# pred_result:利用spark mllib 情感分析结果
def res_visulization(pred_result):
# popdensity_ori 用于保存基于我们事先给定的推特情感极性,不同州的情感属性
popdensity_ori = {'New Jersey': 0., 'Rhode Island': 0., 'Massachusetts': 0., 'Connecticut': 0.,
'Maryland': 0.,'New York': 0., 'Delaware': 0., 'Florida': 0., 'Ohio': 0., 'Pennsylvania': 0.,
'Illinois': 0., 'California': 0., 'Hawaii': 0., 'Virginia': 0., 'Michigan': 0.,
'Indiana': 0., 'North Carolina': 0., 'Georgia': 0., 'Tennessee': 0., 'New Hampshire': 0.,
'South Carolina': 0., 'Louisiana': 0., 'Kentucky': 0., 'Wisconsin': 0., 'Washington': 0.,
'Alabama': 0., 'Missouri': 0., 'Texas': 0., 'West Virginia': 0., 'Vermont': 0.,
'Minnesota': 0., 'Mississippi': 0., 'Iowa': 0., 'Arkansas': 0., 'Oklahoma': 0.,
'Arizona': 0., 'Colorado': 0., 'Maine': 0., 'Oregon': 0., 'Kansas': 0., 'Utah': 0.,
'Nebraska': 0., 'Nevada': 0., 'Idaho': 0., 'New Mexico': 0., 'South Dakota': 0.,
'North Dakota': 0., 'Montana': 0., 'Wyoming': 0., 'Alaska': 0.}
# popdensity 用于保存基于随机森林分析的推特情感极性,不同州的情感属性
popdensity = {'New Jersey': 0., 'Rhode Island': 0., 'Massachusetts': 0., 'Connecticut': 0.,
'Maryland': 0.,'New York': 0., 'Delaware': 0., 'Florida': 0., 'Ohio': 0., 'Pennsylvania': 0.,
'Illinois': 0., 'California': 0., 'Hawaii': 0., 'Virginia': 0., 'Michigan': 0.,
'Indiana': 0., 'North Carolina': 0., 'Georgia': 0., 'Tennessee': 0., 'New Hampshire': 0.,
'South Carolina': 0., 'Louisiana': 0., 'Kentucky': 0., 'Wisconsin': 0., 'Washington': 0.,
'Alabama': 0., 'Missouri': 0., 'Texas': 0., 'West Virginia': 0., 'Vermont': 0.,
'Minnesota': 0., 'Mississippi': 0., 'Iowa': 0., 'Arkansas': 0., 'Oklahoma': 0.,
'Arizona': 0., 'Colorado': 0., 'Maine': 0., 'Oregon': 0., 'Kansas': 0., 'Utah': 0.,
'Nebraska': 0., 'Nevada': 0., 'Idaho': 0., 'New Mexico': 0., 'South Dakota': 0.,
'North Dakota': 0., 'Montana': 0., 'Wyoming': 0., 'Alaska': 0.}
idx = 0
for obj in rawTst_data['results']:
user_location = obj['user_location']
popdensity_ori[user_location] += (obj['polarity'] - 1)
popdensity[user_location] += (pred_result[idx] - 1)
idx += 1
# 在终端上输出不同的州的情感属性
# 由于我们设置的 polarity 积极:0 正常:1 消极:2
# 因此对应的不同的州对于新总统的情感值越大则越消极,越小则越积极
print('popdensity_ori')
print(popdensity_ori)
print("---------------------------------------------------------")
print('popdensity')
print(popdensity)
print("---------------------------------------------------------")
# Lambert Conformal map of lower 48 states.
fig = plt.figure(figsize=(14, 6))
# 使用ax1, ax3 分别展示测试数据的原本情感属性值,及基于模型情感分析的结果
ax1 = fig.add_axes([0.05, 0.20, 0.40, 0.75])
ax3 = fig.add_axes([0.50, 0.20, 0.40, 0.75])
# 初始化Basemap对象,获得在美国范围的地图m1
m1 = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49,
projection='lcc',lat_1=33,lat_2=45,lon_0=-95, ax = ax1)
# draw state boundaries.
# data from U.S Census Bureau
# http://www.census.gov/geo/www/cob/st2000.html
shp_info = m1.readshapefile('st99_d00','states',drawbounds=True)
print(shp_info)
在可视化结果中加上各个州的缩写,这里的我们只给出了部分州的缩写,有些美国较小的东部城市没有具体列出。
# 各个州的缩写
cities = ['WA', 'OR', 'CA', 'NV', 'MT','ID','WY','UT',
'CO', 'AZ', 'NM', 'ND', 'SD', 'NE', 'KS',
'OK', 'TX', 'MN', 'IA', 'MO', 'AR', 'LA',
'WI', 'IL', 'MI', 'IN', 'OH', 'KY', 'TN',
'MS', 'AL', 'PA', 'WV', 'GA', 'ME', 'VT',
'NY', 'VA', 'NC', 'SC', 'FL', 'AL']
lat = [47.40, 44.57, 36.12, 38.31, 46.92, 44.24,
42.75, 40.15, 39.06, 33.73, 34.84, 47.53,
44.30, 41.125, 38.526, 35.565, 31.05,
45.69, 42.01, 38.46, 34.97, 31.17, 44.27,
40.35, 43.33, 39.85, 40.39, 37.67, 35.75,
32.74, 61.37, 40.59, 38.49, 33.04, 44.69,
44.045, 42.165, 37.77, 35.63, 33.86, 27.77,
32.81]
lon = [-121.49, -122.07, -119.68, -117.05, -110.45,
-114.48, -107.30, -111.86, -105.31, -111.43,
-106.25, -99.93, -99.44, -98.27, -96.726,
-96.93, -97.56, -93.90, -93.21, -92.29,
-92.37, -91.87, -89.62, -88.99, -84.54,
-86.26, -82.76, -84.67, -86.70, -89.68,
-152.40, -77.21, -80.95, -83.64, -69.38,
-72.71, -74.95, -78.17, -79.81, -80.945,
-81.67, -86.79]
有关美国的各个州的经纬度可在链接中查阅。
https://inkplant.com/code/state-latitudes-longitudes
# choose a color for each state based on population density.
colors={}
colors2 = {}
statenames=[]
# 使用cm.GMT_polar 渐变色用以展示不同情感
cmap = cm.GMT_polar
# 分别获得基于测试数据原本,及基于模型分析的48个大洲情感属性的最大及最小值
# 测试数据本有标签属性值
inverse = [(value, key) for key, value in popdensity_ori.items()]
vmin = min(inverse)[0]
vmax = max(inverse)[0] # set range.
# 测试数据情感分析结果
inverse = [(value, key) for key, value in popdensity.items()]
vmin_pred = min(inverse)[0]
vmax_pred = max(inverse)[0]
print('vmax:')
print(vmax)
print(m1.states_info[0].keys())
for shapedict in m1.states_info:
statename = shapedict['NAME']
# skip DC and Puerto Rico.
if statename not in ['District of Columbia','Puerto Rico']:
pop = popdensity_ori[statename]
pop_pred = popdensity[statename]
# calling colormap with value between 0 and 1 returns
# rgba value. Invert color range (hot colors are high
# population), take sqrt root to spread out colors more.
# 使用的cm.GMT_polar渐变色,不同州的颜色随情感越积极越红,越消极越蓝,无感状态时为无色
# 蓝色为小于cmap的0.5,无色为cmap的0.5,红色为大于cmap的0.5
if pop == 0:
colors[statename] = cmap(0.5)[:3]
elif pop < 0:
colors[statename] = cmap(1.0 - np.sqrt((pop - vmin)/(0-vmin)))[:3]
else:
colors[statename] = cmap(0.5 - np.sqrt((pop - 0)/(vmax-0)))[:3]
# 同上,colors2保存基于mllib情感分析结果的颜色
if pop_pred == 0:
colors2[statename] = cmap(0.5)[:3]
elif pop_pred < 0:
colors2[statename] = cmap(1.0 - np.sqrt((pop_pred - vmin_pred) / (0 - vmin_pred)))[:3]
else:
colors2[statename] = cmap(0.5 - np.sqrt((pop_pred - 0) / (vmax_pred - 0)))[:3]
statenames.append(statename)
# cycle through state names, color each one.
#ax = plt.gca() # get current axes instance
for nshape,seg in enumerate(m1.states):
# skip DC and Puerto Rico.
if statenames[nshape] not in ['District of Columbia','Puerto Rico']:
color = rgb2hex(colors[statenames[nshape]])
#print(statenames[nshape])
poly = Polygon(seg,facecolor=color,edgecolor=color)
ax1.add_patch(poly)
# 绘制经纬线
m1.drawparallels(np.arange(25,65,20),labels=[1,0,0,0])
m1.drawmeridians(np.arange(-120,-40,20),labels=[0,0,0,1])
# 在地图上添加不同州的缩写
x, y = m1(lon, lat)
for city, xc, yc in zip(cities, x, y):
ax1.text(xc - 60000, yc - 50000, city)
ax1.set_title('Twitter-based sentiment analysis about Hillary ')
m2 = Basemap(llcrnrlon=-119,llcrnrlat=22,urcrnrlon=-64,urcrnrlat=49,
projection='lcc',lat_1=33,lat_2=45,lon_0=-95, ax = ax3)
m2.readshapefile('st99_d00', 'states', drawbounds=True)
for nshape, seg in enumerate(m2.states):
# skip DC and Puerto Rico.
if statenames[nshape] not in ['District of Columbia', 'Puerto Rico']:
color = rgb2hex(colors2[statenames[nshape]])
# print(statenames[nshape])
poly = Polygon(seg, facecolor=color, edgecolor=color)
ax3.add_patch(poly)
ax3.set_title('Random Forest prediction')
m2.drawparallels(np.arange(25,65,20),labels=[1,0,0,0])
m2.drawmeridians(np.arange(-120,-40,20),labels=[0,0,0,1])
x, y = m2(lon, lat)
for city, xc, yc in zip(cities, x, y):
ax3.text(xc - 60000, yc - 50000, city)
# 添加渐变色条colorbar,方便对应颜色查看情感属性
ax2 = fig.add_axes([0.05, 0.10, 0.9, 0.05])
norm = mpl.colors.Normalize(vmin=-1, vmax=1)
cb1 = mpl.colorbar.ColorbarBase(ax2, cmap=cmap,
norm=norm,
orientation='horizontal',
ticks=[-1, 0, 1])
cb1.ax.set_xticklabels(['negative', 'natural', 'positive'])
cb1.set_label('Sentiment')
plt.show()
函数定义完成后,在 sparkSA.py 的最后一行调用该函数以执行它。在代码最后加上此行:
res_visulization(predictions.collect())
保存该文件后退出编辑,在 Shiyanlou_cs722 目录下, 通过spark-submit命令提交所编写好的Python脚本SparkSA.py执行程序:
程序运行结束,我们可看到每个州的原始数据及模型预测的情感属性值:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vEYFz5xN-1629958218050)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p636)]
情感分析最终效果图(有时可视化效果未显现程序就已经结束,此时只需重复提交即可):
wget https://labfile.oss.aliyuncs.com/courses/722/sparkSA.py
在使用spark mllib进行情感分析的例子中,很多都是使用朴素贝叶斯模型,这里我们使用的是随机森林模型,在此请同学们使用spark mllib的朴素贝叶斯模型进行代码改写,并输出情感分析结果。
小组讨论
- 需要把州名为空的数据过滤掉;
Ian:
- So the important thing is to talk about how you approached doing the classification in a big-data way that makes sense, not just we did some random forests but how did you specifically exploit big-data techniques to make this faster. It’s not just a machine learning coursework, although you need to talk about that you also need to talk about how you exploited big data techniques to make this more efficient. So basically, you need to describe how you used spark to parallelise your data - you should show how you successfully split your dataset into multiple parts and trained the classifier that way, as opposed to just running it normally on a single node.
- 没有标准测量情感值的准确度(除非手动标记每个推特的情感值,当然这不可能),我们要评估的是the accuracy of prediciting TextBlob labels(这不是问题),我们假设Textblob的准确度很高,我们只需要去复制TextBlob的结果。
- You need to label every single instance in your datasets regardless of whether it’s going to be used for training or testing. The way the testing works is that you take the entries in your test data, run your classifiers on it and then check to see whether your classifier returns the same result as what TextBlob returned, whether your method gives the same label as what TextBlob gave.
70%的数据用于训练,30%的数据用于测试。不需要数据用来应用,本作业只需要训练和测试即可,报告测试结果等;
- 作业的重点是如何应用大数据技术去解决问题,不要花太多的时间去搞数据可视化,可以用于锦上添花;
- 关于local model:举例子,例如只把local model在经济方面的推特调教好,高度拟合(不是过拟合),那这个local model只适用于经济方面的分析,其他方面推特的话这个model就歇菜了。global model会拟合所有训练集数据,把性能平均给数据。local model不会尝试去拟合所有的数据,只关注于数据的一小部分,
因为对于某个特定的实例来说可能大量数据是完全【不相关的】,那就没有必要做拟合,我们只需要拟合【相关的】数据即可。可以搜classification local model,详情可以访问这里https://www.tandfonline.com/doi/abs/10.1198/0003130031423?casa_token=bNOVUPIwxgcAAAAA%3Ahepruxxl3lT7zTolJkgkITOBVlwpi8lnEyWQaI5k8uyWbsAGTkrtmQx0MSUCKeLRfQST-P0yIBRQ& 数据预处理之前需要做数据清洗,修改或删除数据,变成需要的方式,细节可以发给Ian再看一眼。
- 提升分类器的准确度取决于你选择的features,以及参数的调教,可以调整树的数量以及深度,再去对比结果,看是否有效。
参考资料
向量
Spark MLlib中提供的机器学习模型处理的是向量形式的数据,因此我们需将文本转换为向量形式,
这里我们利用Spark提供的Word2Vec功能结合其提供的text8文件中的一部分单词进行了word2vec模型的预训练,
并将模型保存至word2vecM_simple文件夹中,因此本次实验中将tweets转换为向量时直接调用此模型即可。
可以使用text8自行训练词向量转换模型,或线上搜索利用tweets进行分词训练的word2vec模型。
- https://www.cnblogs.com/tina-smile/p/5204619.html
- https://blog.51cto.com/u_15127586/2670975
- https://blog.csdn.net/chuchus/article/details/71330882
- https://blog.csdn.net/chuchus/article/details/77145579
随机森林
- https://www.cnblogs.com/mrchige/p/6346601.html
- https://www.jianshu.com/p/310ef75e150d
- https://blog.csdn.net/qq_41853758/article/details/82934506
- https://blog.csdn.net/wustjk124/article/details/81320995
- https://blog.csdn.net/zyp199301/article/details/71727278
- https://my.oschina.net/u/4347889/blog/3346852
- https://blog.csdn.net/p_function/article/details/77713611
朴素贝叶斯
- http://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes
- https://developer.ibm.com/alert-zh
- https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2
- http://introtopython.org/visualization_earthquakes.html#Adding-detail
机器学习
人类学习:
已有知识->总结规律->解决新问题
机器学习:
已有数据->建立模型(应用算法)->预测新情况
什么是预测新情况?
二手车估价,A车卖1000,B车卖2000,
C车卖3000,那D车卖多少?你能预测吗?
西瓜书,吴恩达
https://www.bilibili.com/video/BV164411b7dx?p=1
你可以去b站看吴恩达的ml
https://www.bilibili.com/video/BV164411S78V?from=search&seid=9762645346183072559
- 尚硅谷2021迎新版大数据Spark从入门到精通
- 看平台的录播课
- 自学机器学习
大数据需要应用一些[操作/运算],
这些[操作/运算]需要用到机器学习的算法。
机器学习的应用:
图像识别,识别图像中的人和物体,Google搜索,根据你个人的搜索习惯来制定不同的结果,例如你搜Java时,如果你是咖啡师或程序员,那么得出来的结果将会不一样。
测谎,推荐系统,文字语音识别系统,自动驾驶,
一些机器学习算法:
需要自学机器学习:
classification problem?
classification algorithms?
regression problem?
time series forecasting problem?
tabular data?
artificial neural network
super vector machine
?
clustering?
聚类算法
https://moodle.nottingham.ac.uk/course/view.php?id=102891
https://blog.csdn.net/javastart/article/details/102885269
https://haokan.baidu.com/v?vid=9547042470775173495&pd=bjh&fr=bjhauthor&type=video
https://haokan.baidu.com/v?vid=6939919533874762798&pd=bjh&fr=bjhauthor&type=video
机器学习就是用数据来回答问题。
数据(图片,音乐,文本,视频,音频等等)
要进行机器学习,首先要先有数据。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fLqthxvY-1629958218051)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p602)]
https://www.jianshu.com/p/fe50fa4e2dd7
https://www.youtube.com/watch?v=HcqpanDadyQ
用数据来训练,使用数据来了解用户习惯,
并且不断优化预测模型,
这些预测模型就可以对新的数据进行预测并回答问题。
数据越多,模型优化的就越好。
机器学习流程
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fwpM6VDd-1629958218051)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p617)]
https://www.bilibili.com/video/BV17t411b7cv
https://www.sohu.com/a/198093510_783844
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ub9kgc8G-1629958218054)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p603)]
1.[采集/收集]数据
要有足够多的[数据/样本]去训练我们的模型,
去让它学习。
数据的质量直接决定模型的好坏。
2.准备数据
一般对数据进行标准化、可视化、随机化,
将原始数据变成方便机器理解的内容。
把原始数据分成训练集与测试集。
量化:
例如要处理的数据是文本数据的话,
需要使用word2vec,tf-idf等方法让其转化成数学模型可以识别的内容(数字)。
预处理:
将数据手动或自动过滤一些非优质数据,
或者进行数据类型转换,
以便用于后续数据模型的计算。
3.选择模型
根据数据形式,选择不同的函数来拟合数据。
应用数据模型去总结[数据/样本]中的规律,
并应用在未知情况下,
从而得出符合已知规律的结论。
我们需要基于业务问题,来决定可以选择哪些可用的模型。
一般情况,模型都有一个固定的模样和形式。
可以靠运气或经验来选择某一个模型。
例如:
如果要预测产品销量,
则可以选择数值预测模型(比如回归模型,时序预测……);
如果要预测员工是否离职,
则可以选择分类模型(比如决策树、神经网络……)。
监督学习SupervisedLearning
监督学习是机器学习中的一大类算法总称,
监督学习的本质是待定系数法,
机器学习的过程就是用待定系数求解析式的过程。
预测明天的气温是多少度,这是一个回归任务;
预测明天是阴、晴还是雨,就是一个分类任务。
分类Classification
决策树Decision Tree
https://www.bilibili.com/video/BV1Xp4y1U7vW
https://www.bilibili.com/video/BV1HV411b7JR
https://baijiahao.baidu.com/s?id=1669031487336565105&wfr=spider&for=pc
https://baijiahao.baidu.com/s?id=1669031487336565105&wfr=spider&for=pc
决策树是用树的结构来构建分类模型,每个节点代表着一个属性,根据这个属性的划分,进入这个节点的儿子节点,直至叶子节点,每个叶子节点都表征着一定的类别,从而达到分类的目的。
决策树是机器学习中一种具有树状结构的算法,
其中每个内部节点表示一个属性上的判断,
每个分支代表每个判断结果的输出,
每个叶节点表示一种分类的结果。
把数据模块化,将所有数据放在一个节点中,
在节点中采取一些决策条件,
这个节点内的数据将会分成2份,
并放在子节点中,重复这种逻辑,
直到所有子节点中的数据都属于同一类别,
最后导致所有的数据都被这个决策树做了分类。
当有了新的测试样本,只知道特征信息,
不知道它的类别,
我们可以把这些特征信息按照树状结果,从上到下,
依次代入决策条件,
就可以知道这个测试样本的类别了。
树越深(应用了足够多的决策条件),分类模型越靠谱。
使用决策树非常直观、可视化,并且在出问题的情况下易于追溯和倒推。
智能医疗,基于已知海量病患信息得出结论,
病人该不该做手术
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EImFhnOl-1629958218055)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p624)]
引用来源:
https://www.bilibili.com/video/BV1Xp4y1U7vW
如何选择决策条件?
如何快速找到最合适的条件,使决策树的效率更高?
先采用"是动物吗?"需要分类需要用到3条件,
而采用"有羽毛吗?"完全分类只需要1个条件。
可以量化信息量的概念->熵。
衡量一个节点内的不确定性。
熵越高,不确定性就越大,样本分类越平均(鱼龙混杂)。
熵越低,不确定性就越小,样本越倾向于某一类型。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-eVs9OIGf-1629958218056)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p625)]
熵降低的速度越快,代表决策树分类效率越高。
Decision Tree is a very well-known machine learning technique for classification or regression.
Most promising characteristics:
❑Capable of handling both numerical
and categorical data
❑ Interpretable
❑Detect variable importance
❑Typically used in ensemble methods (e.g. RandomForest)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0GdjHqC9-1629958218056)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p623)]
This technique partitions the data into different subsets depending on the input attributes. It starts off by measuring the ability of each input attribute/feature to break the data into groups of the same nature (e.g. class label) using dispersion metrics (e.g. Gini. The feature and value with the highest power to split the data is selected as root node, and the data is split into two children nodes (i.e. it builds binary trees). This process is applied recursively until the size of each node reaches a certain size in which we don’t want to split the data further. The final model is a tree-like structure that allows us to predict the output of unseen cases.
随机森林RandomForests
https://blog.csdn.net/qq_34841823/article/details/78611023
随机森林是决策树的升级版,
随机:数据采样随机,特征选择随机,每个树的样本和特征都不一样,结果也不一样,每棵树之间没有关系。
森林:很多个决策树并行放在一起,同时处理。
在得到森林之后,准备预测时,当有一个新的输入样本进入的时候,就让森林中的每一棵决策树分别进行一下判断,看看这个样本应该属于哪一类,然后看看哪一类被选择最多,就预测这个样本为那一类。
总的训练集中的有些样本可能多次出现在一棵树的训练集中,也可能从未出现在一棵树的训练集中。
【为什么要使用随机?】
【随机】降低了异常样本与异常特征对分类结果的影响。
随机森林的输出结果由投票决定,
如果大部分决策树认为结果是对的,那就是对的。
每个决策树之间是相互独立的,
可以并行训练,不需要花费太长时间。
随机森林属于集成学习,
将多个模型组合起来解决问题(例如决策树+神经网络),
这些模型会独立学习,预测,再投票出结果,
准确度比单独的模型高。
https://www.bilibili.com/video/BV1H5411e73F
https://www.bilibili.com/video/BV11i4y1F7n4
https://www.cnblogs.com/hunttown/p/6927819.html
朴素贝叶斯NaiveBayes
天真=假设所有的特征都是相互独立的。
独立的特征:这只狗是母的,这只狗是博美。不独立的特征:库里投篮准,库里是篮球运动员。
判断垃圾邮件,邮件中的没歌词都是特征。
神经网络
https://www.bilibili.com/video/BV1aV411z7FS
https://www.bilibili.com/read/cv6962864
https://www.bilibili.com/video/BV1hW411b7Fk
https://www.bilibili.com/video/BV18W411b72E
https://www.bilibili.com/video/BV18W411b7S2
https://www.bilibili.com/video/BV1jW411B7gp
信息通过层层连接进行传递的数据结构,
有节点,有输入,有输出。
节点之间的连接有权重参数形成edge。
神经元,节点
Neurons
由多个神经元组成的网络叫神经网络。
每一层的输出信息
输入层->隐藏层(n个)->输出层
卷积神经网络
Convolutional Neural Network
https://www.bilibili.com/video/BV1s541187Qk
循环神经网络
Recurrent Neural NNetwork
Batch Normalization
批量归一化
回归Regression
In data mining, regression is a kind of supervised learning, which aims to learn a function that maps a number of input variables with a target output (which is a real number). It is called supervised learning because we assume that we have some initial data from which we can learn that mapping between the variables.
归回的全称是:”regression towards the mean“,向着中间值回归,在图像上给你一堆点,
需要你找一条线,
让这条线尽可能的在所有点的中间,
这个过程就是回归。
为什么要找这条线?
人们希望通过这条线可以预测未来。
下面找关系的例子引用自:英国UCL大学博士利群
简单关系:
一辆汽车100公里耗油10升,
这辆开200公里,要加20升油,这是正比例关系。
复杂关系:
一个人打王者荣耀单排5连胜得到了5个星星,
离下一个段位只剩下5颗星了,
本以为继续玩下去可以升段成功,
但后面的结果是是10连败(游戏失败会掉星)。
游戏局数与排位星数有什么关系呢?
可以把每局的胜负都画在图上,
横轴是游戏的局数,纵轴是排位的星数。
把这些点连起来后,会呈现出起伏的现象,
这样并没有办法做预测。
此时我们需要找到一条尽可能在所有点中间的线,
来代表这名玩家的游戏趋势,
这样可以预判后面的游戏情况。
而回归就是在一堆没啥关系的点中找一条尽量在所有点中的线,让整体关系清晰可见的过程。
因为线才能体现出趋势,而散落的点却不能。
线性回归LinearRegression
https://www.bilibili.com/video/BV17T4y1J7SB
https://www.bilibili.com/video/BV1w5411a7wH/?spm_id_from=333.788.recommend_more_video.4
https://www.bilibili.com/video/BV1YW411j7Vr?from=search&seid=14334434471568281637
https://www.bilibili.com/video/BV1Ft411b7uR?from=search&seid=14334434471568281637
https://www.bilibili.com/video/BV1pz4y1r7zQ?from=search&seid=5546280859464872641
https://www.bilibili.com/video/BV1rK4y1x748?from=search&seid=17339009151011043803
https://my.oschina.net/u/4305185/blog/3848949
https://zhuanlan.zhihu.com/p/225650671
https://blog.csdn.net/ivy_reny/article/details/78599523
https://zhuanlan.zhihu.com/p/33877902
https://blog.csdn.net/weixin_40161254/article/details/89447860
https://zhuanlan.zhihu.com/p/45023349
https://blog.csdn.net/weixin_45717457/article/details/104304784
https://developers.google.com/machine-learning/crash-course/framing/ml-terminology?hl=zh-cn
https://zhuanlan.zhihu.com/p/105363846
https://blog.csdn.net/Android_xue/article/details/90899659
https://www.jianshu.com/p/9c0d4bac8a67
https://www.jianshu.com/p/c6f15063d521
https://blog.csdn.net/weixin_40161254/article/details/89447860
https://blog.csdn.net/weixin_40161254/article/details/89447860
https://www.freesion.com/article/34011206778/
https://blog.csdn.net/u013719780/article/details/51822346
如果要找的这条线是一条直线,那么就是线性回归。
线性回归就是在一堆没啥关系的点中找一条尽量在所有点中的直线,让整体关系清晰可见的过程。
可以把1次函数y = kx + b这样的函数当做假设函数,之后给出一堆点,然后求出k和b的值分别是多少,然后再确定一条尽量在所有点中间的直线,这样通过定参数来找直线的过程就是线性回归。
直线的位置可以由参数(k和b)来确定,
所以找直线的过程也是求参数的过程。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-w1Q6LLax-1629958218057)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p618)]
待定系数法:
- 假设,y = kx + b;
- 代入,把2个点(0,5)、(5,10)代入进函数,
得到:5 = k * 0 + b, 10 = k * 5 + b。为什么是2个点?因为2点确定一条直线; - 求参数,k = 1,b = 5,之后代入回假设函数去,得到y = x + 5;
假设升级到王者段位需要25颗星,那么:
25 = x + 5
x = 20
玩家需要再玩20次游戏才能升级到王者。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BnB956nm-1629958218058)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p619)]
而机器学习跟待定系数法的步骤是吻合的:
- 假设函数,y = kx + b;
- 把所有数据点代入到假设函数中去,得到一个类似方程组的东西(超定方程组);
- 求参数,代入后不用我们自己解方程,而是使用计算机算法来求得参数,再把参数代入假设函数中去,可以得到一个参数确定的解析式,也就是模型,这样就能求得更真实的数据,模型就是为了预测未来;
假如使用了一个线性回归算法得出:
k = 0.4727,b = 6,之后代入回假设函数后,
得到y = 0.4727 * x + 6。
假设升级到王者段位需要25颗星,那么:
25 = 0.4727 * x + 6,
x 约等于 29.62 约等于 30
损失函数
https://www.bilibili.com/video/BV1d7411T7Z6
https://www.bilibili.com/video/BV1X541167yA?p=3
梯度下降
https://www.bilibili.com/video/BV1CE411E7PE
https://www.bilibili.com/video/BV1X541167yA?p=4
多变量线性回归
https://www.bilibili.com/video/BV1eV411o7jd
逻辑回归LogisticRegression
线性回归与逻辑回归是机器学习中最基础的2个算法。
https://www.bilibili.com/video/BV1gA411i7SR/?spm_id_from=autoNext
逻辑回归模型所预测的数值只在0-1之间,只能表达事件出现的概率,被更多地用在分类上,类似于好与坏,买与卖,是与否等等。
可以用于计算概率。
梯度提升树回归GradientBoostedTreeRegression
K nearest neighbor
https://www.bilibili.com/video/BV13K411H7Zs
KNN
K近邻算法,分类算法
非监督学习
k-means
https://www.bilibili.com/video/BV1ei4y1V7hX/?spm_id_from=autoNext
半监督学习
强化学习
迁移学习
4.训练模型
根据选择好的模型,
从训练数据集中进行[训练/学习],
确定模型中等待求出的参数。
例如:
指数模型:y = a * exp(b * x)
y = a乘以常数e的b乘以x次方,
这个模型中的a与b在开始时都不知道,
我们需要使用训练数据集,
让计算机帮我们求出a和b的值。
模型大致的形状或模式是固定的,
但模型中还会有一些不确定的东西在里面,
这样模型才会有通用性,
如果模型中所有的东西都固定死了,
模型的通用性就没有了。
模型中可以适当变化的部分,一般叫做参数。
所谓训练模型,其实就是要基于真实的业务数据,
来确定最合适的模型参数而已。
模型训练好了,也就是意味着找到了最合适的参数。
一旦找到最优参数,模型就基本可用了。
当然,要找到最优的模型参数一般是比较困难的,怎样找?如何找?这就涉及到算法了,也可以使用分析工具。
一个好的算法要运行速度快且复杂度低,
这样才能够实现快速的收敛,
而且能够找到全局最优的参数,
否则训练所花的时间过长效率低,
还只找到局部最优参数,就让人难以忍受了。
5.评估模型
之后我们需要给模型的好坏考个试,
模型训练的好不好?
这个模型能不能代表已有数据的规律?
我们需要使用一些评价指标,
例如:
评价指标是决定系数R^2(R的2次方)
R^2方是在0-1之间的数,越接近1表示模型越好。
在预测中,多次R^2值的预测值是0.9996。
此外,评价标准还有均方误差MSE等等。
评估模型,就是决定一下模型的质量,判断模型是否有用。
一个模型的好坏是需要放在特定的业务场景下来评估的,
也就是基于特定的数据集下才能知道哪个模型好与坏。
既然要评估一个模型的好坏,就应该有一些评价指标。比如,数值预测模型中,评价模型质量的常用指标有:平均误差率、判定系数R2,等等;评估分类预测模型质量的常用指标(如下图所示)有:正确率、查全率、查准率、ROC曲线和AUC值等等。
对于分类预测模型,一般要求正确率和查全率等越大越好,最好都接近100%,表示模型质量好,无误判。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7V2fT8o3-1629958218059)(evernotecid://BCE3D193-8584-4CB1-94B3-46FF37A1AC6C/appyinxiangcom/12192613/ENResource/p604)]
在真实的业务场景中,评估指标是基于测试集的,而不是训练集。
所以,在建模时,一般要将原始数据集分成两部分,一部分用于训练模型,叫训练集;另一部分用于评估模型,叫测试集。
有的人可能会想,为什么评估模型要用两个不同的数据集,直接用一个训练集不就可以了?理论上是不行的,因为模型是基于训练集构建起来的,所以在理论上模型在训练集上肯定有较好的效果。
但是,后来数学家们发现,在训练集上有较好预测效果的模型,在真实的业务应用场景下其预测效果不一定好(这种现象称之为过拟合)。
所以,将训练集和测试集分开来,一个用于训练模型,一个用于评估模型,这样可以提前发现模型是不是存在过拟合。
如果发现在训练集和测试集上的预测效果差不多,就表示模型质量尚好,应该可以直接使用了。
如果发现训练集和测试集上的预测效果相差太远,
就说明模型还有优化的余地。
验证一次就准确评估出模型的好坏是不合适的,
可以采用【交叉验证】的方式来进行多次评估,
以找到准确的模型误差。
其实,模型的评估是分开在两个业务场景中的:
一是基于过去发生的业务数据进行验证,即测试集。
本来,模型的构建就是基于过去的数据集的构建的。
二是基于真实的业务场景数据进行验证。
即,在应用模型步骤中检验模型的真实应用结果。
6.调整模型
如果模型训练的效果不够理想,
就需要调整模型或模型中的一些超参数。
优化模型,一般发生在两种情况下:一是在评估模型中,如果发现模型欠拟合,或者过拟合,说明这个模型待优化。
二是在真实应用场景中,定期进行优化,或者当发现模型在真实的业务场景中效果不好时,也要启动优化。
如果在评估模型时,发现模型欠拟合(即效果不佳)或者过拟合,则模型不可用,需要优化模型。所谓的模型优化,可
以有以下几种情况:
1)重新选择一个新的模型;
2)模型中增加新的考虑因素;
3)尝试调整模型中的阈值到最优;
4)尝试对原始数据进行更多的预处理,比如派生新变量。
不同的模型,其模型优化的具体做法也不一样。比如回归模型的优化,你可能要考虑异常数据对模型的影响,也要进行非线性和共线性的检验;再比如说分类模型的优化,主要是一些阈值的调整,以实现精准性与通用性的均衡。当然,也可以采用元算法来优化模型,就是通过训练多个弱模型,来构建一个强模型(即三个臭皮匠,顶上一个诸葛亮)来实现模型的最佳效果。实际上,模型优化不仅仅包含了对模型本身的优化,还包含了对原始数据的处理优化,如果数据能够得到有效的预处理,可以在某种程度上降低对模型的要求。所以,当你发现你尝试的所有模型效果都不太好的时候,别忘记了,这有可能是你的数据集没有得到有效的预处理,没有找到合适的关键因素(自变量)。不可能有一个模型适用于所有业务场景,也不太可能有一个固有的模型就适用于你的业务场景。好模型都是优化出来的!最后语正如数据挖掘标准流程一样,构建模型的这五个步骤,并不是单向的,而是一个循环的过程。当发现模型不佳时,就需要优化,就有可能回到最开始的地方重新开始思考。即使模型可用了,也需要定期对模型进行维护和优化,以便让模型能够继续适用新的业务场景。
7.预测
前面的6步都是为了第7步来进行服务的,
到这步时,模型已经根据已有数据训练好了,
我们可以利用模型来预测未来的发展趋势。
如果评估模型质量在可接受的范围内,
而且没有出现过拟合,于是就可以开始应用模型了。
这一步,就需要将可用的模型开发出来,
并部署在数据分析系统中,
然后可以形成数据分析的模板和可视化的分析结果,
以便实现自动化的数据分析报告。
应用模型,就是将模型应用于真实的业务场景。
构建模型的目的,就是要用于解决工作中的业务问题的,比如预测客户行为,比如划分客户群,等等。
当然,应用模型过程中,
还需要收集业务预测结果与真实的业务结果,
以检验模型在真实的业务场景中的效果,
同时用于后续模型的优化。