Apache Spark 2.2.0 正式发布，建议所有2.x用户升级

最新推荐文章于 2024-08-07 14:43:07 发布

言则yanze

最新推荐文章于 2024-08-07 14:43:07 发布

阅读量941

点赞数 1

文章标签：大数据 Spark

本文链接：https://blog.csdn.net/imgxr/article/details/80130522

版权

Apache Spark 2.2.0 是2.x系列的第三个版本，该发行版移除了Structured Streaming的实验标签，处理了1100多个问题，更关注可用性、稳定性和性能优化。

建议所有2.x用户更新至2.2.0版本，点击访问下载页面，用户可以在JIRA中查询更多细节。以下按照主要模块，对更新内容进行了分组：

核心 & Spark SQL
Structured Streaming
MLlib
SparkR
GraphX
过期功能
行为变化
已知问题

核心 & Spark SQL

API升级

SPARK-19107: Support creating hive table with DataFrameWriter and Catalog
SPARK-13721: Add support for LATERAL VIEW OUTER explode()
SPARK-18885: Unify CREATE TABLE syntax for data source and hive serde tables
SPARK-16475: Added Broadcast Hints BROADCAST, BROADCASTJOIN, and MAPJOIN, for SQL Queries
SPARK-18350: Support session local timezone
SPARK-19261: Support ALTER TABLE table_name ADD COLUMNS
SPARK-20420: Add events to the external catalog
SPARK-18127: Add hooks and extension points to Spark
SPARK-20576: Support generic hint function in Dataset/DataFrame
SPARK-17203: Data source options should always be case insensitive
SPARK-19139: AES-based authentication mechanism for Spark

性能及稳定性

Cost-Based Optimizer
- SPARK-17075 SPARK-17076 SPARK-19020 SPARK-17077 SPARK-19350: Cardinality estimation for filter, join, aggregate, project and limit/sample operators
- SPARK-17080: Cost-based join re-ordering
- SPARK-17626: TPC-DS performance improvements using star-schema heuristics
SPARK-17949: Introduce a JVM object based aggregate operator
SPARK-18186: Partial aggregation support of HiveUDAFFunction
SPARK-18362 SPARK-19918: File listing/IO improvements for CSV and JSON
SPARK-18775: Limit the max number of records written per file
SPARK-18761: Uncancellable / unkillable tasks shouldn’t starve jobs of resources
SPARK-15352: Topology aware block replication

其他值得注意的变化

SPARK-18352: Support for parsing multi-line JSON files
SPARK-19610: Support for parsing multi-line CSV files
SPARK-21079: Analyze Table Command on partitioned tables
SPARK-18703: Drop Staging Directories and Data Files after completion of Insertion/CTAS against Hive-serde Tables
SPARK-18209: More robust view canonicalization without full SQL expansion
SPARK-13446: [SPARK-18112] Support reading data from Hive metastore 2.0/2.1
SPARK-18191: Port RDD API to use commit protocol
SPARK-8425:Add blacklist mechanism for task scheduling
SPARK-19464: Remove support for Hadoop 2.5 and earlier
SPARK-19493: Remove Java 7 support

编程指南：Spark Programming Guide and Spark SQL, DataFrames and Datasets Guide

Structured Streaming

General Availablity

SPARK-20844: The Structured Streaming APIs are now GA and is no longer labeled experimental

Kafka改进

SPARK-19719: Support for reading and writing data in streaming or batch to/from Apache Kafka
SPARK-19968: Cached producer for lower latency kafka to kafka streams.

API升级

SPARK-19067: Support for complex stateful processing and timeouts using [flat]MapGroupsWithState
SPARK-19876: Support for one time triggers

其他值得注意的变化

SPARK-20979: Rate source for testing and benchmarks

编程指南：Structured Streaming Programming Guide

MLlib

DataFrame API新增算法

SPARK-14709: LinearSVC (Linear SVM Classifier) (Scala/Java/Python/R)
SPARK-19635: ChiSquare test in DataFrame-based API (Scala/Java/Python)
SPARK-19636: Correlation in DataFrame-based API (Scala/Java/Python)
SPARK-13568: Imputer feature transformer for imputing missing values (Scala/Java/Python)
SPARK-18929: Add Tweedie distribution for GLMs (Scala/Java/Python/R)
SPARK-14503: FPGrowth frequent pattern mining and AssociationRules (Scala/Java/Python/R)

已有算法添至 Python & R APIs

SPARK-18239: Gradient Boosted Trees ®
SPARK-18821: Bisecting K-Means ®
SPARK-18080: Locality Sensitive Hashing (LSH) (Python)
SPARK-6227: Distributed PCA and SVD for PySpark (in RDD-based API)

主要错误修复

SPARK-19110: DistributedLDAModel.logPrior correctness fix
SPARK-17975: EMLDAOptimizer fails with ClassCastException (caused by GraphX checkpointing bug)
SPARK-18715: Fix wrong AIC calculation in Binomial GLM
SPARK-16473: BisectingKMeans failing during training with “java.util.NoSuchElementException: key not found” for certain inputs
SPARK-19348: pyspark.ml.Pipeline gets corrupted under multi-threaded use
SPARK-20047: Box-constrained Logistic Regression

编程指南：Machine Learning Library (MLlib) Guide

SparkR

2.2.0版本中SparkR的主要焦点在于对Spark SQL现有特性提供了广泛支持：

主要特性

SPARK-19654: Structured Streaming API for R
SPARK-20159: Support complete Catalog API in R
SPARK-19795: column functions to_json, from_json
SPARK-19399: Coalesce on DataFrame and coalesce on column
SPARK-20020: Support DataFrame checkpointing
SPARK-18285: Multi-column approxQuantile in R

编程指南：SparkR (R on Spark)

GraphX

漏洞修复

SPARK-18847: PageRank gives incorrect results for graphs with sinks
SPARK-14804: Graph vertexRDD/EdgeRDD checkpoint results ClassCastException

优化

SPARK-18845: PageRank initial value improvement for faster convergence
SPARK-5484: Pregel should checkpoint periodically to avoid StackOverflowError

编程指南：GraphX Programming Guide

过期功能

MLlib

SPARK-18613: spark.ml LDA classes should not expose spark.mllib in APIs. In spark.ml.LDAModel, deprecated oldLocalModel and getModel.

SparkR

SPARK-20195: deprecate createExternalTable

行为变化

MLlib

SPARK-19787: DeveloperApi ALS.train() uses default regParam value 0.1 instead of 1.0, in order to match regular ALS API’s default regParam setting.

SparkR

SPARK-19291: This added log-likelihood for SparkR Gaussian Mixture Models, but doing so introduced a SparkR model persistence incompatibility: Gaussian Mixture Models saved from SparkR 2.1 may not be loaded into SparkR 2.2. We plan to put in place backwards compatibility guarantees for SparkR in the future.