spark-各版本特性

最新推荐文章于 2024-04-15 11:57:56 发布

猿与禅

最新推荐文章于 2024-04-15 11:57:56 发布

阅读量2.6k

点赞数

分类专栏： spark 文章标签： spark 特性

spark 专栏收录该内容

51 篇文章 1 订阅

订阅专栏

0.3

Save Operations
You can now save distributed datasets to the Hadoop filesystem (HDFS), Amazon S3, Hypertable, and any other storage system supported by Hadoop. There are convenience methods for several common formats, like text files and SequenceFiles. For example, to save a dataset as text

Faster Broadcast & Shuffle
This release includes broadcast and shuffle algorithms from this paper to better support applications that communicate large amounts of data.

Support for Non-Filesystem Hadoop Input Formats
The new SparkContext.hadoopRDD method allows reading data from Hadoop-compatible storage systems other than file systems, such as HBase, Hypertable, etc.

Outer join operators (leftOuterJoin, rightOuterJoin, etc).

Support for Scala 2.9 interpreter features (history search, Ctrl-C current line, etc) in the 2.9 version.

Better default levels of parallelism for various operations.

Ability to control number of splits in a file.

1.0.0

API Stability
Spark 1.0.0 is the first release in the 1.X major line. Spark is guaranteeing stability of its core API for all 1.X releases. Historically Spark has already been very conservative with API changes, but this guarantee codifies our commitment to application writers. The project has also clearly annotated experimental, alpha, and developer API’s to provide guidance on future API changes of newer components.

Integration with YARN Security
For users running in secured Hadoop environments, Spark now integrates with the Hadoop/YARN security model. Spark will authenticate job submission, securely transfer HDFS credentials, and authenticate communication between components.

Operational and Packaging Improvements
This release significantly simplifies the process of bundling and submitting a Spark application. A new spark-submit tool allows users to submit an application to any Spark cluster, including local clusters, Mesos, or YARN, through a common process. The documentation for bundling Spark applications has been substantially expanded. We’ve also added a history server for Spark’s web UI, allowing users to view Spark application data after individual applications are finished.

Spark SQL
This release introduces Spark SQL as a new alpha component. Spark SQL provides support for loading and manipulating structured data in Spark, either from external structured data sources (currently Hive and Parquet) or by adding a schema to an existing RDD. Spark SQL’s API interoperates with the RDD data model, allowing users to interleave Spark code with SQL statements. Under the hood, Spark SQL uses the Catalyst optimizer to choose an efficient execution plan, and can automatically push predicates into storage formats like Parquet. In future releases, Spark SQL will also provide a common API to other storage systems.

MLlib Improvements
In 1.0.0, Spark’s MLlib adds support for sparse feature vectors in Scala, Java, and Python. It takes advantage of sparsity in both storage and computation in linear methods, k-means, and naive Bayes. In addition, this release adds several new algorithms: scalable decision trees for both classification and regression, distributed matrix algorithms including SVD and PCA, model evaluation functions, and L-BFGS as an optimization primitive. The MLlib programming guide and code examples have also been greatly expanded.

GraphX and Streaming Improvements
In addition to usability and maintainability improvements, GraphX in Spark 1.0 brings substantial performance boosts in graph loading, edge reversal, and neighborhood computation. These operations now require less communication and produce simpler RDD graphs. Spark’s Streaming module has added performance optimizations for stateful stream transformations, along with improved Flume support, and automated state cleanup for long running jobs.

Extended Java and Python Support
Spark 1.0 adds support for Java 8 new lambda syntax in its Java bindings. Java 8 supports a concise syntax for writing anonymous functions, similar to the closure syntax in Scala and Python. This change requires small changes for users of the current Java API, which are noted in the documentation. Spark’s Python API has been extended to support several new functions. We’ve also included several stability improvements in the Python API, particularly for large datasets. PySpark now supports running on YARN as well.

Documentation
Spark’s programming guide has been significantly expanded to centrally cover all supported languages and discuss more operators and aspects of the development life cycle. The MLlib guide has also been expanded with significantly more detail and examples for each algorithm, while documents on configuration, YARN and Mesos have also been revamped.

PySpark now works with more Python versions than before – Python 2.6+ instead of 2.7+, and NumPy 1.4+ instead of 1.7+.
Spark has upgraded to Avro 1.7.6, adding support for Avro specific types.

Internal instrumentation has been added to allow applications to monitor and instrument Spark jobs.

Support for off-heap storage in Tachyon has been added via a special build target.

Datasets persisted with DISK_ONLY now write directly to disk, significantly improving memory usage for large datasets.
Intermediate state created during a Spark job is now garbage collected when the corresponding RDDs become unreferenced, improving performance.

Spark now includes a Javadoc version of all its API docs and a unified Scaladoc for all modules.

A new SparkContext.wholeTextFiles method lets you operate on small text files as individual records.
Migrating to Spark 1.0

While most of the Spark API remains the same as in 0.x versions, a few changes have been made for long-term flexibility, especially in the Java API (to support Java 8 lambdas). The documentation includes migration information to upgrade your applications.

1.1

Spark 1.1.0 is the first minor release on the 1.X line. This release brings operational and performance improvements in Spark core along with significant extensions to Spark’s newest libraries: MLlib and Spark SQL. It also builds out Spark’s Python support and adds new components to the Spark Streaming module. Spark 1.1 represents the work of 171 contributors, the most to ever contribute to a Spark release!

Performance and Usability Improvements
Across the board, Spark 1.1 adds features for improved stability and performance, particularly for large-scale workloads. Spark now performs disk spilling for skewed blocks during cache operations, guarding against memory overflows if a single RDD partition is large. Disk spilling during aggregations, introduced in Spark 1.0, has been ported to PySpark. This release introduces a new shuffle implementation optimized for very large scale shuffles. This “sort-based shuffle” will be become the default in the next release, and is now available to users. For jobs with large numbers of reducers, we recommend turning this on. This release also adds several usability improvements for monitoring the performance of long running or complex jobs. Among the changes are better named accumulators that display in Spark’s UI, dynamic updating of metrics for progress tasks, and reporting of input metrics for tasks that read input data.

Spark SQL
Spark SQL adds a number of new features and performance improvements in this release. A JDBC/ODBC server allows users to connect to SparkSQL from many different applications and provides shared access to cached tables. A new module provides support for loading JSON data directly into Spark’s SchemaRDD format, including automatic schema inference. Spark SQL introduces dynamic bytecode generation in this release, a technique which significantly speeds up execution for queries that perform complex expression evaluation. This release also adds support for registering Python, Scala, and Java lambda functions as UDFs, which can then be called directly in SQL. Spark 1.1 adds a public types API to allow users to create SchemaRDD’s from custom data sources. Finally, many optimizations have been added to the native Parquet support as well as throughout the engine.

MLlib
MLlib adds several new algorithms and optimizations in this release. 1.1 introduces a new library of statistical packages which provides exploratory analytic functions. These include stratified sampling, correlations, chi-squared tests and support for creating random datasets. This release adds utilities for feature extraction (Word2Vec and TF-IDF) and feature transformation (normalization and standard scaling). Also new are support for nonnegative matrix factorization and SVD via Lanczos. The decision tree algorithm has been added in Python and Java. A tree aggregation primitive has been added to help optimize many existing algorithms. Performance improves across the board in MLlib 1.1, with improvements of around 2-3X for many algorithms and up to 5X for large scale decision tree problems.

GraphX and Spark Streaming
Spark streaming adds a new data source Amazon Kinesis. For the Apache Flume, a new mode is supported which pulls data from Flume, simplifying deployment and providing high availability. The first of a set of streaming machine learning algorithms is introduced with streaming linear regression. Finally, rate limiting has been added for streaming inputs. GraphX adds custom storage levels for vertices and edges along with improved numerical precision across the board. Finally, GraphX adds a new label propagation algorithm.

PySpark now allows reading and writing arbitrary Hadoop InputFormats, including SequenceFiles, HBase, Cassandra, Avro, and other data sources
Stage resubmissions are now handled gracefully in the Spark UI
Spark supports tight firewall rules for all network ports
An overflow bug in GraphX has been fix that affects graphs with more than 4 billion vertices
Upgrade Notes
Spark 1.1.0 is backwards compatible with Spark 1.0.X. Some configuration option defaults have changed which might be relevant to existing users:

The default value of spark.io.compression.codec is now snappy for improved memory usage. Old behavior can be restored by switching to lzf.
The default value of spark.broadcast.factory is now org.apache.spark.broadcast.TorrentBroadcastFactory for improved efficiency of broadcasts. Old behavior can be restored by switching to org.apache.spark.broadcast.HttpBroadcastFactory.

PySpark now performs external spilling during aggregations. Old behavior can be restored by setting spark.shuffle.spill to false.

PySpark uses a new heuristic for determining the parallelism of shuffle operations. Old behavior can be restored by setting spark.default.parallelism to the number of cores in the cluster.

1.2

Spark 1.2.0 is the third release on the 1.X line. This release brings performance and usability improvements in Spark’s core engine, a major new API for MLlib, expanded ML support in Python, a fully H/A mode in Spark Streaming, and much more. GraphX has seen major performance and API improvements and graduates from an alpha component. Spark 1.2 represents the work of 172 contributors from more than 60 institutions in more than 1000 individual patches.

Spark Core
In 1.2 Spark core upgrades two major subsystems to improve the performance and stability of very large scale shuffles. The first is Spark’s communication manager used during bulk transfers, which upgrades to a netty-based implementation. The second is Spark’s shuffle mechanism, which upgrades to the “sort based” shuffle initially released in Spark 1.1. These both improve the performance and stability of very large scale shuffles. Spark also adds an elastic scaling mechanism designed to improve cluster utilization during long running ETL-style jobs. This is currently supported on YARN and will make its way to other cluster managers in future versions. Finally, Spark 1.2 adds support for Scala 2.11. For instructions on building for Scala 2.11 see the build documentation.

Spark Streaming
This release includes two major feature additions to Spark’s streaming library, a Python API and a write ahead log for full driver H/A. The Python API covers almost all the DStream transformations and output operations. Input sources based on text files and text over sockets are currently supported. Support for Kafka and Flume input streams in Python will be added in the next release. Second, Spark streaming now features H/A driver support through a write ahead log (WAL). In Spark 1.1 and earlier, some buffered (received but not yet processed) data can be lost during driver restarts. To prevent this Spark 1.2 adds an optional WAL, which buffers received data into a fault-tolerant file system (e.g. HDFS). See the streaming programming guide for more details.

MLLib
Spark 1.2 previews a new set of machine learning API’s in a package called spark.ml that supports learning pipelines, where multiple algorithms are run in sequence with varying parameters. This type of pipeline is common in practical machine learning deployments. The new ML package uses Spark’s SchemaRDD to represent ML datasets, providing direct interoperability with Spark SQL. In addition to the new API, Spark 1.2 extends decision trees with two tree ensemble methods: random forests and gradient-boosted trees, among the most successful tree-based models for classification and regression. Finally, MLlib’s Python implementation receives a major update in 1.2 to simplify the process of adding Python APIs, along with better Python API coverage.

Spark SQL
In this release Spark SQL adds a new API for external data sources. This API supports mounting external data sources as temporary tables, with support for optimizations such as predicate pushdown. Spark’s Parquet and JSON bindings have been re-written to use this API and we expect a variety of community projects to integrate with other systems and formats during the 1.2 lifecycle.

Hive integration has been improved with support for the fixed-precision decimal type and Hive 0.13. Spark SQL also adds dynamically partitioned inserts, a popular Hive feature. An internal re-architecting around caching improves the performance and semantics of caching SchemaRDD instances and adds support for statistics-based partition pruning for cached data.

GraphX
In 1.2 GraphX graduates from an alpha component and adds a stable API. This means applications written against GraphX are guaranteed to work with future Spark versions without code changes. A new core API, aggregateMessages, is introduced to replace the now deprecated mapReduceTriplet API. The new aggregateMessages API features a more imperative programming model and improves performance. Some early test users found 20% - 1X performance improvement by switching to the new API.

In addition, Spark now supports graph checkpointing and lineage truncation which are necessary to support large numbers of iterations in production jobs. Finally, a handful of performance improvements have been added for PageRank and graph loading.

PySpark’s sort operator now supports external spilling for large datasets.
PySpark now supports broadcast variables larger than 2GB and performs external spilling during sorts.
Spark adds a job-level progress page in the Spark UI, a stable API for progress reporting, and dynamic updating of output metrics as jobs complete.
Spark now has support for reading binary files for images and other binary formats.
Upgrading to Spark 1.2
Spark 1.2 is binary compatible with Spark 1.0 and 1.1, so no code changes are necessary. This excludes APIs marked explicitly as unstable. Spark changes default configuration in a handful of cases for improved performance. Users who want to preserve identical configurations to Spark 1.1 can roll back these changes.

spark.shuffle.blockTransferService has been changed from nio to netty
spark.shuffle.manager has been changed from hash to sort
In PySpark, the default batch size has been changed to 0, which means the batch size is chosen based on the size of object. Pre-1.2 behavior can be restored using SparkContext([… args… ], batchSize=1024).
Spark SQL has changed the following defaults:
spark.sql.parquet.cacheMetadata: false -> true
spark.sql.parquet.compression.codec: snappy -> gzip
spark.sql.hive.convertMetastoreParquet: false -> true
spark.sql.inMemoryColumnarStorage.compressed: false -> true
spark.sql.inMemoryColumnarStorage.batchSize: 1000 -> 10000
spark.sql.autoBroadcastJoinThreshold: 10000 -> 10485760 (10 MB)
Known Issues
A few smaller bugs did not make the release window. They will be fixed in Spark 1.2.1:

Netty shuffle does not respect secured port configuration. Work around - revert to nio shuffle: SPARK-4837
java.io.FileNotFound exceptions when creating EXTERNAL hive tables. Work around - set hive.stats.autogather = false. SPARK-4892.
Exception PySpark zip function on textfile inputs: SPARK-4841
MetricsServlet not properly initialized: SPARK-4595

1.3

Spark 1.3.0 is the fourth release on the 1.X line. This release brings a new DataFrame API alongside the graduation of Spark SQL from an alpha project. It also brings usability improvements in Spark’s core engine and expansion of MLlib and Spark Streaming. Spark 1.3 represents the work of 174 contributors from more than 60 institutions in more than 1000 individual patches.

Spark Core
Spark 1.3 sees a handful of usability improvements in the core engine. The core API now supports multi level aggregation trees to help speed up expensive reduce operations. Improved error reporting has been added for certain gotcha operations. Spark’s Jetty dependency is now shaded to help avoid conflicts with user programs. Spark now supports SSL encryption for some communication endpoints. Finaly, realtime GC metrics and record counts have been added to the UI.

DataFrame API
Spark 1.3 adds a new DataFrames API that provides powerful and convenient operators when working with structured datasets. The DataFrame is an evolution of the base RDD API that includes named fields along with schema information. It’s easy to construct a DataFrame from sources such as Hive tables, JSON data, a JDBC database, or any implementation of Spark’s new data source API. Data frames will become a common interchange format between Spark components and when importing and exporting data to other systems. Data frames are supported in Python, Scala, and Java.

Spark SQL
In this release Spark SQL graduates from an alpha project, providing backwards compatibility guarantees for the HiveQL dialect and stable programmatic API’s. Spark SQL adds support for writing tables in the data sources API. A new JDBC data source allows importing and exporting from MySQL, Postgres, and other RDBMS systems. A variety of small changes have expanded the coverage of HiveQL in Spark SQL. Spark SQL also adds support schema evolution with the ability to merging compatible schemas in Parquet.

Spark ML/MLlib
In this release Spark MLlib introduces several new algorithms: latent Dirichlet allocation (LDA) for topic modeling, multinomial logistic regression for multiclass classification, Gaussian mixture model (GMM) and power iteration clustering for clustering, FP-growth for frequent pattern mining, and block matrix abstraction for distributed linear algebra. Initial support has been added for model import/export in exchangeable format, which will be expanded in future versions to cover more model types in Java/Python/Scala. The implementations of k-means and ALS receive updates that lead to significant performance gain. PySpark now supports the ML pipeline API added in Spark 1.2, and gradient boosted trees and Gaussian mixture model. Finally, the ML pipeline API has been ported to support the new DataFrames abstraction.

Spark Streaming
Spark 1.3 introduces a new direct Kafka API (docs) which enables exactly-once delivery without the use of write ahead logs. It also adds a Python Kafka API along with infrastructure for additional Python API’s in future releases. An online version of logistic regression and the ability to read binary records have also been added. For stateful operations, support has been added for loading of an initial state RDD. Finally, the streaming programming guide has been updated to include information about SQL and DataFrame operations within streaming applications, and important clarifications to the fault-tolerance semantics.

GraphX
GraphX adds a handful of utility functions in this release, including conversion into a canonical edge graph.

Upgrading to Spark 1.3
Spark 1.3 is binary compatible with Spark 1.X releases, so no code changes are necessary. This excludes API’s marked explicitly as unstable.

As part of stabilizing the Spark SQL API, the SchemaRDD class has been renamed to DataFrame. Spark SQL’s migration guide describes the upgrade process in detail. Spark SQL also now requires that column identifiers which use reserved words (such as “string” or “table”) be escaped using backticks.

Known Issues
This release has few known issues which will be addressed in Spark 1.3.1:

SPARK-6194: A memory leak in PySPark’s collect().
SPARK-6222: An issue with failure recovery in Spark Streaming.
SPARK-6315: Spark SQL can’t read parquet data generated with Spark 1.1.
SPARK-6247: Errors analyzing certain join types in Spark SQL.

1.4

Spark 1.4.0 is the fifth release on the 1.X line. This release brings an R API to Spark. It also brings usability improvements in Spark’s core engine and expansion of MLlib and Spark Streaming. Spark 1.4 represents the work of more than 210 contributors from more than 70 institutions in more than 1000 individual patches.

SparkR
Spark 1.4 is the first release to package SparkR, an R binding for Spark based on Spark’s new DataFrame API. SparkR gives R users access to Spark’s scale-out parallel runtime along with all of Spark’s input and output formats. It also supports calling directly into Spark SQL. The R programming guide has more information on how to get up and running with SparkR.

Spark Core
Spark core adds a variety of improvements focused on operations, performance, and compatiblity:

SPARK-6942: Visualization for Spark DAGs and operational monitoring
SPARK-4897: Python 3 support
SPARK-3644: A REST API for application information
SPARK-4550: Serialized shuffle outputs for improved performance
SPARK-7081: Initial performance improvements in project Tungsten
SPARK-3074: External spilling for Python groupByKey operations
SPARK-3674: YARN support for Spark EC2 and SPARK-5342: Security for long running YARN applications
SPARK-2691: Docker support in Mesos and SPARK-6338: Cluster mode in Mesos
DataFrame API and Spark SQL
The DataFrame API sees major extensions in Spark 1.4 (see this link for a full list) with a focus on analytic and mathmatical functions. Spark SQL introduces new operational utilities along with support for ORCFile.

SPARK-2883: Support for ORCFile format
SPARK-2213: Sort-merge joins to optimize very large joins
SPARK-5100: Dedicated UI for the SQL JDBC server
SPARK-6829: Mathematical functions in DataFrames
SPARK-8299: Improved error message reporting for DataFrame and SQL
SPARK-1442: Window functions in Spark SQL and DataFrames
SPARK-6231 / SPARK-7059: Improved API support for self joins
SPARK-5947: Partitioning support in Spark’s data source API
SPARK-7320: Rollup and cube functions
SPARK-6117: Summary and descriptive statistics
Spark ML/MLlib
Spark’s ML pipelines API graduates from alpha in this release, with new transformers and improved Python coverage. MLlib also adds several new algorithms.

SPARK-5884: A variety of feature transformers for ML pipelines
SPARK-7381: Python API for ML pipelines
SPARK-5854: Personalized PageRank for GraphX
SPARK-6113: Stabilize DecisionTree and ensembles APIs
SPARK-7262: Binary LogisticRegression with L1/L2 (elastic net)
SPARK-7015: OneVsRest multiclass to binary reduction
SPARK-4588: Add API for feature attributes
SPARK-1406: PMML model evaluation support via MLib
SPARK-5995: Make ML Prediction Developer APIs public
SPARK-3066: Support recommendAll in matrix factorization model
SPARK-4894: Bernoulli naive Bayes
SPARK-5563: LDA with online variational inference to the release note
Spark Streaming
Spark streaming adds visual instrumentation graphs and significantly improved debugging information in the UI. It also enhances support for both Kafka and Kinesis.

SPARK-7602: Visualization and monitoring in the streaming UI including batch drill down (SPARK-6796, SPARK-6862)
SPARK-7621: Better error reporting for Kafka
SPARK-2808: Support for Kafka 0.8.2.1 and Kafka with Scala 2.11
SPARK-5946: Python API for Kafka direct mode
SPARK-7111: Input rate tracking for Kafka
SPARK-5960: Support for transferring AWS credentials to Kinesis
SPARK-7056 A pluggable interface for write ahead logs
Known Issues
This release has few known issues which will be addressed in Spark 1.4.1

Python sortBy()/sortByKey() can hang if a single partition is larger than worker memory SPARK-8202
Unintended behavior change of JSON schema inference SPARK-8093
Some ML pipleline components do not correctly implement copy SPARK-8151
Spark-ec2 branch pointer is wrong SPARK-8310
Credits
Test Partners
Thanks to The following organizations, who helped benchmark or integration test release candidates:
Intel, Palantir, Cloudera, Mesosphere, Huawei, Shopify, Netflix, Yahoo, UC Berkeley and Databricks.

1.5

APIs: RDD, DataFrame and SQL

Consistent resolution of column names (see Behavior Changes section)
SPARK-3947: New experimental user-defined aggregate function (UDAF) interface
SPARK-8300: DataFrame hint for broadcast joins
SPARK-8668: expr function for turning a SQL expression into a DataFrame column
SPARK-9076: Improved support for NaN values
NaN functions: isnan, nanvl
dropna/fillna also fill/drop NaN values in addition to NULL values
Equality test on NaN = NaN returns true
NaN is greater than all other values
In aggregation, NaN values go into one group
SPARK-8828: Sum function returns null when all input values are nulls

Data types
SPARK-8943: CalendarIntervalType for time intervals
SPARK-7937: Support ordering on StructType
SPARK-8866: TimestampType’s precision is reduced to 1 microseconds (1us)
SPARK-8159: Added ~100 functions, including date/time, string, math.
SPARK-8947: Improved type coercion and error reporting in plan analysis phase (i.e. most errors should be reported in analysis time, rather than execution time)
SPARK-1855: Memory and local disk only checkpointing support
Backend Execution: DataFrame and SQL
Code generation on by default for almost all DataFrame/SQL functions
Improved aggregation execution in DataFrame/SQL
Cache friendly in-memory hash map layout
Fallback to external-sort-based aggregation when memory is exhausted
Code generation on by default for aggregations
Improved join execution in DataFrame/SQL
Prefer (external) sort-merge join over hash join in shuffle joins (for left/right outer and inner joins), i.e. join data size is now bounded by disk rather than memory
Support using (external) sort-merge join method for left/right outer joins
Support for broadcast outer join
Improved sort execution in DataFrame/SQL
Cache-friendly in-memory layout for sorting
Fallback to external sorting when data exceeds memory size
Code generated comparator for fast comparisons
Native memory management & representation
Compact binary in-memory data representation, leading to lower memory usage
Execution memory is explicitly accounted for, without relying on JVM GC, leading to less GC and more robust memory management

SPARK-8638: Improved performance & memory usage in window functions
Metrics instrumentation, reporting, and visualization
SPARK-8856: Plan visualization for DataFrame/SQL
SPARK-8735: Expose metrics for runtime memory usage in web UI
SPARK-4598: Pagination for jobs with large number of tasks in web UI
Integrations: Data Sources, Hive, Hadoop, Mesos and Cluster Management

Mesos
SPARK-6284: Support framework authentication and Mesos roles
SPARK-6287: Dynamic allocation in Mesos coarse-grained mode
SPARK-6707: User specified constraints on Mesos slave attributes

YARN
SPARK-4352: Dynamic allocation in YARN works with preferred locations
Standalone Cluster Manager
SPARK-4751: Dynamic resource allocation support
SPARK-6906: Improved Hive and metastore support
SPARK-8131: Improved Hive database support

Upgraded Hive dependency Hive 1.2
Support connecting to Hive 0.13, 0.14, 1.0/0.14.1, 1.1, 1.2 metastore
Support partition pruning pushdown into the metastore (off by default; config flag spark.sql.hive.metastorePartitionPruning)
Support persisting data in Hive compatible format in metastore
SPARK-9381: Support data partitioning for JSON data sources
SPARK-5463: Parquet improvements

Upgrade to Parquet 1.7
Speedup metadata discovery and schema merging
Predicate pushdown on by default
SPARK-6774: Support for reading non-standard legacy Parquet files generated by various libraries/systems by fully implementing all backwards-compatibility rules defined in parquet-format spec
SPARK-4176: Support for writing decimal values with precision greater than 18
ORC improvements (various bug fixes)
SPARK-8890: Faster and more robust dynamic partition insert
SPARK-9486: DataSourceRegister interface for external data sources to specify short names
R Language
SPARK-6797: Support for YARN cluster mode in R
SPARK-6805: GLMs with R formula, binomial/Gaussian families, and elastic-net regularization
SPARK-8742: Improved error messages for R
SPARK-9315: Aliases to make DataFrame functions more R-like
Machine Learning and Advanced Analytics
SPARK-8521: New Feature transformers: CountVectorizer, Discrete Cosine transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer.
New Estimators in Pipeline API: SPARK-8600 naive Bayes, SPARK-7879 k-means, and SPARK-8671 isotonic regression.
New Algorithms: SPARK-9471 multilayer perceptron classifier, SPARK-6487 PrefixSpan for sequential pattern mining, SPARK-8559 association rule generation, SPARK-8598 1-sample Kolmogorov-Smirnov test, etc.
Improvements to existing algorithms
LDA: online LDA performance, asymmetric doc concentration, perplexity, log-likelihood, top topics/documents, save/load, etc.
Trees and ensembles: class probabilities, feature importance for random forests, thresholds for classification, checkpointing for GBTs, etc.
Pregel-API: more efficient Pregel API implementation for GraphX.

GMM: distribute matrix inversions.
Model summary for linear and logistic regression.
Python API: distributed matrices, streaming k-means and linear models, LDA, power iteration clustering, etc.
Tuning and evaluation: train-validation split and multiclass classification evaluator.
Documentation: document the release version of public API methods

Spark Streaming
SPARK-7398: Backpressure: Automatic and dynamic rate controlling in Spark Streaming for handling bursty input streams. This allows a streaming pipeline to dynamically adapt to changes in ingestion rates and computation loads. This works with receivers, as well as, the Direct Kafka approach.
Python API for streaming sources
SPARK-8389: Kafka offsets of Direct Kafka streams available through Python API
SPARK-8564: Kinesis Python API
SPARK-8378: Flume Python API
SPARK-5155: MQTT Python API
SPARK-3258: Python API for streaming machine learning algorithms: K-Means, linear regression, and logistic regression
SPARK-9215: Improved reliability of Kinesis streams : No need for enabling write ahead logs for saving and recovering received data across driver failures
Direct Kafka API graduated: Not experimental any more.
SPARK-8701: Input metadata in UI: Kafka offsets, and input files are visible in the batch details UI
SPARK-8882: Better load balancing and scheduling of receivers across cluster
SPARK-4072: Include streaming storage in web UI
Deprecations, Removals, Configs, and Behavior Changes

Spark Core
DAGScheduler’s local task execution mode has been removed
Default driver and executor memory increased from 512m to 1g
Default setting of JVM’s MaxPermSize increased from 128m to 256m
Default logging level of spark-shell changed from INFO to WARN
NIO-based ConnectionManager is deprecated, and will be removed in 1.6

Spark SQL & DataFrames
Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with code generation for expression evaluation. These features can both be disabled by setting spark.sql.tungsten.enabled to false.
Parquet schema merging is no longer enabled by default. It can be re-enabled by setting spark.sql.parquet.mergeSchema to true.
Resolution of strings to columns in Python now supports using dots (.) to qualify the column or access nested values. For example df[‘table.column.nestedField’]. However, this means that if your column name contains any dots you must now escape them using backticks (e.g., table.column.with.dots.nested).
In-memory columnar storage partition pruning is on by default. It can be disabled by setting spark.sql.inMemoryColumnarStorage.partitionPruning to false.
Unlimited precision decimal columns are no longer supported, instead Spark SQL enforces a maximum precision of 38. When inferring schema from BigDecimal objects, a precision of (38, 18) is now used. When no precision is specified in DDL then the default remains Decimal(10, 0).
Timestamps are now processed at a precision of 1us, rather than 100ns.
Sum function returns null when all input values are nulls (null before 1.4, 0 in 1.4).
In the sql dialect, floating point numbers are now parsed as decimal. HiveQL parsing remains unchanged.
The canonical name of SQL/DataFrame functions are now lower case (e.g. sum vs SUM).
It has been determined that using the DirectOutputCommitter when speculation is enabled is unsafe and thus this output committer will not be used by parquet when speculation is on, independent of configuration.
JSON data source will not automatically load new files that are created by other applications (i.e. files that are not inserted to the dataset through Spark SQL). For a JSON persistent table (i.e. the metadata of the table is stored in Hive Metastore), users can use REFRESH TABLE SQL command or HiveContext’s refreshTable method to include those new files to the table. For a DataFrame representing a JSON dataset, users need to recreate the DataFrame and the new DataFrame will include new files.

Spark Streaming
New experimental backpressure feature can be enabled by setting the configuration spark.streaming.backpressure.enabled to true.
Write Ahead Log does not need to be abled for Kinesis streams. The updated Kinesis receiver keeps track of Kinesis sequence numbers received in each batch, and uses that information re-read the necessary data while recovering from failures.
The number of times the receivers are relaunched on failure are not limited by the max Spark task attempts. The system will always try to relaunch receivers after failures until the StreamingContext is stopped.
Improved load balancing of receivers across the executors, even after relaunching.
Enabling checkpointing when using queueStream throws exception as queueStream cannot be checkpointed. However, we found this to break certain existing apps. So this change will be reverted in Spark 1.5.1.
MLlib
In the spark.mllib package, there are no breaking API changes but some behavior changes:

SPARK-9005: RegressionMetrics.explainedVariance returns the average regression sum of squares.
SPARK-8600: NaiveBayesModel.labels become sorted.
SPARK-3382: GradientDescent has a default convergence tolerance 1e-3, and hence iterations might end earlier than 1.4.
In the experimental spark.ml package, there exists one breaking API change and one behavior change:

SPARK-9268: Java’s varargs support is removed from Params.setDefault due to a Scala compiler bug.
SPARK-10097: Evaluator.isLargerBetter is added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.
Known Issues
The following issues are known in 1.5.0, and will be fixed in 1.5.1 release.

SQL/DataFrame
SPARK-10301: Reading parquet files with different schema (schema merging) for nested structs can return the wrong answer
SPARK-10466: AssertionError when spilling data during sort-based shuffle with data spill
SPARK-10441: Timestamp data type cannot be written out as JSON
SPARK-10495: Date values saved to JSON are stored as strings representing the number of days from epoch (1970-01-01 00:00:00 UTC) instead of strings in the format of “yyyy-mm-dd”.
SPARK-10403: Tungsten mode does not work with tungsten-sort shuffle manager (which is off by default)
SPARK-10422: In-memory cache of string type with dictionary encoding is broken
SPARK-10434 Parquet files with null elements in arrays written by Spark 1.5.0 cannot be read by earlier versions of Spark
Streaming
SPARK-10224 Small chance of data loss when StreamingContext is stopped gracefully

1.6

Spark Core/SQL

SPARK-9999 Dataset API - A new Spark API, similar to RDDs, that allows users to work with custom objects and lambda functions while still gaining the benefits of the Spark SQL execution engine.
SPARK-10810 Session Management - Different users can share a cluster while having different configuration and temporary tables.
SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a per-operator basis for memory usage and spilled data size.
SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unnest arbitrary numbers of columns
SPARK-4849 Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
SPARK-11778 - DataFrameReader.table supports specifying database name. For example, sqlContext.read.table(“dbName.tableName”) can be used to create a DataFrame from a table called “tableName” in the database “dbName”.
SPARK-10947 - With schema inference from JSON into a Dataframe, users can set primitivesAsString to true (in data source options) to infer all primitive value types as Strings. The default value of primitivesAsString is false.
Performance
SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
SPARK-9241 Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
SPARK-9858 Adaptive query execution - Initial support for automatically selecting the number of reducers for joins and aggregations.
SPARK-10978 Avoiding double filters in Data Source API - When implementing a data source with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead

Spark Streaming
API Updates
SPARK-2629 New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
SPARK-10891 Kinesis message handler function - Allows arbitrary function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
SPARK-6328 Python Streaming Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
UI Improvements
Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
Made output operations visible in the streaming tab as progress bars.

MLlib
New algorithms/models
SPARK-8518 Survival analysis - Log-linear model for survival analysis
SPARK-9834 Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming framework
SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
API improvements

ML Pipelines
SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml algorithms
SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
R API
SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
SPARK-9681 Feature interactions in R formula - Interaction operator “:” in R formula
Python API - Many improvements to Python API to approach feature parity
Misc improvements
SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear Regression can take instance weights
SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
Documentation improvements
SPARK-7751 @since versions - Documentation includes initial version when classes and methods were added
SPARK-11337 Testable example code - Automated testing for code in user guide examples
Deprecations
In spark.mllib.clustering.KMeans, the “runs” parameter has been deprecated.
In spark.ml.classification.LogisticRegressionModel and spark.ml.regression.LinearRegressionModel, the “weights” field has been deprecated, in favor of the new name “coefficients.” This helps disambiguate from instance (row) weights given to algorithms.
Changes of behavior

MLlib
spark.mllib.tree.GradientBoostedTrees validationTol has changed semantics in 1.6. Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of GradientDescent convergenceTol: For large errors, it uses relative error (relative to the previous error); for small errors (< 0.01), it uses absolute error.
spark.ml.feature.RegexTokenizer: Previously, it did not convert strings to lowercase before tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the behavior of the simpler Tokenizer transformer.
SQL
The flag (spark.sql.tungsten.enabled) that turns off Tungsten mode and code generation has been removed. Tungsten mode and code generation are always enabled (SPARK-11644).
Spark SQL’s partition discovery has been changed to only discover partition directories that are children of the given path. (i.e. if path="/my/data/x=1" then x=1 will no longer be considered a partition but only children of x=1.) This behavior can be overridden by manually specifying the basePath that partitioning discovery should start with (SPARK-11678).
For a UDF, if it has primitive type input argument (a non-nullable input argument), when the value of this argument is null, this UDF will return null (SPARK-11725).
When casting a value of an integral type to timestamp (e.g. casting a long value to timestamp), the value is treated as being in seconds instead of milliseconds (SPARK-11724).
With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true (SPARK-12077).
getBoolean, getByte, getShort, getInt, getLong, getFloat and getDouble of a Row will throw a NullPointerException if the value at the given ordinal is a null (SPARK-11553).
variance is the alias of var_samp instead of var_pop (SPARK-11490).
The semantic of casting a String type value to a Boolean type value has been changed (SPARK-10442). Casting any one of “t”, “true”, “y”, “yes”, and “1” will return true. Casting any of “f”, “false”, “n”, “no”, and “0” will return false. For other String literals, casting them to a Boolean type value will return null.
Aggregate function first and last will not ignore null values by default (SPARK-9740). To make them ignore null values, users can set the second argument of first and last to true. For example, first(col, true) will return the first non-null value of the column col.

SPARK-12546 Save DataFrame/table as Parquet with dynamic partitions may cause OOM; this can be worked around by decreasing the memory used by both Spark and Parquet using spark.memory.fraction (for example, 0.4) and parquet.memory.pool.ratio (for example, 0.3, in Hadoop configuration, e.g. setting it in core-site.xm

2.0.0

Programming APIs
One of the largest changes in Spark 2.0 is the new updated APIs:

Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.
SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.
A new, streamlined configuration API for SparkSession
Simpler, more performant accumulator API
A new, improved Aggregator API for typed aggregation in Datasets
SQL
Spark 2.0 substantially improved SQL functionalities with SQL2003 support. Spark SQL can now run all 99 TPC-DS queries. More prominently, we have improved:

A native SQL parser that supports both ANSI-SQL as well as Hive QL
Native DDL command implementations
Subquery support, including
Uncorrelated Scalar Subqueries
Correlated Scalar Subqueries
NOT IN predicate Subqueries (in WHERE/HAVING clauses)
IN predicate subqueries (in WHERE/HAVING clauses)
(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
View canonicalization support
In addition, when building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms.

Native CSV data source, based on Databricks’ spark-csv module

Off-heap memory management for both caching and runtime execution

Hive style bucketing support

Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.
Performance and Runtime

Substantial (2 - 10X) performance speedups for common operators in SQL and DataFrames via a new technique called whole stage code generation.

Improved Parquet scan throughput through vectorization
Improved ORC performance

Many improvements in the Catalyst query optimizer for common workloads
Improved window function performance via native implementations for all window functions
Automatic file coalescing for native data sources

MLlib
The DataFrame-based API is now the primary API. The RDD-based API is entering maintenance mode. See the MLlib guide for details

ML persistence: The DataFrames-based API provides near-complete support for saving and loading ML models and Pipelines in Scala, Java, Python, and R. See this blog post and the following JIRAs for details: SPARK-6725, SPARK-11939, SPARK-14311.

MLlib in R: SparkR now offers MLlib APIs for generalized linear models, naive Bayes, k-means clustering, and survival regression. See this talk to learn more.

Python: PySpark now offers many more MLlib algorithms, including LDA, Gaussian Mixture Model, Generalized Linear Regression, and more.

Algorithms added to DataFrames-based API: Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler feature transformer.

Speed/scaling
Vectors and Matrices stored in DataFrames now use much more efficient serialization, reducing overhead in calling MLlib algorithms. (SPARK-14850)

SparkR
The largest improvement to SparkR in Spark 2.0 is user-defined functions. There are three user-defined functions: dapply, gapply, and lapply. The first two can be used to do partition-based UDFs using dapply and gapply, e.g. partitioned model learning. The latter can be used to do hyper-parameter tuning.

In addition, there are a number of new features:

Improved algorithm coverage for machine learning in R, including naive Bayes, k-means clustering, and survival regression.
Generalized linear models support more families and link functions.
Save and load for all ML models.
More DataFrame functionality: Window functions API, reader, writer support for JDBC, CSV, SparkSession
Streaming
Spark 2.0 ships the initial experimental release for Structured Streaming, a high level streaming API built on top of Spark SQL and the Catalyst optimizer. Structured Streaming enables users to program against streaming sources and sinks using the same DataFrame/Dataset API as in static data sources, leveraging the Catalyst optimizer to automatically incrementalize the query plans.

For the DStream API, the most prominent update is the new experimental support for Kafka 0.10.

Dependency, Packaging, and Operations
There are a variety of changes to Spark’s operations and packaging process:

Spark 2.0 no longer requires a fat assembly jar for production deployment.
Akka dependency has been removed, and as a result, user applications can program against any versions of Akka.
Support launching multiple Mesos executors in coarse grained Mesos mode.
Kryo version is bumped to 3.0.
The default build is now using Scala 2.11 rather than Scala 2.10.
Removals, Behavior Changes and Deprecations
Removals
The following features have been removed in Spark 2.0:

Bagel
Support for Hadoop 2.1 and earlier
The ability to configure closure serializer
HTTPBroadcast
TTL-based metadata cleaning
Semi-private class org.apache.spark.Logging. We suggest you use slf4j directly.
SparkContext.metricsSystem
Block-oriented integration with Tachyon (subsumed by file system integration)
Methods deprecated in Spark 1.x
Methods on Python DataFrame that returned RDDs (map, flatMap, mapPartitions, etc). They are still available in dataframe.rdd field, e.g. dataframe.rdd.map.
Less frequently used streaming connectors, including Twitter, Akka, MQTT, ZeroMQ
Hash-based shuffle manager
History serving functionality from standalone Master
For Java and Scala, DataFrame no longer exists as a class. As a result, data sources would need to be updated.
Spark EC2 script has been fully moved to an external repository hosted by the UC Berkeley AMPLab
Behavior Changes
The following changes might require updating existing applications that depend on the old behavior or API.

The default build is now using Scala 2.11 rather than Scala 2.10.
In SQL, floating literals are now parsed as decimal data type rather than double data type.
Kryo version is bumped to 3.0.
Java RDD’s flatMap and mapPartitions functions used to require functions returning Java Iterable. They have been updated to require functions returning Java iterator so the functions do not need to materialize all the data.
Java RDD’s countByKey and countAprroxDistinctByKey now returns a map from K to java.lang.Long, rather than to java.lang.Object.
When writing Parquet files, the summary files are not written by default. To re-enable it, users must set “parquet.enable.summary-metadata” to true.
The DataFrame-based API (spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather than in spark.mllib.linalg. This removes the last dependencies of spark.ml.* on spark.mllib.*. (SPARK-13944) See the MLlib migration guide for a full list of API changes.
For a more complete list, please see SPARK-11806 for deprecations and removals.

Deprecations
The following features have been deprecated in Spark 2.0, and might be removed in future versions of Spark 2.x:

Fine-grained mode in Apache Mesos
Support for Java 7
Support for Python 2.6
Known Issues
Lead and Lag’s behaviors have been changed to ignoring nulls from respecting nulls (1.6’s behaviors). In 2.0.1, the behavioral changes will be fixed in 2.0.1 (SPARK-16721).
Lead and Lag functions using constant input values does not return the default value when the offset row does not exist (SPARK-16633).

2.1.0

API updates
SPARK-17864: Data type APIs are stable APIs.
SPARK-18351: from_json and to_json for parsing JSON for string columns
SPARK-16700: When creating a DataFrame in PySpark, Python dictionaries can be used as values of a StructType.
Performance and stability
SPARK-17861: Scalable Partition Handling. Hive metastore stores all table partition metadata by default for Spark tables stored with Hive’s storage formats as well as tables stored with Spark’s native formats. This change reduces first query latency over partitioned tables and allows for the use of DDL commands to manipulate partitions for tables stored with Spark’s native formats. Users can migrate tables stored with Spark’s native formats created by previous versions by using the MSCK command.
SPARK-16523: Speeds up group-by aggregate performance by adding a fast aggregation cache that is backed by a row-based hashmap.
Other notable changes
SPARK-9876: parquet-mr upgraded to 1.8.1
Programming guides: Spark Programming Guide and Spark SQL, DataFrames and Datasets Guide.

Structured Streaming
API updates
SPARK-17346: Kafka 0.10 support in Structured Streaming
SPARK-17731: Metrics for Structured Streaming
SPARK-17829: Stable format for offset log
SPARK-18124: Observed delay based Event Time Watermarks
SPARK-18192: Support all file formats in structured streaming
SPARK-18516: Separate instantaneous state from progress performance statistics
Stability
SPARK-17267: Long running structured streaming requirements
Programming guide: Structured Streaming Programming Guide.

MLlib
API updates
SPARK-5992: Locality Sensitive Hashing
SPARK-7159: Multiclass Logistic Regression in DataFrame-based API
SPARK-16000: ML persistence: Make model loading backwards-compatible with Spark 1.x with saved models using spark.mllib.linalg.Vector columns in DataFrame-based API
Performance and stability
SPARK-17748: Faster, more stable LinearRegression for < 4096 features
SPARK-16719: RandomForest: communicate fewer trees on each iteration
Programming guide: Machine Learning Library (MLlib) Guide.

SparkR
The main focus of SparkR in the 2.1.0 release was adding extensive support for ML algorithms, which include:

New ML algorithms in SparkR including LDA, Gaussian Mixture Models, ALS, Random Forest, Gradient Boosted Trees, and more
Support for multinomial logistic regression providing similar functionality as the glmnet R package
Enable installing third party packages on workers using spark.addFile (SPARK-17577).
Standalone installable package built with the Apache Spark release. We will be submitting this to CRAN soon.
Programming guide: SparkR (R on Spark).

GraphX
SPARK-11496: Personalized pagerank
Programming guide: GraphX Programming Guide.

Deprecations
MLlib
SPARK-18592: Deprecate unnecessary Param setter methods in tree and ensemble models
Changes of behavior
Core and SQL
SPARK-18360: The default table path of tables in the default database will be under the location of the default database instead of always depending on the warehouse location setting.
SPARK-18377: spark.sql.warehouse.dir is a static configuration now. Users need to set it before the start of the first SparkSession and its value is shared by sessions in the same application.
SPARK-14393: Values generated by non-deterministic functions will not change after coalesce or union.
SPARK-18076: Fix default Locale used in DateFormat, NumberFormat to Locale.US
SPARK-16216: CSV and JSON data sources write timestamp and date values in ISO 8601 formatted string. Two options, timestampFormat and dateFormat, are added to these two data sources to let users control the format of timestamp and date value in string representation, respectively. Please refer to the API doc of DataFrameReader and DataFrameWriter for more details about these two configurations.
SPARK-17427: Function SIZE returns -1 when its input parameter is null.
SPARK-16498: LazyBinaryColumnarSerDe is fixed as the the SerDe for RCFile.
SPARK-16552: If a user does not specify the schema to a table and relies on schema inference, the inferred schema will be stored in the metastore. The schema will be not inferred again when this table is used.
Structured Streaming
SPARK-18516: Separate instantaneous state from progress performance statistics

MLlib
SPARK-17870: ChiSquareSelector now accounts for degrees of freedom by using pValue rather than raw statistic to select the top features.

Known Issues
SPARK-17647: In SQL LIKE clause, wildcard characters ‘%’ and ‘_’ right after backslashes are always escaped.
SPARK-18908: If a StreamExecution fails to start, users need to check stderr for the error.

2.2.0

API updates
SPARK-19107: Support creating hive table with DataFrameWriter and Catalog
SPARK-13721: Add support for LATERAL VIEW OUTER explode()
SPARK-18885: Unify CREATE TABLE syntax for data source and hive serde tables
SPARK-16475: Added Broadcast Hints BROADCAST, BROADCASTJOIN, and MAPJOIN, for SQL Queries
SPARK-18350: Support session local timezone
SPARK-19261: Support ALTER TABLE table_name ADD COLUMNS
SPARK-20420: Add events to the external catalog
SPARK-18127: Add hooks and extension points to Spark
SPARK-20576: Support generic hint function in Dataset/DataFrame
SPARK-17203: Data source options should always be case insensitive
SPARK-19139: AES-based authentication mechanism for Spark
Performance and stability
Cost-Based Optimizer
SPARK-17075 SPARK-17076 SPARK-19020 SPARK-17077 SPARK-19350: Cardinality estimation for filter, join, aggregate, project and limit/sample operators
SPARK-17080: Cost-based join re-ordering
SPARK-17626: TPC-DS performance improvements using star-schema heuristics
SPARK-17949: Introduce a JVM object based aggregate operator
SPARK-18186: Partial aggregation support of HiveUDAFFunction
SPARK-18362 SPARK-19918: File listing/IO improvements for CSV and JSON
SPARK-18775: Limit the max number of records written per file
SPARK-18761: Uncancellable / unkillable tasks shouldn’t starve jobs of resources
SPARK-15352: Topology aware block replication
Other notable changes
SPARK-18352: Support for parsing multi-line JSON files
SPARK-19610: Support for parsing multi-line CSV files
SPARK-21079: Analyze Table Command on partitioned tables
SPARK-18703: Drop Staging Directories and Data Files after completion of Insertion/CTAS against Hive-serde Tables
SPARK-18209: More robust view canonicalization without full SQL expansion
SPARK-13446: [SPARK-18112] Support reading data from Hive metastore 2.0/2.1
SPARK-18191: Port RDD API to use commit protocol
SPARK-8425:Add blacklist mechanism for task scheduling
SPARK-19464: Remove support for Hadoop 2.5 and earlier
SPARK-19493: Remove Java 7 support
Programming guides: Spark RDD Programming Guide and Spark SQL, DataFrames and Datasets Guide.

Structured Streaming
General Availablity
SPARK-20844: The Structured Streaming APIs are now GA and is no longer labeled experimental
Kafka Improvements
SPARK-19719: Support for reading and writing data in streaming or batch to/from Apache Kafka
SPARK-19968: Cached producer for lower latency kafka to kafka streams.
API updates
SPARK-19067: Support for complex stateful processing and timeouts using [flat]MapGroupsWithState
SPARK-19876: Support for one time triggers
Other notable changes
SPARK-20979: Rate source for testing and benchmarks
Programming guide: Structured Streaming Programming Guide.

MLlib
New algorithms in DataFrame-based API
SPARK-14709: LinearSVC (Linear SVM Classifier) (Scala/Java/Python/R)
SPARK-19635: ChiSquare test in DataFrame-based API (Scala/Java/Python)
SPARK-19636: Correlation in DataFrame-based API (Scala/Java/Python)
SPARK-13568: Imputer feature transformer for imputing missing values (Scala/Java/Python)
SPARK-18929: Add Tweedie distribution for GLMs (Scala/Java/Python/R)
SPARK-14503: FPGrowth frequent pattern mining and AssociationRules (Scala/Java/Python/R)
Existing algorithms added to Python and R APIs
SPARK-18239: Gradient Boosted Trees ®
SPARK-18821: Bisecting K-Means ®
SPARK-18080: Locality Sensitive Hashing (LSH) (Python)
SPARK-6227: Distributed PCA and SVD for PySpark (in RDD-based API)
Major bug fixes
SPARK-19110: DistributedLDAModel.logPrior correctness fix
SPARK-17975: EMLDAOptimizer fails with ClassCastException (caused by GraphX checkpointing bug)
SPARK-18715: Fix wrong AIC calculation in Binomial GLM
SPARK-16473: BisectingKMeans failing during training with “java.util.NoSuchElementException: key not found” for certain inputs
SPARK-19348: pyspark.ml.Pipeline gets corrupted under multi-threaded use
SPARK-20047: Box-constrained Logistic Regression
Programming guide: Machine Learning Library (MLlib) Guide.

SparkR
The main focus of SparkR in the 2.2.0 release was adding extensive support for existing Spark SQL features:

Major features
SPARK-19654: Structured Streaming API for R
SPARK-20159: Support complete Catalog API in R
SPARK-19795: column functions to_json, from_json
SPARK-19399: Coalesce on DataFrame and coalesce on column
SPARK-20020: Support DataFrame checkpointing
SPARK-18285: Multi-column approxQuantile in R
Programming guide: SparkR (R on Spark).

GraphX
Bug fixes
SPARK-18847: PageRank gives incorrect results for graphs with sinks
SPARK-14804: Graph vertexRDD/EdgeRDD checkpoint results ClassCastException
Optimizations
SPARK-18845: PageRank initial value improvement for faster convergence
SPARK-5484: Pregel should checkpoint periodically to avoid StackOverflowError
Programming guide: GraphX Programming Guide.

Deprecations
Python
SPARK-12661: Drop support for Python 2.6

MLlib
SPARK-18613: spark.ml LDA classes should not expose spark.mllib in APIs. In spark.ml.LDAModel, deprecated oldLocalModel and getModel.

SparkR
SPARK-20195: deprecate createExternalTable
Changes of behavior

MLlib
SPARK-19787: DeveloperApi ALS.train() uses default regParam value 0.1 instead of 1.0, in order to match regular ALS API’s default regParam setting.

SparkR
SPARK-19291: This added log-likelihood for SparkR Gaussian Mixture Models, but doing so introduced a SparkR model persistence incompatibility: Gaussian Mixture Models saved from SparkR 2.1 may not be loaded into SparkR 2.2. We plan to put in place backwards compatibility guarantees for SparkR in the future.
Known Issues
SPARK-21093: Multiple gapply execution occasionally failed in SparkR

2.3.0

Major features
Spark on Kubernetes: [SPARK-18278] A new kubernetes scheduler backend that supports native submission of spark jobs to a cluster managed by kubernetes. Note that this support is currently experimental and behavioral changes around configurations, container images and entrypoints should be expected.

Vectorized ORC Reader: [SPARK-16060] Adds support for new ORC reader that substantially improves the ORC scan throughput through vectorization (2-5x). To enable the reader, users can set spark.sql.orc.impl to native.
Spark History Server V2: [SPARK-18085] A new spark history server (SHS) backend that provides better scalability for large scale applications with a more efficient event storage mechanism.

Data source API V2: [SPARK-15689][SPARK-22386] An experimental API for plugging in new data sources in Spark. The new API attempts to address several limitations of the V1 API and aims to facilitate development of high performant, easy-to-maintain, and extensible external data sources. Note that this API is still undergoing active development and breaking changes should be expected.

PySpark Performance Enhancements: [SPARK-22216][SPARK-21187] Significant improvements in python performance and interoperability by fast data serialization and vectorized execution.

Performance and stability
[SPARK-21975] Histogram support in cost-based optimizer
[SPARK-20331] Better support for predicate pushdown for Hive partition pruning
[SPARK-19112] Support for ZStandard compression codec
[SPARK-21113] Support for read ahead input stream to amortize disk I/O cost in the spill reader
[SPARK-22510][SPARK-22692][SPARK-21871] Further stabilize the codegen framework to avoid hitting the 64KB JVM bytecode limit on the Java method and Java compiler constant pool limit
[SPARK-23207] Fixed a long standing bug in Spark where consecutive shuffle+repartition on a DataFrame could lead to incorrect answers in certain surgical cases
[SPARK-22062][SPARK-17788][SPARK-21907] Fix various causes of OOMs
[SPARK-22489][SPARK-22916][SPARK-22895][SPARK-20758][SPARK-22266][SPARK-19122][SPARK-22662][SPARK-21652] Enhancements in rule-based optimizer and planner

[SPARK-20236] Support Hive style dynamic partition overwrite semantics.
[SPARK-4131] Support INSERT OVERWRITE DIRECTORY to directly write data into the filesystem from a query
[SPARK-19285][SPARK-22945][SPARK-21499][SPARK-20586][SPARK-20416][SPARK-20668] UDF enhancements
[SPARK-20463][SPARK-19951][SPARK-22934][SPARK-21055][SPARK-17729][SPARK-20962][SPARK-20963][SPARK-20841][SPARK-17642][SPARK-22475][SPARK-22934] Improved ANSI SQL compliance and Hive compatibility
[SPARK-20746] More comprehensive SQL built-in functions
[SPARK-21485] Spark SQL documentation generation for built-in functions
[SPARK-19810] Remove support for Scala 2.10
[SPARK-22324] Upgrade Arrow to 0.8.0 and Netty to 4.1.17
Programming guides: Spark RDD Programming Guide and Spark SQL, DataFrames and Datasets Guide.

Structured Streaming
Continuous Processing
A new execution engine that can execute streaming queries with sub-millisecond end-to-end latency by changing only a single line of user code. To learn more see the programming guide.
Stream-Stream Joins
Ability to join two streams of data, buffering rows until matching tuples arrive in the other stream. Predicates can be used against event time columns to bound the amount of state that needs to be retained.
Streaming API V2
An experimental API for plugging in new source and sinks that works for batch, micro-batch, and continuous execution. Note this API is still undergoing active development and breaking changes should be expected.
Programming guide: Structured Streaming Programming Guide.

MLlib

Highlights
ML Prediction now works with Structured Streaming, using updated APIs. Details below.
New/Improved APIs
[SPARK-21866]: Built-in support for reading images into a DataFrame (Scala/Java/Python)
[SPARK-19634]: DataFrame functions for descriptive summary statistics over vector columns (Scala/Java)
[SPARK-14516]: ClusteringEvaluator for tuning clustering algorithms, supporting Cosine silhouette and squared Euclidean silhouette metrics (Scala/Java/Python)
[SPARK-3181]: Robust linear regression with Huber loss (Scala/Java/Python)
[SPARK-13969]: FeatureHasher transformer (Scala/Java/Python)
Multiple column support for several feature transformers:
[SPARK-13030]: OneHotEncoderEstimator (Scala/Java/Python)
[SPARK-22397]: QuantileDiscretizer (Scala/Java)
[SPARK-20542]: Bucketizer (Scala/Java/Python)
[SPARK-21633] and SPARK-21542]: Improved support for custom pipeline components in Python.
[SPARK-21087]: CrossValidator and TrainValidationSplit can collect all models when fitting (Scala/Java). This allows you to inspect or save all fitted models.
[SPARK-19357]: Meta-algorithms CrossValidator, TrainValidationSplit, OneVsRest` support a parallelism Param for fitting multiple sub-models in parallel Spark jobs
[SPARK-17139]: Model summary for multinomial logistic regression (Scala/Java/Python)
[SPARK-18710]: Add offset in GLM
[SPARK-20199]: Added featureSubsetStrategy Param to GBTClassifier and GBTRegressor. Using this to subsample features can significantly improve training speed; this option has been a key strength of xgboost.

[SPARK-22156] Fixed Word2Vec learning rate scaling with num iterations. The new learning rate is set to match the original Word2Vec C code and should give better results from training.
[SPARK-22289] Add JSON support for Matrix parameters (This fixed a bug for ML persistence with LogisticRegressionModel when using bounds on coefficients.)
[SPARK-22700] Bucketizer.transform incorrectly drops row containing NaN. When Param handleInvalid was set to “skip,” Bucketizer would drop a row with a valid value in the input column if another (irrelevant) column had a NaN value.
[SPARK-22446] Catalyst optimizer sometimes caused StringIndexerModel to throw an incorrect “Unseen label” exception when handleInvalid was set to “error.” This could happen for filtered data, due to predicate push-down, causing errors even after invalid rows had already been filtered from the input dataset.
[SPARK-21681] Fixed an edge case bug in multinomial logistic regression that resulted in incorrect coefficients when some features had zero variance.

Major optimizations:
[SPARK-22707] Reduced memory consumption for CrossValidator
[SPARK-22949] Reduced memory consumption for TrainValidationSplit
[SPARK-21690] Imputer should train using a single pass over the data
[SPARK-14371] OnlineLDAOptimizer avoids collecting statistics to the driver for each mini-batch.
Programming guide: Machine Learning Library (MLlib) Guide.

SparkR
The main focus of SparkR in the 2.3.0 release was towards improving the stability of UDFs and adding several new SparkR wrappers around existing APIs:

Major features
Improved function parity between SQL and R
[SPARK-22933]: Structured Streaming APIs for withWatermark, trigger, partitionBy and stream-stream joins
[SPARK-21266]: SparkR UDF with DDL-formatted schema support
[SPARK-20726][SPARK-22924][SPARK-22843] Several new Dataframe API Wrappers
[SPARK-15767][SPARK-21622][SPARK-20917][SPARK-20307][SPARK-20906] Several new SparkML API Wrappers
Programming guide: SparkR (R on Spark).

GraphX
Optimizations
[SPARK-5484] Pregel now checkpoints periodically to avoid StackOverflowErrors
[SPARK-21491] Small performance improvement in several places
Programming guide: GraphX Programming Guide.

Deprecations
Python
[SPARK-23122] Deprecate register* for UDFs in SQLContext and Catalog in PySpark

MLlib
[SPARK-13030] OneHotEncoder has been deprecated and will be removed in 3.0. It has been replaced by the new OneHotEncoderEstimator. Note that OneHotEncoderEstimator will be renamed to OneHotEncoder in 3.0 (but OneHotEncoderEstimator will be kept as an alias).

SparkSQL
[SPARK-22036]: By default arithmetic operations between decimals return a rounded value if an exact representation is not possible (instead of returning NULL in the prior versions)
[SPARK-22937]: When all inputs are binary, SQL elt() returns an output as binary. Otherwise, it returns as a string. In the prior versions, it always returns as a string despite of input types.
[SPARK-22895]: The Join/Filter’s deterministic predicates that are after the first non-deterministic predicates are also pushed down/through the child operators, if possible. In the prior versions, these filters were not eligible for predicate pushdown.
[SPARK-22771]: When all inputs are binary, functions.concat() returns an output as binary. Otherwise, it returns as a string. In the prior versions, it always returns as a string despite of input types.
[SPARK-22489]: When either of the join sides is broadcastable, we prefer to broadcasting the table that is explicitly specified in a broadcast hint.
[SPARK-22165]: Partition column inference previously found incorrect common type for different inferred types, for example, previously it ended up with double type as the common type for double type and date type. Now it finds the correct common type for such conflicts. For details, see the migration guide.
[SPARK-22100]: The percentile_approx function previously accepted numeric type input and outputted double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
[SPARK-21610]: the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named _corrupt_record by default). Instead, you can cache or save the parsed results and then send the same query.
[SPARK-23421]: Since Spark 2.2.1 and 2.3.0, the schema is always inferred at runtime when the data source tables have the columns that exist in both partition schema and data schema. The inferred schema does not have the partitioned columns. When reading the table, Spark respects the partition values of these overlapping columns instead of the values stored in the data source files. In 2.2.0 and 2.1.x release, the inferred schema is partitioned but the data of the table is invisible to users (i.e., the result set is empty).

PySpark
[SPARK-19732]: na.fill() or fillna also accepts boolean and replaces nulls with booleans. In prior Spark versions, PySpark just ignores it and returns the original Dataset/DataFrame.
[SPARK-22395]: Pandas 0.19.2 or upper is required for using Pandas related functionalities, such as toPandas, createDataFrame from Pandas DataFrame, etc.
[SPARK-22395]: The behavior of timestamp values for Pandas related functionalities was changed to respect session timezone, which is ignored in the prior versions.
[SPARK-23328]: df.replace does not allow to omit value when to_replace is not a dictionary. Previously, value could be omitted in the other cases and had None by default, which is counter-intuitive and error prone.

MLlib
Breaking API Changes: The class and trait hierarchy for logistic regression model summaries was changed to be cleaner and better accommodate the addition of the multi-class summary. This is a breaking change for user code that casts a LogisticRegressionTrainingSummary to a BinaryLogisticRegressionTrainingSummary. Users should instead use the model.binarySummary method. See [SPARK-17139] for more detail (note this is an @Experimental API). This does not affect the Python summary method, which will still work correctly for both multinomial and binary cases.
[SPARK-21806]: BinaryClassificationMetrics.pr(): first point (0.0, 1.0) is misleading and has been replaced by (0.0, p) where precision p matches the lowest recall point.
[SPARK-16957]: Decision trees now use weighted midpoints when choosing split values. This may change results from model training.
[SPARK-14657]: RFormula without an intercept now outputs the reference category when encoding string terms, in order to match native R behavior. This may change results from model training.
[SPARK-21027]: The default parallelism used in OneVsRest is now set to 1 (i.e. serial). In 2.2 and earlier versions, the level of parallelism was set to the default threadpool size in Scala. This may change performance.
[SPARK-21523]: Upgraded Breeze to 0.13.2. This included an important bug fix in strong Wolfe line search for L-BFGS.
[SPARK-15526]: The JPMML dependency is now shaded.
Also see the “Bug fixes” section for behavior changes resulting from fixing bugs.

Known Issues
[SPARK-23523][SQL] Incorrect result caused by the rule OptimizeMetadataOnlyQuery
[SPARK-23406] Bugs in stream-stream self-joins

2.4.0

Major features
Barrier Execution Mode: [SPARK-24374] Support Barrier Execution Mode in the scheduler, to better integrate with deep learning frameworks.
Scala 2.12 Support: [SPARK-14220] Add experimental Scala 2.12 support. Now you can build Spark with Scala 2.12 and write Spark applications in Scala 2.12.
Higher-order functions: [SPARK-23899] Add a lot of new built-in functions, including higher-order functions, to deal with complex data types easier.
Built-in Avro data source: [SPARK-24768] Inline Spark-Avro package with logical type support, better performance and usability.

新的调度模型（Barrier Scheduling），使用户能够将分布式深度学习训练恰当地嵌入到 Spark 的 stage 中，以简化分布式训练工作流程。
添加了35个高阶函数，用于在 Spark SQL 中操作数组/map。
新增一个新的基于 Databricks 的 spark-avro 模块的原生 AVRO 数据源。
PySpark 还为教学和可调试性的所有操作引入了热切的评估模式（eager evaluation mode）。
Spark on K8S 支持 PySpark 和 R ，支持客户端模式（client-mode）。
Structured Streaming 的各种增强功能。例如，连续处理（continuous processing）中的有状态操作符。
内置数据源的各种性能改进。例如，Parquet 嵌套模式修剪（schema pruning）。
支持 Scala 2.12。

API
[SPARK-24035] SQL syntax for Pivot
[SPARK-24940] Coalesce and Repartition Hint for SQL Queries
[SPARK-19602] Support column resolution of fully qualified column name
[SPARK-21274] Implement EXCEPT ALL and INTERSECT ALL
Performance and stability
[SPARK-16406] Reference resolution for large number of columns should be faster
[SPARK-23486] Cache the function name from the external catalog for lookupFunctions
[SPARK-23803] Support Bucket Pruning
[SPARK-24802] Optimization Rule Exclusion
[SPARK-4502] Nested schema pruning for Parquet tables
[SPARK-24296] Support replicating blocks larger than 2 GB
[SPARK-24307] Support sending messages over 2GB from memory
[SPARK-23243] Shuffle+Repartition on an RDD could lead to incorrect answers
[SPARK-25181] Limited the size of BlockManager master and slave thread pools, lowering memory overhead when networking is slow
Connectors
[SPARK-23972] Update Parquet from 1.8.2 to 1.10.0
[SPARK-25419] Parquet predicate pushdown improvement
[SPARK-23456] Native ORC reader is on by default
[SPARK-22279] Use native ORC reader to read Hive serde tables by default
[SPARK-21783] Turn on ORC filter push-down by default
[SPARK-24959] Speed up count() for JSON and CSV
[SPARK-24244] Parsing only required columns to the CSV parser
[SPARK-23786] CSV schema validation - column names are not checked
[SPARK-24423] Option query for specifying the query to read from JDBC
[SPARK-22814] Support Date/Timestamp in JDBC partition column
[SPARK-24771] Update Avro from 1.7.7 to 1.8
Kubernetes Scheduler Backend
[SPARK-23984] PySpark bindings for K8S
[SPARK-24433] R bindings for K8S
[SPARK-23146] Support client mode for Kubernetes cluster backend
[SPARK-23529] Support for mounting K8S volumes
PySpark
[SPARK-24215] Implement eager evaluation for DataFrame APIs
[SPARK-22274] User-defined aggregation functions with Pandas UDF
[SPARK-22239] User-defined window functions with Pandas UDF
[SPARK-24396] Add Structured Streaming ForeachWriter for Python
[SPARK-23874] Upgrade Apache Arrow to 0.10.0
[SPARK-25004] Add spark.executor.pyspark.memory limit
[SPARK-23030] Use Arrow stream format for creating from and collecting Pandas DataFrames
[SPARK-24624] Support mixture of Python UDF and Scalar Pandas UDF
Other notable changes
[SPARK-24596] Non-cascading Cache Invalidation
[SPARK-23880] Do not trigger any job for caching data
[SPARK-23510] Support Hive 2.2 and Hive 2.3 metastore
[SPARK-23711] Add fallback generator for UnsafeProjection
[SPARK-24626] Parallelize location size calculation in Analyze Table command
Programming guides: Spark RDD Programming Guide and Spark SQL, DataFrames and Datasets Guide.

Structured Streaming
Major features
[SPARK-24565] Exposed the output rows of each microbatch as a DataFrame using foreachBatch (Python, Scala, and Java)
[SPARK-24396] Added Python API for foreach and ForeachWriter
[SPARK-25005] Support “kafka.isolation.level” to read only committed records from Kafka topics that are written using a transactional producer.

[SPARK-24662] Support the LIMIT operator for streams in Append or Complete mode
[SPARK-24763] Remove redundant key data from value in streaming aggregation
[SPARK-24156] Faster generation of output results and/or state cleanup with stateful operations (mapGroupsWithState, stream-stream join, streaming aggregation, streaming dropDuplicates) when there is no data in the input stream.
[SPARK-24730] Support for choosing either the min or max watermark when there are multiple input streams in a query
[SPARK-25399] Fixed a bug where reusing execution threads from continuous processing for microbatch streaming can result in a correctness issue
[SPARK-18057] Upgraded Kafka client version from 0.10.0.1 to 2.0.0
Programming guide: Structured Streaming Programming Guide.

MLlib
Major features
[SPARK-22666] Spark datasource for image format
Other notable changes
[SPARK-22119] Add cosine distance measure to KMeans/BisectingKMeans/Clustering evaluator
[SPARK-10697] Lift Calculation in Association Rule mining
[SPARK-14682] Provide evaluateEachIteration method or equivalent for spark.ml GBTs
[SPARK-7132] Add fit with validation set to spark.ml GBT
[SPARK-15784] Add Power Iteration Clustering to spark.ml
[SPARK-15064] Locale support in StopWordsRemover
[SPARK-21741] Python API for DataFrame-based multivariate summarizer
[SPARK-21898] Feature parity for KolmogorovSmirnovTest in MLlib
[SPARK-10884] Support prediction on single instance for regression and classification related models
[SPARK-23783] Add new generic export trait for ML pipelines
[SPARK-11239] PMML export for ML linear regression
Programming guide: Machine Learning Library (MLlib) Guide.

SparkR
Major features
[SPARK-25393] Adding new function from_csv()
[SPARK-21291] add R partitionBy API in DataFrame
[SPARK-25007] Add array_intersect/array_except/array_union/shuffle to SparkR
[SPARK-25234] avoid integer overflow in parallelize
[SPARK-25117] Add EXCEPT ALL and INTERSECT ALL support in R
[SPARK-24537] Add array_remove / array_zip / map_from_arrays / array_distinct
[SPARK-24187] Add array_join function to SparkR
[SPARK-24331] Adding arrays_overlap, array_repeat, map_entries to SparkR
[SPARK-24198] Adding slice function to SparkR
[SPARK-24197] Adding array_sort function to SparkR
[SPARK-24185] add flatten function to SparkR
[SPARK-24069] Add array_min / array_max functions
[SPARK-24054] Add array_position function / element_at functions
[SPARK-23770] Add repartitionByRange API in SparkR
Programming guide: SparkR (R on Spark).

GraphX
Major features
[SPARK-25268] run Parallel Personalized PageRank throws serialization Exception
Programming guide: GraphX Programming Guide.

Deprecations
MLlib
[SPARK-23451] Deprecate KMeans computeCost
[SPARK-25345] Deprecate readImages APIs from ImageSchema
Changes of behavior

Spark Core
[SPARK-25088] Rest Server default & doc updates
Spark SQL
[SPARK-23549] Cast to timestamp when comparing timestamp with date
[SPARK-24324] Pandas Grouped Map UDF should assign result columns by name
[SPARK-23425] load data for hdfs file path with wildcard usage is not working properly
[SPARK-23173] from_json can produce nulls for fields which are marked as non-nullable
[SPARK-24966] Implement precedence rules for set operations
[SPARK-25708] HAVING without GROUP BY should be global aggregate
[SPARK-24341] Correctly handle multi-value IN subquery
[SPARK-19724] Create a managed table with an existed default location should throw an exception
Please read the Migration Guide for all the behavior changes

Known Issues
[SPARK-25271] CTAS with Hive parquet tables should leverage native parquet source
[SPARK-24935] Problem with Executing Hive UDAF’s from Spark 2.2 Onwards
[SPARK-25879] Schema pruning fails when a nested field and top level field are selected
[SPARK-25906] spark-shell cannot handle -i option correctly
[SPARK-25921] Python worker reuse causes Barrier tasks to run without BarrierTaskContext
[SPARK-25918] LOAD DATA LOCAL INPATH should handle a relative path

参考资料：

http://spark.apache.org/releases/

猿与禅

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark-各版本特性

0.3Save OperationsYou can now save distributed datasets to the Hadoop filesystem (HDFS), Amazon S3, Hypertable, and any other storage system supported by Hadoop. There are convenience methods for se...
复制链接

扫一扫