2016年元月4号, spark 在其官网上公开了1.6.0版本,于是进行下载和编译.
有了前面的编译经验和之前下载好的java类包,花了大概一分钟就编译妥当,于是重新部署配置一下,马上OK。简直是高效率。
对于scala的编译,还是只需要一条语句。build/sbt -Dscala=2.11 -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver assembly。
对spark 1.6中的新特性进行测试: (DataSet)
其中1.6的新特性还包括:
Spark Core/SQL
- API Updates
- SPARK-9999 Dataset API - A new Spark API, similar to RDDs, that allows users to work with custom objects and lambda functions while still gaining the benefits of the Spark SQL execution engine.
- SPARK-10810 Session Management - Different users can share a cluster while having different configuration and temporary tables.
- SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
- SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
- SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a per-operator basis for memory usage and spilled data size.
- SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unnest arbitrary numbers of columns
- SPARK-4849 Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
- SPARK-11778 - DataFrameReader.table supports specifying database name. For example, sqlContext.read.table(“dbName.tableName”) can be used to create a DataFrame from a table called “tableName” in the database “dbName”.
- SPARK-10947 - With schema inference from JSON into a Dataframe, users can set primitivesAsString to true (in data source options) to infer all primitive value types as Strings. The default value of primitivesAsString is false.
- Performance
- SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
- SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
- SPARK-9241 Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
- SPARK-9858 Adaptive query execution - Initial support for automatically selecting the number of reducers for joins and aggregations.
- SPARK-10978 Avoiding double filters in Data Source API - When implementing a data source with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
- SPARK-11111 Fast null-safe joins - Joins using null-safe equality (
<=>
) will now execute using SortMergeJoin instead of computing a cartisian product. - SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
- SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead
Spark Streaming
- API Updates
- SPARK-2629 New improved state management -
mapWithState
- a DStream transformation for stateful stream processing, supercedesupdateStateByKey
in functionality and performance. - SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
- SPARK-10891 Kinesis message handler function - Allows arbitrary function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
- SPARK-6328 Python Streaming Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
- SPARK-2629 New improved state management -
- UI Improvements
- Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
- Made output operations visible in the streaming tab as progress bars.
MLlib
- New algorithms/models
- SPARK-8518 Survival analysis - Log-linear model for survival analysis
- SPARK-9834 Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
- SPARK-3147 Online hypothesis testing - A/B testing in the Spark Streaming framework
- SPARK-9930 New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
- SPARK-6517 Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
- API improvements
- ML Pipelines
- SPARK-6725 Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml algorithms
- SPARK-5565 LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
- R API
- SPARK-9836 R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
- SPARK-9681 Feature interactions in R formula - Interaction operator “:” in R formula
- Python API - Many improvements to Python API to approach feature parity
- ML Pipelines
- Misc improvements
- SPARK-7685 , SPARK-9642 Instance weights for GLMs - Logistic and Linear Regression can take instance weights
- SPARK-10384, SPARK-10385 Univariate and bivariate statistics in DataFrames - Variance, stddev, correlations, etc.
- SPARK-10117 LIBSVM data source - LIBSVM as a SQL data source
- Documentation improvements
- SPARK-7751 @since versions - Documentation includes initial version when classes and methods were added
- SPARK-11337 Testable example code - Automated testing for code in user guide examples