Spark 1.6.0 (Scala 2.11)版本的编译与安装部署

2016年元月4号, spark 在其官网上公开了1.6.0版本,于是进行下载和编译.


有了前面的编译经验和之前下载好的java类包,花了大概一分钟就编译妥当,于是重新部署配置一下,马上OK。简直是高效率。


对于scala的编译,还是只需要一条语句。build/sbt -Dscala=2.11 -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver assembly。




对spark 1.6中的新特性进行测试: (DataSet)


其中1.6的新特性还包括:

Spark Core/SQL

  • API Updates
    • SPARK-9999  Dataset API - A new Spark API, similar to RDDs, that allows users to work with custom objects and lambda functions while still gaining the benefits of the Spark SQL execution engine.
    • SPARK-10810 Session Management - Different users can share a cluster while having different configuration and temporary tables.
    • SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table.
    • SPARK-11745 Reading non-standard JSON files - Added options to read non-standard JSON files (e.g. single-quotes, unquoted attributes)
    • SPARK-10412 Per-operator Metrics for SQL Execution - Display statistics on a per-operator basis for memory usage and spilled data size.
    • SPARK-11329 Star (*) expansion for StructTypes - Makes it easier to nest and unnest arbitrary numbers of columns
    • SPARK-4849  Advanced Layout of Cached Data - storing partitioning and ordering schemes in In-memory table scan, and adding distributeBy and localSort to DF API
    • SPARK-11778  - DataFrameReader.table supports specifying database name. For example, sqlContext.read.table(“dbName.tableName”) can be used to create a DataFrame from a table called “tableName” in the database “dbName”.
    • SPARK-10947  - With schema inference from JSON into a Dataframe, users can set primitivesAsString to true (in data source options) to infer all primitive value types as Strings. The default value of primitivesAsString is false.
  • Performance
    • SPARK-10000 Unified Memory Management - Shared memory for execution and caching instead of exclusive division of the regions.
    • SPARK-11787 Parquet Performance - Improve Parquet scan performance when using flat schemas.
    • SPARK-9241  Improved query planner for queries having distinct aggregations - Query plans of distinct aggregations are more robust when distinct columns have high cardinality.
    • SPARK-9858  Adaptive query execution - Initial support for automatically selecting the number of reducers for joins and aggregations.
    • SPARK-10978 Avoiding double filters in Data Source API - When implementing a data source with filter pushdown, developers can now tell Spark SQL to avoid double evaluating a pushed-down filter.
    • SPARK-11111 Fast null-safe joins - Joins using null-safe equality (<=>) will now execute using SortMergeJoin instead of computing a cartisian product.
    • SPARK-10917, SPARK-11149 In-memory Columnar Cache Performance - Significant (up to 14x) speed up when caching data that contains complex types in DataFrames or SQL.
    • SPARK-11389 SQL Execution Using Off-Heap Memory - Support for configuring query execution to occur using off-heap memory to avoid GC overhead

Spark Streaming

  • API Updates
    • SPARK-2629  New improved state management - mapWithState - a DStream transformation for stateful stream processing, supercedes updateStateByKey in functionality and performance.
    • SPARK-11198 Kinesis record deaggregation - Kinesis streams have been upgraded to use KCL 1.4.0 and supports transparent deaggregation of KPL-aggregated records.
    • SPARK-10891 Kinesis message handler function - Allows arbitrary function to be applied to a Kinesis record in the Kinesis receiver before to customize what data is to be stored in memory.
    • SPARK-6328  Python Streaming Listener API - Get streaming statistics (scheduling delays, batch processing times, etc.) in streaming.
  • UI Improvements
    • Made failures visible in the streaming tab, in the timelines, batch list, and batch details page.
    • Made output operations visible in the streaming tab as progress bars.

MLlib

  • New algorithms/models
    • SPARK-8518  Survival analysis - Log-linear model for survival analysis
    • SPARK-9834  Normal equation for least squares - Normal equation solver, providing R-like model summary statistics
    • SPARK-3147  Online hypothesis testing - A/B testing in the Spark Streaming framework
    • SPARK-9930  New feature transformers - ChiSqSelector, QuantileDiscretizer, SQL transformer
    • SPARK-6517  Bisecting K-Means clustering - Fast top-down clustering variant of K-Means
  • API improvements
    • ML Pipelines
      • SPARK-6725  Pipeline persistence - Save/load for ML Pipelines, with partial coverage of spark.ml algorithms
      • SPARK-5565  LDA in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
    • R API
      • SPARK-9836  R-like statistics for GLMs - (Partial) R-like stats for ordinary least squares via summary(model)
      • SPARK-9681  Feature interactions in R formula - Interaction operator “:” in R formula
    • Python API - Many improvements to Python API to approach feature parity
  • Misc improvements
  • Documentation improvements
    • SPARK-7751  @since versions - Documentation includes initial version when classes and methods were added
    • SPARK-11337 Testable example code - Automated testing for code in user guide examples

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值