Spark SQL configuration

# export by:
spark.sql("SET -v").show(n=200, truncate=False)
keyvaluemeaning
spark.sql.adaptive.enabledfalseWhen true, enable adaptive query execution.
spark.sql.adaptive.shuffle.targetPostShuffleInputSize67108864bThe target post-shuffle input size in bytes of a task.
spark.sql.autoBroadcastJoinThreshold10485760Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan has been run, and file-based data source tables where the statistics are computed directly on the files of data.
spark.sql.broadcastTimeout300Timeout in seconds for the broadcast wait time in broadcast joins.
spark.sql.cbo.enabledfalseEnables CBO for estimation of plan statistics when set true.
spark.sql.cbo.joinReorder.dp.star.filterfalseApplies star-join filter heuristics to cost based join enumeration.
spark.sql.cbo.joinReorder.dp.threshold12The maximum number of joined nodes allowed in the dynamic programming algorithm.
spark.sql.cbo.joinReorder.enabledfalseEnables join reorder in CBO.
spark.sql.cbo.starSchemaDetectionfalseWhen true, it enables join reordering based on star schema detection.
spark.sql.columnNameOfCorruptRecord_corrupt_recordThe name of internal column for storing raw/un-parsed JSON and CSV records that fail to parse.
spark.sql.crossJoin.enabledfalseWhen false, we will throw an error if a query contains a cartesian product without explicit CROSS JOIN syntax.
spark.sql.extensionsName of the class used to configure Spark Session extensions. The class should implement Function1[SparkSessionExtension, Unit], and must have a no-args constructor.
spark.sql.files.ignoreCorruptFilesfalseWhether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned.
spark.sql.files.maxPartitionBytes134217728The maximum number of bytes to pack into a single partition when reading files.
spark.sql.files.maxRecordsPerFile0Maximum number of records to write out to a single file. If this value is zero or negative, there is no limit.
spark.sql.groupByAliasestrueWhen true, aliases in a select list can be used in group by clauses. When false, an analysis exception is thrown in the case.
spark.sql.groupByOrdinaltrueWhen true, the ordinal numbers in group by clauses are treated as the position in the select list. When false, the ordinal numbers are ignored.
spark.sql.hive.caseSensitiveInferenceModeINFER_AND_SAVESets the action to take when a case-sensitive schema cannot be read from a Hive table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file formats such as Parquet are. Spark SQL must use a case-preserving schema when querying any table backed by files containing case-sensitive field names or queries may not return accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the case-sensitive schema from the underlying data files and write it back to the table properties), INFER_ONLY (infer the schema but don't attempt to write it to the table properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema instead of inferring).
spark.sql.hive.filesourcePartitionFileCacheSize262144000When nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled.
spark.sql.hive.manageFilesourcePartitionstrueWhen true, enable metastore partition management for file source tables as well. This includes both datasource and converted Hive tables. When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning.
spark.sql.hive.metastorePartitionPruningtrueWhen true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see HiveUtils.CONVERT_METASTORE_PARQUET and HiveUtils.CONVERT_METASTORE_ORC for more information).
spark.sql.hive.thriftServer.singleSessionfalseWhen set to true, Hive Thrift server is running in a single session mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database.
spark.sql.hive.verifyPartitionPathfalseWhen true, check all the partition paths under the table's root directory when reading data stored in HDFS.
spark.sql.optimizer.metadataOnlytrueWhen true, enable the metadata-only query optimization that use the table's metadata to produce the partition columns instead of table scans. It applies when all the columns scanned are partition columns and the query has an aggregate operator that satisfies distinct semantics.
spark.sql.orc.filterPushdownfalseWhen true, enable filter pushdown for ORC files.
spark.sql.orderByOrdinaltrueWhen true, the ordinal numbers are treated as the position in the select list. When false, the ordinal numbers in order/sort by clause are ignored.
spark.sql.parquet.binaryAsStringfalseSome other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.
spark.sql.parquet.cacheMetadatatrueTurns on caching of Parquet schema metadata. Can speed up querying of static data.
spark.sql.parquet.compression.codecsnappySets the compression codec use when writing Parquet files. Acceptable values include: uncompressed, snappy, gzip, lzo.
spark.sql.parquet.enableVectorizedReadertrueEnables vectorized parquet decoding.
spark.sql.parquet.filterPushdowntrueEnables Parquet filter push-down optimization when set to true.
spark.sql.parquet.int64AsTimestampMillisfalseWhen true, timestamp values will be stored as INT64 with TIMESTAMP_MILLIS as the extended type. In this mode, the microsecond portion of the timestamp value will betruncated.
spark.sql.parquet.int96AsTimestamptrueSome Parquet-producing systems, in particular Impala, store Timestamp into INT96. Spark would also store Timestamp as INT96 because we need to avoid precision lost of the nanoseconds field. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
spark.sql.parquet.mergeSchemafalseWhen true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available.
spark.sql.parquet.respectSummaryFilesfalseWhen true, we make assumption that all part-files of Parquet are consistent with summary files and we will ignore them when merging schema. Otherwise, if this is false, which is the default, we will merge all part-files. This should be considered as expert-only option, and shouldn't be enabled before knowing what it means exactly.
spark.sql.parquet.writeLegacyFormatfalseWhether to follow Parquet's format specification when converting Parquet schema to Spark SQL schema and vice versa.
spark.sql.pivotMaxValues10000When doing a pivot without specifying values for the pivot column this is the maximum number of (distinct) values that will be collected without error.
spark.sql.session.timeZoneEtc/UTCThe ID of session local timezone, e.g. "GMT", "America/Los_Angeles", etc.
spark.sql.shuffle.partitions80The default number of partitions to use when shuffling data for joins or aggregations.
spark.sql.sources.bucketing.enabledtrueWhen false, we will treat bucketed table as normal table
spark.sql.sources.defaultparquetThe default data source to use in input/output.
spark.sql.sources.parallelPartitionDiscovery.threshold32The maximum number of paths allowed for listing files at driver side. If the number of detected paths exceeds this value during partition discovery, it tries to list the files with another Spark distributed job. This applies to Parquet, ORC, CSV, JSON and LibSVM data sources.
spark.sql.sources.partitionColumnTypeInference.enabledtrueWhen true, automatically infer the data types for partitioned columns.
spark.sql.statistics.fallBackToHdfsfalseIf the table statistics are not available from table metadata enable fall back to hdfs. This is useful in determining if a table is small enough to use auto broadcast joins.
spark.sql.streaming.checkpointLocationThe default location for storing checkpoint data for streaming queries.
spark.sql.streaming.metricsEnabledfalseWhether Dropwizard/Codahale metrics will be reported for active streaming queries.
spark.sql.streaming.numRecentProgressUpdates100The number of progress updates to retain for a streaming query
spark.sql.thriftserver.scheduler.poolSet a Fair Scheduler pool for a JDBC client session.
spark.sql.thriftserver.ui.retainedSessions200The number of SQL client sessions kept in the JDBC/ODBC web UI history.
spark.sql.thriftserver.ui.retainedStatements200The number of SQL statements kept in the JDBC/ODBC web UI history.
spark.sql.variable.substitutetrueThis enables substitution using syntax like ${var} ${system:var} and ${env:var}.
spark.sql.warehouse.dirfile:/home/buildbot/datacalc/spark-warehouse/The default location for managed databases and tables.

other Spark SQL config:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
https://github.com/unnunique/Conclusions/blob/master/AADocs/bigdata-docs/compute-components-docs/sparkbasic-docs/standalone.md

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值