大数据
文章平均质量分 60
zhixingheyi_tian
Intel Big Data. Spark
展开
-
数据仓库相关
在阿里巴巴的数据体系中,我们建议将数据仓库分为三层,自下而上为:数据引入层(ODS,Operation Data Store)、数据公共层(CDM,Common Data Model)和数据应用层(ADS,Application Data Service)。公共汇总粒度事实层(DWS):以分析的主题对象作为建模驱动,基于上层的应用和产品的指标需求,构建公共粒度的汇总指标事实表,以宽表化手段物理化模型。降低数据计算口径和算法不统一风险。公共维度层的表通常也被称为逻辑维度表,维度和维度逻辑表通常一一对应。原创 2023-11-16 16:42:18 · 1275 阅读 · 0 评论 -
SIMD 介绍
AVX指令集是SandyBridge和Larrabee架构下的新指令集。AVX是在之前的128bit扩展到和256bit的SIMD(SingleInstruction, Multiple Data)。而SandyBridge的SIMD演算单元扩展到256bits的同时数据传输也获得了提升,所以从理论上看CPU内核浮点运算性能提升到了2倍。IntelAVX指令集,在SIMD计算性能增强的同时也沿用了的MMX/SSE指令集。不过和MMX/SSE的不同点在于增强的AVX指令,从指令的格式上就发生了很大.原创 2021-10-18 15:56:28 · 4842 阅读 · 0 评论 -
Spark 之 SparkContext
Initializing SparkThe first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf o...原创 2018-12-20 09:33:09 · 219 阅读 · 1 评论 -
Spark Shared Variables
broadcast variables and accumulatorsNormally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variable...原创 2018-12-20 16:45:57 · 167 阅读 · 0 评论 -
DataSet 探究
总述Before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD).After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with r...原创 2018-12-20 17:24:43 · 261 阅读 · 0 评论 -
ZooKeeper 安装
集群安装编辑配置文件下载 zookeeper-3.4.13.tar.gz,解压之后进入 conf目录cp zoo_sample.cfg zoo.cfg编辑zoo.cfg# The number of milliseconds of each ticktickTime=2000# The number of ticks that the initial # synchroniza...原创 2018-12-26 15:50:55 · 157 阅读 · 0 评论 -
Spark job 的触发
判断是RDD action的操作的一个标志是其函数实现里得有sc.runJob原创 2018-11-16 16:49:31 · 635 阅读 · 0 评论 -
kafka 安装
kafka 单机版安装,见官网分布式安装下载最近稳定版的kafkahttp://mirror.bit.edu.cn/apache/kafka/2.1.0/kafka_2.11-2.1.0.tgz解压编辑配置文件config/server.properties修改两项broker.id=541zookeeper.connect=sr541:2181,sr553:2181,sr554...原创 2018-12-26 16:12:04 · 100 阅读 · 0 评论 -
spark 读写 parquet
SQLConf// This is used to set the default data source val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default") .doc("The default data source to use in input/output.") .stringCo...原创 2018-12-10 15:47:41 · 2984 阅读 · 1 评论 -
Spark 之 FileFormat
每个FileFormat 都实现了,inferSchema,但是只有初始化的时候的调用一次。ParquetFileFormatspark 获取 parquet 的 schema 是通过发起了一个job/** * Figures out a merged Parquet schema with a distributed Spark job. * * Note that lo...原创 2018-12-17 11:43:04 · 462 阅读 · 0 评论 -
spark-shell
spark-shell 就是一个脚本里面调度了spark-submitfunction main() { if $cygwin; then # Workaround for issue involving JLine and Cygwin # (see http://sourceforge.net/p/jline/bugs/40/). # If you're usi...原创 2018-12-23 11:43:12 · 356 阅读 · 0 评论 -
RDD 探究
总述At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster.RDDThe main abstraction Spark provi...原创 2018-12-19 15:28:23 · 235 阅读 · 1 评论 -
spark on yarn
Apache Hadoop YARNconceptThe fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons.YARN的基本思想是将资源管理和作业调度/监视的功能分解为单独的守...原创 2018-12-13 12:53:55 · 114 阅读 · 0 评论 -
Spark Basic Concepts
Datasets and DataFramesA Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda f...原创 2018-11-16 17:12:55 · 112 阅读 · 0 评论 -
算法 相关
普通hash 算法,例如取模运算,是对服务器节点数量取模一致性Hash算法是对2^32取模一致性哈希将整个哈希值空间组织成一个虚拟的圆环,如假设某哈希函数H的值空间为0-2^32-1(哈希值是一个32位无符号整形)Hash环整个空间按顺时针方向组织,圆环的正上方的点代表0,0点右侧的第一个点代表1,以此类推,2、3、4、5、6……直到2^32 -1, 我们把这个由2^32个点组成的圆环称...原创 2018-11-19 10:52:18 · 213 阅读 · 1 评论 -
Spark job 提交
Driver 侧在任务提交的时候要完成以下几个工作RDD依赖分析,以生成DAG根据DAG 将job 分割成多个 stagestage 一经确认,即生成相应的 task,将生成的task 分发到 Executor 执行提交的实现入口在SparkContext.scala/** * Run a job on all partitions in an RDD and return t...原创 2018-11-19 14:48:03 · 195 阅读 · 1 评论 -
Parquet 笔记
Glossary (术语)Block (hdfs block): This means a block in hdfs and the meaning is unchanged for describing this file format. The file format is designed to work well on top of hdfs.File: A hdfs file th...原创 2018-11-26 15:36:22 · 125 阅读 · 1 评论 -
Dataset schema
/** * Returns the schema of this Dataset. * * @group basic * @since 1.6.0 */ def schema: StructType = queryExecution.analyzed.schema原创 2018-12-04 13:20:02 · 562 阅读 · 0 评论 -
scala 之关键字 case
case 声明类的好处创建 case class 和它的伴生 object实现了 apply 方法让你不需要通过 new 来创建类实例默认为主构造函数参数列表的所有参数前加 val添加天然的 hashCode、equals 和 toString 方法。由于 == 在 Scala 中总是代表 equals,所以 case class 实例总是可比较的下面的三个操作效果是等价的val ...原创 2018-12-04 14:16:02 · 626 阅读 · 0 评论 -
Scala 之 关键字 lazy
先看一个示例scala> val a = { println("I'am a"); "aaa"}I'am aa: String = aaascala> ares8: String = aaascala> lazy val a = { println("I'am a"); "aaa"}原创 2018-11-28 11:46:21 · 229 阅读 · 0 评论 -
实现 spark DataSourceV2 的几个环节
继承 DataSourceV2class SimpleWritableDataSource extends DataSourceV2 with ReadSupport with WriteSupport { override def createReader() override def createWriter()}构造 Readerclass Reader(path: St...原创 2018-11-28 14:21:01 · 748 阅读 · 0 评论 -
spark 部署模式和启动进程
Spark Standalone Mode(独立集群模式)Launching Spark Applications (启动应用)The spark-submit script provides the most straightforward way to submit a compiled Spark application to the cluster.For standalone cl...原创 2018-12-12 15:45:56 · 518 阅读 · 0 评论 -
OAP read parquet
spark2.1FileScanRDDprivate def nextIterator(): Boolean = {...currentIterator = readFunction(currentFile)...}OptimizedParquetFileFormatoverride def buildReaderWithPartitionValues( sparkS...原创 2018-12-12 13:26:02 · 323 阅读 · 0 评论 -
Spark 零件
SparkEnvSparkEnv 是spark的执行环境对象,存在driver 或 executor 进程中。BlockManagerDriver Application 和 Executor 都会创建 BlockManager .Manager running on every node (driver and executors) which provides interfaces ...原创 2018-12-18 15:05:15 · 154 阅读 · 0 评论 -
OAP ParquetDataFile and Cache
ParquetDataFile.scalaval iterator = reader.iteratorWithRowIds(requiredIds, rowIds) .asInstanceOf[OapCompletionIterator[InternalRow]] val result = ArrayBuffer[Int]() while (iterator.hasN...原创 2019-02-01 16:17:34 · 177 阅读 · 0 评论 -
The Kubernetes Operator for Apache Spark (spark-on-k8s-operator)
Kubernetes Operator for Apache Spark DesignIntroductionIn Spark 2.3, Kubernetes becomes an official scheduler backend for Spark, additionally to the standalone scheduler, Mesos, and Yarn. Compared w...原创 2019-04-16 20:23:02 · 767 阅读 · 0 评论 -
spark start-thriftserver.sh & Kubernetes
启动命令sh sbin/start-thriftserver.sh –master k8s://https://192.168.99.108:8443 –name spark-thriftserver –conf spark.executor.instances=1 –conf spark.kubernetes.container.image=zhixingheyitian/spark...原创 2019-04-19 14:05:40 · 1321 阅读 · 1 评论 -
spark sql examples on kubernetes
submit sql to thriftserver by beelinerun thriftserver in a podsh sbin/start-thriftserver.sh \ --master k8s://https://kubernetes.default.svc.cluster.local:443 \ --name spark-thriftserver \ ...原创 2019-05-07 20:51:37 · 582 阅读 · 0 评论 -
OAP 不同介质的UI
bin/spark-sql \ --master k8s://https://192.168.99.108:8443 \ --deploy-mode client \ --name spark-sql \ --conf spark.executor.instances=2 \ --conf spark.kubernetes.container.image=z...原创 2019-05-13 16:49:52 · 169 阅读 · 0 评论 -
修复Hadoop集群系统
修复某一台datanode先建立免密登陆将namenode ~/.ssh/下的公钥id_rsa.pub 内容copy 到 datanode 的 ~/.ssh/authorized_keys 里,直接追加即可修改配置$HADOOP_HOME/etc/slaves如果datanode IP 变了要修改 slaves/etc/hosts如果datanode IP 变了,要修改 hosts...原创 2019-06-22 09:27:20 · 206 阅读 · 0 评论 -
OAP FileFormt
OAP File// OAP Data File V1 Meta Part// ..// Field Length In Byte// Meta// Magic and Version 4// Row Count In Each Row Group 4// ...原创 2019-06-18 10:38:36 · 340 阅读 · 0 评论 -
Spark on Yarn
deploy modesThere are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on t...原创 2019-07-06 10:01:53 · 116 阅读 · 0 评论 -
Spark 各种部署模式实验
Yarnclusterdriver 端,在nodemanager 里 AM 进程里driver 端 stdout原创 2019-08-29 10:28:10 · 822 阅读 · 0 评论 -
Spark源码分析之SparkSubmit.scala
clusterManager, deployMode//Spark 2.3.2 SparkSubmit.scalaprivate def doPrepareSubmitEnvironment( args: SparkSubmitArguments, conf: Option[HadoopConfiguration] = None) : (Seq[Stri...原创 2019-04-16 15:36:53 · 332 阅读 · 1 评论 -
spark 、hadoop、yarn 集群那些事
hadoop 主节点上jps 后,有这样几个进程hadoop:SecondaryNameNodeNameNodeyarn:ResourceManagerSpark的Job History服务启动 sbin/start-history-server.shHistoryServerHiveRunJar...原创 2019-04-09 17:07:29 · 132 阅读 · 0 评论 -
implement a spark-sql case of separating computation and storage using Kubernetes
PrerequisitesSet up a single-node Kubernetes(minikube)with --cpus 8 --memory 8192Build and push the spark2.4.1 imagePut hive-site.xml in the conf dirRunbin/spark-sql \ --master k8s://https:...原创 2019-04-12 16:57:47 · 114 阅读 · 0 评论 -
Databricks IO (DBIO) cache
Databricks IO CacheThe Databricks IO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whe...原创 2019-01-04 09:40:54 · 596 阅读 · 0 评论 -
Spark 之 Strategy
package object sql { /** * Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting * with the query planner and is not designed to be stable across spark ...原创 2019-01-05 21:21:08 · 426 阅读 · 0 评论 -
Physical Query Operator
BinaryExecNodeBinary physical operator with two child left and right physical operatorsLeafExecNodeLeaf physical operator with no childrenBy default, the set of all attributes that are produce...原创 2019-01-12 15:31:56 · 208 阅读 · 0 评论 -
FileSourceScanExec
FileSourceScanExec is a leaf physical operator (as a DataSourceScanExec) that represents a scan over collections of files (incl. Hive tables).FileSourceScanExec is created exclusively(专门) for a Logic...原创 2019-01-13 10:18:44 · 466 阅读 · 0 评论