big data_zjfzjf2012的博客-CSDN博客

big data

关注

文章平均质量分 85

关注数：文章数：41 文章阅读量：6936 文章收藏量：0

作者: zjfzjf2012

这个作者很懒，什么都没留下…

展开

kudu - impala

partition nums equal to num of cores in clusterkudu optimizes sql if =, <=, '\<', '\>', >=, BETWEEN, or IN used, but not for !=, LIKE, or any other predicate type

原创 2018-08-12 15:54:49 · 255 阅读 · 0 评论
HBase MapReduce

- Data Locality, block placement policy. the first copy is written to the data node where region server runs.- TableInputFormat, divide table at region boundaries by start row and end rowstatic class ...

翻译 2018-05-12 15:54:01 · 123 阅读 · 0 评论
HBase Filters, Counters &amp; Coprocessors

- Filter -> FilterBase. setFilter(filter) method on Get and Scan- CompareFilter, operator + comparator , matched data is keptCompareFilter(CompareOp valueCompareOp, WritableByteArrayComparable valu...

翻译 2018-05-12 12:02:29 · 128 阅读 · 0 评论
scala notes (6) - Annotation, Future and Type Parameter

- Annotationclass MyContainer[@specialized T]def country: String @Localized@Test(timeout = 0, expected = classOf[org.junit.Test.None])def testSomeFeature() { ... }Java annotation can be mixed with Sc...

翻译 2018-04-27 15:34:52 · 115 阅读 · 0 评论
spark - Pair RDD (Key/Value Pairs)

- Create Pair RDDfrom regular RDD by calling map function.val pairs = lines.map(x => (x.split(" ")(0), x))transformation on Pair RDD (data: {(1,2),(3,4),(3,6)})reduceByKey => {(1,2), (3,10)}grou...

翻译 2018-04-27 10:24:24 · 346 阅读 · 0 评论
spark - Running on Cluster

- package spark app (maven)<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.3</version>

翻译 2018-05-05 09:21:14 · 186 阅读 · 0 评论
spark - Tuning and Debugging Spark

- submit application (sparkconf object cannot be changed after SparkContext creationmethod 1bin/spark-submit \—class com.example.MyApp \—master local[4] \—name “My Spark App” \—conf spark.ui.port=...

翻译 2018-05-04 18:42:03 · 143 阅读 · 0 评论
scala notes (5) - pattern and case class

- Pattern and Case Class ch match{ case _ if Character.isDigit(ch) => .. case '+' => ... case _ => ...}prefix match { case "0" | "0x" | "0X" => ...}case variable should be lowercase....

翻译 2018-04-26 12:08:49 · 95 阅读 · 0 评论
spark - Advanced Spark Programming

- Accumulatorval blankLines = new LongAccumulatorsc.register(blankLines)put accumulate in transformation for debugging purpose because of speculative task. it's not accurate. But in action, the accum...

翻译 2018-05-03 20:04:33 · 205 阅读 · 0 评论
spark - Loading and Saving Data

- File FormatsText Filesc.textFile, load a text filesc.wholeTextFiles, load multiple files (filename, entire content) under specified dirJSONsc.textFile.map to JSON object (people.add(mapper.readValue...

翻译 2018-05-03 18:03:54 · 122 阅读 · 0 评论
LSM Log-Structured Merge-Tree

- Sequential access is better than random access -> WAL, append update to log- Memstore in memory for quick lookup -> Memstore which flushes data to store file when reaches valve- Merge multiple...

翻译 2018-05-12 19:43:51 · 139 阅读 · 0 评论
bloom filter

- space efficient look up for fixed number of static elements. - may have, definitely no haven: number of elementsk: number of hash functions, k = n*ln2/mm: number of bits, >= n*lg(1/E)*lgeE: expec...

翻译 2018-05-07 13:07:58 · 107 阅读 · 0 评论
cassandra, hbase and mongodb

cassandra, AP system, weak consistency, heavy write, high availibility, good for online use hbase, CP system, good support on batch analytics, good for analytics, not typical for online use mongodb,...

转载 2018-07-27 11:36:58 · 244 阅读 · 0 评论
JVM trouble shooting

- JPS, TOP and JSTACK, jps to find java info, like classname, parameters of main, JVM arguments, pid, jps -m -ltop to find the most CPU-bound thread, top -Hp pidjstack to dump stacks of thread, jstac...

转载 2018-07-10 18:01:33 · 216 阅读 · 0 评论
Spark Trouble Shooting and Performance Tuning

jjjjj

转载 2018-07-11 10:32:30 · 143 阅读 · 0 评论
kafka

- user caselog collectionmessage systemuser activitystream processingevent source- designkafka broker leader, multiple brokers contend for being leader by creating ephemeral node in zookeeper. only on...

原创 2018-07-16 17:40:38 · 209 阅读 · 0 评论
submit spark code to yarn

- configure spark to submit code to remote yarn val sparkConf = new SparkConf().setAppName(s"Bulk Import $manualNbr").setMaster("yarn").set("deploy-mode", "client")// ...

原创 2018-05-27 16:11:37 · 245 阅读 · 0 评论
HBase Concept

- Data Model, sparse, distributed, persisted multidimensional sorted map(row:string, column:string, time:int64) -> string //both key and value are uninterpreted bytesRowsingle row read and update i...

翻译 2018-05-08 21:23:51 · 140 阅读 · 0 评论
compile spark source code

Change scala version to the scala version in your machine: ./dev/change-scala-version.sh <version>Shutdown zinc: ./build/zinc-<version>/bin/zinc -shutdownCompile Spark: ./build/mvn -Pyarn ...

转载 2018-05-25 18:12:01 · 184 阅读 · 0 评论
scala notes (7) - Advanced Type and Implicit

- advanced typessingleton typedef setTitle(title: String): this.type = { ...; this } // for subtypesdef set(obj: Title.type): this.type = { useNextArgAs = obj; this } //take object as parameter, no ...

翻译 2018-04-29 22:48:12 · 118 阅读 · 0 评论
scala notes (4) - collection

- CollectionArray is equivalent of Java array, it's mutable in terms of value update. but not sizesequenceVector is immutable equivalent of ArrayBuffer which is indexed sequence with fast random acces...

翻译 2018-04-25 18:04:21 · 112 阅读 · 0 评论
MapReduce Types and Formats

- typesmap: (K1, V1) → list(K2, V2)combiner: (K2, list(V2)) → list(K2, V2)reduce: (K2, list(V2)) → list(K3, V3)- partition (HashPartitioner)public abstract class Partitioner<KEY, VALUE> {public ...

翻译 2018-04-21 19:45:47 · 83 阅读 · 0 评论
MapReduce Features

- Counters (values are definitive only once job has successfully completed)Task CountersFilesystem CountersJob Counters (only in application master. doesn't need to send across network, mainly about t...

转载 2018-04-22 19:52:21 · 73 阅读 · 0 评论
hadoop general

- schema on read vs RDBMS schema on write- data flow- splits,split size tends to be HDFS block size to avoid split spanning two nodes which are difficult to data localitydata locality. same node ->...

翻译 2018-04-18 11:20:51 · 141 阅读 · 0 评论
scala field overriding

• A def can only override another def.• A val can only override another val or a parameterless def.• A var can only override an abstract var

原创 2018-03-29 11:40:46 · 110 阅读 · 0 评论
scala linearization

scala linearization for super callscala initialization order is reverse of linearization

原创 2018-03-29 09:20:02 · 111 阅读 · 0 评论
scala ordered trait for subclass

how Ordered trait can be applied to subclass

原创 2018-03-29 09:20:11 · 86 阅读 · 0 评论
scala compilation to java class

how?

原创 2018-04-01 15:10:46 · 106 阅读 · 0 评论
cake pattern

cake pattern in scalahttps://www.clianz.com/2016/04/26/scala-cake-pattern/

转载 2018-03-30 09:56:19 · 247 阅读 · 0 评论
scala type parameters

- type bounds class Pair[T <: Comparable[T]](val first: T, val second: T) {def smaller = if (first.compareTo(second) < 0) first else second //compareTo}class Pair[T](val first: T, val seco

翻译 2018-04-23 18:23:37 · 645 阅读 · 0 评论
YARN (Yet Another Resource Negotiator) - Cluster Manager

- what is yarn- Yarn application run- Resources requestall requests up front (Spark) or dynamic request (MapReduce, mapper tasks requests are up front, but reduce tasks are dynamic)- application lifes...

翻译 2018-04-19 17:24:24 · 204 阅读 · 0 评论
MapReduce Workflow

check output foldercalculate splitsapplication master gets progress and completion reports from tasks. it also requests containers for map tasks and reduce tasks. it starts container by the nodemanage...

翻译 2018-04-21 16:13:32 · 290 阅读 · 0 评论
HBase Region Split

- Split Policy (ConstantSizeRegionSplitPolicy, IncreasingToUpperBoundRegionSplitPolicy, SteppingSplitPolicy)- Split Point, The first row of center block of the biggest file of the store- Split Workflo...

翻译 2018-05-09 17:55:27 · 152 阅读 · 0 评论
MapReduce Application

- Configurationconf.addDefaultResource, conf.addResource, configuration overridden <property><name>fs.defaultFS</name><value>file:/// or hdfs://namenode</value></pr...

翻译 2018-04-21 11:22:59 · 225 阅读 · 0 评论
scala notes (3) - Files &amp; Regular Expression, Trait, Operation and Function

- Files & Regular Expressionsread from file, url and string, remember to close sourceval source = Source.fromFile("myfile.txt", "UTF-8")val source1 = Source.fromURL("http://horstmann.com", "UTF-8...

翻译 2018-04-25 11:14:26 · 117 阅读 · 0 评论
Hadoop I/O

- checksum, CRC-32C, for every 512 bits, write, last datanode of the pipeline verifies checksumread, block verification on client readrawlocalfilesystem, to disable checksum- compression, (default is ...

翻译 2018-04-20 15:11:40 · 90 阅读 · 0 评论
scala notes (2) - Class, Object, Package &amp;amp; Import and Inheritance

- Classclass Counter { private var value = 0 // You must initialize the field, otherwise it's abstract class. def increment() { value += 1 } // Methods are public by default def current() ...

翻译 2018-04-24 19:05:24 · 131 阅读 · 0 评论
scala notes (1) - Basic, Control & Function, Array and Map & Tuple

- Basicsval greeting: String = nullval xma, ymax = 100 // both are setString -> StringOps //intersect, sorted...Int -> RichInt // 1.to(10)primitive -> Rich*BigInt & BigDecimal // * can be...

翻译 2018-04-24 12:02:34 · 109 阅读 · 0 评论
HDFS

- suitable very large size, terabyte, petabyte write once and read many times handle node failure without noticeable interruption- not suitable for some applications with, low-latency data access, HBa...

翻译 2018-04-19 14:51:12 · 229 阅读 · 0 评论
Programming with RDD

- Passing functions to Spark (be careful the reference to the containing object which need to be serializable)class SearchFunctions(val query: String) {def isMatch(s: String): Boolean = {s.contains(...

翻译 2018-04-23 18:51:18 · 80 阅读 · 0 评论

big data

作者: zjfzjf2012

kudu - impala

HBase MapReduce

HBase Filters, Counters &amp;amp; Coprocessors

scala notes (6) - Annotation, Future and Type Parameter

spark - Pair RDD (Key/Value Pairs)

spark - Running on Cluster

spark - Tuning and Debugging Spark

scala notes (5) - pattern and case class

spark - Advanced Spark Programming

spark - Loading and Saving Data

LSM Log-Structured Merge-Tree

bloom filter

cassandra, hbase and mongodb

JVM trouble shooting

Spark Trouble Shooting and Performance Tuning

kafka

submit spark code to yarn

HBase Concept

compile spark source code

scala notes (7) - Advanced Type and Implicit

scala notes (4) - collection

MapReduce Types and Formats

MapReduce Features

hadoop general

scala field overriding

scala linearization

scala ordered trait for subclass

scala compilation to java class

cake pattern

scala type parameters

YARN (Yet Another Resource Negotiator) - Cluster Manager

MapReduce Workflow

HBase Region Split

MapReduce Application

scala notes (3) - Files &amp;amp; Regular Expression, Trait, Operation and Function

Hadoop I/O

scala notes (2) - Class, Object, Package &amp;amp;amp; Import and Inheritance

scala notes (1) - Basic, Control & Function, Array and Map & Tuple

HDFS

Programming with RDD

HBase Filters, Counters & Coprocessors

scala notes (3) - Files & Regular Expression, Trait, Operation and Function

scala notes (2) - Class, Object, Package &amp; Import and Inheritance