spark
分子美食家
机器学习爱好者
展开
-
pyspark-mongo-input-output
1.创建pyspark与mongodb的连接,首先加载依赖包,其有三种方式:1)直接将其放在在安装spark的jars目录下;2)在spark_submit中,添加依赖包信息;3)在创建spark的对象的时候添加依赖信息,具体案例如下图所示spark = SparkSession .builder .appName(‘mongo connection’) .config(“spark.mongodb.input.uri”, “mongodb://节点:端口号/dev.myCollection?原创 2021-01-21 17:07:22 · 164 阅读 · 0 评论 -
创建一个分布式矩阵
import org.apache.spark.mllib.linalg.Matricesimport org.apache.spark.mllib.util.MLUtilsimport org.apache.spark.{SparkContext, SparkConf}object MatrixLearning { def main(args: Array[String]) { val mx = Matrices.dense(2, 3, Array(1, 2, 3, 4, 5, 6)原创 2020-09-01 20:23:27 · 420 阅读 · 0 评论 -
scala-spark read csv data
import org.apache.spark.mllib.util.MLUtilsimport org.apache.spark.{SparkContext, SparkConf}object labeledPointLoadlibSVMFile { def main(args: Array[String]) { val conf = new SparkConf().setMaster("local").setAppName(this.getClass().getSimpleName原创 2020-09-01 20:16:09 · 403 阅读 · 0 评论 -
scala-spark reduce,reduceByKey,sorted,lookup,take,saveAsTextFile
import java.text.SimpleDateFormatimport java.util.Dateimport org.apache.spark._object Reduce_demo { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("Transformation1").setMaster("local") val spark = new SparkConte原创 2020-09-01 00:43:56 · 242 阅读 · 0 评论 -
spark-scala transforamtion union join distinct
import org.apache.spark._import org.apache.spark.network.netty.SparkTransportConfobject Transformation { def main(args:Array[String]): Unit ={ val conf =new SparkConf().setAppName("Transformation1").setMaster("local") val spark=new SparkContex原创 2020-09-01 00:38:58 · 201 阅读 · 0 评论 -
scala-hadoop-hdfs-spark交互
import org.apache.spark._//import java.util._;import scala.util.Randomimport java.text.SimpleDateFormatimport java.util.Dateimport scala.math._object RDDparallelizeSaveAsFile { def main(args:Array[String]) { // val conf = new SparkConf().s.原创 2020-09-01 00:32:18 · 107 阅读 · 0 评论 -
scala-spark-worldcount
```scalaimport org.apache.spark._import org.apache.spark.rdd.RDD.rddToOrderedRDDFunctionsimport org.apache.spark.rdd.RDD.rddToPairRDDFunctionsobject SparkWordCount {def main(args: Array[String]) {if (args.length < 1) { System.err.println("Usage原创 2020-09-01 00:25:54 · 109 阅读 · 0 评论 -
spark获取类名
object CollaborativeFilteringSpark { val conf = new SparkConf().setMaster("local").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) // println(this.getClass().getSimpleName().filter(!_.equals('$'))) //设置环境变量 val sc = new SparkCon原创 2020-09-01 00:23:31 · 414 阅读 · 0 评论 -
scala 提交任务
maven添加依赖<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.o原创 2020-09-01 00:04:54 · 313 阅读 · 0 评论 -
scala报错20/08/31 23:48:40 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 192.168.28.94, exec
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/mav原创 2020-08-31 23:51:18 · 2429 阅读 · 0 评论 -
pyspark 核心概念
RDD数据类型RDD(Resilient Distributed DataSet)是一种弹性分布式数据集,是Spark的核心,其可以有由稳定存储中的数据通过转换(transformation)操作得到。RDD数据是一种可以并行操作的数据,它在创建的时候已经分区,且每次对RDD操作的结果可以放到高速缓存中,省去了MapReduce频繁的磁盘IO。针对RDD数据的操作/函数有两种类型:转换(transformation)和动作(action)。transformation类型:从一个RDD转化到另一个.原创 2020-07-04 13:23:17 · 163 阅读 · 0 评论 -
hive sql
DDL 操作DDL•建表•删除表•修改表结构•创建/删除视图•创建数据库•显示命令建表:CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name[(col_name data_type [COMMENT col_comment], …)][COMMENT table_comment][PARTITIONED BY (col_name data_type [COMMENT col_comment], …)][CLUSTERED BY (.原创 2020-06-10 15:34:03 · 134 阅读 · 0 评论 -
pyspark导出文件
df.write.format(‘com.databricks.spark.csv’).save(‘mycsv.csv’)df.toPandas().to_csv(‘mycsv.csv’)原创 2020-06-01 14:37:15 · 636 阅读 · 0 评论 -
spark提交任务的过程
submit.sh#!/usr/bin/env bash## Licensed to the Apache Software Foundation (ASF) under one or more# contributor license agreements. See the NOTICE file distributed with# this work for additional information regarding copyright ownership.# The ASF l.原创 2020-06-01 11:24:20 · 169 阅读 · 0 评论 -
pyspark常用API
union 和unionallunion 纵向合并dataframeIn this Spark article, you will learn how to union two or more data frames of the same schema to append DataFrame to another or merge two DataFrames and difference between union and union all with Scala examples.Data.原创 2020-05-21 21:26:46 · 765 阅读 · 0 评论 -
hive环境搭建提示: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument
hive环境搭建提示: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgumentSLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]Exception in thread “main” ja...原创 2020-01-13 15:12:09 · 1559 阅读 · 1 评论 -
Installing and Running Hadoop and Spark on Ubuntu 18
Installing and Running Hadoop and Spark on Ubuntu 18#hadoop #spark #java #scalaHadoop & Spark (4 Part Series)Installing and Running Hadoop and Spark on Ubuntu 18This is a short guide (updated ...原创 2020-01-09 13:22:03 · 270 阅读 · 0 评论 -
spark操作
spark dataframe派生于RDD类,但是提供了非常强大的数据操作功能。当然主要对类SQL的支持。在实际工作中会遇到这样的情况,主要是会进行两个数据集的筛选、合并,重新入库。首先加载数据集,然后在提取数据集的前几行过程中,才找到limit的函数。而合并就用到union函数,重新入库,就是registerTemple注册成表,再进行写入到HIVE中。不得不赞叹dataframe的强大...原创 2020-01-06 16:14:43 · 189 阅读 · 0 评论