spark导入elasticsearch

前面简单的介绍了elasticsearch。现在开始使用,当我们面对海量数据的时候,如果想把数据导入到es,肯定不能想以前那样单条导入,我使用的是spark导入到es的批量导入。第三方依赖包:这里写链接内容
可以到这里下载相应的依赖包解压使用。PS:在使用的时候一定要记得spark和scala以及es的对应关系,我使用的是spark1.6.2 ,scala使用的是2.10.4,在使用的时spark-es的jar包elasticsearch-spark_2.10-2.3.3,该版本对应的编译的时候使用的是scala2.10.5.。开始没注意导致出现了一个问题:错误提示为:java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
好了,下面开始介绍使用spark-es:
导入依赖包到spark环境变量:
spark使用es的时候,需要配置es的相应参数两种方式:
1)、val conf = new SparkConf( ).setAppName( "loadToES" )
conf.set("es.nodes", "192.168.0.23")
conf.set("es.port", "9200")
conf.set("es.cluster.name","elasticsearch")
val sc = new SparkContext( conf )

2)、val configuration = Map("es.nodes"->"192.168.0.23","es.port"->"9200","es.cluster.name"->"elasticsearch")

spark使用es的时候分为导入和读取,spark分别提供如下两个接口:
EsSpark和EsSparkSQL,分别使用spark-rdd和spark-DataFrame的方式进行读取和写入
EsSpark-rdd读取:
1)def esRDD(sc: SparkContext): RDD[(String, Map[String, AnyRef])]
2)def esRDD(sc: SparkContext, resource: String, query: String, cfg: Map[String, String]): RDD[(String, Map[String, AnyRef])]
3)def esJsonRDD(sc: SparkContext, resource: String, query: String, cfg: Map[String, String]): RDD[(String, String)]
例如:sc.esRDD(“radio/artists”, “?q=me*”)
查询出create an RDD streaming all the documents matching me* from index radio/artists

val conf = new JobConf()
conf.set(“es.resource”, “radio/artists”)
conf.set(“es.query”, “?q=me*”)
val esRDD = sc.hadoopRDD(conf,
classOf[EsInputFormat[Text, MapWritable]],
classOf[Text], classOf[MapWritable]))
val docCount = esRDD.count();

val conf = new Configuration()
conf.set(“es.resource”, “radio/artists”)
conf.set(“es.query”, “?q=me*”)
val esRDD = sc.newAPIHadoopRDD(conf,
classOf[EsInputFormat[Text, MapWritable]],
classOf[Text], classOf[MapWritable]))
val docCount = esRDD.count();

EsSpark-rdd批量写入:
1)def saveToEs(rdd: RDD[_], resource: String, cfg: Map[String, String])
2)def saveToEsWithMeta[K,V](rdd: RDD[(K,V)], resource: String, cfg: Map[String, String])
3)saveJsonToEs(rdd: RDD[_], resource: String, cfg: Map[String, String])

EsSpark-DataFrame批量写入:
1)def esDF(sc: SQLContext, resource: String, query: String)
// Spark 1.3 style
val df = sql.load(“spark/index”, “org.elasticsearch.spark.sql”)

// Spark 1.4 style
val df = sql.read.format(“org.elasticsearch.spark.sql”).load(“spark/index”)
// Spark 1.5 style
val df = sql.read.format(“es”).load(“spark/index”)
val sql = new SQLContext…
// options for Spark 1.3 need to include the target path/resource
val options13 = Map(“path” -> “spark/index”,
“pushdown” -> “true”,
“es.nodes” -> “someNode”, “es.port” -> “9200”)

// Spark 1.3 style
val spark13DF = sql.load(“org.elasticsearch.spark.sql”, options13)

// options for Spark 1.4 - the path/resource is specified separately
val options = Map(“pushdown” -> “true”, “es.nodes” -> “someNode”, “es.port” -> “9200”)

// Spark 1.4 style
val spark14DF = sql.read.format(“org.elasticsearch.spark.sql”)
.options(options).load(“spark/index”)

pushdown option - specific to Spark data sources

sqlContext.sql(
“CREATE TEMPORARY TABLE myIndex ” +
“USING org.elasticsearch.spark.sql ” +
“OPTIONS ( resource ‘spark/index’, nodes ‘spark/index’)” ) “

include

es.read.field.include = name, address.

exclude

es.read.field.exclude = *.created
2)

def saveToEs(srdd: DataFrame, resource: String, cfg: Map[String, String])

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext._

import org.elasticsearch.spark.sql._

val sqlContext = new SQLContext(sc)
case class Person(name: String, surname: String, age: Int)
val people = sc.textFile(“people.txt”)
.map(_.split(“,”))
.map(p => Person(p(0), p(1), p(2).trim.toInt))
.toDF()
people.saveToEs(“spark/people”)

使用案例:

使用自定义的_id:
方法一:
val rdd = sc.textFile("hdfs://master:9000/es.nb").map( x => (Hashing.md5().hashString( x,Charsets.UTF_8 ).asLong(),x))
    val relation = sqlContext.createDataFrame( rdd ).toDF("key","val")
    EsSparkSQL.saveToEs( relation,"es/info",configuration )

val otp = Map(“iata” -> “OTP”, “name” -> “Otopeni”)
val muc = Map(“iata” -> “MUC”, “name” -> “Munich”)
val sfo = Map(“iata” -> “SFO”, “name” -> “San Fran”)

// instance of SparkContext
val sc = …

val airportsRDD = sc.makeRDD(Seq((1, otp), (2, muc), (3, sfo)))
pairRDD.saveToEsWithMeta(airportsRDD, “airports/2015”)
此方式实现的是指定_id;

方法二:
import org.elasticsearch.spark.rdd.Metadata._ (1)

val otp = Map(“iata” -> “OTP”, “name” -> “Otopeni”)
val muc = Map(“iata” -> “MUC”, “name” -> “Munich”)
val sfo = Map(“iata” -> “SFO”, “name” -> “San Fran”)

// metadata for each document
// note it’s not required for them to have the same structure
val otpMeta = Map(ID -> 1, TTL -> “3h”) (2)
val mucMeta = Map(ID -> 2, VERSION -> “23”) (3)
val sfoMeta = Map(ID -> 3) (4)

// instance of SparkContext
val sc = …

val airportsRDD = sc.makeRDD(Seq((otpMeta, otp), (mucMeta, muc), (sfoMeta, sfo))) (5)
pairRDD.saveToEsWithMeta(airportsRDD, “airports/2015”) (6)
(1)导入这个包,即元数据的枚举类型
(2)ID with a value of 1 and TTL with a value of 3h

其他的方法大家可以尝试一下。小杨也是刚接触elasticsearch,处于摸索阶段,希望大家多多指导。欢迎来访,qq:791279468

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值