使用例子
object Save2EsLocalTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("save2eslocal").setMaster("local[*]")
conf.set("spark.streaming.stopGracefullyOnShutdown","true")
conf.set("es.index.auto.create", "false")
conf.set("es.nodes", "127.0.0.1")
conf.set("es.port", "9200")
val sc = new SparkContext(conf)
/*
* es的参数
*
* es.resource.write : index/type
* es.write.operation :
* index 加入新数据
* upsert 数据不存在插入,数据存在更新
*
* es.mapping.id : 将document field 映射为 document id
*
* */
val config = scala.collection.mutable.Map("es.resource.write" -> "test/students","es.mapping.id"->"sid","es.write.operation"->"upsert")
//必须引入
import org.elasticsearch.spark._
val students = sc.makeRDD(Seq(Map("sid"->"7","sname"->"hhy","sage"->100)))
students.saveToEs(config)
sc.stop()
}
}
更多参数配置地址:https://www.elastic.co/guide/en/elasticsearch/hadoop/5.5/configuration.html
源码分析
我们从主要的入口saveToEs开始分析,RDD是没有saveToEs这个方法,那么为什么我们可以在这里调用呢?
因为我们引入了import org.elasticsearch.spark._这个包,而这个里面有个package对象spark,里面存在相应的implicit,implict的使用,可以自行查阅资料学习。
主要代码如下:
implicit def sparkRDDFunctions[T : ClassTag](rdd: RDD[T]) = new SparkRDDFunctions[T](rdd)
class SparkRDDFunctions[T : ClassTag](rdd: RDD[T]) extends Serializable {
def saveToEs(resource: String): Unit = { EsSpark.saveToEs(rdd, resource) }
def saveToEs(resource: String, cfg: scala.collection.Map[String, String]): Unit = { EsSpark.saveToEs(rdd, resource, cfg) }
def saveToEs(cfg: scala.collection.Map[String, String]): Unit = { EsSpark.saveToEs(rdd, cfg) }
}
上述代码中可以看到,其实隐饰转换主要用到了EsSpark类中的saveToEs方法。如果使用saveToEs(resource: String)方法,后面的代码会进一步创建一个cfg然后将resource封装进去,“es.resource.write”-> resource,resource的格式是:index/type,
saveToEs(resource: String, cfg: scala.collection.Map[String, String])或者saveToEs(cfg: scala.collection.Map[String, String]),这三种都会有相应的cfg,然后会跟SparkConf中的参数(里面会把spark.开头的key,去掉spark.只留下后面的作为key)合并,组成新的settings配置。
def saveToEs(rdd: RDD[_], cfg: Map[String, String]) {
doSaveToEs方法(rdd, cfg, false)
}
EsSpark.saveToES主要调用了doSaveToEs方法
private[spark] def doSaveToEs(rdd: RDD[_], cfg: Map[String, String], hasMeta: Boolean) {
CompatUtils.warnSchemaRDD(rdd, LogFactory.getLog("org.elasticsearch.spark.rdd.EsSpark"))
if (rdd == null || rdd.partitions.length == 0) {
return
}
val sparkCfg = new SparkSettingsManager().load(rdd.sparkContext.getConf)
val config = new PropertiesSettings().load(sparkCfg.save())
config.merge(cfg.asJava)
// Need to discover the EsVersion here before checking if the index exists
InitializationUtils.discoverEsVersion(config, LOG)
InitializationUtils.checkIdForOperation(config)
InitializationUtils.checkIndexExistence(config)
rdd.sparkContext.runJob(rdd, new EsRDDWriter(config.save(), hasMeta).write _)
}
runJob是spark中的提交job的代码,可以自定查阅资料理解其实现,我们主要讲一下new EsRDDWriter(config.save(), hasMeta).write _。上述代码中会通过rest方式查看你的集群版本,以及检查是否是update操作,update操作需要设置es.mapping.id这个配置,值为你document的某个字段即可,另外如果es.index.auto.create设置为no,或者false等,会检查一下index是否存在,不存在抛出异常,如果设置为yes或者true,那么不存在会自动创建index,所以不会检查index是否存在。
下面进入new EsRDDWriter(config.save(), hasMeta).write的代码分析:
new EsRDDWriter(config.save(), hasMeta)时候,会对EsRDDWriter初始化
protected def valueWriter: Class[_ <: ValueWriter[_]] = classOf[ScalaValueWriter]
protected def bytesConverter: Class[_ <: BytesConverter] = classOf[JdkBytesConverter]
protected def fieldExtractor: Class[_ <: FieldExtractor] = classOf[ScalaMapFieldExtractor]
lazy val settings = {
//此处的serializedSettings就是config.save(),其实就是将参数配置转为String表示,然后再次转回来
val settings = new PropertiesSettings().load(serializedSettings);
//设置一些必要的初始化的参数,可自定义实现
//es.ser.writer.value.class / es.ser.writer.bytes.class / es.mapping.default.extractor.class
InitializationUtils.setValueWriterIfNotSet(settings, valueWriter, log)
InitializationUtils.setBytesConverterIfNeeded(settings, bytesConverter, log)
InitializationUtils.setFieldExtractorIfNotSet(settings, fieldExtractor, log)
settings
}
lazy val metaExtractor = new ScalaMetadataExtractor()
Scala中使用关键字lazy来定义惰性变量,实现延迟加载(懒加载),惰性变量只能是不可变变量,并且只有在调用惰性变量时,才会去实例化这个变量。
def write(taskContext: TaskContext, data: Iterator[T]) {
val writer = RestService.createWriter(settings, taskContext.partitionId, -1, log)
taskContext.addTaskCompletionListener((TaskContext) => writer.close())
if (runtimeMetadata) {
writer.repository.addRuntimeFieldExtractor(metaExtractor)
}
while (data.hasNext) {
writer.repository.writeToIndex(processData(data))
}
}
write方法具体分析
write方法创建PartitionWriter
val writer = RestService.createWriter(settings, taskContext.partitionId, -1, log)
看createWriter具体实现:
public static PartitionWriter createWriter(Settings settings, int currentSplit, int totalSplits, Log log) {
Version.logVersion();
InitializationUtils.validateSettings(settings);
InitializationUtils.discoverEsVersion(settings, log);
InitializationUtils.discoverNodesIfNeeded(settings, log);
InitializationUtils.filterNonClientNodesIfNeeded(settings, log);
InitializationUtils.filterNonDataNodesIfNeeded(settings, log);
InitializationUtils.filterNonIngestNodesIfNeeded(settings, log);
List<String> nodes = SettingsUtils.discoveredOrDeclaredNodes(settings);
int selectedNode = (currentSplit < 0) ? new Random().nextInt(nodes.size()) : currentSplit % nodes.size();
// select the appropriate nodes first, to spread the load before-hand
SettingsUtils.pinNode(settings, nodes.get(selectedNode));
Resource resource = new Resource(settings, false);
log.info(String.format("Writing to [%s]", resource));
// 分析是单索引还是多索引的情况,二者区别是:单索引可以固定node,多索引随机选择node
IndexExtractor iformat = ObjectUtils.instantiate(settings.getMappingIndexExtractorClassName(), settings);
iformat.compile(resource.toString());
RestRepository repository = (iformat.hasPattern() ? initMultiIndices(settings, currentSplit, resource, log) : initSingleIndex(settings, currentSplit, resource, log));
return new PartitionWriter(settings, currentSplit, totalSplits, repository);
}
单索引情况下:
shard与node的存在对应关系,会使用partitionId对shard数量哈希取模,得到相应的node,然后这个partitionwriter就往这个shard写数据;
如果设置es.nodes.client.only=true,那么固定使用client节点;
如果设置了es.nodes.wan.only=true,那么方式同多索引的情况。
增加task任务监听
taskContext.addTaskCompletionListener((TaskContext) => writer.close())
具体写的实现
while (data.hasNext) {
writer.repository.writeToIndex(processData(data))
}
public void writeToIndex(Object object) {
Assert.notNull(object, "no object data given");
lazyInitWriting();
doWriteToIndex(command.write(object));
}
首先看lazyInitWriting()
private void lazyInitWriting() {
if (!writeInitialized) {
writeInitialized = true;
autoFlush = !settings.getBatchFlushManual();
ba.bytes(new byte[settings.getBatchSizeInBytes()], 0);
trivialBytesRef = new BytesRef();
bufferEntriesThreshold = settings.getBatchSizeInEntries();
requiresRefreshAfterBulk = settings.getBatchRefreshAfterWrite();
this.command = BulkCommands.create(settings, metaExtractor, client.internalVersion);
}
}
看重要的实现BulkCommands.create
public static BulkCommand create(Settings settings, MetadataExtractor metaExtractor, EsMajorVersion version) {
String operation = settings.getOperation();
BulkFactory factory = null;
if (ConfigurationOptions.ES_OPERATION_CREATE.equals(operation)) {
factory = new CreateBulkFactory(settings, metaExtractor);
}
else if (ConfigurationOptions.ES_OPERATION_INDEX.equals(operation)) {
factory = new IndexBulkFactory(settings, metaExtractor);
}
else if (ConfigurationOptions.ES_OPERATION_UPDATE.equals(operation)) {
factory = new UpdateBulkFactory(settings, metaExtractor, version);
}
else if (ConfigurationOptions.ES_OPERATION_UPSERT.equals(operation)) {
factory = new UpdateBulkFactory(settings, true, metaExtractor, version);
}
else {
throw new EsHadoopIllegalArgumentException("Unknown operation " + operation);
}
return factory.createBulk();
}
上述代码中可以看到,根据es.write.operation不同的操作,创建不同的AbstractBulkFactory的具体实现类,但是factory.createBulk()的时候,都是调用的AbstractBulkFactory的实现,具体实现类没有重写这个方法。
public BulkCommand createBulk() {
List<Object> before = new ArrayList<Object>();
List<Object> after = new ArrayList<Object>();
if (!isStatic) {
before.add(new DynamicHeaderRef());
after.add(new DynamicEndRef());
}
else {
writeObjectHeader(before);
before = compact(before);
writeObjectEnd(after);
after = compact(after);
}
boolean isScriptUpdate = settings.hasUpdateScript();
// RDD中数据是否是json类型
if (jsonInput) {
//json类型的并且使用了script
if (isScriptUpdate) {
return new JsonScriptTemplateBulk(before, after, jsonExtractors, settings);
}
//json类型的没有使用script
return new JsonTemplatedBulk(before, after, jsonExtractors, settings);
}
//非json时候,同时使用了script
if (isScriptUpdate) {
return new ScriptTemplateBulk(settings, before, after, valueWriter);
}
//其他
return new TemplatedBulk(before, after, valueWriter);
}
before主要设置了写数据之前的信息,比如_version,_type,_routing等
after主要设置了"\n"
private List<Object> compact(List<Object> list) {
if (list == null || list.isEmpty()) {
return null;
}
List<Object> compacted = new ArrayList<Object>();
StringBuilder stringAccumulator = new StringBuilder();
for (Object object : list) {
if (object instanceof FieldExtractor) {
if (stringAccumulator.length() > 0) {
compacted.add(new BytesArray(stringAccumulator.toString()));
stringAccumulator.setLength(0);
}
compacted.add(new FieldWriter((FieldExtractor) object));
}
else {
stringAccumulator.append(object.toString());
}
}
if (stringAccumulator.length() > 0) {
compacted.add(new BytesArray(stringAccumulator.toString()));
}
return compacted;
}
compact主要是将before / after中的保存的值进一步封装,FieldExtractor类型的封装到FieldWriter中,String类型的封装到BytesArray中
,before / after这两个list中对象只包含这两种。
回到前面的writeToIndex方法看其内部的调用的doWriteToIndex(command.write(object))方法的实现:假设我们的数据是Map,那么command是TemplatedBulk的对象,看其write的实现
public BytesRef write(Object object) {
ref.reset();
scratchPad.reset();
Object processed = preProcess(object, scratchPad);
// write before object
writeTemplate(beforeObject, processed);
// write object
doWriteObject(processed, scratchPad, valueWriter);
ref.add(scratchPad);
// writer after object
writeTemplate(afterObject, processed);
return ref;
}
上面代码就是把before after 等封装到了BytesRef中
private void doWriteToIndex(BytesRef payload) {
// check space first
// ba is the backing array for data
if (payload.length() > ba.available()) {
if (autoFlush) {
flush();
}
else {
throw new EsHadoopIllegalStateException(
String.format("Auto-flush disabled and bulk buffer full; disable manual flush or increase capacity [current size %s]; bailing out", ba.capacity()));
}
}
data.copyFrom(payload);
payload.reset();
dataEntries++;
if (bufferEntriesThreshold > 0 && dataEntries >= bufferEntriesThreshold) {
if (autoFlush) {
flush();
}
else {
// handle the corner case of manual flush that occurs only after the buffer is completely full (think size of 1)
if (dataEntries > bufferEntriesThreshold) {
throw new EsHadoopIllegalStateException(
String.format(
"Auto-flush disabled and maximum number of entries surpassed; disable manual flush or increase capacity [current size %s]; bailing out",
bufferEntriesThreshold));
}
}
}
}
上述代码主要的是flush()方法的实现,进一步主要调用的是下面代码:
public BulkResponse tryFlush() {
BulkResponse bulkResult;
try {
// double check data - it might be a false flush (called on clean-up)
if (data.length() > 0) {
if (log.isDebugEnabled()) {
log.debug(String.format("Sending batch of [%d] bytes/[%s] entries", data.length(), dataEntries));
}
bulkResult = client.bulk(resourceW, data);
executedBulkWrite = true;
} else {
bulkResult = BulkResponse.ok(0);
}
} catch (EsHadoopException ex) {
hadWriteErrors = true;
throw ex;
}
// always discard data since there's no code path that uses the in flight data
discard();
return bulkResult;
}
主要代码是
bulkResult = client.bulk(resourceW, data);
此处的client是RestClient的对象,new RestRepository时候,已经进行了初始化,而在创建PartitionWriter时候创建了RestRepository
public RestRepository(Settings settings) {
this.settings = settings;
if (StringUtils.hasText(settings.getResourceRead())) {
this.resourceR = new Resource(settings, true);
}
if (StringUtils.hasText(settings.getResourceWrite())) {
this.resourceW = new Resource(settings, false);
}
Assert.isTrue(resourceR != null || resourceW != null, "Invalid configuration - No read or write resource specified");
this.client = new RestClient(settings);
}
下面看client.bulk方法:
public BulkResponse bulk(Resource resource, TrackingBytesArray data) {
Retry retry = retryPolicy.init();
BulkResponse processedResponse;
boolean isRetry = false;
do {
// NB: dynamically get the stats since the transport can change
long start = network.transportStats().netTotalTime;
Response response = execute(PUT, resource.bulk(), data);
long spent = network.transportStats().netTotalTime - start;
stats.bulkTotal++;
stats.docsSent += data.entries();
stats.bulkTotalTime += spent;
// bytes will be counted by the transport layer
if (isRetry) {
stats.docsRetried += data.entries();
stats.bytesRetried += data.length();
stats.bulkRetries++;
stats.bulkRetriesTotalTime += spent;
}
isRetry = true;
processedResponse = processBulkResponse(response, data);
} while (data.length() > 0 && retry.retry(processedResponse.getHttpStatus()));
return processedResponse;
}
主要看这行代码Response response = execute(PUT, resource.bulk(), data);
protected Response execute(Method method, String path, ByteSequence buffer) {
return execute(new SimpleRequest(method, null, path, null, buffer), true);
}
//上面代码调用了下面的方法
protected Response execute(Request request, boolean checkStatus) {
Response response = network.execute(request);
if (checkStatus) {
checkResponse(request, response);
}
return response;
}
其中network是NetworkClient的对象,network.execute主要实现代码是: response = currentTransport.execute(routedRequest),其中currentTransport是Transport 的对象,该类主要有两个实现:CommonsHttpTransport和LeasedTransport
public NetworkClient(Settings settings) {
this(settings, (!SettingsUtils.hasJobTransportPoolingKey(settings) ? new CommonsHttpTransportFactory() : PooledTransportManager.getTransportFactory(settings)));
}
如果没有设置es.internal.transport.pooling.key那么是CommonsHttpTransportFactory创建CommonsHttpTransport,否则就是另外一个了,new Trasnport时候,如果设置了username password,会进行认证,
“es.net.http.auth.user” “es.net.http.auth.pass”;。
然后调用了 response = currentTransport.execute(routedRequest)
public Response execute(Request request) throws IOException {
HttpMethod http = null;
switch (request.method()) {
case DELETE:
http = new DeleteMethodWithBody();
break;
case HEAD:
http = new HeadMethod();
break;
case GET:
http = (request.body() == null ? new GetMethod() : new GetMethodWithBody());
break;
case POST:
http = new PostMethod();
break;
case PUT:
http = new PutMethod();
break;
default:
throw new EsHadoopTransportException("Unknown request method " + request.method());
}
CharSequence uri = request.uri();
if (StringUtils.hasText(uri)) {
if (String.valueOf(uri).contains("?")) {
throw new EsHadoopInvalidRequest("URI has query portion on it: [" + uri + "]");
}
http.setURI(new URI(escapeUri(uri.toString(), sslEnabled), false));
}
// NB: initialize the path _after_ the URI otherwise the path gets reset to /
// add node prefix (if specified)
String path = pathPrefix + addLeadingSlashIfNeeded(request.path().toString());
if (path.contains("?")) {
throw new EsHadoopInvalidRequest("Path has query portion on it: [" + path + "]");
}
path = HttpEncodingTools.encodePath(path);
http.setPath(path);
try {
// validate new URI
uri = http.getURI().toString();
} catch (URIException uriex) {
throw new EsHadoopTransportException("Invalid target URI " + request, uriex);
}
CharSequence params = request.params();
if (StringUtils.hasText(params)) {
http.setQueryString(params.toString());
}
ByteSequence ba = request.body();
if (ba != null && ba.length() > 0) {
if (!(http instanceof EntityEnclosingMethod)) {
throw new IllegalStateException(String.format("Method %s cannot contain body - implementation bug", request.method().name()));
}
EntityEnclosingMethod entityMethod = (EntityEnclosingMethod) http;
entityMethod.setRequestEntity(new BytesArrayRequestEntity(ba));
entityMethod.setContentChunked(false);
}
//为请求添加头部信息,通过es.net.http.header.xxx 参数设置, headers = new HeaderProcessor(settings)
headers.applyTo(http);
// when tracing, log everything
if (log.isTraceEnabled()) {
log.trace(String.format("Tx %s[%s]@[%s][%s]?[%s] w/ payload [%s]", proxyInfo, request.method().name(), httpInfo, request.path(), request.params(), request.body()));
}
long start = System.currentTimeMillis();
try {
//httpclient调用executeMethod
client.executeMethod(http);
} finally {
stats.netTotalTime += (System.currentTimeMillis() - start);
}
if (log.isTraceEnabled()) {
Socket sk = ReflectionUtils.invoke(GET_SOCKET, conn, (Object[]) null);
String addr = sk.getLocalAddress().getHostAddress();
log.trace(String.format("Rx %s@[%s] [%s-%s] [%s]", proxyInfo, addr, http.getStatusCode(), HttpStatus.getStatusText(http.getStatusCode()), http.getResponseBodyAsString()));
}
// the request URI is not set (since it is retried across hosts), so use the http info instead for source
return new SimpleResponse(http.getStatusCode(), new ResponseInputStream(http), httpInfo);
}
上面的bulk方法中执行的代码是Response response = execute(PUT, resource.bulk(), data);
我们可以知道会request.method()将会匹配put,从而HttpMethod http = new PutMethod();