Spark Solr(1)Read Data from SOLR

309 篇文章 0 订阅
160 篇文章 0 订阅
Spark Solr(1)Read Data from SOLR

I get a lot of data-bind version conflict with spark and solr. So I clone the project and made some version updates there.

Originally it is fork from https://github.com/LucidWorks/spark-solr
>git clone https://github.com/luohuazju/spark-solr

Only some dependencies update in pom.xml
<modelVersion>4.0.0</modelVersion>
<groupId>com.lucidworks.spark</groupId>
<artifactId>spark-solr</artifactId>
- <version>3.4.0-SNAPSHOT</version>
+ <version>3.4.0.1</version>
<packaging>jar</packaging>
<name>spark-solr</name>
<description>Tools for reading data from Spark into Solr</description>
@@ -39,11 +39,10 @@
<java.version>1.8</java.version>
<spark.version>2.2.1</spark.version>
<solr.version>7.1.0</solr.version>
- <fasterxml.version>2.4.0</fasterxml.version>
+ <fasterxml.version>2.6.7</fasterxml.version>
<scala.version>2.11.8</scala.version>
<scala.binary.version>2.11</scala.binary.version>
<scoverage.plugin.version>1.1.1</scoverage.plugin.version>
- <fasterxml.version>2.4.0</fasterxml.version>
<MaxPermSize>128m</MaxPermSize>
</properties>
<repositories>

Command to build that package
>mvn clean compile install -DskipTests

After build that, I get a driver versioned as <spark.solr.version>3.4.0.1</spark.solr.version>

Set Up SOLR Spark Task
pom.xml to build the dependencies
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.sillycat</groupId>
<artifactId>sillycat-spark-solr</artifactId>
<version>1.0</version>
<description>Fetch the Events from Kafka</description>
<name>Spark Streaming System</name>
<packaging>jar</packaging>

<properties>
<spark.version>2.2.1</spark.version>
<spark.solr.version>3.4.0.1</spark.solr.version>
</properties>

<dependencies>
<!-- spark framework -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- spark SOLR -->
<dependency>
<groupId>com.lucidworks.spark</groupId>
<artifactId>spark-solr</artifactId>
<version>${spark.solr.version}</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.3</version>
</dependency>
<!-- JUNIT -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.sillycat.sparkjava.SparkJavaApp</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>assemble-all</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

Here is the major implementation class to connect to the zookeeper and SOLR
SeniorJavaFeedApp.java
package com.sillycat.sparkjava.app;

import java.util.List;

import org.apache.solr.common.SolrDocument;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;

import com.lucidworks.spark.rdd.SolrJavaRDD;
import com.sillycat.sparkjava.base.SparkBaseApp;

public class SeniorJavaFeedApp extends SparkBaseApp {

private static final long serialVersionUID = -1219898501920199612L;

protected String getAppName() {
return "SeniorJavaFeedApp";
}

public void executeTask(List<String> params) {
SparkConf conf = this.getSparkConf();
SparkContext sc = new SparkContext(conf);

String zkHost = "zookeeper1.us-east-1.elasticbeanstalk.com,zookeeper2.us-east-1.elasticbeanstalk.com,zookeeper3.us-east-1.elasticbeanstalk.com/solr/allJobs";
String collection = "allJobs";
String solrQuery = "expired: false AND title: Java* AND source_id: 4675";
String keyword = "Architect";

logger.info("Prepare the resource from " + solrQuery);
JavaRDD<SolrDocument> rdd = this.generateRdd(sc, zkHost, collection, solrQuery);
logger.info("Executing the calculation based on keyword " + keyword);

List<SolrDocument> results = processRows(rdd, keyword);
for (SolrDocument result : results) {
logger.info("Find some jobs for you:" + result);
}
sc.stop();
}

private JavaRDD<SolrDocument> generateRdd(SparkContext sc, String zkHost, String collection, String solrQuery) {
SolrJavaRDD solrRDD = SolrJavaRDD.get(zkHost, collection, sc);
JavaRDD<SolrDocument> resultsRDD = solrRDD.queryShards(solrQuery);
return resultsRDD;
}

private List<SolrDocument> processRows(JavaRDD<SolrDocument> rows, String keyword) {
JavaRDD<SolrDocument> lines = rows.filter(new Function<SolrDocument, Boolean>() {
private static final long serialVersionUID = 1L;

public Boolean call(SolrDocument s) throws Exception {
Object titleObj = s.getFieldValue("title");
if (titleObj != null) {
String title = titleObj.toString();
if (title.contains(keyword)) {
return true;
}
}
return false;
}
});
return lines.collect();
}

}

Here is the class to run the Spark task on Cluster and Local
#Run the local#

>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp

>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp

#Run binary on local#

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp

#Run binary on Remote YARN Cluster#

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp


References:
https://github.com/LucidWorks/spark-solr
https://lucidworks.com/2015/08/20/solr-spark-sql-datasource/
https://lucidworks.com/2016/08/16/solr-as-sparksql-datasource-part-ii/

Spark library
http://spark-packages.org/

Write to XML - stax
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html#bnbgx
https://www.journaldev.com/892/how-to-write-xml-file-in-java-using-java-stax-api
Spark to s3
http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值