Spark Solr(1)Read Data from SOLR
I get a lot of data-bind version conflict with spark and solr. So I clone the project and made some version updates there.
Originally it is fork from lucidworks/spark-solr
Only some dependencies update in pom.xml
4.0.0
com.lucidworks.spark
spark-solr
- 3.4.0-SNAPSHOT
+ 3.4.0.1
jar
spark-solr
Tools for reading data from Spark into Solr
@@ -39,11 +39,10 @@
1.8
2.2.1
7.1.0
- 2.4.0
+ 2.6.7
2.11.8
2.11
1.1.1
- 2.4.0
128m
Command to build that package
>mvn clean compile install -DskipTests
After build that, I get a driver versioned as 3.4.0.1
Set Up SOLR Spark Task
pom.xml to build the dependencies
4.0.0
com.sillycat
sillycat-spark-solr
1.0
Fetch the Events from Kafka
Spark Streaming System
jar
2.2.1
3.4.0.1
org.apache.spark
spark-core_2.11
${spark.version}
com.lucidworks.spark
spark-solr
${spark.solr.version}
org.apache.httpcomponents
httpclient
4.5.3
junit
junit
4.12
test
org.apache.maven.plugins
maven-compiler-plugin
3.6.1
1.8
1.8
org.apache.maven.plugins
maven-assembly-plugin
2.4.1
jar-with-dependencies
com.sillycat.sparkjava.SparkJavaApp
assemble-all
package
single
Here is the major implementation class to connect to the zookeeper and SOLR
import java.util.List;
import org.apache.solr.common.SolrDocument;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import com.lucidworks.spark.rdd.SolrJavaRDD;
import com.sillycat.sparkjava.base.SparkBaseApp;
public class SeniorJavaFeedApp extends SparkBaseApp {
private static final long serialVersionUID = -1219898501920199612L;
protected String getAppName() {
return "SeniorJavaFeedApp";
}
public void executeTask(List params) {
SparkConf conf = this.getSparkConf();
SparkContext sc = new SparkContext(conf);
String collection = "allJobs";
String solrQuery = "expired: false AND title: Java* AND source_id: 4675";
String keyword = "Architect";
logger.info("Prepare the resource from " + solrQuery);
JavaRDD rdd = this.generateRdd(sc, zkHost, collection, solrQuery);
logger.info("Executing the calculation based on keyword " + keyword);
List results = processRows(rdd, keyword);
for (SolrDocument result : results) {
logger.info("Find some jobs for you:" + result);
}
sc.stop();
}
private JavaRDD generateRdd(SparkContext sc, String zkHost, String collection, String solrQuery) {
SolrJavaRDD solrRDD = SolrJavaRDD.get(zkHost, collection, sc);
JavaRDD resultsRDD = solrRDD.queryShards(solrQuery);
return resultsRDD;
}
private List processRows(JavaRDD rows, String keyword) {
JavaRDD lines = rows.filter(new Function() {
private static final long serialVersionUID = 1L;
public Boolean call(SolrDocument s) throws Exception {
Object titleObj = s.getFieldValue("title");
if (titleObj != null) {
String title = titleObj.toString();
if (title.contains(keyword)) {
return true;
}
}
return false;
}
});
return lines.collect();
}
}
Here is the class to run the Spark task on Cluster and Local
#Run the local#
>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp
>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp
#Run binary on local#
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp
#Run binary on Remote YARN Cluster#
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp
>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp
References:
Spark library
Write to XML - stax
Spark to s3