Solr(9)Solr Index Replication on Ubuntu and Scala Client
1. Create One More Core
Go to the example directory, copy the core directory
> pwd
/opt/solr/example
> cp -r collection1 jobs
> rm -fr collection1/
Go to the example directory, start the server
> java -jar start.jar
Go to the console web UI
http://ubuntu-master:8983/solr/#/
Add Core -
name: jobs
instanceDir: /opt/solr/example/solr/jobs
dataDir: /opt/solr/example/solr/jobs/data
config: /opt/solr/example/solr/jobs/conf/solrconfig.xml
schema: /opt/solr/example/solr/jobs/conf/schema.xml
Add one Record to the Solr System
Jobs —> Documents —> Request-Handler —>Document Type (Solr Command Raw XML or JSON)
<add>
<doc>
<field name=“id”>1</field>
<field name=“title”>senior software engineer</field>
</doc>
</add>
It is not working, maybe because of the commit issue. I tried with JSON, it works.
{
“id”:”1”,
“title”:”software engineer"
}
Click on the “Query” Tab, you will get all your data from there.
2. Set up the Replicate Server
> scp -r ubuntu-master:/opt/solr/example/solr/jobs ./
Check the master configuration, search for “/replication”, adding these configuration
<lst name="master">
<str name="enable">${master.enable:false}</str>
<str name="replicateAfter">commit</str>
<str name="replicateAfter">startup</str>
<str name="confFiles">schema.xml,stopwords.txt</str>
</lst>
<lst name="slave">
<str name="enable">${slave.enable:false}</str>
<str name="masterUrl">${master.url:http://ubuntu-master:8983/solr/jobs}</str>
<str name="pollInterval">00:00:60</str>
<str name="httpConnTimeout">5000</str>
<str name="httpReadTimeout">10000</str>
</lst>
Also, change the auto commit time in the configuration.
<autoCommit>
<maxDocs>300000</maxDocs>
<!-- 5 minutes -->
<maxTime>300000</maxTime>
<openSearcher>true</openSearcher>
</autoCommit>
I will do the same thing on other solr configuration files on slaves. I have 2 slaves, ubuntu-dev1, ubuntu-dev2
Start the master with this command with the master option enabled
> java -Dmaster.enable=true -jar start.jar
Start the slaves on my slave servers.
> java -Dslave.enable=true -jar start.jar
From the Server Web UI Console, I can only see the replication is enabled.
http://ubuntu-master:8983/solr/#/jobs/replication
We can go to the slave console to check
http://ubuntu-dev1:8983/solr/#/jobs/query
Right now, I can add one more data on master and check if it gets indexed on the slaves.
Some console logging on slaves
529867 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Slave in sync with master.
589867 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Master's generation: 8
589868 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Slave's generation: 7
589869 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Starting replication process
589883 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Number of files in latest index in master: 52
After the process, we can search any latest data on slaves and masters.
3. Set up the Load Balance
I am running HA PROXY with the SOLR master, so I need to choose another port number, the configuration will be as follow:
listen solr_cluster 0.0.0.0:8984
acl master_methods method POST DELETE PUT
use_backend solr_master_backend if master_methods
default_backend solr_read_backends
backend solr_master_backend
server solr-master ubuntu-master:8983 check inter 5000 rise 2 fall 2
backend solr_read_backends
balance roundrobin
server solr-slave1 ubuntu-dev1:8983 check inter 5000 rise 2 fall 2
server solr-slave2 ubuntu-dev2:8983 check inter 5000 rise 2 fall 2
It is working well, we can check from here
http://ubuntu-master/haproxy-status
4. Build a Simple Client
https://github.com/takezoe/solr-scala-client
This class helps a lot. CaseClassMapper
package com.sillycat.jobsconsumer.persistence
import com.sillycat.jobsconsumer.models.Job
import com.sillycat.jobsconsumer.utilities.{IncludeConfig, IncludeLogger}
import jp.sf.amateras.solr.scala.SolrClient
import jp.sf.amateras.solr.scala.sample.Param
/**
* Created by carl on 8/6/15.
*/
object SolrClientDAO extends IncludeLogger with IncludeConfig{
private val solrClient = {
try {
logger.info("Init the SOLR Client ---------------")
val solrURL = config.getString(envStr("solr.url.jobs"))
logger.info("SOLR URL = " + solrURL)
val client = new SolrClient(solrURL)
client
} catch {
case x: Throwable =>
logger.error("Couldn't connect to SOLR: " + x)
null
}
}
def releaseResource = {
if(solrClient != null){
solrClient.shutdown()
}
}
def addJob(job:Job): Unit ={
//logger.debug("Adding job (" + job + ") to solr")
solrClient.add(job)
}
def query(query:String):Seq[Job] = {
logger.debug("Fetching the job results with query = " + query)
val result = solrClient.query(query).getResultAs[Job]()
result.documents
}
def commit = {
solrClient.commit()
}
}
The dependency will be as follow:
//for solr scala driver
resolvers += "amateras-repo" at "http://amateras.sourceforge.jp/mvn/"
"jp.sf.amateras.solr.scala" %% "solr-scala-client" % "0.0.12",
And the Test Class is as follow:
package com.sillycat.jobsconsumer.persistence
import com.sillycat.jobsconsumer.models.Job
import com.sillycat.jobsconsumer.utilities.IncludeConfig
import org.scalatest.{BeforeAndAfterAll, Matchers, FunSpec}
import redis.embedded.RedisServer
/**
* Created by carl on 8/7/15.
*/
class SolrDAOSpec extends FunSpec with Matchers with BeforeAndAfterAll with IncludeConfig{
override def beforeAll() {
if(config.getString("build.env").equals("test")){
}
}
override def afterAll() {
}
describe("SolrDAO") {
describe("#add and query"){
it("Add one single job to Solr") {
val expect = Job("id1","title1","desc1","industry1")
val num = 10000
val start = System.currentTimeMillis()
for ( i<- 1 to num){
val job = (Job("id" + i, "title" + i, "desc" + i, "industry" + i))
SolrClientDAO.addJob(job)
}
val end = System.currentTimeMillis()
println("total time for " + num + " is " + (end-start))
println("it is " + num / ((end-start)/1000) + " jobs/second")
// SolrDAO.commit
// val result = SolrDAO.query("title:title1")
// result should not be (null)
// result.size > 0 should be (true)
// result.foreach { item =>
// println(item.toString + "\n")
// }
}
}
}
}
Clean all the data during testing
http://ubuntu-master:8983/solr/jobs/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E&commit=true
Actually the data schema is stored and defined in conf/schema.xml, I should update as follow:
<field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="desc" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="industry" type="text_general" indexed="true" stored="true" multiValued="false"/>
add single job at one time
total time for 10000 is 180096
it is 55 jobs/second
Find the log4j.properties here and change the log level
/opt/solr/example/resources/log4j.properties
I turned off the logging and used 2 threads on the clients, I get performance about below on each.
total time for 10000 is 51688
it is 196 jobs/second
The performance is as follow for single threads
total time for 10000 is 28398
it is 357 jobs/second
References:
Setup Scaling Servers
http://blog.csdn.net/thundersssss/article/details/5385699
http://lutaf.com/197.htm
http://blog.warningrc.com/2013/06/10/Solr-data-backup.html
Single mode on Jetty
http://sillycat.iteye.com/blog/2227398
load balance on the slaves
http://davehall.com.au/blog/dave/2010/03/13/solr-replication-load-balancing-haproxy-and-drupal
https://gist.github.com/feniix/1974460
http://stackoverflow.com/questions/10090386/how-to-check-solr-healthy-using-haproxy
solr clients
https://github.com/takezoe/solr-scala-client
https://wiki.apache.org/solr/Solrj
1. Create One More Core
Go to the example directory, copy the core directory
> pwd
/opt/solr/example
> cp -r collection1 jobs
> rm -fr collection1/
Go to the example directory, start the server
> java -jar start.jar
Go to the console web UI
http://ubuntu-master:8983/solr/#/
Add Core -
name: jobs
instanceDir: /opt/solr/example/solr/jobs
dataDir: /opt/solr/example/solr/jobs/data
config: /opt/solr/example/solr/jobs/conf/solrconfig.xml
schema: /opt/solr/example/solr/jobs/conf/schema.xml
Add one Record to the Solr System
Jobs —> Documents —> Request-Handler —>Document Type (Solr Command Raw XML or JSON)
<add>
<doc>
<field name=“id”>1</field>
<field name=“title”>senior software engineer</field>
</doc>
</add>
It is not working, maybe because of the commit issue. I tried with JSON, it works.
{
“id”:”1”,
“title”:”software engineer"
}
Click on the “Query” Tab, you will get all your data from there.
2. Set up the Replicate Server
> scp -r ubuntu-master:/opt/solr/example/solr/jobs ./
Check the master configuration, search for “/replication”, adding these configuration
<lst name="master">
<str name="enable">${master.enable:false}</str>
<str name="replicateAfter">commit</str>
<str name="replicateAfter">startup</str>
<str name="confFiles">schema.xml,stopwords.txt</str>
</lst>
<lst name="slave">
<str name="enable">${slave.enable:false}</str>
<str name="masterUrl">${master.url:http://ubuntu-master:8983/solr/jobs}</str>
<str name="pollInterval">00:00:60</str>
<str name="httpConnTimeout">5000</str>
<str name="httpReadTimeout">10000</str>
</lst>
Also, change the auto commit time in the configuration.
<autoCommit>
<maxDocs>300000</maxDocs>
<!-- 5 minutes -->
<maxTime>300000</maxTime>
<openSearcher>true</openSearcher>
</autoCommit>
I will do the same thing on other solr configuration files on slaves. I have 2 slaves, ubuntu-dev1, ubuntu-dev2
Start the master with this command with the master option enabled
> java -Dmaster.enable=true -jar start.jar
Start the slaves on my slave servers.
> java -Dslave.enable=true -jar start.jar
From the Server Web UI Console, I can only see the replication is enabled.
http://ubuntu-master:8983/solr/#/jobs/replication
We can go to the slave console to check
http://ubuntu-dev1:8983/solr/#/jobs/query
Right now, I can add one more data on master and check if it gets indexed on the slaves.
Some console logging on slaves
529867 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Slave in sync with master.
589867 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Master's generation: 8
589868 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Slave's generation: 7
589869 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Starting replication process
589883 [snapPuller-10-thread-1] INFO org.apache.solr.handler.SnapPuller – Number of files in latest index in master: 52
After the process, we can search any latest data on slaves and masters.
3. Set up the Load Balance
I am running HA PROXY with the SOLR master, so I need to choose another port number, the configuration will be as follow:
listen solr_cluster 0.0.0.0:8984
acl master_methods method POST DELETE PUT
use_backend solr_master_backend if master_methods
default_backend solr_read_backends
backend solr_master_backend
server solr-master ubuntu-master:8983 check inter 5000 rise 2 fall 2
backend solr_read_backends
balance roundrobin
server solr-slave1 ubuntu-dev1:8983 check inter 5000 rise 2 fall 2
server solr-slave2 ubuntu-dev2:8983 check inter 5000 rise 2 fall 2
It is working well, we can check from here
http://ubuntu-master/haproxy-status
4. Build a Simple Client
https://github.com/takezoe/solr-scala-client
This class helps a lot. CaseClassMapper
package com.sillycat.jobsconsumer.persistence
import com.sillycat.jobsconsumer.models.Job
import com.sillycat.jobsconsumer.utilities.{IncludeConfig, IncludeLogger}
import jp.sf.amateras.solr.scala.SolrClient
import jp.sf.amateras.solr.scala.sample.Param
/**
* Created by carl on 8/6/15.
*/
object SolrClientDAO extends IncludeLogger with IncludeConfig{
private val solrClient = {
try {
logger.info("Init the SOLR Client ---------------")
val solrURL = config.getString(envStr("solr.url.jobs"))
logger.info("SOLR URL = " + solrURL)
val client = new SolrClient(solrURL)
client
} catch {
case x: Throwable =>
logger.error("Couldn't connect to SOLR: " + x)
null
}
}
def releaseResource = {
if(solrClient != null){
solrClient.shutdown()
}
}
def addJob(job:Job): Unit ={
//logger.debug("Adding job (" + job + ") to solr")
solrClient.add(job)
}
def query(query:String):Seq[Job] = {
logger.debug("Fetching the job results with query = " + query)
val result = solrClient.query(query).getResultAs[Job]()
result.documents
}
def commit = {
solrClient.commit()
}
}
The dependency will be as follow:
//for solr scala driver
resolvers += "amateras-repo" at "http://amateras.sourceforge.jp/mvn/"
"jp.sf.amateras.solr.scala" %% "solr-scala-client" % "0.0.12",
And the Test Class is as follow:
package com.sillycat.jobsconsumer.persistence
import com.sillycat.jobsconsumer.models.Job
import com.sillycat.jobsconsumer.utilities.IncludeConfig
import org.scalatest.{BeforeAndAfterAll, Matchers, FunSpec}
import redis.embedded.RedisServer
/**
* Created by carl on 8/7/15.
*/
class SolrDAOSpec extends FunSpec with Matchers with BeforeAndAfterAll with IncludeConfig{
override def beforeAll() {
if(config.getString("build.env").equals("test")){
}
}
override def afterAll() {
}
describe("SolrDAO") {
describe("#add and query"){
it("Add one single job to Solr") {
val expect = Job("id1","title1","desc1","industry1")
val num = 10000
val start = System.currentTimeMillis()
for ( i<- 1 to num){
val job = (Job("id" + i, "title" + i, "desc" + i, "industry" + i))
SolrClientDAO.addJob(job)
}
val end = System.currentTimeMillis()
println("total time for " + num + " is " + (end-start))
println("it is " + num / ((end-start)/1000) + " jobs/second")
// SolrDAO.commit
// val result = SolrDAO.query("title:title1")
// result should not be (null)
// result.size > 0 should be (true)
// result.foreach { item =>
// println(item.toString + "\n")
// }
}
}
}
}
Clean all the data during testing
http://ubuntu-master:8983/solr/jobs/update?stream.body=%3Cdelete%3E%3Cquery%3E*:*%3C/query%3E%3C/delete%3E&commit=true
Actually the data schema is stored and defined in conf/schema.xml, I should update as follow:
<field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="desc" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="industry" type="text_general" indexed="true" stored="true" multiValued="false"/>
add single job at one time
total time for 10000 is 180096
it is 55 jobs/second
Find the log4j.properties here and change the log level
/opt/solr/example/resources/log4j.properties
I turned off the logging and used 2 threads on the clients, I get performance about below on each.
total time for 10000 is 51688
it is 196 jobs/second
The performance is as follow for single threads
total time for 10000 is 28398
it is 357 jobs/second
References:
Setup Scaling Servers
http://blog.csdn.net/thundersssss/article/details/5385699
http://lutaf.com/197.htm
http://blog.warningrc.com/2013/06/10/Solr-data-backup.html
Single mode on Jetty
http://sillycat.iteye.com/blog/2227398
load balance on the slaves
http://davehall.com.au/blog/dave/2010/03/13/solr-replication-load-balancing-haproxy-and-drupal
https://gist.github.com/feniix/1974460
http://stackoverflow.com/questions/10090386/how-to-check-solr-healthy-using-haproxy
solr clients
https://github.com/takezoe/solr-scala-client
https://wiki.apache.org/solr/Solrj