http://wiki.apache.org/solr/DataImportHandler#Commands
NOTE:Each table contains an additional column last_modified of timestamp type. You may want to download the database again since it has been updated recently. We use this timestamp field to determine what rows in each table have changed since the last indexed time.
Solution1:
Using delta-import command
Delta Import operation can be started by hitting the URL http://localhost:8983/solr/dataimport?command=delta-import . This operation will be started in a new thread and the status attribute in the response should be shown busy now. Depending on the size of your data set, this operation may take some time. At any time, you can hit http://localhost:8983/solr/dataimport to see the status flag.
When delta-import command is executed, it reads the start time stored in conf/dataimport.properties . It uses that timestamp to run delta queries and after completion, updates the timestamp in conf/dataimport.properties .
Note : there is an alternative approach for updating documents in Solr, which is in many cases more efficient and also requires less configuration explained on DataImportHandlerDeltaQueryViaFullImport .
Delta-Import Example
We will use the same example database used in the full import example. Note that the database schema has been updated and each table contains an additional column last_modified of timestamp type. You may want to download the database again since it has been updated recently. We use this timestamp field to determine what rows in each table have changed since the last indexed time.
Take a look at the following data-config.xml
<dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" /> <document name="products"> <entity name="item" pk="ID" query ="select * from item" deltaImportQuery ="select * from item where ID='${dataimporter.delta.id}'" deltaQuery ="select id from item where last_modified > '${dataimporter.last_index_time}'"> <entity name="feature" pk="ITEM_ID" query="select description as features from feature where item_id='${item.ID}'"> </entity> <entity name="item_category" pk="ITEM_ID, CATEGORY_ID" query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"> <entity name="category" pk="ID" query="select description as cat from category where id = '${item_category.CATEGORY_ID}'"> </entity> </entity> </entity> </document> </dataConfig>
Pay attention to the deltaQuery attribute which has an SQL statement capable of detecting changes in the item table. Note the variable ${dataimporter.last_index_time } The DataImportHandler exposes a variable called last_index_time which is a timestamp value denoting the last time full-import 'or ' delta-import was run. You can use this variable anywhere in the SQL you write in data-config.xml and it will be replaced by the value during processing.
Note
-
The deltaQuery in the above example only detects changes in item but not in other tables. You can detect the changes to all child tables in one SQL query as specified below. Figuring out it's details is an exercise for the user
deltaQuery="select id from item where id in (select item_id as id from feature where last_modified > '${dataimporter.last_index_time}') or id in (select item_id as id from item_category where item_id in (select id as item_id from category where last_modified > '${dataimporter.last_index_time}') or last_modified > '${dataimporter.last_index_time}') or last_modified > '${dataimporter.last_index_time}'"
- Writing a huge deltaQuery like the above one is not a very enjoyable task, so we have an alternate mechanism of achieving this goal.
<dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" /> <document> <entity name="item" pk="ID" query="select * from item" deltaImportQuery="select * from item where ID=='${dataimporter.delta.id}'" deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'"> <entity name="feature" pk="ITEM_ID" query="select DESCRIPTION as features from FEATURE where ITEM_ID='${item.ID}'" deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}"/> <entity name="item_category" pk="ITEM_ID, CATEGORY_ID" query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'" deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}"> <entity name="category" pk="ID" query="select DESCRIPTION as cat from category where ID = '${item_category.CATEGORY_ID}'" deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'" parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}"/> </entity> </entity> </document> </dataConfig>
Here we have three queries specified for each entity except the root (which has only two).
-
The query gives the data needed to populate fields of the Solr document in full-import
-
The deltaImportQuery gives the data needed to populate fields when running a delta-import
-
The deltaQuery gives the primary keys of the current entity which have changes since the last index time
-
The parentDeltaQuery uses the changed rows of the current table (fetched with deltaQuery) to give the changed rows in the parent table. This is necessary because whenever a row in the child table changes, we need to re-generate the document which has that field.
Let us reiterate on the findings:
-
For each row given by query , the query of the child entity is executed once.
-
For each row given by deltaQuery , the parentDeltaQuery is executed.
- If any row in the root/child entity changes, we regenerate the complete Solr document which contained that row.
Solution2:
My delta-import goes out of memory . Any workaround ?
It is possible to do delta import using full-import. Taken from the delta-import example
<dataConfig> <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" /> <document> <entity name="findDelta" query="select id from item where id in (select item_id as id from feature where last_modified > '${dataimporter.last_index_time}') or id in (select item_id as id from item_category where item_id in (select id as item_id from category where last_modified > '${dataimporter.last_index_time}') or last_modified > '${dataimporter.last_index_time}') or last_modified > '${dataimporter.last_index_time}'" rootEntity="false"> <entity name="item" query="select * from item where id='${findDelta.id}'> <entity name="feature" query="select description as features from feature where item_id='${item.ID}'"> </entity> <entity name="item_category" query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"> <entity name="category" query="select description as cat from category where id = '${item_category.CATEGORY_ID}'"> </entity> </entity> </entity> </entity> </document> </dataConfig>
notice that for the entity 'findDelta' rootEntity is set to false, so a document is created for each row from 'item'. The command has to be
command=full-import&clean=false
Solution3:
Using query attribute for both full and delta import
The standard approach in Solr is to define one query for the initial import and a second query to fetch the IDs of documents that have changed and a third query to fetch the data that changed. Especially if you expect a large number of changes this isn't very efficient. Furthermore if both in the initial import and the delta case you have the same SELECT list, its tedious to maintain 3 queries where two are almost identical and one still very similar. The fundamental idea is to only define one query for both the full and delta import by reading out the requests parameters on a full-import.
Example
So take the example from the docs:
<entity name="item" pk="ID" query="SELECT * FROM item" deltaImportQuery="SELECT * FROM item WHERE id = '${dataimporter.delta.id}'" deltaQuery="SELECT id FROM item WHERE last_modified > '${dataimporter.last_index_time}'">
This can be rewritten as follows:
<entity name="item" pk="ID"
query="SELECT * FROM item
WHERE '${dataimporter.request.clean}' != 'false'
OR last_modified > '${dataimporter.last_index_time}
'">
When doing a normal full import solr defaults the clean to true (watch out you can override this default in the solrconfig.xml so you might want to always set it explicitly to true). As a result the first part of the WHERE condition will be 'true' != 'false' which any decent RDBMS will figure out will lead to the entire all always evaluating to true aka reading the entire table: http://localhost:8983/solr/core0/dataimport?command=full-import&clean=true
Now when doing a delta import you do not use the delta import, but instead you do a normal full-import but with the clear GET parameter set to false : http://localhost:8983/solr/core0/dataimport?command=full-import&clean=false
In this case the first part of the WHERE will be 'false' != 'false' which is obviously always false and any RDBMS should optimize that away and just evaluate the second condition.
In this case it means obviously that in case you also want to use deletedPkQuery then when running the delta-import command is still necessary.