1. 介绍
Solr索引可以接收不同的数据来源,包括XML文件,逗号分隔值(CSV)文件,从数据库提取的数据,常见的文件格式如MS Word或PDF.有三种常用的方法加载数据到Solr索引:
* 使用Apache Tika的Solr Cell框架,处理二进制或结构化文件如Office, Word, PDF 和其他专有格式。
* 通过HTTP请求上传XML文件
* 使用SolrJ写一个Java应用程序。这可能是最好的选择。
2. Post工具
2.1 索引XML$ bin/post -h
$ bin/post -c gettingstarted *.xml
$ bin/post -c gettingstarted -p 8984 *.xml
$ bin/post -c gettingstarted -d '<delete><id>42</id></delete>'
2.2 索引CSV
$ bin/post -c gettingstarted *.csv
索引tab分隔(tab-separated)文件
$ bin/post -c signals -params "separator=%09" -type text/csv data.tsv
2.3 索引JSON
$ bin/post -c gettingstarted *.json
2.4 索引富文件
$ bin/post -c gettingstarted a.pdf
$ bin/post -c gettingstarted afolder/
$ bin/post -c gettingstarted -filetypes ppt,html afolder/
3. 使用Index Handlers上传数据
Index Handlers是用来添加、删除和更新索引文档的请求处理器。
除了使用Tika插件导入富文档,或使用Data Import Handler导入结构化数据,Solr原生支持索引XML,CSV,JSON文档。
3.1 XML格式索引更新
Content-type: application/xml or Content-type: text/xml(1) 添加文档
<add>
<doc>
<field name="authors">Patrick Eagar</field>
<field name="subject">Sports</field>
<field name="price">12.40</field>
<field name="title" boost="2.0">Summer of the all-rounder: Test and championship cricket in England 1982</field>
</doc>
<doc boost="2.5">
...
</doc>
</add>
(2) XML更新命令
- Commit 和 Optimize
<commit> 操作将上次commit至今提交的文档写入磁盘。commit前,新索引的文档对Searcher不可见。
Commit操作可以被显示的提交一个<commit/>消息,也可以由solrconfig.xml中的<autocommit>参数触发。
参数:
waitSearcher
expungeDeletes
<optimize> 操作请求Solr合并内部数据,以获得更好的搜索效果。对于大的搜索需要花费一些时间。
参数:
waitSearcher
maxSegments
<commit waitSearcher="false"/>
<commit waitSearcher="false" expungeDeletes="true"/>
<optimize waitSearcher="false"/>
- 删除(Delete)操作
两种方式:"Delete by ID" (UniqueID) 或 "Delete by Query"
可包含多个删除操作:
<delete>
<id>0002166313</id>
<id>0031745983</id>
<query>subject:sport</query>
<query>publisher:penguin</query>
</delete>
- 回滚(Rollback)操作
回滚上次commit后的添加和删除操作。
<rollback/>
- 使用curl命令执行更新
curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary '
<add>
<doc>
<field name="authors">Patrick Eagar</field>
<field name="subject">Sports</field>
<field name="dd">796.35</field>
<field name="isbn">0002166313</field>
<field name="yearpub">1982</field>
<field name="publisher">Collins</field>
</doc>
</add>'
curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml"
--data-binary @myfile.xml
curl http://localhost:8983/solr/my_collection/update?stream.body=%3Ccommit/%3E
(3) Using XSLT to Transform XML Index Updates
3.2 JSON 格式索引更新
Content-Type: application/json 或 Content-Type: text/json- 添加文档
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update/json/docs' --data-binary '
{
"id": "1",
"title": "Doc 1"
}'
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update' --data-binary '
[
{
"id": "1",
"title": "Doc 1"
},
{
"id": "2",
"title": "Doc 2"
}
]'
curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary
@example/exampledocs/books.json -H 'Content-type:application/json'
- 发送命令
curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/my_collection/update' --data-binary '
{
"add": {
"doc": {
"id": "DOC1",
"my_boosted_field": { /* use a map with boost/value for a boosted field */
"boost": 2.3,
"value": "test"
},
"my_multivalued_field": [ "aaa", "bbb" ] /* Can use an array for a multi-valued field */
}
},
"add": {
"commitWithin": 5000, /* commit this document within 5 seconds */
"overwrite": false, /* don't check for existing documents with the same uniqueKey */
"boost": 3.45, /* a document boost */
"doc": {
"f1": "v1", /* Can use repeated keys for a multi-valued field */
"f1": "v2"
}
},
"commit": {},
"optimize": { "waitSearcher":false },
"delete": { "id":"ID" }, /* delete by ID */
"delete": { "query":"QUERY" } /* delete by query */
}'
简单的delete-by-id:
{ "delete":"myid" }
{ "delete":["id1","id2"] }
{
"delete":"id":50,
"_version_":12345
}
便捷请求路径:
/update/json
/update/json/docs
- 转换和索引自定义JSON
curl 'http://localhost:8983/solr/my_collection/update/json/docs'\
'?split=/exams'\
'&f=first:/first'\
'&f=last:/last'\
'&f=grade:/grade'\
'&f=subject:/exams/subject'\
'&f=test:/exams/test'\
'&f=marks:/exams/marks'\
-H 'Content-type:application/json' -d '
{
"first": "John",
"last": "Doe",
"grade": 8,
"exams": [
{
"subject": "Maths",
"test" : "term1",
"marks" : 90},
{
"subject": "Biology",
"test" : "term1",
"marks" : 86}
]
}'
3.3 CSV格式索引更新
curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary@example/exampledocs/books.csv -H 'Content-type:application/csv'
curl 'http://localhost:8983/solr/update/csv?commit=true&separator=%09&escape=%5c'
--data-binary @/tmp/result.txt
4. 使用Apache Tika的Solr Cell上传数据
ExtractingRequestHandler可以使用Tika来支持上传二进制文件,如Word,PDF, 用于数据抽取和索引。curl
'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true'
-F "myfile=@example/exampledocs/solr-word.pdf"
- 配置ExtractingRequestHandler
需要配置solrconfig.xml包含相关Jar:
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
然后在solrconfig.xml配置:
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.Last-Modified">last_modified</str>
<str name="uprefix">ignored_</str>
</lst>
<!--Optional. Specify a path to a tika configuration file. See the Tika docs for
details.-->
<str name="tika.config">/my/path/to/tika.config</str>
<!-- Optional. Specify one or more date formats to parse. See
DateUtil.DEFAULT_DATE_FORMATS
for default date formats -->
<lst name="date.formats">
<str>yyyy-MM-dd</str>
</lst>
</requestHandler>
5. 使用Data Import Handler上传结构化存储数据
配置solrconfig.xml:
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">/path/to/my/DIHconfigfile.xml</str>
</lst>
</requestHandler>
配置DIHconfigfile.xml:
参考:example/example-DIH/solr/db/conf/db-data-config.xml
6. 部分更新文档
对于只有部分改变的文档,Solr支持两种方法更新进行更新:- atomic updates, 允许只修改一个或多个字段,而不需要重新索引整个文档。
- optimistic concurrency 或 optimistic locking, 这是很多NoSQL数据库的特性,允许基于版本有条件的更新。
6.1 Atomic Updates
setadd
remove
removeregex
inc
已存在文档:
{
"id":"mydoc",
"price":10,
"popularity":42,
"categories":["kids"],
"promo_ids":["a123x"],
"tags":["free_to_try","buy_now","clearance","on_sale"]
}
应用更新命令:
{
"id":"mydoc",
"price":{"set":99},
"popularity":{"inc":20},
"categories":{"add":["toys","games"]},
"promo_ids":{"remove":"a123x"},
"tags":{"remove":["free_to_try","on_sale"]}
}
更新后文档:
{
"id":"mydoc",
"price":99,
"popularity":62,
"categories":["kids","toys","games"],
"tags":["buy_now","clearance"]
}
6.2 Optimistic Concurrency
Optimistic Concurrency是solr的一个特性,用于客户端程序来确定其正在更新的文档没有同时被其他客户端修改。此功能需要每个索引文档有一个_version_字段,并且与更新命令中指定的_version_比较。Solr的schema.xml默认有_version_字段。
<field name="_version_" type="long" indexed="false" stored="true" required="true" docValues="true"/>
通常Optimistic Concurrency的工作流如下:
(1) client读取一个文档。/get 确保有最近的版本。
(2) client在本地修改文档。
(3) client向solr提交修改的文档,/update
(4) 如果版本冲突(HTTP error 409), client重复处理步骤。
更新规则:
* 如果_version_值大于1,那么文档中的_version_必须匹配索引中的_version_.
* 如果_version_值等于1,那么文档必须存在,且不会进行版本匹配;如果文档不存在,更新会被拒绝。
* 如果_version_值小于0,那么文档必须不存在;如果文档存在,更新会被拒绝。
* 如果_version_值等于0,那么不管文档是否存在,版本是否匹配。如果文档存在,将被覆盖;不存在,将被添加。
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?versions=true' --data-binary '
[ { "id" : "aaa" },
{ "id" : "bbb" } ]'
{"responseHeader":{"status":0,"QTime":6},
"adds":["aaa",1498562471222312960,
"bbb",1498562471225458688]}
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?_version_=999999&versions=true'
--data-binary '
[{ "id" : "aaa",
"foo_s" : "update attempt with wrong existing version" }]'
{"responseHeader":{"status":409,"QTime":3},
"error":{"msg":"version conflict for aaa expected=999999
actual=1498562471222312960",
"code":409}}
$ curl -X POST -H 'Content-Type: application/json'
'http://localhost:8983/solr/techproducts/update?_version_=1498562471222312960&versio
ns=true&commit=true' --data-binary '
[{ "id" : "aaa",
"foo_s" : "update attempt with correct existing version" }]'
{"responseHeader":{"status":0,"QTime":5},
"adds":["aaa",1498562624496861184]}
$ curl 'http://localhost:8983/solr/techproducts/query?q=*:*&fl=id,_version_'
{
"responseHeader":{
"status":0,
"QTime":5,
"params":{
"fl":"id,_version_",
"q":"*:*"}},
"response":{"numFound":2,"start":0,"docs":[
{
"id":"bbb",
"_version_":1498562471225458688},
{
"id":"aaa",
"_version_":1498562624496861184}]
}}
7. 删除重复数据(De-duplication)
Solr通过<Signature>类原生支持去重技术,并且容易添加新的hash/signature实现。一个Signature可以通过几种方式实现:
MD5Signature 128 bit hash
Lookup3Signature 64 bit hash
TextProfileSignature Fuzzy hashing from nutch
配置:
- solrconfig.xml
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<bool name="overwriteDupes">false</bool>
<str name="fields">name,features,cat</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
...
</requestHandler>
- schema.xml
<field name="signatureField" type="string" stored="true" indexed="true" multiValued="false" />
8. 索引时语言检测
使用langid UpdateRequestProcessor可以在索引时检测语言并映射文本到语言相关的字段。Solr支持两种实现:* Tika's language detection feature: http://tika.apache.org/0.10/detection.html
* LangDetect language detection: http://code.google.com/p/language-detection/
<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>
<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid.fl">title,subject,text,keywords</str>
<str name="langid.langField">language_s</str>
</lst>
</processor>
9. Content Stream
基于URL地址访问SolrRequestHandlers时,包含请求参数的SolrQueryRequest对象,也可以包含一个ContentStreams列表, 含有用于请求的bulk data.10. 整合UIMA
Apache Unstructured Information Management Architecture (UIMA)