概述
因工作需要,经常使用kettle进行数据采集,使用数据库作为etl资源库,存储各种sql算法,然后通过java调度程序,定时执行etl脚本。
现有各类sql算法已达到上千个,kettle目前仅提供根据脚本名称,描述信息进行脚本检索的方式,当需要根据表名,关键字等查找sql算法时,kettle工具不支持;此时,需要分析etl资源库逻辑结构,编写sql查询相关算法,手工实现;由于etl资源库被检索利用的频率很高,单纯的通过手工方式检索,无法满足日益快速的业务变更的需求,亟需一种可以简化算法搜索的手段。
本文介绍如何将etl资源库中的sql算法导入到solr,并通过velocity模版展示查询结果,以及通过highlight.js高亮显示sql脚本。
下载solr
Solr 官网:https://lucene.apache.org/solr/
目前最新版Solr8.0.0:http://www.apache.org/dyn/closer.lua/lucene/solr/8.0.0
下载solr安装包:solr-8.0.0.zip
并解压到目录:D:\dev\solr-8.0.0,目录结构如下:
Solr 结构化数据导入DIH
Solr提供一种方式,能够将数据库中的数据直接导入到solr,官方在线文档是7.7版本,但是同样适用于8.0,在线文档地址:
http://lucene.apache.org/solr/guide/7_7/uploading-structured-data-store-data-with-the-data-import-handler.html
启动solr示例:dih
solr start -e dih
INFO - 2019-04-12 09:08:16.169; org.apache.solr.util.configuration.SSLCredentialProviderFactory; Processing SSL Credential Provider chain: env;sysprop
Starting up Solr on port 8983 using command:
"D:\dev\solr-8.0.0\bin\solr.cmd" start -p 8983 -s "D:\dev\solr-8.0.0\example\example-DIH\solr"
INFO - 2019-04-12 09:08:18.326; org.apache.solr.util.configuration.SSLCredentialProviderFactory; Processing SSL Credential Provider chain: env;sysprop
Waiting up to 30 to see Solr running on port 8983
Solr dih example launched successfully. Direct your Web browser to http://localhost:8983/solr to visit the Solr Admin UI
Dataimport页面:
在左侧选择core:db,点击:dataimport,在右侧,点击:configuration,可以看到示例的数据库配置信息:
示例配置目录:solr-8.0.0\example\example-DIH\solr\db\conf
Solrconfig配置文件
启用dih,需要配置如下选项:
<!-- 启用dataimport 功能,需要如下配置 -->
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
<!-- 如果需要连接sqlserver数据库,建议将数据库驱动jar包单独放在dist/jdbc-lib 目录下 -->
<lib dir="${solr.install.dir:../../../..}/dist/jdbc-lib/" regex=".*\.jar" />
<!-- 请求处理 -->
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config.xml</str>
</lst>
</requestHandler>
Solr-velocity search ui
Solr提供一套快速搭建结果查询的界面,官方文档地址:
http://lucene.apache.org/solr/guide/7_7/velocity-search-ui.html
启动solr示例:techproducts
注意:启动之前,需要关闭dih示例,或者使用新端口启动。
solr start -e techproducts
INFO - 2019-04-12 22:16:44.914; org.apache.solr.util.configuration.SSLCredentialProviderFactory; Processing SSL Credential Provider chain: env;sysprop
Solr home directory D:\dev\solr-8.0.0\example\techproducts\solr already exists.
Starting up Solr on port 8983 using command:
"D:\dev\solr-8.0.0\bin\solr.cmd" start -p 8983 -s "D:\dev\solr-8.0.0\example\techproducts\solr"
INFO - 2019-04-12 22:16:48.536; org.apache.solr.util.configuration.SSLCredentialProviderFactory; Processing SSL Credential Provider chain: env;sysprop
Waiting up to 30 to see Solr running on port 8983
WARNING: Core 'techproducts' already exists!
Checked core existence using Core API command:
http://localhost:8983/solr/admin/cores?action=STATUS&core=techproducts
Solr techproducts example launched successfully. Direct your Web browser to http://localhost:8983/solr to visit the Solr Admin UI
browse页面:
打开示例自带的browse查询结果页面,地址:
http://localhost:8983/solr/techproducts/browse
当输入查询条件,显示结果如下:
Solrconfig配置文件
配置文件中与velocity相关的内容如下:
<lib dir="${solr.install.dir:../../../..}/contrib/velocity/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-velocity-\d.*\.jar" />
<requestHandler name="/browse" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<!-- VelocityResponseWriter settings -->
<str name="wt">velocity</str>
<str name="v.template">browse</str>
<str name="v.layout">layout</str>
<str name="title">Solr-itas</str>
<!-- Query settings -->
<str name="defType">edismax</str>
<str name="qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
</str>
<str name="mm">100%</str>
<str name="q.alt">*:*</str>
<str name="rows">10</str>
<str name="fl">*,score</str>
<str name="mlt.qf">
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
</str>
<str name="mlt.fl">text,features,name,sku,id,manu,cat,title,description,keywords,author,resourcename</str>
<int name="mlt.count">3</int>
<!-- Faceting defaults -->
<str name="facet">on</str>
<str name="facet.missing">true</str>
<str name="facet.field">cat</str>
<str name="facet.field">manu_exact</str>
<str name="facet.field">content_type</str>
<str name="facet.field">author_s</str>
<str name="facet.query">ipod</str>
<str name="facet.query">GB</str>
<str name="facet.mincount">1</str>
<str name="facet.pivot">cat,inStock</str>
<str name="facet.range.other">after</str>
<str name="facet.range">price</str>
<int name="f.price.facet.range.start">0</int>
<int name="f.price.facet.range.end">600</int>
<int name="f.price.facet.range.gap">50</int>
<str name="facet.range">popularity</str>
<int name="f.popularity.facet.range.start">0</int>
<int name="f.popularity.facet.range.end">10</int>
<int name="f.popularity.facet.range.gap">3</int>
<str name="facet.range">manufacturedate_dt</str>
<str name="f.manufacturedate_dt.facet.range.start">NOW/YEAR-10YEARS</str>
<str name="f.manufacturedate_dt.facet.range.end">NOW</str>
<str name="f.manufacturedate_dt.facet.range.gap">+1YEAR</str>
<str name="f.manufacturedate_dt.facet.range.other">before</str>
<str name="f.manufacturedate_dt.facet.range.other">after</str>
<!-- Highlighting defaults -->
<str name="hl">on</str>
<str name="hl.fl">content features title name</str>
<str name="hl.preserveMulti">true</str>
<str name="hl.encoder">html</str>
<str name="hl.simple.pre"><b></str>
<str name="hl.simple.post"></b></str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.name.hl.fragsize">0</str>
<str name="f.name.hl.alternateField">name</str>
<str name="f.content.hl.snippets">3</str>
<str name="f.content.hl.fragsize">200</str>
<str name="f.content.hl.alternateField">content</str>
<str name="f.content.hl.maxAlternateFieldLength">750</str>
<!-- Spell checking defaults -->
<str name="spellcheck">on</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.alternativeTermCount">2</str>
<str name="spellcheck.maxResultsForSuggest">5</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.maxCollations">3</str>
</lst>
<!-- append spellchecking to our list of components -->
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
<queryResponseWriter name="velocity" class="solr.VelocityResponseWriter" startup="lazy">
<str name="template.base.dir">${velocity.template.base.dir:}</str>
</queryResponseWriter>
手工创建solr core后,/browse 节点配置如下:
需要手工添加相关配置节点才能够满足常规使用。
<requestHandler name="/browse" class="solr.SearchHandler" useParams="query,facets,velocity,browse">
<lst name="defaults">
<str name="echoParams">explicit</str>
</lst>
</requestHandler>
highlight语法高亮
用途:
用于在web上高亮显示sql等语法;
用法:
下载highlight.zip
配置highlight
解压highlight.zip,目录结构如下:
拷贝Css文件存储在目录:solr-8.0.0\server\solr-webapp\webapp\css\highlight\styles
拷贝Js文件存储在目录:solr-8.0.0\server\solr-webapp\webapp\js\highlight\highlight.pack.js
Etl资源库分析算法
创建视图
用于分析etl资源库内部信息:
drop view v_etl_sql_analyze;
go
create view v_etl_sql_analyze as
SELECT 'etl_db_' + CAST(a1.id_step AS VARCHAR(20))+ CAST(a2.id_step AS VARCHAR(20)) as id
, 'etl_db' AS db_name
, b.modified_date
, a1.id_step
, b.id_directory
, c.directory_name
, b.id_transformation
, b.name AS trans_name
, CAST (b.description AS VARCHAR (200)) AS trans_desc
, d1.name AS db_from
, d2.name AS db_to
, CAST (a1.value_str AS VARCHAR (8000)) AS step_from
, CAST (a2.value_str AS VARCHAR (80)) AS step_to
FROM etl_db..r_step_attribute a1
, etl_db..r_step_attribute a2
, etl_db..r_step_attribute a1d
, etl_db..r_step_attribute a2d
, etl_db..R_DATABASE D1
, etl_db..R_DATABASE D2
, etl_db..r_trans_hop h
, etl_db..r_transformation b
LEFT JOIN etl_db..r_directory c ON b.id_directory = c.id_directory
WHERE h.id_transformation = b.id_transformation
AND a1.id_step = h.id_step_from
AND a1d.id_step = h.id_step_from
AND a1.code = 'sql'
AND a1d.code = 'id_connection'
AND a1d.VALUE_NUM = d1.ID_DATABASE
AND a2.id_step = h.id_step_to
AND a2d.id_step = h.id_step_to
AND a2.code = 'table'
AND a2d.code = 'id_connection'
AND a2d.VALUE_NUM = d2.ID_DATABASE
视图查询:
综合示例
上面介绍基本的知识点,下面开始综合使用上述知识,展示etl资源库的sql算法。
创建core
进入bin目录,执行:
solr start -f -m 1g -p 8980
solr create -c etl -s 2 -rf 2
WARNING: Using _default configset with data driven schema functionality. NOT RECOMMENDED for production use.
To turn off: bin\solr config -c etl -p 8983 -action set-user-property -property update.autoCreateFields -value false
INFO - 2019-04-12 08:50:36.339; org.apache.solr.util.configuration.SSLCredentialProviderFactory; Processing SSL Credential Provider chain: env;sysprop
Created new core 'etl'
创建后,默认放置在目录:
D:\dev\solr-8.0.0\server\solr\etl
开发环境
基于eclipse,可以方便修改velocity配置,配置流程:
1、新建maven工程:etl-solr
2、将solr目录(D:\dev\solr-8.0.0\server\solr)拷贝到webapp目录下:
安装Veloedit插件
Eclipse安装veloedit插件步骤:
打开Help->Eclipse Marketplace,在search中输入:velocity,安装veloedit:
配置solr启动脚本
为了启动eclipse中的solr目录,需要设置-s参数启动solr服务,命令如下:
@echo off
title etl-solr-etl[8980][%date% %time%]
set solr.home=D:\etl-solr\src\main\webapp\solr
set path=%path%;D:\dev\solr-8.0.0\bin
solr start -f -m 1g -p 8980 -s %solr.home%
配置managed-schema
添加如下字段
<field name="db_from" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<field name="db_name" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<field name="db_to" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<field name="directory_name" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="modified_date" type="pdate" uninvertible="true" indexed="true" stored="true"/>
<field name="step_from" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<field name="step_to" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<field name="text" type="text_general" uninvertible="true" multiValued="true" indexed="true" stored="false"/>
<field name="trans_desc" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<field name="trans_name" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<copyField source="db_from" dest="text"/>
<copyField source="db_name" dest="text"/>
<copyField source="db_to" dest="text"/>
<copyField source="directory_name" dest="text"/>
<copyField source="step_from" dest="text"/>
<copyField source="step_to" dest="text"/>
<copyField source="trans_desc" dest="text"/>
<copyField source="trans_name" dest="text"/>
Solrconfig配置文件
添加dataimport配置:
默认创建的core,不支持dataimport,需要手工添加如下配置:
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />
<!-- jdbc driver -->
<lib dir="${solr.install.dir:../../../..}/dist/jdbc-lib/" regex=".*\.jar" />
<requestHandler name="/dataimport" class="solr.DataImportHandler">
<lst name="defaults">
<str name="config">db-data-config-etl.xml</str>
</lst>
</requestHandler>
Etl资源库使用的是sqlserver,需要添加驱动到目录:Jdbc-lib
数据库配置db-data-config-etl.xml
官方提供的数据配置文件过于繁琐,实际使用时,建议通过视图方式简化连接方式。简化后的配置文件如下:
本示例用于加载sqlserver数据库中的视图,视图中的字段命名与schema中的命名尽量一致,减少字段映射;
<dataConfig>
<dataSource
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://192.168.1.2:1433; DatabaseName=etl_db"
user="sa"
password="123456"/>
<document>
<entity name="etl_sql_analyze"
query="select * from v_etl_sql_analyze"
deltaQuery="select id from v_etl_sql_analyze where modified_date > '${dataimporter.last_index_time}'">
<!-- 如果数据库字段命名与solr-schema 配置一致,不需要在此处配置映射
<field column="trans_name" name="trans_name" />
trans_desc -->
</entity>
</document>
</dataConfig>
执行数据导入:
由于kettle保存脚本文件时,采用的是删除,插入模式,即不存在更新操作,所以每次执行solr数据导入时,需要勾选:clean,commit,然后执行:execute,执行后,点击:refresh status 查询执行状态。
Query
执行query,验证数据导入情况
browse配置
默认的browse配置,无法满足检索要求,需要根据techproducts示例提供的配置进行参数调整,调整后如下:
<requestHandler name="/browse" class="solr.SearchHandler" useParams="query,facets,velocity,browse">
<lst name="defaults">
<str name="echoParams">explicit</str>
<!-- VelocityResponseWriter settings -->
<str name="wt">velocity</str>
<str name="v.template">browse</str>
<str name="v.layout">layout</str>
<str name="title">etl-sql-analyze</str>
<!-- Query settings -->
<str name="defType">edismax</str>
<str name="qf">
text^0.5 db_name^1.0 trans_name^2.0 trans_desc^2.0 directory_name^1.0 id^1.0 db_from^1.1 db_to^1.1
step_from^10.0 step_to^5.0
</str>
<str name="mm">100%</str>
<str name="q.alt">*:*</str>
<str name="rows">10</str>
<str name="fl">*,score</str>
<str name="mlt.qf">
text^0.5 db_name^1.0 trans_name^2.0 trans_desc^2.0 directory_name^1.0 id^1.0 db_from^1.1 db_to^1.1
step_from^10.0 step_to^5.0
</str>
<str name="mlt.fl">db_name,trans_name,trans_desc,directory_name,db_from,db_to,step_from,step_to</str>
<int name="mlt.count">3</int>
<!-- Faceting defaults -->
<str name="facet">on</str>
<str name="facet.missing">true</str>
<str name="facet.field">db_name</str>
<str name="facet.field">db_from</str>
<str name="facet.field">db_to</str>
<str name="facet.field">step_to</str>
<str name="facet.query">dbcenter</str>
<str name="facet.query">cmp</str>
<str name="facet.mincount">1</str>
<str name="facet.pivot">db_from,db_to</str>
<str name="facet.range">modified_date</str>
<str name="f.modified_date.facet.range.start">NOW/YEAR-5YEARS</str>
<str name="f.modified_date.facet.range.end">NOW</str>
<str name="f.modified_date.facet.range.gap">+1YEAR</str>
<str name="f.modified_date.facet.range.other">before</str>
<str name="f.modified_date.facet.range.other">after</str>
<!-- Highlighting defaults -->
<str name="hl">on</str>
<str name="hl.fl">*</str>
<str name="hl.preserveMulti">true</str>
<str name="hl.encoder">html</str>
<str name="hl.simple.pre"><b><font color="#c00"></str>
<str name="hl.simple.post"></font></b></str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.name.hl.fragsize">0</str>
<str name="f.name.hl.alternateField">name</str>
<str name="f.content.hl.snippets">3</str>
<str name="f.content.hl.fragsize">200</str>
<str name="f.content.hl.alternateField">content</str>
<str name="f.content.hl.maxAlternateFieldLength">750</str>
<!-- Spell checking defaults -->
<str name="spellcheck">on</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.alternativeTermCount">2</str>
<str name="spellcheck.maxResultsForSuggest">5</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.maxCollations">3</str>
</lst>
<!-- append spellchecking to our list of components -->
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
velocity配置模版
velocity 目录提供了简易的查询界面配置模版;
已知限制
1、browse 与 VelocityResponseWriter 组件在solr下提供内容服务,并可以使用solr http api 方式访问。高级用户潜在访问solr其他部分内容的问题。
2、velocity模版存在硬编码的字段,由于其存储在conf目录下,配置内容需要与schema.xml,solrconfig.xml保持一致。
复制velocity配置
复制techproducts的velocity配置目录:
solr-8.0.0\example\techproducts\solr\techproducts\conf\velocity
拷贝到目录:
solr\etl\conf\velocity
修改browse.vm
#**
* Main entry point into the /browse templates
*#
#set($searcher = $request.searcher)
#set($params = $request.params)
#set($clusters = $response.response.clusters)
#set($mltResults = $response.response.get("moreLikeThis"))
#set($annotate = $params.get("annotateBrowse"))
#parse('query_form.vm')
#parse('did_you_mean.vm')
<div class="navigators">
#parse("facets.vm")
</div>
<div class="content_center">
<div class="pagination">
#parse("pagination_top.vm")
</div>
## Show Error Message, if any
<div class="error">
#parse("error.vm")
</div>
## Render Results, actual matching docs
<div class="results">
#parse("results_list.vm")
</div>
<div class="pagination">
#parse("pagination_bottom.vm")
</div>
<div id="footer">
#parse("footer.vm")
</div>
</div>
<div class="result-sql">
#parse("etl_sql.vm")
</div>
修改Tabs.vm
当前工程未包含地理位置信息,注释掉页签:Spatial, Group
#**
* Provides navigation/access to Advanced search options
* Usually displayed near the top of the page
*#
##TODO: Make some nice tabs here
#set($queryOpts = $params.get("queryOpts"))
<div class="tabs-bar" #annTitle("Click the link to demonstrate various Solr capabilities")>
<span>Type of Search:</span>
##queryOpts=$queryOpts
## return to Simple Search
##set( $selected = ($queryOpts && $queryOpts != "") )
#set( $selected = ! $queryOpts )
<span class="tab #if($selected)selected#end">
#if($selected)
Simple
#else
<a href="#url_for_home/?#debug#annotate">
Simple</a>
#end
</span>
## ## GEO-Spatial / Location Based
## #set( $selected = ($queryOpts == "spatial") )
## <span class="tab #if($selected)selected#end">
## #if($selected)
## Spatial
## #else
## <a href="#url_for_home?&queryOpts=spatial#debug#annotate">
## Spatial</a>
## #end
## </span>
## ## Group By Field
## #set( $selected = ($queryOpts == "group") )
## <span class="tab #if($selected)selected#end">
## #if($selected)
## Group By
## #else
## <a href="#url_for_home?#debug#annotate&queryOpts=group&group=true&group.field=manu_exact">
## Group By</a>
## #end
## </span>
</div>
修改hit.vm
用于根据自定义的字段【trans_name】加载模版【etl_doc.vm】
#**
* Called for each matching document but then
* calls one of product_doc, join_doc or richtext_doc
* depending on which fields the doc has
*#
#set($docId = $doc.getFieldValue('id'))
<div class="result-document">
## Has a "trans_name" field ?
#if($doc.getFieldValue('trans_name'))
#parse("etl_doc.vm")
## Has a "compName_s" field ?
#elseif($doc.getFieldValue('compName_s'))
#parse("join_doc.vm")
## Fallback to richtext_doc
#else
#parse("richtext_doc.vm")
#end
</div>
新增etl_doc.vm
脚本名称:etl_doc.vm,根据managed-schema内容配置页面显示模版:
#**
* Render a hit representing a etl
* assumed to have a field called "trans_name"
*#
<link rel="stylesheet" href="/solr/css/highlight/styles/rainbow.css">
<script src="/solr/js/highlight/highlight.pack.js"></script>
<script>
hljs.initHighlightingOnLoad();
function show_sql(div_id){
$('#highlight_sql').html(document.getElementById(div_id.trim()).innerHTML);
return false;
}
</script>
<div><b><a href="javascript:void(0)" onclick='show_sql("#field("id")")'>用途:#field('trans_desc')</a></b></div>
<div>资源库:[#field('db_name')]
脚本:#if($doc.getFieldValue('directory_name'))/#field('directory_name')/#else/#end#field('trans_name')</div>
<div>数据流:[#field('db_from') --> #field('db_to')]
目标表:[#field('step_to')]</div>
<div>
#set($str=$doc.getFieldValue('step_from')[0])
#if($str.length()>200)
#set($str=$str.substring(0,200))
<pre><code class="sql">$str ...</code></pre>
#else
<pre><code class="sql">$str</code></pre>
#end
</div>
<div id=#field('id') class="hidden">
<pre><code class="sql">$doc.getFieldValue('step_from')[0]</code></pre>
</div>
#parse('debug.vm')
新增etl_sql.vm
#**
* Render a hit representing a etl sql
* assumed to have a field called "trans_name"
*#
<h2>sql</h2>
<div id="highlight_sql"/>
click title to show full sql.
最终效果:
通过点击标题:[用途],可以在右侧显示详细的sql代码:
–END–