sql server 2008 html,如何在Sql Server 2008全文搜索中忽略html標記

I'm working on a knowledge base project using SQL Server 2008 Full Text Search Engine. Project included in articles and files where each article has multiple files.In those articles whole content is pure html.

我正在使用SQL Server 2008全文搜索引擎開發知識庫項目。項目包含在文章和文件中,每篇文章都有多個文件。在這些文章中,整個內容都是純HTML。

Right now,I successfully created fulltext catalog and index on SQL Server 2008 and my database is version 10 compatible.

現在,我成功地在SQL Server 2008上創建了全文目錄和索引,並且我的數據庫與版本10兼容。

Here are my questions:

這是我的問題:

1)Is it possible to ignore html tags,more clearly texts containing in "<...>", while searching in these articles,because if i wish to search for div,table etc. there should be no result returned?

1)在這些文章中搜索時,是否可以忽略html標簽,更清楚地包含“<...>”中的文本,因為如果我想搜索div,表等,應該沒有返回結果?

2)Articles will be updated anytime,so full text index must be updated when a new record is inserted.Is it enough to set only "TRACK CHANGES AUTOMATIC" while creating full text catalog?

2)文章將隨時更新,因此在插入新記錄時必須更新全文索引。在創建全文目錄時是否足以僅設置“TRACK CHANGES AUTOMATIC”?

3)We may use FILESTREAM feature hereafter,does SQL Server 2008 have a good performance on files using full text index? What specific document types does SQL Server 2008 good on indexing?

3)我們以后可能會使用FILESTREAM功能,SQL Server 2008在使用全文索引的文件上有很好的表現嗎? SQL Server 2008在索引方面有哪些特定的文檔類型?

Regards

問候

2 个解决方案

#1

-1

Please check for these:

請檢查以下內容:

1) In SQL Server Full Text, we can define noise words/Stopwords. You can edit the Noise world file and then you have to rebuild the catalog. So you can put all the html tags as noise. Please check

1)在SQL Server全文中,我們可以定義干擾詞/停用詞。您可以編輯Noise world文件,然后必須重建目錄。所以你可以把所有的html標簽都作為噪音。請檢查

http://msdn.microsoft.com/en-us/library/ms142551.aspx

2) With track changes it automatically include the changes in current full text search, but the ranking of these newly added article gets changed from the previous. So until and unless you master index is synced it will give up and down with ranking.

2)通過跟蹤更改,它會自動包含當前全文搜索的更改,但這些新添加的文章的排名會從之前的更改中更改。因此,除非你掌握索引同步,否則它將放棄和排名。

3) As far as i know we can implement custom filters, stemmers and word breakers and can plug into SQL Server full text search.By default i may not know the complete list, but it does doc and pdf.

3)據我所知,我們可以實現自定義過濾器,詞干分析器和斷字器,並可以插入SQL Server全文搜索。默認情況下,我可能不知道完整的列表,但它有doc和pdf。

For more information on SQL Server full text search 2008 please check:

有關SQL Server全文搜索2008的更多信息,請檢查:

http://technet.microsoft.com/en-us/library/cc721269.aspx

#2

26

there is a filter for .htm and .html files.

有.htm和.html文件的過濾器。

to see if you have the filter installed run this sql:

看看你是否安裝了過濾器運行這個sql:

SELECT * FROM sys.fulltext_document_types

you should see:

你應該看到:

.htm E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation

.html E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation

so, if you can convert your articles column to varbinary(max), then you can add a full text index on it and specify a doc type of '.html'

所以,如果你可以將你的文章列轉換為varbinary(max),那么你可以在其上添加一個全文索引並指定一個'.html'的文檔類型

once the index has populated, you can verify the keywords using this sql:

索引填充后,您可以使用此sql驗證關鍵字:

SELECT display_term, column_id, document_count

FROM sys.dm_fts_index_keywords

(DB_ID('your_db'), OBJECT_ID('your_table'))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值