I'm working on a knowledge base project using SQL Server 2008 Full Text Search Engine. Project included in articles and files where each article has multiple files.In those articles whole content is pure html.
我正在使用SQL Server 2008全文搜索引擎開發知識庫項目。項目包含在文章和文件中,每篇文章都有多個文件。在這些文章中,整個內容都是純HTML。
Right now,I successfully created fulltext catalog and index on SQL Server 2008 and my database is version 10 compatible.
現在,我成功地在SQL Server 2008上創建了全文目錄和索引,並且我的數據庫與版本10兼容。
Here are my questions:
這是我的問題:
1)Is it possible to ignore html tags,more clearly texts containing in "<...>", while searching in these articles,because if i wish to search for div,table etc. there should be no result returned?
1)在這些文章中搜索時,是否可以忽略html標簽,更清楚地包含“<...>”中的文本,因為如果我想搜索div,表等,應該沒有返回結果?
2)Articles will be updated anytime,so full text index must be updated when a new record is inserted.Is it enough to set only "TRACK CHANGES AUTOMATIC" while creating full text catalog?
2)文章將隨時更新,因此在插入新記錄時必須更新全文索引。在創建全文目錄時是否足以僅設置“TRACK CHANGES AUTOMATIC”?
3)We may use FILESTREAM feature hereafter,does SQL Server 2008 have a good performance on files using full text index? What specific document types does SQL Server 2008 good on indexing?
3)我們以后可能會使用FILESTREAM功能,SQL Server 2008在使用全文索引的文件上有很好的表現嗎? SQL Server 2008在索引方面有哪些特定的文檔類型?
Regards
問候
2 个解决方案
#1
-1
Please check for these:
請檢查以下內容:
1) In SQL Server Full Text, we can define noise words/Stopwords. You can edit the Noise world file and then you have to rebuild the catalog. So you can put all the html tags as noise. Please check
1)在SQL Server全文中,我們可以定義干擾詞/停用詞。您可以編輯Noise world文件,然后必須重建目錄。所以你可以把所有的html標簽都作為噪音。請檢查
http://msdn.microsoft.com/en-us/library/ms142551.aspx
2) With track changes it automatically include the changes in current full text search, but the ranking of these newly added article gets changed from the previous. So until and unless you master index is synced it will give up and down with ranking.
2)通過跟蹤更改,它會自動包含當前全文搜索的更改,但這些新添加的文章的排名會從之前的更改中更改。因此,除非你掌握索引同步,否則它將放棄和排名。
3) As far as i know we can implement custom filters, stemmers and word breakers and can plug into SQL Server full text search.By default i may not know the complete list, but it does doc and pdf.
3)據我所知,我們可以實現自定義過濾器,詞干分析器和斷字器,並可以插入SQL Server全文搜索。默認情況下,我可能不知道完整的列表,但它有doc和pdf。
For more information on SQL Server full text search 2008 please check:
有關SQL Server全文搜索2008的更多信息,請檢查:
http://technet.microsoft.com/en-us/library/cc721269.aspx
#2
26
there is a filter for .htm and .html files.
有.htm和.html文件的過濾器。
to see if you have the filter installed run this sql:
看看你是否安裝了過濾器運行這個sql:
SELECT * FROM sys.fulltext_document_types
you should see:
你應該看到:
.htm E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation
.html E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation
so, if you can convert your articles column to varbinary(max), then you can add a full text index on it and specify a doc type of '.html'
所以,如果你可以將你的文章列轉換為varbinary(max),那么你可以在其上添加一個全文索引並指定一個'.html'的文檔類型
once the index has populated, you can verify the keywords using this sql:
索引填充后,您可以使用此sql驗證關鍵字:
SELECT display_term, column_id, document_count
FROM sys.dm_fts_index_keywords
(DB_ID('your_db'), OBJECT_ID('your_table'))