Ryan Rinaldi

.net development served with a slice of sarcasm
  • rss
  • About Me
  • Blog
  • Contact

SQL Server full text indexing of HTML content

Ryan Rinaldi 12/29/2008 3:40:19 AM

Sorry for the long and boring title, but I wanted to make it easy for Google to find this post.

If you happen to have a site that is storing Unicode (that's bold because it's important, as you will see later!) HTML content in a SQL Server database and you want to enable full text indexing there are a few hoops you have to jump through that aren't obvious.

First of all, you can't store your content as varchar, nvarchar or text.  The assumption is that the text in those columns is plain text so the indexer uses a vanilla iFilter instead of the iFilter for HTML.  To use the HTML iFilter you will need to create a pair of columns.  One "Document Type" column and another varbinary column to store your content.  What I did was create a document type column and a persisted calculated varbinary column that was the column actually indexed (not the column that the application edits).

In the end I had 3 columns:

  1. [Content] -> This is the varchar(max) column that my application updates
  2. [IndexedContent] -> This is the varbinary(max) column that is indexed
  3. [ContentDocumentType] -> This is the (obviously) Document Type column that tells SQL Server to use the HTML iFilter.

Here is the code to setup full text indexing (let's assume you wanted to index some content in a CMS):

EXEC sp_fulltext_catalog @ftcat= 'ContentManagementSystem' , 
     @action= 'Create'
GO
 
EXEC sp_fulltext_table 'dbo.Entries', 'create', 'ContentManagementSystem', 'PK_Entries_EntryID';
GO
 
EXEC sp_fulltext_service @action='load_os_resources', @value=1;
GO
 
ALTER TABLE Entries
    ADD IndexedContent AS (0xFFFE+CONVERT([varbinary](max),[Content],0))
GO
 
ALTER TABLE Entries
    ADD ContentDocumentType AS 'htm'
GO
 
EXEC sp_fulltext_column 'dbo.Entries', 'IndexedContent', 'add', 'ContentDocumentType'
GO
 
EXEC sp_fulltext_table 'dbo.Entries','start_change_tracking';
GO
 
EXEC sp_fulltext_table 'dbo.Entries','activate';
GO

The key part is where you see 0xFFEE.  When you are storing Unicode content in a varbinary field, the field needs to start with those magic bits or else the indexer doesn't index the field properly and you will get ZERO results back.

Comments
1 Comment
Tags
blog

1 responses.

 avatar

says:

1.11.2009 at 6:45 PM

Hi, this caught my interest.

Firstly, is it right to say that the problem being solved here is that when you store HTML, some words will be next to a tag without a punctuation character in between, and so not be recognised?

And is it right to say that your solution is to apply the HTML iFilter, but that to do this you need to duplicate the 'Content' field because the iFilter alters the field to which it is applied, which would render it useless for publishing as opposed to searching purposes?

Do you know of a way to do full-text comparisons in SQL Server, to get for example a list of related records ranked by the strength of the match?

Leave a Comment

  • My Tweets

    • Looking at our options to move to TFS 2010 from 08 and change process template: just as painful to migrate off of TFS. sigh.
    • RT @Snywhip: I'm very proud of Spain and it's banning of the bull fights starting in 2012. Finally ;)
    • RT @adamtybor: @sergiopereira happy birthday!
    • Follow Me on Twitter
Powered by Graffiti CMS
  • Content © Ryan Rinaldi 2010
  • design by jide
  • Theme converted to GraffitiCMS by Adonis Bitar