1. Introduction
The first article on 'Real World ASP.NET Search Engines' at http://www.kdkeys.net/forums/4548/ShowPost.aspx introduces text indexing and text searching in CommunityServer's ASP.NET Search Engine. It presents some of the reasons why local Search Engines are used in Websites and why you may need to use a custom Search Engine for your own Website.
This article presents a detailed explanation of how text indexing and text searching works in CommunityServer's ASP.NET Search Engine. If you haven't read the first article, please take a moment to read it at http://www.kdkeys.net/forums/4548/ShowPost.aspx .
2. Text Indexing
To index the text of a Weblog, the IndexPosts method of the WeblogSearch class creates an instance of the WeblogDataProvider class and calls the SearchReindex method to get a list of Weblog posts or text which will be indexed:
WeblogDataProvider wdp = WeblogDataProvider.Instance();
PostSet postSet = wdp.SearchReindexPosts(setSize, settingsID);
The PostSet object is a collection of Weblog post objects . By iterating through the Weblog posts in the PostSet collection, each post can be indexed for searching. The PostSet collection could be a collection of Forum posts or Web pages where each post in the collection is a single Web page. If a collection of Forum posts needs to be indexed for example, an instance of the ForumDataProvider class is created and the SearchReindexPosts method executed to return a collection of posts:
ForumDataProvider fdp = ForumDataProvider.Instance()
PostSet postSet = fdp.SearchReindexPosts(setSize, settingsID)
An index-able Post object could consist of different types of text. For example, if a post object represents a Web page, then the Web page can be defined as a Post object consisting of Title, Keywords metatags, Description Metatags, Author Metatags, Subject and Body. Likewise, a Forum post consists of a Username (post author), Subject and Body. To index text, the Search.Index method is called with the indexable Post property as a parameter from the ForumSearch class or GallerySearch class or BlogSearch class.

ForumSearch.IndexPosts:
Hashtable words = new Hashtable();
//Index the Post Author
words = Index(Post.Username, Words, WordLocation.Author, settingsID);
//Index the Post Subject
words = Index(Post.Subject, words, WordLocation.Subject, settingsID);
//Index the Post Body
words = Index(post.Body, words, WordLocation.Body, settingsID);
if the Post object represented a Web page and the meta tags needed to be considered by this Search Engine, the indexing of the meta tags could be made by calls similar to:
//Index the Keywords Meta Tag
words = Index(post.Keywords, words, WordLocation.Keywords, settingsID);
//Index the Description Meta Tag
words = Index(post.Description, words, WordLocation.Description, settingsID);
GallerySearch.IndexPosts:
Picture picture = post as Picture;
if(picture != null){
// Index the Picture Subject
words = Index(picture.Subject, words, WordLocation.Subject, settingsID);
// Count the attachment filename with the subject, if it doesn't equal the subject
//Index the Picture’s Filename!
if(picture.Subject != picture.AttachmentFilename)
words = Index(picture.AttachmentFilename, words, WordLocation.Subject, settingsID);
// Index the Picture Body
words = Index(picture.Body, words, WordLocation.Body, settingsID);
// Get a count of the total words in the body
totalBodyWords = CleanSearchTerms(post.Body).Length;
//Save the indexed picture to the database
InsertIntoSearchBarrel(words, post, settingsID, totalBodyWords);}
GalleryComment comment = post as GalleryComment;
if(comment != null){
words = Index(comment.Body, words, WordLocation.Body, settingsID);
// Get a count of the total words
totalBodyWords = CleanSearchTerms(post.Body).Length;
//Save the indexed picture to the database
InsertIntoSearchBarrel(words, post, settingsID, totalBodyWords);
}
3. SearchBarrel
After the specific parts of a Post are indexed, a call is made to the famous SearchBarrel to save the indexed text in the database:
InsertIntoSearchBarrel(words, post, settingsID, totalBodyWords) where the totalBodyWords parameter is created by counting the total number of words in the Post body with the method
totalBodyWords = CleanSearchTerms(post.Body).Length;
Now, to the InsertIntoSearchBarrel method, public void InsertIntoSearchBarrel(Hashtable words, Post post, int settingsID, int totalBodyWords) which inserts all the indexed words in a post in the Community Server cs_SearchBarrel table.
InsertIntoSearchBarrel iterates through all the keys in the Hashtable object words, retrieves the word or indexed text for each key and calculates the weight of the word using the CalculateWordRank method,
foreach
(int key in words.Keys){
Word w = words[key] as Word;
w.Weight = CalculateWordRank(w, post, totalBodyWords);}
and then saves the Hashtable of indexed words in the database:
CommonDataProvider dp = CommonDataProvider.Instance();
dp.InsertIntoSearchBarrel(words, post, settingsID);
Finally, the InsertIntoSearchBarrel method of the SqlCommonDataProvider class is executed to save the indexed text in the database using the stored procedure cs_Search_Add
Since the SqlCommonDataProvider.InsertIntoSearchBarrel method receives a Hashtable containing indexed words, the entire Hashtable of indexed words or posts is saved by repeatedly calling the cs_Search_Add stored procedure passing in an indexed word each time
foreach
(int wordHash in words.Keys) {
Word word;
// Get the Word instance to process
word = (Word) words[wordHash];
myCommand.ExecuteNonQuery();
}
4. Text Searching
In the first half of this article I explained how Community Server’s ASP.NET Search Engine first indexes text or the content of a Web page and saves the indexed text to a database. In this part of the article, I will explian how text is searched for a set of words that match all or some of the words searched by a user.
After the Text contained in a Weblog, Forums, Gallery or Web page is indexed, it is ready to be searched for matches to a user's query. Text is searched when the GetSearchResults method of the Search class is called with the user’s search query. The ForumsSearch, GallerySearch and WeblogSearch classes derive from the Search class and are used in searching text in Weblogs, Forums and Picture Galleries specifically.
The GetSearchResults method is overridden by classes deriving from the Search class including the ForumsSearch and WeblogSearch class to execute a local search for possible matches to a user’s search query. This means that in the ForumsSearch class, for example, an instance of the ForumsDataProvider is created and the GetSearchResults method executed:
protected override
SearchResultSet GetSearchResults(SearchQuery query, SearchTerms terms){
ForumDataProvider fdp = ForumDataProvider.Instance();
SearchResultSet results = fdp.GetSearchResults(query, terms);}
The ForumsSearch GetSearchResults method delegates the actual work of searching the database for possible matches to a Search to the ForumsSqlDataProvider.GetSearchResults method, which creates a dynamic SQL statement in
string
searchSQL = SqlGenerator.SearchText(query,terms,GetSettingsID(),ApplicationType.Forum);
and searches the indexed posts for matches to the search terms with a call to the stored procedure cs_forums_Search.
Note that the size of the dynamic SQL statement which includes the search terms is limited to a variable length Unicode string which is less than 4000 characters. The dynamic sql search statement is a sql statement for searching against the Search Barrel table cs_SearchBarrel.
While the GetSearchResults() method is responsible for matching a set of words otherwise known as search terms against the cs_Posts and cs_SearchBarrel tables, the GetSearchTerms() method is responsible for creating a set of words or search terms which can be matched in the cs_Posts and cs_SearchBarrel tables.
The GetSearchTerms of the Search class receives a user's search as a SearchQuery object, strips HTML tags from the Search Query, tokenizes the clean Search Terms and breaks up the tokenized into "And" and "Or" search terms and returns the final result set to the GetSearchResults method of the Search class. It converts words created by a user into a form suitable for matching against the cs_SearchBarrel and cs_Posts table.
5. Summary
This article presents a detailed explanation of how text is indexed and searched for matches on some or all of the words used in a Search Engine's search query. It demonstrates how ASP.NET and SQL Server are used to create a scalable real world Search Engine that is used to power searches on a number of popular high traffic websites. It also provides a detailed explanation of how text from Weblogs, Forums and Photo Galleries stored on a database is indexed and searched for matches to search terms and queries created by online users.
An introductory article on Text Indexing and Text Searching is available in the previous article at http://www.kdkeys.net/forums/4548/ShowPost.aspx
6. About The Author
Kingsley Tagbo is an freelance writer and consultant.
You can reach him via his blogs at http://www.kdkeys.net/blogs/kingsleytagbo or http://www.kdkeys.net/blogs/kingsley.tagbo