Data Mining

Download Free Data Mining Source Code In C/C++, C#, Visual Basic, Visual Basic.NET, Java,
and other programming languages
Welcome to Data Mining Sign in | Join | Help
in Search

Data Mining Source Code Newsletter

Business Analyst Training
Live, Online, Video Courses
Instructor-Led + Hands-On
BusinessAnalystBootCamp.Com

SQL + Database Training
Live, Online, Video Classes
Instructor-Led + Hands-On
SQLBootCamp.Com

Software Developer Training
Live, Online, Video Courses
Instructor-Led + Hands-On
SoftwareDevelperBootCamp.Com

IT CAREER COACH
Hands-On Experience Coaching
IT Skills Training
IT-Career-Coach.NET

IT Professional Newsletter
"Free" IT Career Success Tips
How To Accelerate Your Career
IT Career Newsletter

Ask IT Career Questions
"ASK" A Burning IT Career
Question Or Get Answers
Ask A Burning IT Question Now!

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed

Real World ASP.NET Search Engine For Text Mining, Text Indexing and Text Searching - Part 2

Last post 04-29-2005, 11:02 by Sedgewick. 1 replies.
Sort Posts: Previous Next
  •  04-17-2005, 20:24 4569

    Real World ASP.NET Search Engine For Text Mining, Text Indexing and Text Searching - Part 2

    1. Introduction

    The first article on 'Real World ASP.NET Search Engines' at http://www.kdkeys.net/forums/4548/ShowPost.aspx introduces text indexing and text searching in CommunityServer's ASP.NET Search Engine. It presents some of the reasons why local Search Engines are used in Websites and why you may need to use a custom Search Engine for your own Website.

    This article presents a detailed explanation of how text indexing and text searching works in CommunityServer's ASP.NET Search Engine. If you haven't read the first article, please take a moment to read it at http://www.kdkeys.net/forums/4548/ShowPost.aspx .

    2. Text Indexing

    To index the text of a Weblog, the IndexPosts method of the WeblogSearch class creates an instance of the WeblogDataProvider class and calls the SearchReindex method to get a list of Weblog posts or text which will be indexed:

    WeblogDataProvider wdp = WeblogDataProvider.Instance();

    PostSet postSet = wdp.SearchReindexPosts(setSize, settingsID);

    The PostSet object is a collection of Weblog post objects . By iterating through the Weblog posts in the PostSet collection, each post can be indexed for searching. The PostSet collection could be a collection of Forum posts or Web pages where each post in the collection is a single Web page. If a collection of Forum posts needs to be indexed for example, an instance of the ForumDataProvider class is created and the SearchReindexPosts method executed to return a collection of posts:

    ForumDataProvider fdp = ForumDataProvider.Instance()

    PostSet postSet = fdp.SearchReindexPosts(setSize, settingsID)

    An index-able Post object could consist of different types of text. For example, if a post object represents a Web page, then the Web page can be defined as a Post object consisting of Title, Keywords metatags, Description Metatags, Author Metatags, Subject and Body. Likewise, a Forum post consists of a Username (post author), Subject and Body. To index text, the Search.Index method is called with the indexable Post property as a parameter from the ForumSearch class or GallerySearch class or BlogSearch class.

    ForumSearch.IndexPosts:

    Hashtable words = new Hashtable();

    //Index the Post Author

    words = Index(Post.Username, Words, WordLocation.Author, settingsID);

    //Index the Post Subject

    words = Index(Post.Subject, words, WordLocation.Subject, settingsID);

    //Index the Post Body

    words = Index(post.Body, words, WordLocation.Body, settingsID);

    if the Post object represented a Web page and the meta tags needed to be considered by this Search Engine, the indexing of the meta tags could be made by calls similar to:

    //Index the Keywords Meta Tag

    words = Index(post.Keywords, words, WordLocation.Keywords, settingsID);

    //Index the Description Meta Tag

    words = Index(post.Description, words, WordLocation.Description, settingsID);

    GallerySearch.IndexPosts:

    Picture picture = post as Picture;

    if(picture != null){

    // Index the Picture Subject

    words = Index(picture.Subject, words, WordLocation.Subject, settingsID);

    // Count the attachment filename with the subject, if it doesn't equal the subject

    //Index the Picture’s Filename!

    if(picture.Subject != picture.AttachmentFilename)

    words = Index(picture.AttachmentFilename, words, WordLocation.Subject, settingsID);

    // Index the Picture Body

    words = Index(picture.Body, words, WordLocation.Body, settingsID);

    // Get a count of the total words in the body

    totalBodyWords = CleanSearchTerms(post.Body).Length;

    //Save the indexed picture to the database

    InsertIntoSearchBarrel(words, post, settingsID, totalBodyWords);}

    GalleryComment comment = post as GalleryComment;

    if(comment != null){

    words = Index(comment.Body, words, WordLocation.Body, settingsID);

    // Get a count of the total words

    totalBodyWords = CleanSearchTerms(post.Body).Length;

    //Save the indexed picture to the database

    InsertIntoSearchBarrel(words, post, settingsID, totalBodyWords);}

    3. SearchBarrel

    After the specific parts of a Post are indexed, a call is made to the famous SearchBarrel to save the indexed text in the database:

    InsertIntoSearchBarrel(words, post, settingsID, totalBodyWords) where the totalBodyWords parameter is created by counting the total number of words in the Post body with the method

    totalBodyWords = CleanSearchTerms(post.Body).Length;

    Now, to the InsertIntoSearchBarrel method, public void InsertIntoSearchBarrel(Hashtable words, Post post, int settingsID, int totalBodyWords) which inserts all the indexed words in a post in the Community Server cs_SearchBarrel table.

    InsertIntoSearchBarrel iterates through all the keys in the Hashtable object words, retrieves the word or indexed text for each key and calculates the weight of the word using the CalculateWordRank method,

    foreach(int key in words.Keys){

    Word w = words[key] as Word;

    w.Weight = CalculateWordRank(w, post, totalBodyWords);}

    and then saves the Hashtable of indexed words in the database:

    CommonDataProvider dp = CommonDataProvider.Instance();

    dp.InsertIntoSearchBarrel(words, post, settingsID);

    Finally, the InsertIntoSearchBarrel method of the SqlCommonDataProvider class is executed to save the indexed text in the database using the stored procedure cs_Search_Add

    Since the SqlCommonDataProvider.InsertIntoSearchBarrel method receives a Hashtable containing indexed words, the entire Hashtable of indexed words or posts is saved by repeatedly calling the cs_Search_Add stored procedure passing in an indexed word each time

    foreach (int wordHash in words.Keys) {

    Word word;

    // Get the Word instance to process

    word = (Word) words[wordHash];

    myCommand.ExecuteNonQuery();

    }

    4. Text Searching

    In the first half of this article I explained how Community Server’s ASP.NET Search Engine first indexes text or the content of a Web page and saves the indexed text to a database. In this part of the article, I will explian how text is searched for a set of words that match all or some of the words searched by a user.

    After the Text contained in a Weblog, Forums, Gallery or Web page is indexed, it is ready to be searched for matches to a user's query. Text is searched when the GetSearchResults method of the Search class is called with the user’s search query. The ForumsSearch, GallerySearch and WeblogSearch classes derive from the Search class and are used in searching text in Weblogs, Forums and Picture Galleries specifically.

    The GetSearchResults method is overridden by classes deriving from the Search class including the ForumsSearch and WeblogSearch class to execute a local search for possible matches to a user’s search query. This means that in the ForumsSearch class, for example, an instance of the ForumsDataProvider is created and the GetSearchResults method executed:

    protected override SearchResultSet GetSearchResults(SearchQuery query, SearchTerms terms){

    ForumDataProvider fdp = ForumDataProvider.Instance();

    SearchResultSet results = fdp.GetSearchResults(query, terms);}

    The ForumsSearch GetSearchResults method delegates the actual work of searching the database for possible matches to a Search to the ForumsSqlDataProvider.GetSearchResults method, which creates a dynamic SQL statement in

    string searchSQL = SqlGenerator.SearchText(query,terms,GetSettingsID(),ApplicationType.Forum);

    and searches the indexed posts for matches to the search terms with a call to the stored procedure cs_forums_Search.

    Note that the size of the dynamic SQL statement which includes the search terms is limited to a variable length Unicode string which is less than 4000 characters. The dynamic sql search statement is a sql statement for searching against the Search Barrel table cs_SearchBarrel.

    While the GetSearchResults() method is responsible for matching a set of words otherwise known as search terms against the cs_Posts and cs_SearchBarrel tables, the GetSearchTerms() method is responsible for creating a set of words or search terms which can be matched in the cs_Posts and cs_SearchBarrel tables.

    The GetSearchTerms of the Search class receives a user's search as a SearchQuery object, strips HTML tags from the Search Query, tokenizes the clean Search Terms and breaks up the tokenized into "And" and "Or" search terms and returns the final result set to the GetSearchResults method of the Search class. It converts words created by a user into a form suitable for matching against the cs_SearchBarrel and cs_Posts table.

    5. Summary

    This article presents a detailed explanation of how text is indexed and searched for matches on some or all of the words used in a Search Engine's search query. It demonstrates how ASP.NET and SQL Server are used to create a scalable real world Search Engine that is used to power searches on a number of popular high traffic websites. It also provides a detailed explanation of how text from Weblogs, Forums and Photo Galleries stored on a database is indexed and searched for matches to search terms and queries created by online users.

    An introductory article on Text Indexing and Text Searching is available in the previous article at http://www.kdkeys.net/forums/4548/ShowPost.aspx

    6. About The Author

    Kingsley Tagbo is an freelance writer and consultant.

    You can reach him via his blogs at http://www.kdkeys.net/blogs/kingsleytagbo or http://www.kdkeys.net/blogs/kingsley.tagbo



    Sign-up For Data Mining Source Code Newsletter

  •  04-29-2005, 11:02 4628 in reply to 4569

    Re: Real World ASP.NET Search Engine For Text Mining, Text Indexing and Text Searching - Part 2

    Incredible article - very well executed!

    Isn't it amazing that the concept of the word Barrel works so well, and preforms so well?

Announcing The Data Mining Source Code Newsletter!

Subscribe By Email | Subscribe By RSS Feed