Microsoft Presents FAST forward 09: Engage Your User
18 December 08 03:31 PM | Microsoft Enterprise Search Blog

The Mirage, Las Vegas, Feb 9-11

Since its inaugural conference in 2006, FASTforward has been a venue for though leadership and innovation in the field of search. This year, FASTforward’09 is the industry’s largest business and technology conference dedicated to search-driven innovation. Join the discussion! At FASTforward’09, we explore how businesses are responding – and evolving – in the face of rapid technological change and the growing demands for user control. As The User Revolution continues, we examine search’s critical role in helping companies engage their users. This year’s conference will also highlight Microsoft’s vision for enterprise search technology.

New this year, a SharePoint technology track covering Enterprise Search, Social Computing, Enterprise Content Management and more!  Other tracks include:

  • Monetization via Search (customer-facing)
  • Productivity via Search (internal enterprise)
  • FAST technology
  • Partner Solutions

Top Ten Reasons Why You Should Attend FASTforward’09:

1. Uncover new opportunities for using search

2. Hear what others have done with search technology

3. Learn industry best practices for search

4. Hear the Microsoft vision for search and FAST

5. Learn how SharePoint and FAST products are positioned

6. Gain insight on integration plans for SharePoint and FAST products

7. Understand how partners can help

8. Obtain access to Microsoft and FAST executives and industry luminaries

9. Network with colleagues

10. Attend convenient pre-conference technical training

Come spend three days with us at the Mirage in Las Vegas learning from industry thought leaders, customers, partners, and our own Microsoft experts!

Learn more at FASTforward ‘09. Register before January 9 and receive $400 off of the full registration fee. See you there!

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Microsoft positioned in the Leaders Quadrant of the 2008 Information Access Magic Quadrant
30 October 08 05:33 PM | Microsoft Enterprise Search Blog

We’ve got great news to share! Last month, Gartner published the 2008 Magic Quadrant for Information Access Technology, and Microsoft was positioned in the Leaders Quadrant. Since the completion of the acquisition, we’ve worked incredibly hard to communicate and demonstrate a combined vision and strategy to our customers and partners. It’s good to know we’re heading in the right direction!

When I talk with customers about search, it’s clear that organizations have very different needs. In fact, many people tell me that even within an organization the one-size-fits-all approach just doesn’t work. So over the last year, we’ve announced some bold moves designed to create a compelling portfolio of search applications. With the addition of Search Server Express and the acquisition of FAST, we now have a product line-up designed to meet a broad range of business needs:

  • Some departments or small organizations need search that is quick and easy to set up; we offer Microsoft Search Server Express as a free download so that you can get it up and running in about 30 minutes. We’re excited to see customers like St. Jude Medical and Urbis having quick successes with Express. We’re also seeing partners, such as StartReady, build solutions around Search Server Express to create a search appliance.
  • Many organizations need search as an integral part of a business productivity infrastructure; Search in Microsoft Office SharePoint Server is integrated with other key SharePoint productivity workloads such as portals, collaboration, ECM, business processes and BI. Customers like McCann Worldgroup and Jones Lang LaSalle are all deriving productivity increases with better search in SharePoint. In particular, both companies are promoting collaboration and leveraging in-house experts with people search enhanced by user profiles in MySites.
  • Some organizations face business problems that demand high-end search; FAST ESP offers best-in-class search with extreme scalability, query performance, and other advanced capabilities for sophisticated customer-facing or inside-the-firewall applications. For example, Aerotek and TEKsystems, two of the world’s largest staffing companies, deliver job searching to more than 1.3 million users. In more than 164 million queries, greater than 99.5% of query results came back in less than 2 seconds. For inside-the-firewall productivity, they index more than 10 million complex candidate records with low latency during high volume index updates. We’re also excited to see Pfizer pushing the envelope with an Enterprise Collaboration Framework driven by FAST ESP on top of SharePoint

While our “Leaders Quadrant” position in the Magic Quadrant is an important milestone, we still think of this as the very beginning of our journey. We’re continuing to combine our deep technical expertise with our broad reach to deliver exciting innovations to the market – so you can and should expect great things to come. Stay tuned!

Kirk Koenigsbauer
General Manager,
SharePoint Business Group

Magic Quadrant for Information Access Technology (Gartner Research, Sept. 30, 2008) Microsoft is positioned in the Leaders Quadrant of Gartner, Inc.'s 2008 Magic Quadrant for Information Access Technology. This report assesses vendors with capabilities that go beyond enterprise search to encompass a range of technologies. Their capabilities include search; federated search, content classification, categorization and clustering; fact and entity extraction; taxonomy creation and management; information presentation (for example, visualization) to support analysis and understanding; and desktop search to address user-controlled repositories in order to locate and "invoke" documents, data, e-mail and intelligence.

The Magic Quadrant is copyrighted 2008 by Gartner, Inc. and is reused with permission. The Magic Quadrant is a graphical representation of a marketplace at and for a specific time period. It depicts Gartner's analysis of how certain vendors measure against criteria for that marketplace, as defined by Gartner. Gartner does not endorse any vendor, product or service depicted in the Magic Quadrant, and does not advise technology users to select only those vendors placed in the "Leaders" quadrant. The Magic Quadrant is intended solely as a research tool, and is not meant to be a specific guide to action. Gartner disclaims all warranties, express or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Taking People Search on the Road….
16 October 08 01:33 PM | Microsoft Enterprise Search Blog

In another great blog post Matt McDermott walks you through the steps of enabling SharePoint’s people search capability on a mobile device with the end results looking something like this;

Search Results

The post is here;

 http://blogs.catapultsystems.com/matthew/archive/2008/09/27/mobile-people-search.aspx

Richard Riley
Senior Technical Product Manager
Microsoft Corp.

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Partner Post: One Stop Search from the Microsoft Office Research Task Pane
26 September 08 04:04 PM | Microsoft Enterprise Search Blog

Since the release of Microsoft Office 2003, Microsoft desktop applications such as MS Word, PowerPoint, Excel, Outlook and Internet Explorer have contained an internal federated or meta-search capability known as the ‘Research Pane’. To see this in action in office 2003 (see link for instructions for Office 2007), select (i.e. highlight) a word or phrase within MS Word or MS Outlook, and on PC’s right click on the highlighted word, pull down to the “Lookup Up” option and click. Another way to do this is to hold down the ‘Alt’ button while left-clicking on a highlighted word (in Macs use a command-click). The Research Pane should then open up in the application window and execute a search on the highlighted section. Out of the box, MS Office ships with several research sources such as the Microsoft Encarta Dictionary, Microsoft Live Search, MSN Money and some third party offerings from Factiva and Thomson Gale among others. Here is a screenshot of content returned from three enterprise search engines as well as from some public biomedical websites.

clip_image002

The list of sources that can be searched from the Research Pane is expandable by adding connections to Research Pane service providers. Armed with a URL to a Research Pane “registration service”, a user can install the source into their MS applications using the “Research options…” link. This potentially gives users access to a large set of data sources to choose from. Once a source is installed, the user can select the source from a dropdown list (which causes the search to be executed) or can select a set of sources based on certain pre-defined categories.

Raritan Technologies specializes in Federated Search solutions and has created an array of search connectors to a number of web sites, web services, search engines and databases and directory services (to name a few) using our Search Integration Framework Toolkit (SIFT) and Federation Manager. We and our partner in this effort, New Idea Engineering, have also provided a number of ways to deploy these federated search connectors to web applications and within web services such as SOAP and Open Search. We have recently added to this list by providing a MS Research Pane service ‘front-end’ to our federated connectors. This enables connections to search engines such as Autonomy IDOL, K2 or Ultraseek, Dieselpoint, Endeca, Exalead, Fast, Lucene, Mark Logic (and others) as well as Sharepoint (out of the box) SQL databases, LDAP directories, SOAP and OpenSearch web services, Z39.50 sources, Internet web sites that have search boxes (a very large list that includes general web search engines and specialized sites such as news or research sites) and Content Management Systems such as Alfresco, Documentum and eRoom, and Archival Systems like Symantec Enterprise Vault to be ‘plugged-in’ to any MS Office application. The modular design of the Raritan Search Integration Framework enables other connectors to be added to this list and as this happens, these new sources will automatically be available to users of the Research Pane once configured as a service.

The ability to combine internal content sources from content management systems, enterprise search engines, databases and directory services with external content from subscription or public web sites and web services into MS Office applications provides a huge potential for search integration at the “tip of the sword” where thought and knowledge are combined to create new content.

For more information on the Raritan Technologies “Research Pane Integration” or to arrange for a trial connector please visit http://www.raritantechnologies.com/ResearchPane.shtml.

Barry Freindlich
President Raritan
Technologies, Inc.

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Filed under:
How to: Customize the Thesaurus in SharePoint Search and Search Server
23 September 08 05:49 PM | Microsoft Enterprise Search Blog

The thesaurus is an xml file that provides users with a means of automatically expanding or rewriting their queries to include synonyms, acronyms, etc. For example, in a chemical company, product ID 1234, oxygen, O2 and LOX could all refer to the same item.

A SharePoint Search administrator can modify the thesaurus file to substitute all these words at search query time. This document explains how to set up a thesaurus and where to find the relevant files.

Supported Thesaurus Syntax:
To use the sample files provided by the product, you need to remove the comment beginning (<!--) and ending lines (-->) from the xml file.

Explanation of terms:

Term Meaning
thesaurus marks beginning (and end) of thesaurus
diacritics_sensitive

Diacritics are marks, such as accents that are added to letters that change their pronunciation. For example, the acute accent over and e gives you: é.
0 – ignore diacritics
1 – respect diacritics

expansion A list of alternative forms each marked by <sub> by the sub keyword
sub One of several alternatives in an expansion
replacement Several patterns will be replaced with a substitution.
pat A pattern to be replaced
sub Item to be substituted

Example:

<XML ID="Microsoft Search Thesaurus">
  <thesaurus xmlns="x-schema:tsSchema.xml">
    <diacritics_sensitive>0</diacritics_sensitive>
  <expansion>
    <sub>Internet Explorer</sub>
    <sub>IE</sub>
    <sub>IE5</sub>
  </expansion>
  <replacement>
    <pat>NT5</pat>
    <pat>W2K</pat>
    <sub>Windows 2000</sub>
  </replacement>
</thesaurus>

The example means:

  • We have elected to ignore accents, etc in the thesaurus
  • Queries containing IE, or any other one of the <sub> clauses will also contain “internet explorer” and “ie5”.
  • If a query contains terms “NT5” or “W2K”, they will be replaced by “Windows 2000”.

How to Customize the Thesaurus:

  1. Find the appropriate thesaurus file in the config folder contained in the registry key: [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager]"DefaultApplicationsPath”
  2. Update the thesaurus file(s) for each appropriate language for each desired <expansion> or <replacement>.
  3. Replace the file(s) on each index, query and web frontend server for each search application path:
    %programfiles%\Microsoft Office Servers\12.0\Data\Office Server\Applications\[GUID]\Config 
    Note index propagation does not sync these files on all the servers in the farm.
  4. Stop and restart search service (this is needed to load the new thesaurus files). E.G., in a console window, run “net stop osearch & net start osearch” without quotes, or launch Programs\Administrations Tools\Services then right click Office SharePoint Search Service then choose restart.

Notes:

See “Finding Important Files” below for a summary of where to find the key files to manage your thesaurus.

  1. (optional) If you want to have the same thesaurus files apply to all newly created SSPs, put your thesaurus files under the main config folder
    (e.g., %programfiles%\Microsoft Office Servers\12.0\Data\config).
  2. If there is a syntax error in the thesaurus file, all expansions and replacements will be ignored.
  3. If a word in the thesaurus file matches a stop word in the stop word file, it will be ignored.   To avoid this, remove it from the appropriate stop word file.
  4. Thesaurus terms are broken into words at query time.  Add words you do not want to be broken into the custom dictionary file customLANG.lex (see Finding Important Files for more details).
  5. Search first applies the thesaurus, and then expands words into their alternate forms, when “stemming” functionality is turned on.   Care should be taken to avoid expanding into too many unnecessary forms as this may harm search performance and accuracy.
  6. The “All words” option on the Advanced Search page might no longer work when using multiple term substitution with the thesaurus. This is because an implicit “+” is used between every term.  For example, if we used our example thesaurus above and typed E.G., “browser ie” in the “All words” field, it would look for “+browser +ie” – it would no longer allow “Internet Explorer”.
  7. There is a 10,000 term limit per language in thesaurus.

Finding Important Files:

The following are the most important files used to manage your thesaurus.

There are 50 default stop word files and 48 thesaurus sample files for the languages we support.

The search service install path can be located by examining registry key [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager]"DefaultApplicationsPath”

The default location of the thesaurus files (for each index, query and web frontend server) is:
%programfiles%\ Microsoft Office Servers\12.0\Data\Office Server 
When a search application is created, a copy of the thesaurus file will also be placed under: %programfiles%\Microsoft Office Servers\12.0\Data\Office Server\Applications\[GUID]\Config

Stop word files for each language can be found as noiseLANG.txt, where LANG is the 3 letter acronym for that language. For example, US English is noiseENU.txt, and the language neutral list is noiseNEU.txt.

To find the appropriate acronym for your language(s), please look them up under: http://www.microsoft.com/globaldev/nlsweb/default.mspx.

Ping Lin
Senior Test Lead
Microsoft Corp.
Victor Poznanski
Senior Program Manager
Microsoft Corp.
Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
SharePoint Image Search
19 September 08 06:03 PM | Microsoft Enterprise Search Blog

Matthew McDermott, a SharePoint MVP, has written a great 4 part blog post on how to make SharePoint 2007 search (and Search Server) render image results in a way that looks very similar to http://images.live.com.

Not only does this make searching images much easier, it’s also a very thorough step-by-step tutorial on how to customize results using the built in Web Parts and XSL – it’s well worth a read.

SharePoint Image Search (Part 1)

SharePoint Image Search (Part 2)

SharePoint Image Search (Part 3)

SharePoint Image Search (Part 4)

The end result makes SharePoint Image results look like the screencap below.

isearch

Richard Riley
Senior Technical Product Manager
Microsoft Corp.

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
SQL File groups and Search
16 September 08 06:17 PM | Microsoft Enterprise Search Blog

This article has been a long time coming, but it is finally here.  In the post below I will cover how to configure the Search database to span multiple filegroups.  First I'll cover a little about the benefits of doing so:

General references on what SQL file groups are:

The method that we have chosen to implement filegroups on the Search database is one of segregation.  We have identified all of the tables and indexes within the database that are solely used for crawling and not used at all to satisfy end-user queries.  The remaining tables and indexes are used for end-user queries.  However, the nature of the Search and indexing problem still dictates that the "query" tables are written to during a crawl.  The crawl only tables and indexes are isolated into their own filegroup.  With the crawl and query centric filegroups identified you can now ensure that the IO intensive process of crawling has a reduced impact on the IO subsystem that is hosting the query filegroup by ensuring that these filegroups are on separate spindles.

The whole goal of using filegroups is to improve the performance of the system.  This is done by providing an additional file.  This file must be placed on a different set of spindles to see any kind of performance enhancement.  If your SQL machine is not IO bound for the Search database then implementing filegroups will not provide you with any benefits. 

To make the migration process easier we did not actually create a query filegroup.  We simply created a new filegroup called "CrawlFileGroup" and moved the crawl tables out of the PRIMARY filegroup.  Such that PRIMARY effectively becomes the query filegroup.  This migration process is one that can be quite expensive to complete and could take hours to finish.  Keep this in mind when scheduling this on your production servers.  Because the move involves dropping and recreating numerous clustered indexes you should assume that the DB is offline during this move as many long running locks will be taken to recreate the index.  

Issues and concerns with using filegroups:

Back-up and Restore

One concern that you will need to be aware of in you planning for deploying filegroups on the Search database is that your restore process will be slightly impacted.  Out of the box Search restore is unaware of the filegroup that will exist within the backup image.  Because of this there is no way to indicate where this file should be restored to.  As a result the restore process is going to try and place the crawl filegroup file onto the same drive that it existed  on when you ran the back-up.  Once you enable filegroups you will be committed to making sure that all future machines that you restore your back-up to have a drive with the same drive letter that you initially created the filegroup on.   

Future upgrades, Service packs and Hot fixes

Each Hotfix, Service Pack and update that you apply to the server has the potential to modify the index that was moved into the CrawlFileGroup or add an new index to one of the tables moved to the filegroup.  When/if this happens the index will be moved back or created in the primary filegroup.  Updates will also clean out any non-product sproc.  Because of the risk of index modification with updates applied you will need to reinstall the stored proc and run the scripts again after each update applied.

The risk of a new index being added or modified quite low at this time.  We have confirmed that this does not occur if upgrading from RTM to SP1.  But, it does happen when upgrading from SP1  to the Infrastructure Update.  Future Updates are less like to modify the set of indexes.

However, the risk still exists and you will want to re-run the scripts below after each update that you apply to your system.  In the case when you apply an update and the index did not change running the script is a no-op and nothing gets moved.  So it is very cheap to run the script on a system that already has the indexes moved. 

SQL 2005 and greater

The script that is moving the indexes is utilizing new features that were released in SQL 2005.  As such you cannot perform this optimization with SQL 2000. 

Step- by-Step instructions for applying filegroups to your environment.

To deploy this you will need to manually create a file group on the Search database.  To do this execute the following steps:

a. Go to the Filegroups section of the Search database properties within SQL Server Management Studio.

b. From the Filegroups section click add and fill in the name "CrawlFileGroup." The scripts are written assume the filegroup has this name, failure to use this name will result in early failures  in the script

clip_image001[1]

c. Once you have a new filegroup with the name CrawlFileGroup you need add a file into this group.  To do this select the Files section of the database properties dialog and add a new file into the CrawlFileGroup.  Be sure that you place this file onto a separate drive with isolated spindles.

clip_image002[1]

d. Next you need to install the stored proc that will move the indexes and tables to the new filegroup.  Open the script named  MoveTableToFileGroup.sql within Management Studio and execute it; ensuring that you are working with the Search database  This will create a stored proc named proc_MoveTableToFileGroup.  Confirm that this sproc does indeed exist within the Search database.

e. Open and execute the second script named   MoveCrawlTablesToFileGroup.sql, this is the script that does all of the work by calling proc_MoceTableToFileGroup for each table that is dedicated for crawling. 

That is all there is to it.  You have now moved you crawl tables on to a separate set of spindles. 

Thank you for your time and as always I welcome any feedback or questions

Dan Blood
Senior Test  Engineer
Microsoft Corp

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Filed under: ,
Partner Post: Announcing conceptClassifier for SharePoint – Automatic Classification within Office
02 September 08 08:10 PM | Microsoft Enterprise Search Blog

Enterprise customers are increasingly struggling with how to apply policy and governance at the desktop. End user adoption is cited as the single most critical barrier to success in ECM and Records Management initiatives. Using Concept Searching’s unique compound term processing conceptClassifier for SharePoint can now be used to automatically classify content from Microsoft Office Applications, upload the documents directly to SharePoint, store the metadata in SharePoint properties and write back the classifications to the custom properties of the document for use within knowledge and workflow applications or enterprise applications such as ECM, Document Management, Records Management, or eDiscovery.

The classification can take place automatically without end user intervention. Optionally, Subject Matter Experts can be granted the authority to manually adjust the classification based on the taxonomy. A ribbon bar has been added to the familiar Office interface enabling automatic classification of content. When the end user classifies a document the system will retrieve existing concepts as an aid to the classification process as shown below. Subject Matter Experts also have the ability to add or delete classes in the taxonomy.

clip_image002

Documents are uploaded to SharePoint and the classification metadata is stored in the properties fields. The classification status automatically reflects the manual classification so as to not overwrite the classification classes the Subject Matter Expert entered. The systems administrator features currently enabled include the ability to edit the classifications, classify the document, a batch of documents or the full library. This metadata can now be used by Microsoft Enterprise Search to improve identification of relevant documents when searching.

clip_image004

For more information visit www.conceptsearching.com or click here to view a webcast demo of the integrated technology.

Martin Garland    
President                                                                                                                   Concept Searching, Inc

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
SQL Index defrag and maintenance tasks for Search
02 September 08 07:47 PM | Microsoft Enterprise Search Blog

Hi all, this topic is an area that has caused me much pain and work.  My goal for this was to follow the recommended SQL guidelines while minimizing the impact that these maintenance jobs have on Crawling and Queries.  We know from the SQL Monitoring an I/O post that Search is extremely I/O intensive .  As it turns out so is all of the regular maintenance that SQL recommends, so finding the right balance between the two is an interesting scheduling task.

As a starting point much information about SQL maintenance and MOSS is covered in the following paper:

There are some key areas from the above paper that I would like to augment here.

  1. The stored procedure (proc_DefragIndexes) identified in this paper will work, but it is extremely expensive to run on the Search DB as it defrags all of the indexes in the table.
  2. Maintenance plans generated with the Maintenance Plan Wizard in SQL Server 2005 can cause unexpected results (KB 932744.)  While this was fixed in SQL 2005 SP2 these maintenance plans also do more work than is necessary to have a healthy functional system.   
  3. Shrinking  the Search DB  should not be a necessary task that you need to perform.  The process of Shrinking the database does not provide a performance benefit.  SQL best practices for DBCC SHRINKFILE suggest that this operation is most effective after an operation that creates lots of unused space.  Search does not regularly perform these types of operations.  The only time that a SHRINKFILE may make sense is after you have cleaned out your index by removing a Content Source.     
  4. Rebuilding an index can cause latency issues with SQL Mirroring if the SQL I/O subsystem is constrained.  If you are using SQL Mirroring, be sure you are following the SQL best practices and the SharePoint mirroring white paper.  Because Search, SQL Mirroring, and defrag are all very I/O intensive you will want to be extra cautious with your deployment plan for this defrag script and make sure you test the script prior to going into production.

DBCC CHECKDB

DBCC CHECKDB is a command used to check the logical and physical integrity of all the objects in a database.  SQL Best practices recommend that you run DBCC CHECKDB periodically.  For a Search deployment we would recommend that you run DBCC CHECKDB WITH PHYSICAL_ONLY on a regular basis.  The PHYSICAL_ONLY option will reduce the overhead of the command.  However, due to the cost of running this you should schedule it during off-peak times.  The frequency of execution depends on your business needs, but a good place to start is once a week just prior to your back-up.  You still need to run DBCC CHECKDB, but less frequently also based on business needs.  Perhaps every other or every third back-up.  

When running these commands make sure that you have a monitoring process in-place.  DBCC only reports errors, it does not fix them unless explicitly specified by other options.  You either want to archive the output of the DBCC command for post processing or make sure you have event log monitoring set-up (for example MOM) to check for DBCC errors.

In very large environments you can run DBCC on an off-line (sandbox) copy of the database.  This will be less intrusive to end-users and the crawl.  In this scenario you would restore your back-up to a separate sandbox and run DBCC CHECKDB in the restored  environment.        

Fragmentation and index statistics freshness

We started with the proc_DefragIndexes script mentioned above.  After running it became obvious that the script was just too expensive to run on a regular basis.  To reduce the load placed on the I/O system we took a look at all of our indexes in the Search DB and defragged them one-by-one to measuring query performance along the way.  Doing this we were able to identify the indices that provided a performance benefit to the system when they were defragmented.  These indexes are listed below:

  • IX_MSSDocProps
  • IX_MSSDocSdids
  • IX_AlertDocHistory
  • IX_MSSDEFINITIONS_DOCID
  • IX_MSSDEFINITIONS_TERM
  • PK_Sdid
  • IX_SDHash
  • IX_DOCID

Optionally there are two additional indexes that you may want to include in your defrag maintenance plan.  These indexes do not see much use in typical out of box situations and are commented out in the script.  But if your environment is built on a custom UI or makes extensive use of the Advanced Search UI you will see improvements in query latencies if you defrag them.

  • IX_int -- defrag this index if you have a lot of queries that using numeric properties in the property store.  The classic case is date rage queries.
  • IX_Str -- defrag this index if you have a lot of queries that using string properties in the property store.  There is not a common case for this but if you have made changes to your managed properties and are driving your search UI off of exact matches for a string based property you will want to regularly defrag this index.

Once we knew which indexes to defrag we looked at the duration it took for the index to reach a 10% defragmentation rate.  From this we adjusted the FILLFACTOR so we could maintain a longer period of time between actually needing a defrag.  At this point we are seeing a duration somewhere around 2+ weeks between defrags.  Do note that by increasing the FILLFACTOR we did grow the size of the database slightly, the growth rate on SearchBeta was not that large.

We then looked at the cost/benefit of doing a Reorganize versus a Rebuild.  This was a interesting discovery for us.  Initially we had a script in place similar to proc_DefragIndexes that would choose to Reorganize or Rebuild based on percent fragmentation with 30% being the decision point (IE greater than 30% would do a Rebuild).  What we found was a Reorganize was taking over 8 hours with a 10% fragmentation rate and during this time end-user queries suffered dramatically.  Out of curiosity and desperation we tried a Rebuild which is supposed to be the more expensive of the two operations.  The Rebuild operation is completing in approximately 1 hour while the Reorganize takes as long as 8 hours.  The Rebuild operation is more expensive in the sense that you will see some failed queries during the hour that it runs, where as the Reorganize doesn't have as drastic of an effect on the queries, but the overall cost is much higher since you have an 8 hour window where the query performance is degraded.  UPDATE STATISTICS:  In the experiments we ran we found that simply doing the rebuild (which also updates statistics) that it was not necessary to regularly use this command.

Finally we deployed the script into an environment that utilized SQL Mirroring.  Unfortunately this didn't work out very well.  The mirror got so far behind that we eventually had to disconnect the mirror and stop the defrag.  Going through an analysis of this it became clear that the root cause was that the environment was heavily I/O bound and the defrag script generated more I/O than the system could keep up with.   While the mirror was behind end-user query latencies suffered dramatically.  To recover from this we ultimately had to improve the hardware by increasing the number of spindles. 

To mitigate this we have added a parameter to the script that allows you to reduce the MAXDOP used in the index rebuild.  Setting this parameter to 1 on a SQL box that is minimally I/O bound helps, but it may not be enough depending on how constrained the system is.  If you are in an environment  that is I/O bound (with or without SQL Mirroring) we strongly recommend that you go through a test of the defrag before you go live with the deployment.  The easiest thing to try is the following SQL statement:

ALTER INDEX IX_MSSDocProps ON [dbo].[MSSDocProps]

REBUILD WITH (MAXDOP = 1, FILLFACTOR = 80, ONLINE = OFF)
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }

The statement above rebuilds the largest index using the lowest possible MAXDOP, this index must be rebuilt OFFLINE so you will need to run this on a test system or during a maintenance window.   While this command is running keep an eye on the state of your mirroring with:

  • The duration of the command.  Will it complete within your service window?  For comparison purposes this command completes in under an hour on the SearchBeta hardware
  • SQL I/O latencies
  • If you have mirroring in place
    • The Database Mirroring Monitor
    • Send and Redo Queues  within perfmon.  The monitor above will tell you if mirroring is too far out of sync, but these counters are useful for comparison if you start changing the MAXDOP parameter.

Bottom line we feel the rebuild is a much better operation to run and recommend that you:

  1. Run the script on a regular basis; once a night or on the weekends depending on your service windows.
    • Weekends or weekly - reduce the fragmentation rate (sproc parameter) to 5.0 or lower to prevent missing the defrag due to a fraction of a percent (IE - 9.5%)
    • Nightly - use the defaults for fragmentation rate. The largest index (MSSDocProps) gets rebuilt approximately every 2 weeks on SearchBeta. Running the script nightly will ensure that your indexes are up to date more often, but gives you less control over the exact time that the index rebuild occurs.
  2. Before running the script the first time test out how your system will behave when rebuilding MSSDocProps.
  3. Reduce MAXDOP - If your environment shows poor I/O response time or unacceptable durations (cannot complete a defrag inside your service window) reducing the MAXDOP value may reduce the duration of the script and put less pressure on the I/O system.  Reducing the MAXDOP will not help enough if the system is very I/O bound. 
  4. SQL Mirroring - SQL mirroring is sensitive to I/O latencies, adding the defrag may be too much I/O for the system handle.
  5. Poor I/O latency - You should focus on improving the I/O subsystem of your SQL environment before you begin running this script.    

Stored Procedure syntax:

exec proc_DefragSearchIndexes [MAXDOP value], 
[fragmentation percent]
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }
  • MAXDOP value - Integer value. Default is 0  which means that all available CPUs will be used.
  • Fragmentation percent - decimal value. Default is 10.0.  This value was explicitly chosen because we able measure query latency improvements on SearchBeta when defragging at the 10% boundary.  

-Thanks

Dan Blood
Senior Test  Engineer
Microsoft Corp

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Filed under: ,
How to: Mine the ULS logs for query latency
02 September 08 07:38 PM | Microsoft Enterprise Search Blog

Tracking query latencies can be made easier through the use of the products ULS logs.   Below you will find information on how to enable the specific ULS traces as well as information for how to parse the logs.  The primary usage of this information is to monitor the ongoing health of your system.  It is one tool in the toolbox to make sure that the system is running in a viable state.  It is also necessary when you are making small changes to your environment so you can measure the benefits or detriments of the changes made.  Another key usage of the query latency ULS logs is the ability to where the larger portions of time is being spent in the query.  For example you can see the time spent in SQL improve after doing index defrags.      

ULS logging

Making changes to ULS log settings can impact performance and cause more disk space to be consumed when.  However, the category and level changes mentioned below are what SearchBeta is running with and the cost of this is negligible given the benefit it provides.  Just make sure your logs files are not on a drive that is tight on disk space.

You will need to change the following ULS settings to get the events that we need traced. 

From "Central Admin.Operations.Diagnostic Logging" set the following;  
Category: "MS  Search Query Processor"
Least critical event to report to the trace log: "High"

LogParser

There are a number of interesting traces that you get with the above setting.  To really look at this data you will need to use some kind of log parsing utility to strip out the interesting traces and perform some additional post processing.  I recommend that you use logparser.exe to do this parsing.  Below I give examples of Log Parser queries to get at the data. Additionally you should provide the following input parameters to logparser.exe since the ULS log files are Unicode, tab separated text files. 

  • -i:TSV -iCodepage:-1 -fixedSep:ON

Traces

With the above ULS trace settings you will get the following messages in the log (location of these log files can be found in the above UI for changing the logging level):

  • Completed query execution with timings: v1 v2 v3 v4 v5 v6
    • The 5 numbers v1,v2,v3,v4,v5, and v6  are time measurements in milliseconds
      • v6 = Cumulated time spent in various  calls to SQL  except the property fetching
      • v5 = Time spent waiting for the full-text query results from the query server (TimeSpentInIndex)
      • v4 = Latency of the query measured after the joining of index results with the SQL part of the query. This includes v5 and the time spent in SQL for resolving the SQL part of advanced queries (e.g. queries sorted by date or queries including property based restrictions like AND size > 1000).
      • v4-v5  =  Join tim
      • v3 = Latency of the query measured after security trimming. It includes V4 plus retrieval of descriptors form SQL and access check.
      • v3-v4 = Security Trimming tim
      • v2 = Latency of the query measured after the duplicate detection.
      • v3-v2 = Duplicate detection tim
      • v1 = Total time spent in QP. (TotalQPTime)
      • v1 -v2= Time spent retrieving properties and hit highlighting . (FetchTime)
  • Join retry v1 v2 v3
    • Retry caused because there were not enough results from SQL that matched the results returned from the full-text index.
  • Security trimming retry v1 v2 v3
    • Caused by the user executing a query the returns a number of results that they do not have permission to read.  The query is retried until the enough results are available to display the first page of results.
  • Near duplicate removal retry v1 v2 v3
    • There were so many virtually identical documents that were trimmed out that the query processor did not have an adequate number of documents to display
    • The 3 numbers v1,v2 and v3 are counts of documents.  If you see one of these messages in the log it means that the query processor was unable to satisfy the requested number of results on the first attempt and had to execute the SQL portion of the query a second++ time with a larger number of requested results.  The numbers here are not excessively useful and most of the analysis you will do is around the existence of this trace.   This and the relative frequency of each of the retries allows you to determine why so much time is being spent in a given phase of the query.   
      • v1 is the current upper bound on the number of documents to work with (this will go up on subsequent retries)
      • v2 is the number of documents before the operation that caused the retry
      • v3 is the number of documents after the operation that caused the retry.

Where is all of the time being spent for the queries executed in the system?

The answer to this question is primarily within the "Completed query execution…" trace.  The number of retries  help explain why the time spent in any one location is so high.   Given all of the timing information that you can get from a single query and the fact that this data is available for each and every query executed, the problem becomes more of an exercise in figuring out how to store the data and provide a mechanism to summarize or chart it.  Without doing this there is just too much data to try and interpret.  The solution we have on SearchBeta is to collect the data on a regular basis (hourly) and import it into a SQL reporting server that is segregated from the SQL machine hosting the Search farm.

Once the data is in SQL we have created a number of Excel spreadsheets that query the data directly from SQL and chart it using Excel Pivot Tables/Charts.  We have also gone further to provide a set of dashboards within a MOSS system that use Excel Server to provide up to date reports on the health of the system that are available for anyone to look at.  

Once you have the basics of this system set-up there are a multitude of other reports and health monitoring that are possible; from collecting performance counters to mining IIS logs.  The IIS logs provide a key piece of information about query latencies that is missing from the ULS trace.  Primarily answering the question of how much additional time is spent rendering the UI.

A sample of one of the charts that we are able to produce with the ULS log data is below:

clip_image001

The log parser query that we use to mine the ULS logs is below.  Note there are number of output options for LogParser, I am using a simple CSV file below.  But you can also import the data directly into SQL.

*remember the numbers in the log are in milliseconds, the query below translates the time into seconds.

Select  Timestamp
      , TO_INT(Extract_token(Message,7, ' ')) as TotalQPTime
      , TO_INT(Extract_token(Message,8, ' ')) as v2
      , TO_INT(Extract_token(Message,9, ' ')) as v3
      , TO_INT(Extract_token(Message,10, ' ')) as v4
      , TO_INT(Extract_token(Message,11, ' ')) as TimeSpentInIndex
      , TO_INT(Extract_token(Message,12, ' ')) as v6
      , SUB(v4, TimeSpentInIndex) as JoinTime
      , SUB(v3, v4) as SecurityTrimmingTime
      , CASE v2
            WHEN 0 THEN 0 
            ELSE SUB(v2, v3) 
        End as DuplicateDetectionTime
      , SUB(TotalQPTime, v2) as FetchTime
INTO QTiming
FROM \\%wfeHost%\ULSlogs\%wfeHost%*.log
WHERE Category = 'MS Search Query Processor' 
      AND Message LIKE '%Completed query execution with timings:%' 
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }

*FYI -- Prior to the MSS release and Infrastructure Update updating MOSS with the MSS changes, the first two "tokens" (QueryID: XXX.) at the beginning of the trace did not exist.  So you will need to subtract 2 from the second parameter of each "Extract_token" predicate in the above SQL command.

What is the percentage of retries that the system has?

To get an idea for how many "retries" are occurring you need to correlate the number of retries with the number of queries executed and calculate a % of total retry values for each type of retry.  The timing data above does include time spent in a retry.    

Log Parser queries:

  • Total number of queries executed
SELECT count (Message) 
FROM *.log 
WHERE Category = 'MS Search Query Processor' and Message 
like '%Completed query execution%'
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }
  • Total number of retries due to Security trimming
SELECT count (Message) 
FROM *.log 
WHERE Category = 'MS Search Query Processor' and Message 
like '%Security trimming retry%'
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }
  • Total number of retries due to Join retries
SELECT count (Message) 
FROM *.log 
WHERE Category = 'MS Search Query Processor' and Message 
like '%Join retry%'
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }
  • Total number of retries due to Duplicate Removal
SELECT count (Message) 
FROM *.log 
WHERE Category = 'MS Search Query Processor' and Message 
like '%Near duplicate removal retry%'
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }

Thank you for your time and as always I welcome any feedback or questions

Dan Blood
Senior Test  Engineer
Microsoft Corp

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Search Server 2008 Express Redistribution Rights
21 August 08 07:26 PM | Microsoft Enterprise Search Blog

If you’re interested in using Search Server 2008 Express in your application or shipping it on hardware then take a look at the redistribution license page on the Enterprise Search site.

This redistribution license agreement grants you the right to redistribute Microsoft Search Server 2008 Express with your software application or hardware.

To obtain a Search Server 2008 Express redistribution license, you must:

  1. Review the Search Server 2008 Express Redistribution End-User License Agreement (EULA).

  2. Print and retain a copy of the Search Server 2008 Express Redistribution EULA for your records.

  3. Register for Search Server 2008 Express redistribution rights.

The license is applicable for all 37 Search Server 2008 Express languages.

 

image

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
Announcing Faceted Search v2.5
12 August 08 05:50 PM | Microsoft Enterprise Search Blog

Starting Faceted Search 2.5, the solution relies on Microsoft Enterprise Library to address common software requirements in caching, logging, exception handling, policy injection etc., etc. More importantly, the 2.5 is a ground breaking release that is setting new targets for the Faceted Search. So, what’s new?

image

New Features

1. Caching – dramatically improves performance and decreases the load on the search engine

The solution uses 2 mechanisms for manageable cache: quick and long. I built the caching logic on assumption that user knows what he/she is looking for. The Search Facets web part will cache original result set and use it for the search refinement, paging and other postbacks. If the initial result set doesn’t provide full coverage of the search, the smart 2nd thread will run against real-time data providing adjustment to the cached match.

2. Synchronization with Core Search Results web part

The MOSS search is adjusted by several parameters that designer can set for the Core Search Results web part itself. These include remove duplicates, enable trimming, permit noise words. When you drop the Search Facets web part to the search results page, it will find the Core Search Results, read its parameters and sync the search query parameters to exactly match ones used by the Core.

image

3. Support for advanced search

It was the most wanted feature since Faceted Search 1.0. With 2.5, the Facets are rendered for advanced search although do not extend yet to ranges. The functionality is accomplished by extending SearchQuery structure to accommodate POST requests and sync back to GET query.

image

4. Match of search counters

This release introduced an updated search syntax that is design to provide matching counters to the core search. In fact, the new search query is using both KeywordQuery and FullTextQuery through the use of generics.

public class GenericQuery<T> : IDisposable where T : Query
{
    private EventHandler _customLogic;

    public ResultTableCollection Execute(EventArgs args)
    {
        _customLogic(_query, args);
        return _query.Execute();
    }

    ...
}
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }

Additionally, the WHERE clause of the search query was modified to provide closer match to the Core counter.

5. Introducing Parent-Child relationships

By design, the facets can support only 2 levels. This release extended the Facets schema to allow management of the nested layers. That eases the pain of displaying complex hierarchies such as geography, or org chart etc. Parent-Child relationship can be set by facet name and facet value, or just by facet name.

<Column Name="BDCCity" DisplayName="City" ParentName="BDCState" />
<Column Name="BDCState" DisplayName="State" >
  <Mappings>
    <Mapping Match="Alberta"  ParentName="BDCCountry" ParentValue="Canada"/>    
    <Mapping Match="Manitoba" ParentName="BDCCountry" ParentValue="Canada" />
    <Mapping Match="Ontario"  ParentName="BDCCountry" ParentValue="Canada"/>
    <Mapping Match="Quebec"   ParentName="BDCCountry" ParentValue="Canada"/>
  </Mappings>
</Column>
.csharpcode, .csharpcode pre { font-size: small; color: black; font-family: consolas, "Courier New", courier, monospace; background-color: #ffffff; /*white-space: pre;*/ } .csharpcode pre { margin: 0em; } .csharpcode .rem { color: #008000; } .csharpcode .kwrd { color: #0000ff; } .csharpcode .str { color: #006080; } .csharpcode .op { color: #0000c0; } .csharpcode .preproc { color: #cc6633; } .csharpcode .asp { background-color: #ffff00; } .csharpcode .html { color: #800000; } .csharpcode .attr { color: #ff0000; } .csharpcode .alt { background-color: #f4f4f4; width: 100%; margin: 0em; } .csharpcode .lnum { color: #606060; }

In the configuration above, the City facets will display only after the user chose the State. The State itself will match the country of origin.

6. Extending search to logical “OR” queries

Original facets always represent “AND” queries. That implies ability to narrow the search results by adding extra criteria. In this release I prototyped the way to expand the search by adding additional matches to the criteris. This in fact resulted in rewamped the Bread Crumbs UI. Proviuded now out-of-the-box support for languages is a good example of how “OR” queries empower the search.

7. Simplified web part properties

The 2.5 release is friendly to modifications of the web part properties. I have all properties classified and broken down to groups for each of the web parts.

image

8. Other

There are lots and lots of numerous fixes and enhancements, including improved security validation, code refactoring, extending facet sorting, support of quoted search and duplicates etc., etc.

What’s next

It’s my privilege to say that we have a team now that helps to shape new releases and brainstorm the furutre of the Faceted Search. In present we are looking at AJAX and SilverLight and hopefully you’ll start seeing more and more power of Facets in the near future.

Leonid Lyublinski
Senior Consultant
Microsoft Consultancy Services

Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks
SharePoint Server 2007 Powers Beijing 2008 Olympic Games
05 August 08 03:20 PM | Microsoft Enterprise Search Blog

In light of the 2008 Olympics starting tomorrow in Beijing, we’re extremely excited to share that the Beijing Organizing Committee for the Olympic Games is one of our newest SharePoint Server 2007 customers! After checking out a variety of different search solutions, such as the Google Search Appliance, the BOCOG is using SharePoint as a search platform to power their INFO 2008 database (which over 3,500 Olympics partners and members of the media will have access to.) This means that thousands of athletes, officials, partners and media will be able to quickly and easily search and find the information they need to successfully present, promote, and report the 29th Summer Olympics on the SharePoint platform.

Not only is the BOCOG a great customer to work with, this is a perfect example of how SharePoint can perform under pressure and scale to fit the business needs of any enterprise-sized customer. Using our powerful Enterprise Search functions in SharePoint, INFO 2008 users will be able to access the information they need in an easy, efficient way.

Below is a screenshot of the INFO 2008 system in action, using SharePoint to search athlete bios:

clip_image001

Check out the rest of the BOCOG case study here, and let the games begin!

Del.icio.us | Digg It | Technorati |