One of the points on a slide I present in training and many presentation on typical problems with Enterprise Search engines is 'Bad Document Authoring'. I just found another example where this is a problem so thought to write a few words about it here.
Many corporations produce documents in a number of different formats. Authors use a variety of word processing or desktop publishing tools to produce documents. The most common, of course, are Microsoft Office applications like Word or Excel and Portable Document Format (PDF). Many organizations also have decided that PDF is a better format for publishing official documents because they cannot be modified and formatting is preserved.
The major problem with indexing and returning these documents as good results is that authors are really bad at adding the kind of information (metadata) that describes these documents. Often, even filenames have little to no meaning. Many believe that this is the reason why you need a search engine - to find this poorly authored and organized documents.
For us, the most common problem is returning a PDF document that has a title somewhat like this: 'Microsoft Word - Document 2384' or simply '010504ext.doc'. MondoSearch like most search engines will look at the title tag content of the documents it finds and try to use that as a title. If this doesn't exist, it will look elsewhere or try to generate the title. Many times, this is not even an opportunity because the title tag exists but with the filename of the document.
So how can you get around this? Well, there are several options:
1) Get your authors to enter meta data in their word documents. - This is probably the best method and easy to do but suffers from poor user adoption. The authors must open the properties dialogue when creating the document and type in a title and description about the document. The title should ideally be 2-5 words and the description 3-8 words long. I will likely make another post about titles on this blog so won't get more detail on this now.
2) Add titles and descriptions to the PDFdocuments. - Most PDF documents are generated by pushing a little pdf icon'd button in the corner of Word. This generates the document automatically and does not offer to add the information. Therefore, adding them to the PDF's manually is the only option. You must open the PDF in Acrobat and then click on the little arrow above the scrollbar on the right and chose document properties. Here you (or your part time monkey) can enter a title and description for the documents.
3) Use MondoSearch's pre-indexing module, Content Optimizer, to add the information to the document at crawl time. - Our Content Optimizer is a pretty powerful tool that will allow you to programmatically add meta data to documents at crawl time. If all your documents have similar patterns, you can use the rules in the Optimizer to general titles and descriptions from these patters. I've used this tool to add a lot of Metadata, ignore irrelevant content, and even boost ranking on all sorts of document types.
Although I love our Content Optimizer, the best way to solve this problem is at the source and educate authors to make documents with good metadata. Even having all the existing PDF's fixed is probably better than building all sorts of rules to compensate for bad authoring. However, if option 1 and 2 are not available to you, try out Content Optimizer. Some consulting may be needed but I'd be happy to help you out.
Del.icio.us | Digg It | Technorati | Blinklist | Furl | reddit | DotNetKicks