Thursday, October 04, 2007

Search and Research

The major references from Microsoft that I have found cover:

  • How to do basic setup (file types, how to set which content, etc.)
  • Capacity (but in the context of THEIR scale, which is hardly anyone else's at this point - terabyte index for crying out loud)
  • Some details - a little minutia, but not necessarily enough to give you all the answers (this is where I am living right now)

I have specific concerns in a couple categories - capacity planning (space, processor, etc.) and functional specifics (what is really indexed). I worry that what limits you set could have a profound influence on the former, you would think. But maybe not.

Our users are concerned about having all content completely full text indexed (it seems redundant - "complete" and "full" - but it's important!). Early on I had lots of questions about meta data vs. attachments, lists vs. libraries, etc. but for now I am strictly concerned with the file content in document libraries. More specifically, how do I balance the maximum upload and search setting and what are the repercussions of incomplete crawls due to size ("The file reached the maximum download limit. Check that the full text of the document can be meaningfully crawled") or compressed content (index grow factor - more on this in a bit).

There are two relevant settings, MaxDownloadSize and MaxGrowFactor. It was proposed that we set the MaxDownloadSize and the maximum upload to the same size - 50MB (we had been using a larger maximum upload). MaxDownloadSize and MaxGrowFactor are registry settings on the indexer and the upload size is a Sharepoint central admin setting). The grow factor is a multiple of the original file's size - for a factor of 4, if the original file was 1MB and was compressed, when the indexer's filter uncompresses and starts adding text to the index, it cannot add more than 4 meg.

My tests had two goals - (1) determine the functional ramifications of the settings and (2) find out how storage was impacted. For now I am not interested in how long it takes to complete, network or SQL impact, etc. None of those things are much of an issue for me for the time being.

For my initial tests, I had the default MaxDownloadSize and MaxGrowFactor of 10MB and 4MB respectively. I have a test database that has 50,000 documents in it and a significant amount of content larger than 10 MB. I also created some control content that consisted of large PDF's, some with a text source and some OCRed, and large .DOC and .DOCX files (mostly text, some graphics).

The results were ugly. With PDF's over the 10 MB Maxdownloadsize, no content was indexed (apparently the searchable parts are not in the first 10MB). My word tests are inconsistent and not extensive enough to be conclusive. It seems that sometimes the whole doc could be ignored and sometimes not. Word 2007 documents seem to work well, even when they are too big.

My plan was then to up the MAXDOWNLOADSIZE to 50 MB (and not mess with MaxGrowFactor). But at first I followed the advice of the technet entries below and it did nothing. After several passes at this, I started to look around at services and realized that there must be another registry key. But once I found where I thought it should go, it wasn't there. I now had something different to search and was able to confirm though an entry written by Bill English that you need to add the key for MAXDOWNLOADSIZE for the MOSS search (not WSS, unless you aren't MOSS):

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager

Then I added 50 as the decimal value for the Dword key.

Now I could re-run the tests and expect some results, and that's what I got. There were no longer any warnings:

“The file reached the maximum download limit. Check that the full text of the document can be meaningfully crawled."

None. This is interesting, because some of my test PDF's didn't show up in the results (I had one that was 1700+ pages, mostly text - 47MB. But most of them did work. It was an arduous test, and what I learned so far is that the settings make it better, but it is still not perfect. We previously had done a lot of this kind of work with Lotus Notes databases. Notes search is very different (and probably superior in most regards, like most things Notes). But Notes also had known issues with some PDF's not searching correctly, and with Notes it is not so easy to install a different ifilter, at least as far as I know.

Incredulously, the database and local index files were actually smaller. Let me explain. The search database was about 30% larger, but there was a significant amount of free space. After a shrink, it was smaller. The local index files on the server were a bit smaller. I felt that I had good controls on the experiment but I don't understand that part of the results. I think we will just be very careful and monitor the growth of the index database.


References:


http://technet2.microsoft.com/Office/f/?en-us/library/01cad765-f754-4d07-9dd9-4c725852a8d91033.mspx


http://office.microsoft.com/en-us/sharepointserver/HA011648411033.aspx


http://support.microsoft.com/kb/287231

http://admincompanion.mindsharp.com/SharePoint%20Server%202007%20Administrators%20Companion%20Wi/Crawling%20Details.aspx

1 comment:

  1. Anonymous10:50 AM

    I am having the same issue with large files. I have adjusted the maxdownloadsize and maxgrowfactor registries so now I also get no warnings...but if I search for a string that is found towards the end of a large file it will not return as a hit. But the file will return if I use a string from the top portion of the large file. Any ideas?

    ReplyDelete