Monday, December 28, 2009

Problems Indexing an EML Inside a ZIP in SharePoint

Previously on Bob Klass.Info (I've been watching too much TV lately, re-watching several seasons of the "Dexter" series in the last few days):

"Friday, December 18, 2009

Adding Zip Search File Type in MOSS 2007

I am a little behind the curve on this. Better late than never.
  1. Get the filter pack here: http://www.microsoft.com/downloads/details.aspx?FamilyId=60C92A37-719C-4077-B5C6-CAC34F4227CC&displaylang=en
  2. Install it
  3. Set up the item type in the SSP
  4. Set up a new key for .zip in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\
  5. Give it a value of {20E823C2-62F3-4638-96BD-90F4F6784EBC}
  6. Restart the search service
That is the shorthand version! This is done for you in SharePoint 2010."
That was all fine and interesting, but then the real interesting stuff turned up. My Foxit PDF ifilter and standard MS text ifilters both seemed to function on files nested within ZIP files. But EML files did not (and EML's further have attachments!).
It turns out there is a property of the ifilter called "threading." This threading is simply an object property to say whether it works in single or multi-threaded modes.
A utility called IFilter Explorer will help you look at threading. I don't know if there is another tool like this anywhere or an alternative to this tool. Running this on my test indexer seemed to work OK, but it is old (in software years - like doggie years, it is about 80) and it seems to be an orphan (widow?). But it beats sifting through the registry manually.
Now it gets (unnecessarily) messy.
I was looking to find how EML indexing works. It uses:
  • c:\windows\system32\mimefilt.dll
  • content type message/rfc822
  • GUID of {5645C8C2-E277-11CF-8FDA-00AA00A14F93}
  • related PersistentAddinsRegistered of {89BCB740-6119-101A-BCB7-00DD010655AF}
I should have found the threading model set as "both" indicating that BOTH single threaded or multi-treaded daemons can use it. I found it marked as "Bo" in the two places in the registry.
I changed the first "Bo" entry to "Both" and the second one magically was fixed. Time to restart the services and re-try the indexing ... and ... it worked. Even PDF's attached to the EML inside the ZIP were indexed. This "Both" corrupted to "Bo" burned up a few hours of troubleshooting.
Dexter's usual actions would be justified if he could find a guilty party in this case.

No comments:

Post a Comment