You know, there is always something more to learn about SPS03.
OK - I'm playing around some more with the sites directory content source. After working with it for a while, I realize that this content source *does not* crawl the sites directory in a portal, but rather the sites that have been enabled for crawling in the Manage Sites to be Crawled list.
Hmmm.
So, I further learn that the portal content source crawls the Sites Directory area in a portal, but *does not* follow the links to the sites - it merely crawls the existence of those sites by crawling the links themselves.
Another Hmmm.
OK - so when a new sites directory content source is created in the same portal, instead of ending in sites=$$$default$$$, it ends in sites=*.
Discussion Questions:
1. In what scenario would one want to create a second Sites Directory Content Source since there is only *one* Manage Sites to be Crawled list per portal (and this list does not appear to participate in shared services)?
2. IF one does create a second sites directory content source, what is it crawling?
Other bits of information: When I create an embedded site in a portal via the sites directory, the site is automatically added to both the sites directory in the portal and the Manage Sites to be Crawled list. OK. If I then remove the site listing from the sites directory, the listing in the Manage...Crawled list remains. The sites directory content source will still crawl the site but the portal content source will not crawl this list. Furthermore, IIRC (sitting in a hotel room right now), there is no way to
assign a source group to the sites directory content source - it get's it's own source group automatically. But if I create a second sites directory content source, I don't remember how or if I can assign that to a difference source group. Also, the default portal content source crawls the portal every 10 min. The default sites directory content source does an
Incremental (Inclusive) crawl once each night. This means that new site collections created in the sites directory will appear in the search results within 10 minutes of being created, but the *content* for these sites could take up to 24 hour (OK - 23 hours and 50 minutes) to appear in the search results. To me, this is a bad default design.