I've been building the number and complexity of our content sources for our internal portal. For at least half of the web sites we crawl, I need to create one or more site path rules. That's fine. Site path rules give us a way to ensure that we are crawling just that portion of the content that we need to crawl from the source. But I noticed an anomoly and wanted to see if anyone else has experienced this or if anyone has some advice/ideas:
Here's the scenario:
I crawl a regular web site, say FOO, at www.foo.com. But I don't want the entire web site, I just want a subsection of the site, we'll call it products. Hence, all I want is www.foo.com/products, but all the other parts of the site I don't want. Now, there are links on the products page that take me to other parts of the site, so it is best that I use site path rules to limit the crawler to just that site, as follows:
www.foo.com include
www.foo.com/products/product.html include
www.foo.com/products/product.html/* include
www.foo.com/* exclude
Now, this works, as far as I can tell. When I enter queries, I receive back that content only in the products portion of the foo web site.
Here's the part I don't get: When I did the exact same thing to a page that had .aspx pages, the site path rules didn't seem to “kick in“ or “work“ as I expected.
I need to test this some more, but has anyone else seen anything like this?