Interpretation of www.scan email results

The www.scan program can generate results via email, as well as to standard output. It is probably most convenient to set this up as a cron entry and have the results sent via email to you periodically.

The results have a couple of portions, each of which can use a little explanation.

Parameter

The first section lists all of the parameters that the program is being operated under, thus letting you know how to reproduce the run. This even lists the undocumented options.

Results

This sections lists the errors found, and which page has the offending link. Note that there are many reasons that the link may fail, and that the problem may go away of its own accord (e.g. the site was temporarily missing). Some won't go away. Some, upon inspection, look like they are working, but may be working by virtue of a redirect. In this case you probably want to use the current URL.

Some example messages are:

           Missing file: http://www.seps.org/_vti_bin/fpcount.exe/ (Address 198.68.20.125 HTTP/1.1 500 Internal Server Error)
        referenced from: http://www.seps.org/
In this case, information was removed from the URL after the fpcount.exe/. Everything from a ? and beyond is removed. This means pages to CGI or other server side packages may fail.

If the real URL really does work, this would be a good candidate for inclusion into an exception file.

           Missing file: http://microsoft.com/ie/download/ (new location at /windows/ie/download/)
        referenced from: http://www.seps.org/tips.htm
In this case, Micro$oft has moved their web page. When you check on the link, you will find that the link will still work, but gets rewritten to reflect the new location. When an organization does this, it is often best to change your web page to the new link before they remove the redirection (after some period of time) and it becomes much harder to find the appropriate content.

This is often because the URL is written without a trailing slash when needed. It also occurs when the web site is reorganized and they are nice enough to indicate where things went. Occasionally this is used to distribute server load, at which point it is time to add an entry in the exception table.

           Missing file: http://www.nasa.gov (Address 198.116.142.34 HTTP/1.1 500 Server Error)
        referenced from: http://www.seps.org/cvoracle/

It turns out that some people get a little lazy when writing URLs, and the browser makes a request, is rebuffed by the server, and then automatically rewrites the URL in certain ways to see if they can find the data. In this case, the author left off the required trailing / of the URL. It should be http://www.nasa.gov/. This is a common error.

           Missing file: http://www.mos.org/mos/sbm/sciencemail.html (Address 204.164.199.41 HTTP/1.1 404 Not Found)
        referenced from: http://www.seps.org/cvoracle/

Sometimes the friendly folks at web sites reorganize their sites, or just plain remove content. This is your cue to go in and find out if you can find the appropriate content, or just give up and remove the associated link from your site.

           Missing file: http://www.seps.org/oracle/oracle.admin/Admin/data (Address 198.68.20.125 HTTP/1.1 401 Authorization Required)
        referenced from: http://www.seps.org/cvoracle/expert.html

Some pages require passwords. This program has no way to specify passwords, so this would be a good entry for the exception file.

           Missing file: http://www.seps.org/oracle/oracle.archive/Expert/merrillg/Life_Science.Health/1999.10 Timeout
        referenced from: http://www.seps.org/oracle/oracle.archive/Expert/merrillg/Life_Science.Health/1999.11

Sometimes a server or the network may be so slow that the program gives up. This might be a real problem, or a transient problem. Keep track of these links and do testing to see how the page responds to testing.

Trailer

At the end of the message, you will see a message which looks like:
2132 entries in the search queue at the worst case.
13668 checked, 13137 local, 531 remote.

                www.scan v0.05 regan@ao.com
                http://mordred.ao.com/www.scan/
This indicates how many queue entries were used (see --depth), the number of pages checked, etc. It also indicates that it was the www.scan program which generated the email message, and the web page for the program should you want to look for a current version.

I hope this answers most questions in using this program.