This file is an unsorted heap of various hints which don't fit to any other
documentation file. Also a sort of FAQ.
---------------------------------------------------------------------------

General switches
~~~~~~~~~~~~~~~~
All programs (including scripts like `indexer') accept these switches:

	-C<file>	Load config file <file> instead of the default
			cf/sherlock. Must precede all other switches.
			If used multiple times, the configs are combined,
			the rightmost one having the highest priority.
	-S<section>.<var>=<value>  Change configuration variable
			(executed after loading the config files)

There is a cf/trace config file which includes the default cf/sherlock,
but enables all tracing options, so -Ccf/trace is very useful if you are
trying to solve mysteries.

Logfile names
~~~~~~~~~~~~~
Names of all log files can contain strftime() conversion specifiers which
get replaced according to current date and time. But beware, in some cases
the switching of config files is delayed to avoid splitting related
entries over multiple files (e.g., the scheduler avoids splitting slots).

Log format
~~~~~~~~~~
All modules use a unified log entry format:
I 2001-09-23 19:33:52 [scheduler] Waiting 98 seconds for end of the cycle
^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
class  timestamp       program name      the message itself

The classes are:

	D	debugging messages
	I	informational messages (default class)
	W	warnings
	E	errors
	i,w,e	the same, but caused by events external to Sherlock
	!	fatal errors (the program has died)

Program names are omitted in case the entry occurs in a log file dedicated
to that particular program (e.g., gatherd messages in log/gatherd). If there
are multiple threads, the program name also contains thread PID.

Gatherd log lines
~~~~~~~~~~~~~~~~~
Logging of processed documents by gatherd is somewhat obscure, because we want
to show maximum information possible and at the same time avoid gigabytes of
disk space wasted by logs. It consists of two types of entries: those generated
by gatherd subprocesses (they carry a PID instead of program name in the log entry
description above) and those printed by gatherd itself (they contain no square
bracket part). Among the latter ones, the most important information are entries
generated for each URL processed:

I 2001-09-23 19:33:43 http://www.ucw.cz/: 0003 Not modified [4762*] d=13

Read: <class> <timestamp> <URL>: <status-code> <message> "["<pid><refresh><flag>"]" d=<delay>

status-code	object status code as defined in doc/objects
pid		PID of the thread which was processing the URL (so that you can
		match log entries by that thread with the particular URL)
refresh		"*" if we were refreshing the URL, "" if no previous contents
		were available
flag		"+" if a new version of the document (or an error marker)
		has been stored, "=" if not changed since last version,
		"!" if it's a duplicate
delay		how long did the URL spend in the queue

How to infer changes in number of gathered documents from the flags:

  flag == "+" or "!"		increment number of documents if status-code
				says it wasn't an error
				if refresh == "*", decrement number of documents
  flag == "=" or ""		leave the number of documents as it is

Gatherd status lines
~~~~~~~~~~~~~~~~~~~~
On Linux, gatherd alters the command line displayed by `ps' to show the
current status of each thread (<status> <URL>). The states are as follows:

	R	resolving
	D	downloading
	P	parsing
	S	storing to bucket file & sending description to gatherd master

The checker mode
~~~~~~~~~~~~~~~~
Sherlock is also able to work as a web consistency checker, checking broken
links and other errors and also validating HTML using an external validator
(I like the one available from the Web Design Group at http://www.htmlhelp.org/
the most).

Just edit cf/checker* to make it suit your needs and run normal gathering
with this configuration (add -Ccf/checker to all commands you run); then
use the `checker' script to summarize the results.

If you don't turn the ignore_text switch on in cf/checker-filter, you can
use the gathered data for indexing as well.

Load balancing
~~~~~~~~~~~~~~
You can easily run multiple search servers, each with its own copy of the
index and let the front-end balance the load between them. Just call the
send-index script from cf/timetable and use scontrol instead of gcontrol
at the search server machines.

Indexing databases
~~~~~~~~~~~~~~~~~~
If you want to index contents of a database instead of a collection
of documents specified by URL's, it's better to write a simple program
which will bypass the gatherer and generate the objects directly.
You can find the format of the objects in doc/objects, buckettool -i
is ideal for putting them to the bucket file. Also the minimal configuration
in the `bare' directory can be handy.
