Sherlock 3.x -- introduction and installation guide
===================================================

0. Prerequisities
~~~~~~~~~~~~~~~~~
To build and run Sherlock, you need:

  o  The GNU toolchain (gcc 3.x or newer is required)
  o  GNU bison and flex
  o  GNU bash 2.0 or newer
  o  Perl (any reasonably new version; 5.6.1 works for me, however Watson does
     not work with >= 5.8)
  o  ZLib (newer than stone-age)
  o  Linux (porting to other systems should be easy)
  o  GNUPlot (only if you want to use the Watson utility to generate
     nice graphs with various statistics, but you need the development
     version which unfortunately is missing from most Linux distributions)
  o  ImageMagick >= 5.5.7.9 if you build the image parser
     (the older versions have fatal bugs)

If you are installing Sherlock as root, probably the best and most secure way
is to create a special user account, su to this user for the whole installation
and also run all programs as this user.

1.a. Compiling
~~~~~~~~~~~~~~
First of all, edit the `custom/config.mk' file and select the feature set you
want. Defaults should be fine for Linux/i386; for 64-bit machines you will also
need to edit lib/config.h and fix the type sizes (I hope I'll add a configure
script soon taking care of all this). Then run

	make

Now, you've built a working installation in the "run" directory and you can play
there.

You can also run "make install" to create the same tree in INSTALL_DIR specified
in config.mk; if there already is an installation there, make install will
replace the binaries and show differences in config files, prompting you about
replacing them.

1.b. Installing as a package
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you have installed Sherlock as a (Debian) binary package, you will need to
instantiate it first.  (It is possible to run Sherlock in many instances on one
computer at the same time.)  Each instance dwells in its private directory
subtree.  To create a new instance, just run

	holmes-instantiate <target_directory>

This will create everything and print further instructions.  The same command
can be also used for upgrading an instance -- if you have upgraded the binary
package, you still need to run

	holmes-instantiate -u <target_directory>

for all instances running at the computer.  All daemons of the upgraded
instance must be stopped at that time, of course.  Configuration changes are
preserved.  If you do not alter the original configuration files in directory
cf.orig, the configuration files can be automatically upgraded too.

2. A brief look around
~~~~~~~~~~~~~~~~~~~~~~
Now, take a look at the run tree you've built. It contains several
subdirectories with commands, configuration and data files:

  bin/		programs and utilities
  cache/	image cache (used by Watson)
  cf/		configuration files
  db/		database files of the gatherer
  index/	main index
  lib/		libraries used by the programs
  lock/		lock files, pid files and similar stuff
  log/		log files
  tmp/		temporary files

Whenever you run any Sherlock command, make sure your current directory
is the root of the run tree -- all files are accessed relative to it,
allowing you to run multiple instances of Sherlock at the same machine.

The whole system consists of four principal components:

  gatherer	gathers documents (e.g., by spidering the WWW), parses
		them and stores the results to the gatherer database.
		Supervised by bin/scheduler which, according to cf/timetable,
		runs bin/gatherd (the heart of the gatherer) and bin/expire
		(takes care of refreshing and expiring old pages).

  indexer	takes the gatherer database and generates an index from it.
		Supervised by bin/indexer which is usually run by the scheduler.

  search server	answers queries according to the index (the gatherer
		database is no longer needed). Contained in bin/sherlockd,
		controlled by bin/scontrol which is usually run by the
		scheduler. Talks with clients via TCP, the protocol is
		described in doc/search.

  front-end	is just an user-interface to the search server: e.g.,
		it can be a CGI script passing user queries to the
		search server and formatting the results in HTML.

This file describes the most common use of Sherlock -- indexing of web
pages. The system itself is much more powerful and due to its modular
structure, it can be easily used or adapted for a large number of other
uses, but it's outside the scope of this brief introduction. If you
suspect any indexing and searching task could be accomplished by using
Sherlock (and it's quite probable), just ask the authors for more hints.

3. Configuring
~~~~~~~~~~~~~~
Before you run Sherlock, you need to configure it.

All parts of Sherlock share a single config file cf/sherlock,
each module has its own section there. Please read the whole
file and follow the comments, but here is a couple of things
you really should set to avoid later surprises:

  o  HTTP.From (i.e., item From in section HTTP) -- set it to your
     address, so the webmasters of the servers you index will know
     who's spidering them.
  o  GatherD.MinServerDelay -- this one sets the minimum interval
     between two accesses to the same server; if you're indexing
     your own server, you can safely decrease it to a few seconds.
  o  Section Expire -- here you set how often should the documents
     be checked to see if they have changed.
  o  Search.Allow -- if you want to run the front-end on a different
     machine, you need to enable external connections here.

Also, you need to specify which documents should be gathered and indexed
and which should be left out (usually, you index your own domain and
ignore everything outside). This is described by a so called filter
(cf/filter) which is a program in a simple programming language
(see doc/filter for a complete definition) describing the exact
rules, shared by all components of Sherlock. In most cases, it should
be sufficient to take the default filter and edit the domain names there.

When running any Sherlock commands, you can select a different
configuration file or override any settings by -C and -S switches,
see doc/hints for details.

If you want to log as much information as possible, use -Ccf/trace
which enables all available logging and tracing options.

4. Initializing
~~~~~~~~~~~~~~~
During normal operation, everything is controlled by the scheduler,
the gatherer automatically follows links to discover new documents,
the expirer takes care of refreshing the already known ones, the indexer
is run periodically to create indices, the search server is run to
answer user queries, but to start the whole circle from scratch, you
need to make a couple of special steps:

  o  Make sure the "db" directory is empty.
  o  Run "bin/gc -i" and feed the starting set of URL's (usually,
     your home page is enough, the gatherer will discover everything
     else by following links) to its standard input. This also sets up
     the gatherer database.
  o  (Optional) Run "bin/gatherd" and let the gatherer work for a couple
     of minutes. In the meantime, watch the gatherer log (log/gatherd*)
     to see if everything runs well. Then just press Ctrl-C and wait
     for gatherd to terminate.
  o  Run "bin/indexer" to build an initial index (which will be empty
     if you left out the previous step).
  o  Run "bin/scontrol start" to test sherlockd.
  o  Run "bin/gcontrol start" to start the gatherer scheduler.

From now on, everything should run automatically.

After something is gathered and the index generated, you can try to send
a query either by telnetting to the search server's port (by default it's
8192) or by using bin/query.

If you want to start Sherlock on system boot and your system uses
System V like init scripts (and assuming you've compiled everything
as user `sherlock' in his home directory), just use gcontrol as the
init script; otherwise, use the following shell fragment with the
appropriate user name and path:

	export SH_USER=sherlock
	export SH_HOME=~sherlock/run
	exec $SH_HOME/bin/gcontrol $@

You can also use `gcontrol start <target>' if you want to perform some
special action upon scheduler startup (e.g., if you want to force
immediate regeneration of the index), see cf/timetable for a list
of such actions.

The control scripts also have built-in log rotation facilities -- just
run `bin/scontrol cron' daily and set the SKeeper.RotateLogs switch.

5. Useful utilities
~~~~~~~~~~~~~~~~~~~
There are some more programs you can make use of when administering
Sherlock:

gcontrol	master control script of the gatherer scheduler
scontrol	master control script of the search server,
		usually run automatically by the scheduler
indexer		master control script of the indexer -- run to
		generate a new index; called automatically by the scheduler
gc		control the gatherer (view and modify the list of
		known URL's and the queue, insert URL's manually
		and a lot of other tasks). The gatherer must be
		stopped during such manipulations.
query		send test queries to the search server
check-sherlockd	Netsaint plug-in for checking search servers
gtest		look how does the gatherer process a given URL,
		usually run with -Ccf/trace
gbatch		gathers a list of URL's given on its standard input
		without following any links. Ideal for indexing
		document collections present locally.
analyse-log,	process gatherer logs and draw graphs with various
plot-log	statistics from them -- currently under development,
		not recommended for general use

These utilities are useful when debugging Sherlock or trying
to understand how does it work:

buckettool	manipulate contents of bucket files (the large files
		like db/objects serving as bags with lots of small
		objects)
cs2cs		simple convertor of character sets
db-tool		manipulate contents of database files (db/*.db)
idxdump		dump various parts of the index in human readable form
objdump		dump object files and convert data streams to a human
		readable form

6. More documentation
~~~~~~~~~~~~~~~~~~~~~
doc/file-formats	formats of all data files
doc/filter		the filter language
doc/hints		hints and frequenty asked questions
doc/indexer		roadmap of indexer modules
doc/objects		format of documents and their attributes
doc/search		communication with the search server

7. Example front-end
~~~~~~~~~~~~~~~~~~~~
There is a simple CGI front-end in the free/front-end directory. It's written
in Perl using a couple of Perl modules (see lib/perl) for common CGI tasks
and communication with the search server. It should serve as an example
of how to use the modules or even as a skeleton of your own front-end.

In case your documents contain non-ASCII characters, the front-end shows
them in UTF-8 encoding. If you want anything else, just use an appropriate
web server module (mod_charset, mod_czech etc.) to convert the output.

The front-end expects queries of this type:

	aleph beth gimel	any of these words
	"aleph beth"		a phrase
	aleph +beth -gimel	`beth' is mandatory, `gimel' forbidden
	? "aleph" or "beth"	use `?' to enter any full query in the
				native language of sherlockd (cf. doc/search)
