Shepherd: the new gathering daemon
==================================

Section 1: General ideas
------------------------

The new gathering daemon, Shepherd, is based on a different principle
than the old one from the free version.  It runs in cycles.  In the
beginning of every cycle, it plans what to gather, and then it
distributes the requests over all reaper daemons.  After the time period
has elapsed, all records are sorted, added to the database, and newly
discovered URL's are added to the known ones.

Server delays:

Shepherd ensures that no server is contacted too often.  A typical delay
is 10 seconds between the requests, however you can override this to a
smaller value in filters.  This is typically necessary for huge web
hosting sites.  Shepherd tries to refresh every document at least once
per month, however some documents (that are modified more often and/or
that have high page rank) are refreshed twice or four times more often.

Limitations on downloaded documents:

The disc space and network bandwidth is limited and there is a
potentially infinite number of web pages.  Shepherd takes page ranks
computed by the indexer and sorts all known URL's according to them.
The page ranks are computed for both downloaded documents and for URL's
the links point to.  The best, say, 80M documents are marked for
gathering, and the URL's of the next, say, 100M documents are just kept
in the database for a possible resurrection.  The worst URL's are
forgotten.  There are, however, also other limitations.  We limit the
maximum number of documents from one site (a protection again infinite
sites and again link farms), from one server (otherwise we cannot
refresh them all on time due to the minimum server delay), and also the
number of PDF files, images, etc.

Distributed gathering:

The gathering process is distributed.  The main server, Shepherd, uses
services of several reaper server, ReapD.  If no ReapD is available,
Shepherd downloads the documents in its own sub-processes, like the
old gatherer.  The reapers download the desired documents, parse them,
and compress them.  The communication is robust and if one of the
ReapD's crash, only its newly downloaded documents are lost.  In such a
case, Shepherd returns all the URL's into the main queue.

Remote indexing:

The indexation is an asynchronous process.  The indexer runs on a
dedicated machine and it downloads the documents via a gigabyte network
from Shepherd.  At the same time, Shepherd continues with gathering new
documents.  It is even possible to run multiple indexers and backupers
at the same time.  For each such an incoming connection, Shepherd
records a "database state", which is a snapshot of all active buckets at
the time.  After the state is unlocked, it is deleted.

Database cleanup:

If there is no locked state, it is possible to perform a cleanup
process.  A cleanup is run automatically if the database size exceeds a
given limit, or if a long time has elapsed from the last cleanup.
During this process, the least interesting buckets (according to their
pagerank) are deleted and the database is shaken down.

Site equivalences:

Several sites are just mirrors, e.g. www.domain.com and domain.com.
Shepherd uses a semi-automatical detection of equivalent sites.  Once
per time, it is necessary to gather root documents from all sites into a
special empty index (the old gatherer is more suitable for this task).
After running the indexer, a special script needs to be run, which
generates a text file with equivalence classes.  This file is then
stored into cf/url-equiv, where it is read be Shepherd.  When Shepherds
discovers a URL, it internally normalizes it according to this table and
only then it performs a lookup to decide whether the URL is known and
should be downloaded or not.

For every site, one equivalence is implicitly assumed --- the above
mentioned www.subdomain.domain.com = subdomain.domain.com.  All records
in cf/url-equiv MUST be recorded with stripped leading "www.".  Please
remember this is you manually add some records.

URL footprints and queue keys:

URL's are too long and of variable length, hence the whole gatherer
only works with their so-called footprints.  An URL is split into two
parts: site and path.  For each of them, an MD5 hash is computed.  We
take 64 bits from both of them and concatenate them into a 128-bit final
hash.  The probability of a collision is negligible.

A site (e.g. some.domain.com) is identified by the first half of its footprint.
A physical server is identified by its queue key (qkey).  Note that some
servers like huge hosting sites can host several sites.  If the server has
exactly one IP address, then qkey equal it.  If the server has more IP
addresses (e.g. a round-robin DNS is used), one of them is deterministically
selected.  If the IP address has not yet been resolved or it does not exist, a
qkey of the form 7fxxxxxx is used.

Limits on maximum number of documents should be set for sites and minimum
server delays should be set for queue keys.

Section 2: Data structures
--------------------------

1. Bucket file db/objects

It is basically the same as the bucket file of the free version.  Every
bucket can be stored in an uncompressed and compressed form.  The
gatherer decides between these two forms based on the achieved
compression level.  The decompression is transparent for all modules.
The buckets can be of several types:

- v2 bucket generated by the old gatherer: uncompressed, body only,
  attributes in a text format.  Note that the gatherer from newer
  version of Sherlock is capable of using the new formats too.
- v3.0 bucket generated by Shepherd: uncompressed, text format, but the
  header and the body are separated by a blank line.  Shepherd modules
  save time by only parsing the header.
- v3.3 bucket: uncompressed, with header and body separated, and with
  attributes stored in a new zero-copy optimized format (length of the
  attribute, contents, and the type of the attribute) that can be parsed
  in place
- v3.3 bucket with compression: like v3.3, but the body can be
  compressed using our very fast compression library LiZaRd based on LZO

2. States db/state/*

Every state is in its own directory and it consists of the following
files:

a. Control: specifies if it is a closed state or a work-in-progress
   state.  It is used for crash recovery.  Only one state (the newest
   one) can be opened.

b. Index: a list of all known signatures, each has a link to a bucket,
   pagerank, retry counter, flags, some meta-informations, and a type:

    - sleeping: know about the URL, but don't want to gather it
    - ungathererd: know about the URL and want to gather it
    - gathered
    - error: tried to gather, but didn't succeed
    - robot file: and it can be ungathered, gathered, or an error
    - queue key: it is just a note on what is the queue key of the site,
      we work with it like with documents

  The ordering is not defined.

c. Site list: a list of all known sites.  It is maintained, among
   others, for the filters.  One often needs to set values locally for a
   site (server delay, maximum number of documents, ...).  It is also
   contains the normalized footprint of the site and its queue key.

d. Contributions: a list of all URL's linked from the newly gathered
   pages with their computed footprints.

e. Journal: all changes of the bucket file against the last state.

f. Checkpoints: during gathering, new buckets and journal records are
   created.  If an error has occured, we can rollback to some previous
   consistent state without dumping all documents gathered in the
   meantime.  To be able to do this, we record a checkpoint once upon
   time that contains the current file length of the bucket file,
   contribution, and the journal.  The crash recovery just truncates
   these files.

3. Equivalence list cf/url-equiv

In the future, it will be maintained automatically in the co-operation
with the indexer.  However, now we use the old cf/url-equiv.

4. Feedback from the indexer

The indexer uploads it back to the gatherer and, in the beginning of the
next cycle, Shepherd updates its weights according to it and deletes the
feedback file.

Section 3: How the gatherer works
---------------------------------

There is always running one main process replacing the good old
scheduler.  It manages the states and lock files, it scans the request
queue, accepts incoming network connection from the indexer, and runs
the right modules at the right time.  These modules are independent
programs doing simple independent things and they are usually called in
the following order.  The scheduler always first creates a hardlink of
the old state and works with the new "copy".  All modules take into
account that the index and sites can be hardlinks, hence they never
overwrite them, but they instead create a temporary file and rename it
afterwards.

1. Plan -- planning of the gathering phase

It browses the state, decides what to gather (either some ungathered
document, or a gathered one that is sufficiently old), assignes
priorities (according to the age and weight), and chooses as many best
ones as could be gathered in the next phase.  It takes into account the
performance of the gatherer and the estimated number of documents that
can be gathered from one queue key.  It also maintains average download
times from each sites so that slow servers won't block the whole
gatherer.

2. Reap -- the gathering phase

It downloads the plan (which fits into the RAM) and gathers the
documents using all ReapD.  It stops after a given time, when the plan
is finished, or when it receives a signal from the master Shepherd.
Shepherd can send such a signal when the size of the bucket file exceeds
the limit (then a cleanup is necessary) or when someone asks for an
explicit shutdown.

Once upon time, the database is checkpoints, i.e. all files are synced
and the file length are written into a special checkpoint log.  If
shep-reap crashes, it is possible to rollback to the last checkpoint
using shep-recover.

3. Cork -- "corking" the new buckets

It combines the old state and the contents of the journal.  Beware that
it ignores the conttrib file.  A new state is created and it can be
closed.  It also updates several site parameters, e.g. the average
download time and the queue key.

Note that (1)--(3) can be repeated if you preserve the contrib file.

4. Merge -- adding new URL's

It takes a state with a contrib file (typically a hardlinked copy of
(3)) and combines old URL's from the index with the new ones from
contrib.  It merges according to the normalized footprints.  If we
already know an equivalent URL, the new one is ignored (the only
exceptions are robots.txt and the pages needed for maintaining the
equivalence classes).

It also deletes error and duplicates (i.e. pages with normalized
footprint equal to some other page).  For every new URL, a new record in
the site list is created, however we do not create new (thin) buckets
yet; we just record their position in the contrib file.

5. Equiv -- computing equivalence classes

We would like to assign every site its normalized footprint according to
the feedback from the indexer.  However, at the moment we just read
cf/url-equiv here.

6. Select -- selecting interesting URL's

This module decides which URL are worth gathering, which we should just
remember (sleeping URL's), and which should be dumped.  It also sets
limits for the individual sites.  It also removes from the site list the
sites without URL's (e.g. if we filter out all documents or if we have
just downloaded an error page and the link to it has disappeared).

7. Record -- recording new URL's into the database

For each new URL that has passed through select, a new (thin) bucket is
created.  Such a bucket only contains a header and no body.  It also
deletes queue keys for sites deleted by shep-select, and too old queue
keys that signalise that the server does not exist.  This phase creates
a new consistent state.

8. Cleanup -- cleaning the database and shaking down the bucket file

It is run once per a few phases, typically when the bucket file is too
big.  The following processes are performed:

  - shaking down the bucket file
  - updating the bucket numbers in the index file
  - deleting buckets the state does not refer to
  - deleting URL's filtered out by the filters
  - renumbering sections
  - computing the average bucket size and similar parameters so that one
    can watch how theit estimates in the configuration match the reality
  - sleeping buckets with downloaded data (i.e. they were recently
    converted from a gathered state into a sleeping state) are trimmed
  - checking whether the state does not refer an unexisting bucket

There are also the following additional modules that are not a part of
the standard gathering cycle.  They are run when needed:

A. Export -- preparation for indexing

It creates a list of buckets that are a part of the current state for
the indexer (see Indexer.Source=indexed:<buckets>:<exports>).

B. Feedback

It reads the feedback from the indexer (generated by feedback-gath) and
updates the weights and the flag "no longer referenced" in the state.

C. Dump -- manual dumper of all data structures

D. The `shep' utility -- manual configuration

It can insert new URL's, delete old ones, request a refresh, etc.  Some
operations are performed immediately and the other ones are recorded
into the contrib file, where they are read by Merge.

E. init -- creating a new empty datase and an ampty state

F. Recover -- crash recovery

It tries to repair a damaged database.  At the moment, it can happen for
the following reasons:

  a. the index file is lost
  b. shep-cleanup has been interrupted (the index file is correct, but
     the bucket numbers is mismatched)
  c. shep-reap has been interrupted (everything is all right except for
     that ther can be a garbage at the end of some files)

In the cases (a) and (b), shep-recover reads the whole bucket file,
creates corresponding index items to all buckets, and updated the
existing index.  In the case (c), it is sufficient to rollback to the
last checkpoint.

G. Mirror -- backup of the database

It connects via the network in the same way as the indexer does, and
downloads a copy of the database (the bucket file and other files).
Beware that it does not synchronize the bucket numbers, hence if one
wants to use the bucket, he must manually run shep-recover like after an
unsuccessfull cleanup.

H. Upgrade -- an upgrade of the database to the current format
