Parameters
~~~~~~~~~~
Single struct index_params

Fingerprints
~~~~~~~~~~~~
Sequence of:	u32	fingerprint[3]
		u32	id

LabelsByID
~~~~~~~~~~
Sequence of:	u32	orig_url_id
		Sequence of C strings
		byte	0

Attributes
~~~~~~~~~~
Array of struct card_attr[id]

Notes
~~~~~
Array of struct card_note[id]

Checksums
~~~~~~~~~
Sequence of:	u32	checksum[4]
		u32	id

Links
~~~~~
Sequence of:	u32	fingerprint[3]		<-- link destination
		u32	id			<-- link source  (top 2 bits are type: 0=ref, 1=redirect, 2=frame, 3=image)

LinkGraph
~~~~~~~~~
Sequence of:	u32	vertex
		u16	degree
		u32	dest_vertex[degree]	<-- (top 2 bits are link type)

URLList
~~~~~~~
Array of strings[id]

Labels
~~~~~~
Sequence of:	u32	merged_id
		u32	url_id
		u32	redir_id
		u32	count
		byte	labels[count]

LinkTexts
~~~~~~~~~
Sequence of:	u32	srcid
		u32	fingerprint[3]
		u16	length
		byte	text[length]		<-- including category switch

Signatures
~~~~~~~~~~
Sequence of:	u32	id
		u32	signature[Matcher.Signatures]

Merges
~~~~~~
Array of u32[id]

In fact, this array is two arrays on the top of each other -- redirect
array (in case of cards with CARD_FLAG_EMPTY set which are either completely
empty or redirects) and merge array (for all other cards).

The merge array has two uses: Before the merger is run, it contains the parent
of the card in a Tarjan union-find tree or ~0 for root nodes. The merger rewrites
merges[id] to the representant of the id's equivalence class.

The redirect array contains the ultimate destination of the redirect (after
following the whole redirect chain) or ~0 if the redirect leads to nowhere.
The backlinker uses Tarjan trees for this purpose internally.

LexRaw, LexByFreq, LexOrder
~~~~~~~~~~~~~~~~~~~~~~~~~~~
u32 word_count
Array [word_count] of:
		u32	id			<-- lowest 3 bits are word class
		u32	frequency
		context_t context_class
		byte	length
		byte	word[length]

LexRaw is sorted randomly
LexByFreq is sorted by frequency decreasingly
LexOrder is sorted by id, first item has id 8, then ids increase by 8

WordIndex
~~~~~~~~~
Sequence of:	u32	wordid
		u32	size			<-- byte size of the following sequence
		Sequence {
		  u32	oid
		  u16	count
		  u16	ref[count]
		}

StringIndex
~~~~~~~~~~~
Sequence of:	u32	fingerprint[3]
		u32	size			<-- byte size of the following sequence
		Sequence {
		  u32	oid
		  u16	count
		  u16	ref[count]
		}

References
~~~~~~~~~~
Sequence of:	Sequence {
			u16	oidhi,oidlo
			u16	num_refs
			u16	ref[num_refs]
		}
		u32	0

ref:		either	0ttt pppp pppp pppp  for text (t=type, p=position)
		or	1ttt tppp pppp ppww  for meta-information (t=type, w=weight, p=position)

LexWords
~~~~~~~~
u32 word_count
Sequence of:	fpos	ref_pos
		u16	chainlen		<-- reference chain length in 4K pages
		byte	class
		byte	freq			<-- logarithmic frequency scaled to [0,255]
		context_t context_class
		byte	length
		byte	word[length]

LexWords is in the same order as LexOrder and ids are inherited from it as well

Lexicon
~~~~~~~
u32 word_count
u32 cplx_count
Array [word_count+cplx_count] of:
		/*
		 * Words come first, complexes next
		 * Words are sorted by length (in characters, not bytes), then
		 * lexicographically without accents and finally
		 * lexicographically with accents.
		 * Complexes are sorted by root and then by context slot
		 */
		fpos	ref_pos			<-- the same records as in LexWords
		u16	chainlen
		byte	class
		byte	freq
		context_t context_slot
		byte	length
		byte	word[length]

StringMap
~~~~~~~~~
Sequence of:	u32	fingerprint[3]
		fpos	ref_pos
Last item has hash 0xff[12] and points after the last reference.

StringHash
~~~~~~~~~~
Array of:	u32	stringmap_pos		<-- measured in string map items

StemsOrder
~~~~~~~~~~
Sequence of:	u32	stemmer_id	<-- the same as in Stems
		u32	language_mask
		Sequence of:
			u32	stem_word_id
			u32	word_id
		u32	0xffffffff

Stems
~~~~~
Sequence of:	u32	stemmer_id	<-- 0x80000000 for synonyma, 0x80000001 for reverse synonyma
		u32	language_mask
		Sequence of:  /* sorted by stem_word_id */
			u32	stem_word_id | 0x80000000  <-- for type 0x80000000 stem_word_id is actually class ID
			u32	word_id[]		   <-- for type 0x80000001 word_id is actually class ID
		u32	0xffffffff

Keywords
~~~~~~~~
Text file with lines:	[UFC][0-9a-f]+ [0-9a-f]+ [0123] text...
		byte	type		<-- URL, File, Catalog
		u32	redir_id
		u32	url_id
		byte	weight
		byte	text[]

Output file of the fingerprint resolver
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sequence of:	u32	source_id		<-- sorted on (source_id & 0x3fffffff)
		u32	dest_id			<-- id of the document possibly | 0x20000000 (for not downloaded documents)

Feedback from indexer to gatherer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sequence of:	byte	footprint[16]
		u32	card_id			<-- id of the document possibly | 0x20000000 (for not downloaded documents)
		byte	flags			<-- from card notes
		byte	dynamic_weight

CardPrints
~~~~~~~~~~
Sequence of:	byte	fingerprint[12]
		u32	card_id

GatherD host file
~~~~~~~~~~~~~~~~~
Sequence of:	byte	protocol_id
		u16	port
		byte	hostname[], 0x0a
		u32	queue_read_pos
		u32	queue_write_pos
		u32	obj_count[SHERLOCK_NUM_SECTIONS]
		u32	robot_id
		u32	robot_time
		u32	rec_err_count
		u32	queue_key
		u32	queue_priority

GatherD queue file
~~~~~~~~~~~~~~~~~~
Sequence of pages per QUEUE_PAGE_SIZE bytes, each contains:
		u32	pos_of_next_page
		Sequence of:
			byte	url_rest[], 0
			u32	priority

Bucket index file
~~~~~~~~~~~~~~~~~
Sequence of:	u32	oid
		sh_time_t last_checked_time

Cards
~~~~~
Sequence of:	u32	bucket_type
		u32	stored_len
		/* If bucket_type == BUCKET_TYPE_V33_LIZARD, then also: */
			u32	orig_len
			u32	orig_adler32
		byte	data[stored_len]
		byte	padding[]		<- so that the total length is a multiple of 1 << CARD_POS_SHIFT
