Formats of Various Files Used by Sherlock
*****************************************

### Buckets ###  (various representations of objects, see also sherlock/object.h)

Buckets usually appear as a part of the bucket file with type and length stored
in bucket file metadata.

Plain text bucket  (BUCKET_TYPE_PLAIN, has only a body with no header)
~~~~~~~~~~~~~~~~~
Sequence of:	byte	attr_name
		byte	attr_value[]
		byte	'\n'

V3.0 bucket  (BUCKET_TYPE_V30, textual bucket with header and body)
~~~~~~~~~~~
plain_text_bucket	header
byte			'\n'
plain_text_bucket	body

V3.3 bucket  (BUCKET_TYPE_V33, optimized for quick reading)
~~~~~~~~~~~
v3.3_sequence	header
byte		0				<-- separates header and body
v3.3_sequence	body

where v3.3_sequence is a sequence of:
utf-8		length+1			<-- so that the 1st byte will be never 0
byte		attr_value[]
byte		attr_name

V3.3 compressed bucket  (BUCKET_TYPE_V33_LIZARD)
~~~~~~~~~~~~~~~~~~~~~~
v3.3_sequence	header
byte		0				<-- separates header and body
u32		orig_len
u32		orig_adler32
byte		lizard_data[]			<-- v3.3_sequence compressed by Lizard

Linearized objects  (as generated by obj_linearize(), no external meta-data needed except for length)
~~~~~~~~~~~~~~~~~~
byte		bucket_type			<-- relative to BUCKET_TYPE_PLAIN
byte		bucket_data[]			<-- any of the V3.3 bucket formats, header is empty

Compressed blocks  (as generated by lizard_bwrite(), currently used only for storing buckets)
~~~~~~~~~~~~~~~~~
u32		total_len			<-- of the whole block except for this field
						<-- from this point on, the whole structure can be interpreted as a linearized object
byte		bucket_type			<-- relative to BUCKET_TYPE_PLAIN
byte		bucket_separator		<-- denotes empty bucket hdr
Either:		u32	orig_len		<-- for BUCKET_TYPE_V33_LIZARD
		u32	orig_adler32
		byte	compressed_data[]
Or:		byte	uncompressed_data[]	<-- for all other bucket types

### Index and internal indexer files ###

Parameters
~~~~~~~~~~
Single struct index_params

Fingerprints and CardPrints
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sequence of:	u32	fingerprint[3]
		u32	id

LabelsByID
~~~~~~~~~~
Sequence of:	u32	orig_url_id
		byte	label_flags		<-- LABEL_TYPE_xxx | LABEL_FLAG_xxx
		Sequence of V3.3 attributes
		byte	0

Attributes
~~~~~~~~~~
Array of struct card_attr[id]

Notes
~~~~~
Array of struct card_note[id]

Checksums
~~~~~~~~~
Sequence of:	u32	checksum[4]
		u32	id

Links
~~~~~
Sequence of:	u32	fingerprint[3]		<-- link destination
		u32	id			<-- link source  (top 2 bits are type: 0=ref, 1=redirect, 2=frame, 3=image)

LinkGraph
~~~~~~~~~
Sequence of:	u32	vertex
		u16	degree
		u32	dest_vertex[degree]	<-- (top 2 bits are link type)

URLList
~~~~~~~
Array of strings[id]

Labels
~~~~~~
Sequence of:	u32	merged_id
		u32	url_id
		u32	redir_id
		u32	count
		byte	label_flags		<-- LABEL_TYPE_xxx | LABEL_FLAG_xxx
		byte	labels[count]		<-- stored as a sequence of V3.3 attributes

LinkTexts
~~~~~~~~~
Sequence of:	u32	srcid
		u32	fingerprint[3]
		u16	length
		byte	text[length]		<-- including category switch

Signatures
~~~~~~~~~~
Sequence of:	u32	id
		u32	signature[Matcher.Signatures]

Merges
~~~~~~
Array of u32[id]

In fact, this array is two arrays on the top of each other -- redirect
array (in case of cards with CARD_FLAG_EMPTY set which are either completely
empty or redirects) and merge array (for all other cards).

The merge array has two uses: Before the merger is run, it contains the parent
of the card in a Tarjan union-find tree or ~0 for root nodes. The merger rewrites
merges[id] to the representant of the id's equivalence class.

The redirect array contains the ultimate destination of the redirect (after
following the whole redirect chain) or ~0 if the redirect leads to nowhere.
The backlinker uses Tarjan trees for this purpose internally.

LexRaw, LexByFreq, LexOrder
~~~~~~~~~~~~~~~~~~~~~~~~~~~
u32 word_count
Array [word_count] of:
		u32	id			<-- lowest 3 bits are word class
		u32	frequency
		context_t context_class
		byte	length
		byte	word[length]

LexRaw is sorted randomly
LexByFreq is sorted by frequency decreasingly
LexOrder is sorted by id, first item has id 8, then ids increase by 8

WordIndex
~~~~~~~~~
Sequence of:	u32	wordid
		u32	size			<-- byte size of the following sequence
		Sequence RefChain {
		  u28	oid
		  u4	length			<-- 1..15, 0=special
		  if	(!length)
			utf8	length2		<-- replaces length
		  byte	ref[length]
		}

ref:	0tpp pppp			2 most frequent word-types, 6-bit delta (0 is behind the edge)
	10pp pttt, u8			word-type, 11-bit delta (0 is behind the edge)
	110p pttt, u16			word-type, 18-bit delta
	1110 tttt wwpp pppp		weight, meta-type, 6-bit position
	1111 0tww pppp pttt, u8		weight, meta-type, 13-bit position
	1111 10tt tppp pppp, u16	word-type, 23-bit delta
	1111 110w wppp tttt, u16	weight, meta-type, 19-bit position
	1111 111?			RFU

	Note: the bits are scattered such that the number of bit shifts is minimized.
	See chewer.c for the exact algorithm.

StringIndex
~~~~~~~~~~~
Sequence of:	u32	fingerprint[3]
		u32	size			<-- byte size of the following sequence
		Sequence RefChain		<-- as in WordIndex

References
~~~~~~~~~~
Sequence of:	Sequence RefChain		<-- as in WordIndex
		u32	0

LexWords
~~~~~~~~
u32 word_count
Sequence of:	fpos	ref_pos
		u16	chainlen		<-- reference chain length in 4K pages
		byte	class
		byte	freq			<-- logarithmic frequency scaled to [0,255]
		context_t context_class
		byte	length
		byte	word[length]

LexWords is in the same order as LexOrder and ids are inherited from it as well

Lexicon
~~~~~~~
u32 word_count
u32 cplx_count
Array [word_count+cplx_count] of:
		/*
		 * Words come first, complexes next
		 * Words are sorted by length (in characters, not bytes), then
		 * lexicographically without accents and finally
		 * lexicographically with accents.
		 * Complexes are sorted by root and then by context slot
		 */
		fpos	ref_pos			<-- the same records as in LexWords
		u16	chainlen
		byte	class
		byte	freq
		context_t context_slot
		byte	length
		byte	word[length]

StringMap
~~~~~~~~~
Sequence of:	u32	fingerprint[3]
		fpos	ref_pos
Last item has hash 0xff[12] and points after the last reference.

StringHash
~~~~~~~~~~
Array of:	u32	stringmap_pos		<-- measured in string map items

StemsOrder
~~~~~~~~~~
Sequence of:	u32	stemmer_id	<-- the same as in Stems
		u32	language_mask
		Sequence of:
			u32	stem_word_id
			u32	word_id
		u32	0xffffffff

Stems
~~~~~
Sequence of:	u32	stemmer_id	<-- 0x80000000 for synonyma, 0x80000001 for reverse synonyma
		u32	language_mask
		Sequence of:  /* sorted by stem_word_id */
			u32	stem_word_id | 0x80000000  <-- for type 0x80000000 stem_word_id is actually class ID
			u32	word_id[]		   <-- for type 0x80000001 word_id is actually class ID
		u32	0xffffffff

Keywords
~~~~~~~~
Text file with lines:	[UFC][0-9a-f]+ [0-9a-f]+ [0123] text...
		byte	type		<-- URL, File, Catalog
		u32	redir_id
		u32	url_id
		byte	weight
		byte	text[]

Output file of the fingerprint resolver
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sequence of:	u32	source_id		<-- sorted on (source_id & 0x3fffffff)
		u32	dest_id			<-- id of the document possibly | 0x20000000 (for not downloaded documents)

Feedback from indexer to gatherer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sequence of:	byte	footprint[16]
		u32	card_id			<-- id of the document possibly | 0x20000000 (for not downloaded documents)
		byte	flags			<-- from card notes
		byte	dynamic_weight

Bucket index file
~~~~~~~~~~~~~~~~~
Sequence of:	u32	oid
		sh_time_t last_checked_time

Cards
~~~~~
Sequence of:	u32	stored_len
		byte	data[stored_len]	<-- linearized object (see above)
		byte	padding[]		<-- so that the position of the next card is a multiple of 1 << CARD_POS_SHIFT

Blacklist
~~~~~~~~~
Sequence of:	u32	card_id

### GatherD internals ###

GatherD host file
~~~~~~~~~~~~~~~~~
Sequence of:	byte	protocol_id
		u16	port
		byte	hostname[], 0x0a
		u32	queue_read_pos
		u32	queue_write_pos
		u32	obj_count[SHERLOCK_NUM_SECTIONS]
		u32	robot_id
		u32	robot_time
		u32	rec_err_count
		u32	queue_key
		u32	queue_priority

GatherD queue file
~~~~~~~~~~~~~~~~~~
Sequence of pages per QUEUE_PAGE_SIZE bytes, each contains:
		u32	pos_of_next_page
		Sequence of:
			byte	url_rest[], 0
			u32	priority
