Formats of Various Files Used by Sherlock
*****************************************

### Buckets ###  (various representations of objects, see also sherlock/object.h)

Buckets usually appear as a part of the bucket file with type and length stored
in bucket file metadata.

Plain text bucket  (BUCKET_TYPE_PLAIN, has only a body with no header)
~~~~~~~~~~~~~~~~~
Sequence of:	byte	attr_name
		byte	attr_value[]
		byte	'\n'

V3.0 bucket  (BUCKET_TYPE_V30, textual bucket with header and body)
~~~~~~~~~~~
plain_text_bucket	header
byte			'\n'
plain_text_bucket	body

V3.3 bucket  (BUCKET_TYPE_V33, optimized for quick reading)
~~~~~~~~~~~
v3.3_sequence	header
byte		0				<-- separates header and body
v3.3_sequence	body

where v3.3_sequence is a sequence of:
utf-8		length+1			<-- so that the 1st byte will be never 0
byte		attr_value[]
byte		attr_name

V3.3 compressed bucket  (BUCKET_TYPE_V33_LIZARD)
~~~~~~~~~~~~~~~~~~~~~~
v3.3_sequence	header
byte		0				<-- separates header and body
u32		orig_len
u32		orig_adler32
byte		lizard_data[]			<-- v3.3_sequence compressed by Lizard

Linearized objects  (as generated by obj_linearize(), no external meta-data needed except for length)
~~~~~~~~~~~~~~~~~~
byte		bucket_type			<-- relative to BUCKET_TYPE_PLAIN
byte		bucket_data[]			<-- any of the V3.3 bucket formats, header is empty

Compressed blocks  (as generated by lizard_bwrite(), currently used only for storing buckets)
~~~~~~~~~~~~~~~~~
u32		total_len			<-- of the whole block except for this field
						<-- from this point on, the whole structure can be interpreted as a linearized object
byte		bucket_type			<-- relative to BUCKET_TYPE_PLAIN
byte		bucket_separator		<-- denotes empty bucket hdr
Either:		u32	orig_len		<-- for BUCKET_TYPE_V33_LIZARD
		u32	orig_adler32
		byte	compressed_data[]
Or:		byte	uncompressed_data[]	<-- for all other bucket types

### Index and internal indexer files ###

Parameters
~~~~~~~~~~
Single struct index_params

Fingerprints and CardPrints
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sequence of:	u32	fingerprint[3]
		u32	id

FPSplits
~~~~~~~~
Sequence of:	u32	node

This file describes a scenario, how we can recursively split a set of fingerprints,
so the corresponding intervals of the file Fingerprints fit in memory (as hash tables).

The recursion tree is dumped in the prefix order, possible values are:
* RESOLVE_SPLIT | bits -- split FPs by `bits' highest significant bits
* count -- resolve FPs by next `count' entries from Fingerprints

LabelsByID
~~~~~~~~~~
Sequence of:	u32	orig_url_id
		byte	label_flags		<-- LABEL_TYPE_xxx | LABEL_FLAG_xxx
		Sequence of V3.3 attributes
		byte	0

Attributes
~~~~~~~~~~
Array of struct card_attr[id]

Notes
~~~~~
Array of struct card_note[id]

CardInfo
~~~~~~~~
Array of struct card_info[stage2_id]

Checksums
~~~~~~~~~
Sequence of:	u32	checksum[4]
		u32	id

Links
~~~~~
Sequence of:	u32	fingerprint[3]		<-- link destination
		u32	id			<-- link source  (top 2 bits are type: 0=ref, 1=redirect, 2=frame, 3=image)

LinkGraph  (sorted on src_vertex)
~~~~~~~~~
Sequence of:	u32	src_vertex | mask
		byte	degree[0..4]		<-- depending on mask
		u32	dest_vertex[degree]	<-- (top 2 bits are link type)

mask:		00xx xxxx xxxx xxxx		degree==1 and is 0 bytes long
		01xx xxxx xxxx xxxx		degree is 1 byte long
		10xx xxxx xxxx xxxx		degree is 2 bytes long
		11xx xxxx xxxx xxxx		degree is 4 bytes long

URLList
~~~~~~~
Array of strings[id]

URLIndex
~~~~~~~~
Sequence of:	byte	pos[BYTES_PER_O]	<-- offset in URLList

Labels
~~~~~~
Sequence of:	u32	merged_id
		u32	url_id
		u32	redir_id
		u32	count
		byte	label_flags		<-- LABEL_TYPE_xxx | LABEL_FLAG_xxx
		byte	labels[count]		<-- stored as a sequence of V3.3 attributes

RefTexts
~~~~~~~~
Sequence of:	u32	srcid
		u32	fingerprint[3]
		u16	length
		byte	text[length]		<-- including category switch

Signatures
~~~~~~~~~~
Sequence of:	u32	id
		u32	signature[Matcher.Signatures]

Merges
~~~~~~
Array of u32[id]

In fact, this array is two arrays on the top of each other -- redirect
array (in case of cards with CARD_FLAG_EMPTY set which are either completely
empty or redirects) and merge array (for all other cards).

The merge array has two uses: Before the merger is run, it contains the parent
of the card in a Tarjan union-find tree or ~0 for root nodes. The merger rewrites
merges[id] to the representant of the id's equivalence class.

The redirect array contains the ultimate destination of the redirect (after
following the whole redirect chain) or ~0 if the redirect leads to nowhere.
The backlinker uses Tarjan trees for this purpose internally.

LexRaw, LexByFreq, LexOrder
~~~~~~~~~~~~~~~~~~~~~~~~~~~
u32 word_count
Array [word_count] of:
		u32	id			<-- lowest 3 bits are word class
		u32	frequency
		context_t context_class
		byte	length
		byte	word[length]

LexRaw is sorted randomly
LexByFreq is sorted by frequency decreasingly
LexOrder is sorted by id, first item has id 8, then ids increase by 8

WordIndex
~~~~~~~~~
Sequence of:	u32	wordid
		u32	size			<-- byte size of the following sequence
		Sequence RefChain {
		  u28	oid
		  u4	length			<-- 1..15, 0=special
		  if	(!length)
			utf8	length2		<-- replaces length
		  byte	ref[length]
		}

ref:	0tpp pppp			2 most frequent word-types, 6-bit delta (0 is behind the edge)
	10pp pttt, u8			word-type, 11-bit delta (0 is behind the edge)
	110p pttt, u16			word-type, 18-bit delta
	1110 tttt wwpp pppp		weight, meta-type, 6-bit position
	1111 0tww pppp pttt, u8		weight, meta-type, 13-bit position
	1111 10tt tppp pppp, u16	word-type, 23-bit delta
	1111 110w wppp tttt, u16	weight, meta-type, 19-bit position
	1111 111?			RFU

	Note: the bits are scattered such that the number of bit shifts is minimized.
	See chewer.c for the exact algorithm.

StringIndex
~~~~~~~~~~~
Sequence of:	u32	fingerprint[3]
		u32	size			<-- byte size of the following sequence
		Sequence RefChain		<-- as in WordIndex

References
~~~~~~~~~~
Sequence of:	byte	slice_mask		<-- which slices are present (omitted if num_slices=1)
		utf8	slice_sizes[]		<-- sizes of all present slices but the last one; includes the terminating zeroes
		Sequence [] {			<-- for all present slices
			Sequence RefChain	<-- as in WordIndex
			u32	0
		}

LexWords
~~~~~~~~
u32 word_count
Sequence of:	fpos	ref_pos
		u16	chainlen		<-- reference chain length in 4K pages
		byte	class
		byte	freq			<-- logarithmic frequency scaled to [0,255]
		context_t context_class
		byte	length
		byte	word[length]

LexWords is in the same order as LexOrder and ids are inherited from it as well

Lexicon
~~~~~~~
u32 word_count
u32 cplx_count
Array [word_count+cplx_count] of:
		/*
		 * Words come first, complexes next
		 * Words are sorted by length (in characters, not bytes), then
		 * lexicographically without accents and finally
		 * lexicographically with accents.
		 * Complexes are sorted by root and then by context slot
		 */
		fpos	ref_pos			<-- the same records as in LexWords
		u16	chainlen
		byte	class
		byte	freq
		context_t context_slot
		byte	length
		byte	word[length]

StringMap
~~~~~~~~~
Sequence of:	u32	fingerprint[3]
		fpos	ref_pos
Last item has hash 0xff[12] and points after the last reference.

StringHash
~~~~~~~~~~
Array of:	u32	stringmap_pos		<-- measured in string map items

StemsOrder
~~~~~~~~~~
Sequence of:	u32	stemmer_id	<-- the same as in Stems
		u32	language_mask
		Sequence of:
			u32	stem_word_id
			u32	word_id
		u32	0xffffffff

Stems
~~~~~
Sequence of:	u32	stemmer_id	<-- 0x80000000 for synonyma, 0x80000001 for reverse synonyma
		u32	language_mask
		Sequence of:  /* sorted by stem_word_id */
			u32	stem_word_id | 0x80000000  <-- for type 0x80000000 stem_word_id is actually class ID
			u32	word_id[]		   <-- for type 0x80000001 word_id is actually class ID
		u32	0xffffffff

Keywords
~~~~~~~~
Text file with lines:	[UFC][0-9a-f]+ [0-9a-f]+ [0123] text...
		byte	type		<-- URL, File, Catalog
		u32	redir_id
		u32	url_id
		byte	weight
		byte	text[]

Output file of the fingerprint resolver
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sequence of:	u32	id			<-- fingerprint resolved to the first (lowest) ID from Fingerprints
		byte	data[record_size]	<-- copied from the source file

Feedback from indexer to gatherer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sequence of:	byte	footprint[16]
		u32	card_id			<-- id of the document possibly | 0x20000000 (for not downloaded documents)
		byte	flags			<-- from card notes
		byte	dynamic_weight

Bucket index file
~~~~~~~~~~~~~~~~~
Sequence of:	u32	oid
		ucw_time_t last_checked_time

Cards
~~~~~
Sequence of:	u32	stored_len
		byte	data[stored_len]	<-- linearized object (see above)
		byte	padding[]		<-- so that the position of the next card is a multiple of 1 << CARD_POS_SHIFT

Blacklist
~~~~~~~~~
Sequence of:	u32	card_id

ImageSignatures
~~~~~~~~~~~~~~~
Sequence of:	u32	card_id
		byte	len			<-- first byte of struct image_signature
		byte	data[image_signature_size(len) - 1]

The sequence is sorted primary by cluster, secondary by card_id.

ImageSignaturesUnsorted
~~~~~~~~~~~~~~~~~~~~~~~
u32	count		<-- number of signatures
Same sequence as ImageSignatures, but unsorted.

ImageClusters
~~~~~~~~~~~~~
u32	tree_depth
struct image_cluster[1 << tree_depth] cluster	<-- heap-like storage of binary search tree, see indexer/imagesigs.c

ImageThumbnails
~~~~~~~~~~~~~~~
Sequence of:	struct	image_thumb_hdr hdr
		byte	thumbnail_data[hdr.thumb_size]

### GatherD internals ###

GatherD host file
~~~~~~~~~~~~~~~~~
Sequence of:	byte	protocol_id
		u16	port
		byte	hostname[], 0x0a
		u32	queue_read_pos
		u32	queue_write_pos
		u32	obj_count[SHERLOCK_NUM_SECTIONS]
		u32	robot_id
		u32	robot_time
		u32	rec_err_count
		u32	queue_key
		u32	queue_priority

GatherD queue file
~~~~~~~~~~~~~~~~~~
Sequence of pages per QUEUE_PAGE_SIZE bytes, each contains:
		u32	pos_of_next_page
		Sequence of:
			byte	url_rest[], 0
			u32	priority

### ShepherD internals ###

URL database file (Shepherd.URLDatabaseFile)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
struct url_db_hdr header (see gather/shepherd/shepherd.h)

Sequence of:	byte	zero			<-- can be used to find start of a record
		byte	mask			<-- sum of masked bits below
		u16	len			<-- URL length in bytes
		u32	oid			<-- bucket | 0x01 (first byte, so 0x01000000 for big-endian CPU)
		u32	fp_site_0		<-- fp.site.x[0] | 0x02
		u32	fp_site_1		<-- fp.site.x[1] | 0x04
		u32	fp_rest_0		<-- fp.rest.x[0] | 0x08
		u32	fp_rest_1		<-- fp.rest.x[1] | 0x10
		byte	url[ALIGN_TO(len, 4)]	<-- URL padded with 0-3 zeros

Sorted text source
~~~~~~~~~~~~~~~~~~
Header:		u32	count			<-- number of records
		u32	idx_count		<-- number of blocks
		u64	idx_pos			<-- file offset of the index (see below)

Sequence of blocks:
		u32	block_size		<-- size of the block
		u32	buf_size		<-- size of decompressed block data
		byte	data[block_size]	<-- lizard-compressed sequence of records

		Record format:
			struct footprint fp	<-- record's FP
			byte[] attributes	<-- textual object represention (terminated by an empty line)

Sequence of idx_count index entries:
		struct footprint fp		<-- FP of the last record in the block
		u64	pos			<-- file offset of the block

The index can be followed by user data.

Sorted URL database (Shepherd.URLSortedFile)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
'U' attributes compressed to the "Sorted text source" format (see above)
followed by struct sorted_trailer (see gather/shepherd/shep-urls.c).

