Sherlock 3.x -- Object Data Formats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

OBJECT ATTRIBUTES
~~~~~~~~~~~~~~~~~
+--------- attribute name
|  +------ attribute source:
|  |       G=gatherer (bucket body), H=gatherer (bucket header), g=internal to gatherer, Q=gatherer (from queue), X=gatherer (bucket export)
|  |       I=indexer, i=indexer internal, S=search server, M=multiplexer
|  |+----- inside cards (with default settings): -=not present, +=present, U=in per-URL or per-redirect blocks [all in default configuration], x=in per-reftext block
|  ||+---- "#" if multiple occurences are possible, "@" if multiple occurences are just a long value split to more lines
|  |||
A  G-#	applet/object reference
B  S+	database the object was found in
C  G-	checksum of the document  [gatherd only]
D  GU	time of last download
E  GU#	content-encoding (can occur multiple times for nested encodings)
F  G-#	frame reference
G  G+	image attributes: \
	(orig_w orig_h orig_colorspace orig_ncolors thn_w thn_h thn_type [A], \
	[A is mark for animated images])
I  G-#	image reference
J  G+	time of last significant change ("jump") of object contents
K  IU#	catalog category
L  GU	HTTP last modified timestamp (in server time)
M  G+#	meta-information (title, keywords, ...): [<weight(0-3)>]<type-tag><text>  [some are per-URL]
N  G+@	base224-encoded image thumbnail [except mux]
   M+	thumbnail filename in cache
O  g	OID reference (only old gatherd)
   H-	URL footprint (only shepherd)
   S+	card ID reference
Q  S+	quality factor
R  G-#	reference to URL
S  G-	HTTP Server header
T  GU	content-type
U  HU	url [mandatory]
V  XU	last visited timestamp [supported only by shepherd, not gatherd]
W  g	wait == retry after (seconds)
   I #	weight calculation history (<one-letter-of-weight-type><value>)
    +	types: s (scanner), o (oook), d (dynamic weight), m (merged), p (penalized), \
    U	u (URL record), y (redirect), \
    x	x (reftext; the number of links is appended)
X  G+@	document text
   S+@	XML-ized snippet of document text
Y  G-	redirected to
Z  G-@	source stream (debugging only)
a  G-#	iframe reference
b  IU	frame backlink (max weight page containing this as a frame)
c  GU	charset
d  G-#	reference to description (<img longdesc=...> etc.)
e  G+	expires
f  G-#	form reference
g  G-	HTTP ETag
h  G-	how long did it take to gather this document (in ms)
i  IU	image backlink (max weight page containing this obj as image)
j  G-@	validator's judgement
k  g	queue key
   I+	length of useful content (number of alnum characters)
l  G-	language list
m  S+	more documents from this site compressed
n  S+	area ID
p  g	date of previous version (`J' from previous version)
q  G+	archiving contents forbidden by META tags (just passed through sherlockd, to be handled by front-ends)
r  G-#	filtering rule (from robots.txt)
s  GU	document size (before any encoding was applied)
t  S+	site id
v  G-	card version (this format is v1)
w  S+	weight
x  i-#	reference text: <url> <weight> <count> [<weight(0-3)>]<type-tag><text>
y  IU	back-redirect
z  IU	backlink (max weight page pointing to this one)
    x	source of the reftext
0-9	user attributes [not touched anywhere, but propagated]
.  ?+#	remark [can be generated anywhere]
!  G-	gathering result: <error code> <message>
   S+	object not shown due to inconsistencies in database
+  S+	reserved for footer start mark
(U I+#	start of per-URL block (here dwell all attributes marked with "U")
(y I+#	start of per-redirect block (they are inside "(U" blocks and contain "U" type attrs for redirect to the real URL)
(x I+#  start of per-reftext block (here dwell all attributes marked with "x"), reftext blocks are behind the body
)	end of block

REFERENCES
~~~~~~~~~~
Each reference (attributes A,F,I,R,Y,d,f) has the following syntax:
<url>[ <ref-id>[ <dont-follow>]]

<url> = the URL referred to
<ref-id> = reference ID number used in URL brackets in the extracted text
<dont-follow> = 1 if the link has to be not followed

EXTRACTED TEXT
~~~~~~~~~~~~~~
The extracted text stored in gatherer output is represented as a UTF-8 string
consisting of printable characters, whitespace characters and tags. Each tag
is composed of bytes in range 0x80--0xbf, so that it cannot be mistaken for
a correct UTF-8 character. White spaces (only spaces and newlines are permitted)
and certain tags serve as word separators.

There are two types of tags: attribute changers and brackets. Attribute
changers switch word types, the new type is valid till a next word type
tag (default word type at the start of the document is WT_TEXT, the B bit
indicates a sentence break):

    +-+-+-+-+-+-+-+-+
    |1|0|0|B| type  |
    +-+-+-+-+-+-+-+-+

Brackets delimit URL references inside the document. Start bracket must be
matched with an appropriate end bracket and the brackets must nest correctly.
Start brackets look this way:

    +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
    |1|0|1|0| high  | |1|0|    low    |
    +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+

(high and low give together a 10-bit number of the reference).
End brackets are different:

    +-+-+-+-+-+-+-+-+
    |1|0|1|1|0|0|0|0|
    +-+-+-+-+-+-+-+-+

Attribute changers automatically break words, brackets don't.

BUCKETS
~~~~~~~
There are the following types of buckets (cf. lib/bucket.h):

BUCKET_TYPE_COMPAT	Old-style buckets as generated by gatherd.
			They don't distinguish between bucket header and body,
			"H" and "G" classes (see above) are mixed together.
			The attributes are stored as follows:
				byte	attribute_name
				byte	content[]
				byte	'\n'

BUCKET_TYPE_V30		New-style buckets as generated by shepherd.
			They consist of header (class "H" attributes) and body (class "G")
			seperated with a blank line. If there is no body (which is perfectly correct
			if the document hasn't been gathered yet), the blank line can be omitted as well.
			The attributes are stored in the same way as in BUCKET_TYPE_COMPAT.

BUCKET_TYPE_V33		Advanced version of BUCKET_TYPE_V30, optimized for zero-copy reading.
			The attributes are stored as follows:
				UTF-8	length+1
				byte	content[length]
				byte	attribute_name
			The header and the body are separated by a zero byte.

BUCKET_TYPE_V33_LIZARD	Compressed version of BUCKET_TYPE_V33 (using lib/lizard.c).
			Header and the separating zero byte are the same, but the body stream
			is compressed by the lizard and preceded by a 32-bit length of the original
			uncompressed version and its Adler32 checksum. The body can be omitted.

ERROR CODES
~~~~~~~~~~~
00xx	Document processed OK (0=normal document, 1=redirect, 2=resolved a queue key, 3=not modified since, 4=not changed)
1sxx	Temporary errors
2sxx	Fatal errors

c0xx	URL parsing errors (see lib/url.h for a list)
c1xx	downloader errors (0=unknown proto, 1=auth not supported, 2=timeout, 3=no host, 4=DNS timeout, 5=DNS error, 6=???,
	7=connect failed, 8=invalid response hdr, 9=hdr unexp close, 10=hdr line too long, 11=hdr too long,
	12=hdr syntax error, 13=max hdr lines exc, 14=no resp hdr, 15=invalid version, 16=invalid rsph,
	17=already expired, 18=not modified, 19=length mismatch, 20=???, 21=unknown xfer-encoding,
	22=illegal chunked encoding, 23=unexp close in chunked enc, 24=must be localhost, 25=is-a-directory, 26=path too long,
	27=invalid file name, 28=expected to be a directory, 29=name normalization, 30=HTTP error, 31=FILE errno,
	32=special files not indexed, 33=HTTP downgrade, 34=invalid IP address, 35=would be too large, 36=configuration error)
c2xx	parser errors (all of them fatal):
	00-19  common for all parsers: 0=format error, 1=line too long, 2=image manipulation error, 3=internal error)
	20-39  PDF parser (see gather/format/pdf/pdf.h)
c3xx	gatherd errors (0=subprocess exited with rc<>0, 1=subprocess died, 2=subprocess timed out, 3=incomplete object,
	4=forbidden by robots.txt, 5=robots.txt is a redirect, 6=host known to be nonexistent, 7=filtered out)
c4xx	gatherer errors (0=unknown content type, 1=forbidden by filter, 2=too many conversions, 3=unknown content encoding,
	5=object too large)
c5xx	decoder errors (0=invalid header, 1=cannot decode)
c6xx	validator errors (0=child exited with error, 1=read error)
