holmes (3.4.1) stable; urgency=high

  * An essential bugfix in the gatherer concerning compressed buckets.

 -- Robert Spalek <robert@ucw.cz>  Wed, 23 Feb 2005 17:30:00 +0100

holmes (3.4) stable; urgency=low

  * License changed from proprietary to GPL!  The free version is hence no
    longer limited to 100k documents.

  Major new features and changes:

  * Buckets and cards are now stored in a new format designed for fast
    zero-copy processing.  Moreover, the gatherer and the indexer can compress
    the bucket file and cards using an included very fast compression library
    LiZaRd, saving about 30% of their size.
  * A lot of optimization work has been done, especially in the following
    modules: general sorter, bucket shakedown, computation of signatures, and
    chewer.  The usage of regular expressions at time-critical places (URL
    keywords recognition and testing session id's) has been abandoned in
    favour of hard-encoded C-routines; the speedup is about 10x.  The memory
    requirements of indexer have been decreased, and a partial sorting (with
    some more seeking afterwards) is used where appropriate.  The whole
    Sherlock has scaled up; we have successfully used it on 70M documents.
  * Texts of HTML links (reference texts) can now be indexed together with the
    text of a page.  They are attached at the end of the card together with
    the URL of the referring page.
  * Big cleanup in the customization interface; see doc/changes.  A new "bare"
    customization (as simple as possible) added.  Introduced late matchers,
    custom matchers, and custom statistics.  You can define your own filter
    functions.

  Minor new features and changes:

  * Gatherer:
    o HTML parser recognizes inline META robot control tags and it implements
      the new "rel=nofollow" attribute in tags containing links.  It also
      tries to avoid duplicate ALT and TITLE reference texts.
    o PDF parser supports new encryption formats.  Added a pdfdump utility for
      debugging.  Several problematic PDF documents are parsed correcly now.
      The parser has been split into a few modules and cleaned up.
    o Computation of a queue key (QKEY) works properly even for hosts with
      multiple IP-addresses.
    o Refreshing of documents is improved; we now check the differences
      against the original version before storing the new version.
    o The new charset guesser is more resistant against a few characters in a
      different charset inserted in the middle of the page (such as banner
      texts).  Content-type guesser logic is improved, too.
    o Added charset recognizer configuration for the Polish language.
    o Conformance to standards improved (URL's, HTTP protocol), but we also
      try to be more tolerant to others breaking the standards (especially in
      date parsing).
    o When a filter tries to reject downloading of robots.txt, its decision is
      ignored. Otherwise all sorts of strange things happen.

  * Indexer:
    o Indexer can now generate an index from a plain text file with buckets
      separated by an empty line instead of a bucket-file.
    o Several common singletons (such as & and +) added as context words.
    o Changed the format of the attributes file: file-type and language bits
      redistributed.
    o ireport reports filetype, language, and domain statistics.

  * Search server:
    o Search server can search on multiple indices, which are merged on-line.
    o Added a new CONTEXT FULL mode which dumps all the text we remember.
    o Added a sort mode which only sorts on a given attribute, but it ignores
      the Q-factor.
    o Added a hydra mode for SMP-systems, where several concurrent processes
      are run.
    o Fixed calculation of the 2nd best match.
    o Important bugfixes: no more stack overflows and SIGBUS's.
    o The example front-end works again.

  * Optimizations:
    o Several regular expression backends supported and one of them (known to
      be _relatively_ stable and fast) engine imported to the source tree and
      it is set as default.  However, as said above, we do not trust it anough
      and rather rewrote some tests into C.
    o Filters: Added a new interval test: if (a =# 1 .. 10).  Switch commands
      consisting of == and =# tests are optimized to logarithmic complexity
      using red-black binary search trees.  We also added a new general
      utility for testing filters in various modes.

  * Cleanups:
    o Major cleanup of all libraries. The old `libsh' has been split to a fairly
      generic `libucw' and parts specific for Holmes.  The charset library has
      been almost rewritten.
    o Cleanup of makefiles and configuration files.  The build system no longer
      writes into the source directories.
    o Numbers in configuration files can be specified in various units.
    o Improved and optimized debugging utilities (buckettool, idxdump, and
      dump-card).
    o All daemons create their locks files in a special lock-directory.
    o Added a brief introduction to the coding style (doc/codingstyle).
    o Added an automatic unit-testing framework (try `make tests').
    o Watson rewritten.

  * Other important issues:
    o Bugfix for some (especially Fedora) kernels: Sherlock now copes with
      pointers close to the 4GB limit.
    o Building of shared libraries now uses `gcc -shared' instead of talking
      directly with the linker, making shared libraries work on newer systems.

  * As usual, we have done many more bugfixes, optimizations, cleanups, and
    code polishing.

 -- Robert Spalek <robert@ucw.cz>  Tue, 22 Feb 2005 15:27:00 +0100

holmes (2.6.2) stable; urgency=low

  * Released a parser for PDF documents, previously only available in
    the commercial version.

 -- Martin Mares <mj@ucw.cz>  Sun,  8 Feb 2004 17:15:34 +0100

holmes (2.6.1) stable; urgency=low

  * Released as stable version, no changes against -beta2.

 -- Martin Mares <mj@ucw.cz>  Wed, 21 Jan 2004 14:22:10 +0100

holmes (2.6.0-beta2) unstable; urgency=low

  * Bugfix in refs.c: type restriction.
  * Customization: added meta-type names URLWORD and FILE.  They are always
    matched without accents.

 -- Robert Spalek <robert@ucw.cz>  Mon, 15 Dec 2003 13:48:00 +0200

holmes (2.6.0-beta1) unstable; urgency=low

  * Fixed bug in lexmapper and cards.c improperly handling words with
    ligatures.
  * Fixed magic complexes in search-server.
  * New auto-accent rules for words taken from URL and 1 auto-accent
    bugfix.
  * EXPLAIN command prints also wordtypes of the words.
  * A few other cleanups.

 -- Robert Spalek <robert@ucw.cz>  Mon,  8 Dec 2003 16:43:00 +0200

holmes (2.6.0-alpha3) unstable; urgency=low

  * Fixed bug in lexmapper causing not indexing of context words after break.

 -- Robert Spalek <robert@ucw.cz>  Tue,  2 Dec 2003 14:14:00 +0200

holmes (2.6.0-alpha2) unstable; urgency=low

  * Script for finding equivalent sites (cf/url-equiv) updated.
  * A few tiny changes and fixes: penalization messages in cards, URL keywords
    handled properly for context words, urlkey does not remove WWW-prefix
    twice, cards do not crash on long documents, dump-card calls less properly,
    and url-equiv not present in CVS.

 -- Robert Spalek <robert@ucw.cz>  Thu, 27 Nov 2003 17:50:00 +0200

holmes (2.6.0-alpha1) unstable; urgency=low

  * Added interface to external parsers (e.g. application/postscript or
    application/msword) and added patterns for file recognition of them.  The
    parsers are commented out in the default configuration.
  * Fixes in gatherer: guessing content types according to patterns, resolving
    IP addresses, and HTML validation.
  * Charset scripts polished.
  * Added support for Unicode ligatures.

  * New module keywords collects URL and Filename keywords, prunes them
    according to a lot of criteria, and attaches them to the labels of the
    cards.
  * Documents can be penalized if they contain not title, no out-going links or
    too short contents.  Also, if the document is written mostly in high valued
    word-types (headers, bold text), we call it swindling and remap it to
    normal text.
  * History of document weight changes done in indexer modules (including all
    penalizations) is recorded in the output card.
  * Giant classes can be detected also on the number of incoming redirects.
  * '@' is now considered as a word, hence searching in e-mail addresses is
    easier now.  Indexer uses a new general module alphabet for lexical
    mapping.
  * Introduced redirect brackets (y ...) inside URL brackets (U ...).  The
    ouput of the search server is much cleaner now, because all information
    assigned to the URL is placed in the corresponding block.  The format of
    several internal indexer files has been changed.
  * Indexer script has been polished and you can specify file detetion level.
  * Fixes in indexer: backlinker split into 2 parts (each called at different
    stage), document is not cut so aggresively when dumped into cards, context
    dependend words, max_chain_len deleted, format of log files, and mkgraph
    rewritten almost from scratch and it properly merges vertices that are
    merged by cf/url-equiv.

  * cards.c: shorter output by skipping superfluous <break>'s and renaming some
    attributes, adapted to redirect brackets, which allows proper word
    positioning, and weighting of URL's and redirects tuned a lot and can use
    pagerank.
  * query.c: changed reporting of per-filetype stats and number of matched
    documents.  Partial answer mode changed.
  * refs.c: second match calculation and EXPLAIN messages improved.
  * words.c: all found lemma's of search words.

  * A few fixes in the library: partmap.
  * Totally improved dump-card: can convert between ID<->OID, resolve URL's
    both from bucket-file and from search server, find duplicates, links, ...
  * idxdump and objdump: use fastbufs, adapted to the new formats, better
    formatting of cards and charset conversion.

 -- Robert Spalek <robert@ucw.cz>  Mon, 24 Nov 2003 17:30:00 +0200

holmes (2.5.0-alpha2) unstable; urgency=low

  * WWW-prefix hack: gatherer processes better redirects between different
    URL's with the same urlkey, e.g. domain.cz and www.domain.cz.
  * cards hack: cut URL records that are too long.  It can occur for huge
    equivalence classes that have gatherer too many catalog data.
  * cards fix: handle better words without a position instead of inheriting the
    position of the last word.

 -- Robert Spalek <robert@ucw.cz>  Mon, 24 Nov 2003 17:25:00 +0200

holmes (2.5.0-alpha1) unstable; urgency=low

  * Implemented priority queueing in the gatherer. Priorities are calculated
    according to document ages and they can be altered by the filters. Gathering
    should run much smoother now. Also, gatherer queue keys are now acesssible
    by the filters.
  * Introduced a table of equivalent servers which is used by both the gatherer
    and the indexer.
  * Added tables for many character sets including the whole ISO-8859-x repertoire.
  * A lot of optimizations in the search server, some of them needed changes
    of index file formats.
  * Replaced indexing of word complexes (word + pre-/postpositions) by a much
    more flexible and faster mechanism of context-dependent words.
  * Reworked calculation of document weights. The scanner now assigns
    Indexer.DefaultDocumentWeight to every document and modifies it according
    to a bonus defined by the filters (see bonus and final_bonus in doc/filter).
    Other indexer modules can contribute their bonuses and penalties as well.
    The history of weight calculation for every document is tracked in its
    "W" attribute to simplify debugging.
  * Document titles are accessible to the filters now.
  * Meta tags for robot control can now also use "noarchive" as Google does.
    This tag is just recorded in the object attributes and propagated to the
    front-end which should avoid offering the link to full document text
    in such cases.
  * Added EXPLAIN mode to the search server in which it dumps details of
    weight calculations for each document. However, this incurs a slight
    overhead to processing of all queries even if EXPLAIN is switched off,
    so this feature is available only if CONFIG_EXPLAIN is enabled in config.mk.
  * Added the Watson Monitoring System which replaces the ancient scripts
    for parsing of logs and drawing of performance graphs.
  * Query language: LINK has been renamed to EXT and REF aliased to LINK
    to be compatible with other search engines.
  * Changed interface to customization modules. Each customized version
    now dwells in its own directory which should be linked to "custom"
    in the package root. The config.mk has been moved to this directory.
    The free version also acts as one of the customizations.
  * Added generic red-black trees and binomial heaps to the library.
  * Debugging utilities have been moved from utils/ to debug/.
  * As usually, fixed a lot of bugs and sped many modules up.

 -- Martin Mares <mj@ucw.cz>  Fri,  3 Oct 2003 19:12:35 +0200

holmes (2.4.1) stable; urgency=low

  * Configuration files are now truly preprocessed during installation
    and they can contain blocks conditional on config.mk switches.
  * Added automatic detection of languages based on statistical methods,
    including a utility for automated tuning according to a large corpus.
  * Better processing of framesets -- you can choose to either use the
    noframe version like before or to treat the page as a redirect to
    the frame containing the largest amount of text.
  * Stemming and lemmatization. This release contains only the basic
    Porter's stemmer for English, but other stemmers can be added in
    a modular fashion.
  * Use of synonymic dictionaries.
  * Meta information has been separated from the main text of the document
    and its weighting can be much finer.
  * Spelling checker based on frequencies of words in the indexed data.
  * Finer weighting of word and phrase matches.
  * Speed optimizations of the filter interpreter and also on many other
    places.
  * Better locking strategy of the gatherer database -- read-only queries
    can be done even if the gatherer is running, although the consistency
    of query results is not guaranteed.
  * Automatic recovery of gatherer databases after incorrect shutdown.
  * All replies of the search server are now explicitly structured using
    bracket attributes (see doc/objects), no need to use heuristic rules
    to group them. Also updated the front-end library to make use of that.
  * More robust HTML parser (added several work-arounds for common errors).
  * Added an "ireport" utility for generating reports on equivalence classes
    of documents in the generated index.
  * Lexicon configuration is now stored to the index, so it's no longer
    needed for the search server to run with the same configuration as
    the indexer.
  * Added generic sorting routine generator to the library
  * Searching according to file type and other parameters implemented.
  * Rewritten core of the reference processing in the search server, giving
    better performance and it will be much easier to add new weighting
    rules local for words and index compression now.
  * Lots of minor bug fixes and performance tweaks as usually.

 -- Martin Mares <mj@ucw.cz>  Mon,  2 Jun 2003 21:29:51 +0200

holmes (2.3.1) unstable; urgency=low

  * Tuned default values of configurable parameters and improved config
    file comments.
  * New near matcher giving much better results.
  * Filter language: "include" directive added.
  * Query language: SORTBY CARDID added.
  * Added a database recovery utility (grecover).
  * Minor bug fixes and improvements.

 -- Martin Mares <mj@ucw.cz>  Tue, 17 Dec 2002 15:12:01 +0100

holmes (2.3.0-alpha1) unstable; urgency=low

  * An alpha pre-release of Holmes 2.3.
  * Memory mapping of files is now available.
  * All libraries are now shared.
  * Minor bug fixes.

 -- Martin Mares <mj@ucw.cz>  Fri, 11 Oct 2002 13:35:12 +0200

holmes (2.2.1) unstable; urgency=low

  * Debian package by Robert.
  * Build for i386 by default.
  * Added support for HTTP authentication.
  * More work on the fastbuf library to speed it up. Also added a
    memory mapped file access module, but it isn't used anywhere yet.
  * Updated installation guide.

 -- Martin Mares <mj@ucw.cz>  Thu,  3 Oct 2002 23:00:37 +0200

holmes (2.2.0) unstable; urgency=low

  * First public release.

 -- Martin Mares <mj@ucw.cz>  Tue, 24 Sep 2002 22:06:12 +0200
