vectorscan/doc/dev-reference/performance.rst

.. _perf:

##########################
Performance Considerations
##########################

Vectorscan supports a wide range of patterns in all three scanning modes. It is
capable of extremely high levels of performance, but certain patterns can
reduce performance markedly.

The following guidelines will help construct patterns and pattern sets that
will perform better:

*****************************
Regular expression constructs
*****************************

.. tip:: Do not hand-optimize regular expression constructs.

Quite a large number of regular expressions can be written in multiple ways.
For example, caseless matching of :regexp:`/abc/` can be written as:

* :regexp:`/[Aa][Bb][Cc]/`
* :regexp:`/(A|a)(B|b)(C|c)/`
* :regexp:`/(?i)abc(?-i)/`
* :regexp:`/abc/i`

Vectorscan is capable of handling all these constructs. Unless there is a
specific reason otherwise, do not rewrite patterns from one form to another.

As another example, matching of :regexp:`/foo(bar|baz)(frotz)?/` can be
equivalently written as:

* :regexp:`/foobarfrotz|foobazfrotz|foobar|foobaz/`

This change will not improve performance or reduce overheads.

*************
Library usage
*************

.. tip:: Do not hand-optimize library usage.

The Vectorscan library is capable of dealing with small writes, unusually large
and small pattern sets, etc. Unless there is a specific performance problem
with some usage of the library, it is best to use Vectorscan in a simple and
direct fashion. For example, it is unlikely for there to be much benefit in
buffering input to the library into larger blocks unless streaming writes are
tiny (say, 1-2 bytes at a time).

Unlike many other pattern matching products, Vectorscan will run faster with
small numbers of patterns and slower with large numbers of patterns in a smooth
fashion (as opposed to, typically, running at a moderate speed up to some fixed
limit then either breaking or running half as fast).

Vectorscan also provides high-throughput matching with a single thread of
control per core; if a database runs at 3.0 Gbps in Vectorscan it means that a
3000-bit block of data will be scanned in 1 microsecond in a single thread of
control, not that it is required to scan 22 3000-bit blocks of data in 22
microseconds. Thus, it is not usually necessary to buffer data to supply
Vectorscan with available parallelism.

********************
Block-based matching
********************

.. tip:: Prefer block-based matching to streaming matching where possible.

Whenever input data appears in discrete records, or already requires some sort
of transformation (e.g. URI normalization) that requires all the data to be
accumulated before processing, it should be scanned in block rather than in
streaming mode.

Unnecessary use of streaming mode reduces the number of optimizations that can
be applied in Vectorscan and may make some patterns run slower.

If there is a mixture of 'block' and 'streaming' mode patterns, these should be
scanned in separate databases except in the case that the streaming patterns
vastly outnumber the block mode patterns.

*********************
Unnecessary databases
*********************

.. tip:: Avoid unnecessary 'union' databases.

If there are 5 different types of network traffic T1 through T5 that must
be scanned against 5 different signature sets, it will be far more efficient to
construct 5 separate databases and scan traffic against the appropriate one
than it will be to merge all 5 signature sets and remove inappropriate matches
after the fact.

This will be true even in the case where there is substantial overlap among the
signatures. Only if the common subset of the signatures is overwhelmingly large
(say, 90% of the signatures appear in all 5 traffic types) should a database
that merges all 5 signature sets be considered, and only then if there are no
performance issues with specific patterns that appear outside the common
subset.

******************************
Allocate scratch ahead of time
******************************

.. tip:: Do not allocate scratch space for your pattern database just before
   calling a scan function. Instead, do it just after the pattern database is
   compiled or deserialized.

Scratch allocation is not necessarily a cheap operation. Since it is the first
time (after compilation or deserialization) that a pattern database is used,
Vectorscan performs some validation checks inside :c:func:`hs_alloc_scratch` and
must also allocate memory.

Therefore, it is important to ensure that :c:func:`hs_alloc_scratch` is not
called in the application's scanning path just before :c:func:`hs_scan` (for
example).

Instead, scratch should be allocated immediately after a pattern database is
compiled or deserialized, then retained for later scanning operations.

***********************************************
Allocate one scratch space per scanning context
***********************************************

.. tip:: A scratch space can be allocated so that it can be used with any one of
   a number of databases. Each concurrent scan operation (such as a thread)
   needs its own scratch space.

The :c:func:`hs_alloc_scratch` function can accept an existing scratch space and
"grow" it to support scanning with another pattern database. This means that
instead of allocating one scratch space for every database used by an
application, one can call :c:func:`hs_alloc_scratch` with a pointer to the same
:c:type:`hs_scratch_t` and it will be sized appropriately for use with any of
the given databases. For example:

.. code-block:: c

    hs_database_t *db1 = buildDatabaseOne();
    hs_database_t *db2 = buildDatabaseTwo();
    hs_database_t *db3 = buildDatabaseThree();

    hs_error_t err;
    hs_scratch_t *scratch = NULL;
    err = hs_alloc_scratch(db1, &scratch);
    if (err != HS_SUCCESS) {
        printf("hs_alloc_scratch failed!");
        exit(1);
    }
    err = hs_alloc_scratch(db2, &scratch);
    if (err != HS_SUCCESS) {
        printf("hs_alloc_scratch failed!");
        exit(1);
    }
    err = hs_alloc_scratch(db3, &scratch);
    if (err != HS_SUCCESS) {
        printf("hs_alloc_scratch failed!");
        exit(1);
    }

    /* scratch may now be used to scan against any of
       the databases db1, db2, db3. */

*****************
Anchored patterns
*****************

.. tip:: If a pattern is meant to appear at the start of data, be sure to
   anchor it.

Anchored patterns (:regexp:`/^.../`) are far simpler to match than other
patterns, especially patterns anchored to the start of the buffer (or stream, in
streaming mode). Anchoring patterns to the end of the buffer results in less of
a performance gain, especially in streaming mode.

There are a variety of ways to anchor a pattern to a particular offset:

- The :regexp:`^` and :regexp:`\\A` constructs anchor the pattern to the start
  of the buffer. For example, :regexp:`/^foo/` can *only* match at offset 3.

- The :regexp:`$`, :regexp:`\\z` and :regexp:`\\Z` constructs anchor the pattern
  to the end of the buffer. For example, :regexp:`/foo\\z/` can only match when
  the data buffer being scanned ends in ``foo``. (It should be noted that
  :regexp:`$` and :regexp:`\\Z` will also match before a newline at the end of
  the buffer, so :regexp:`/foo\\z/` would match against either ``abc foo`` or
  ``abc foo\n``.)

- The ``min_offset`` and ``max_offset`` extended parameters may also be used to
  constrain where a pattern could match. For example, the pattern
  :regexp:`/foo/` with a ``max_offset`` of 10 will only match at offsets less
  than or equal to 10 in the buffer. (This pattern could also be written as
  :regexp:`/^.{0,7}foo/`, compiled with the :c:member:`HS_FLAG_DOTALL` flag).


*******************
Matching everywhere
*******************

.. tip:: Avoid patterns that match everywhere, and remember that our semantics
   are 'match everywhere, end of match only'.

Pattern that match everywhere will run slowly due to the sheer number of
matches that they return.

Patterns like :regexp:`/.*/` in an automata-based matcher will match before and
after every single character position, so a buffer with 100 characters will
return 101 matches. Greedy pattern matchers such as libpcre will return a
single match in this case, but our semantics is to return all matches. This is
likely to be very expensive for our code and for the client code of the
library.

Another result of our semantics ("match everywhere") is that patterns that have
optional start or ending sections -- for example :regexp:`/x?abcd*/` -- may not
perform as expected.

Firstly, the :regexp:`x?` portion of the pattern is unnecessary, as it will not
affect the match results.

Secondly, the above pattern will match 'more' than :regexp:`/abc/` but
:regexp:`/abc/` will always detect any input data that will be matched by
:regexp:`/x?abcd*/` -- it will just produce fewer matches.

For example, input data ``0123abcdddd`` will match :regexp:`/abc/` once but
:regexp:`/abcd*/` five times (at ``abc``, ``abcd``, ``abcdd``, ``abcddd``, and
``abcdddd``).

*********************************
Bounded repeats in streaming mode
*********************************

.. tip:: Bounded repeats are expensive in streaming mode.

A bounded repeat construction such as :regexp:`/X.{1000,1001}abcd/` is extremely
expensive in streaming mode, of necessity. It requires us to take action on
each ``X`` character (itself expensive, relative to searching for longer strings)
and potentially record a history of hundreds of offsets where ``X`` occurred in
case the ``X`` and ``abcd`` characters are separated by a stream boundary.

Heavy and unnecessary use of bounded repeats should be avoided, especially
where other parts of a signature are quite specific. For example, a virus
signature that matches a virus payload may be sufficient without including a
prefix that includes, for example, a 2-character Windows executable prefix and
a bounded repeat beforehand.

***************
Prefer literals
***************

.. tip:: Where possible, prefer patterns which 'require' literals, especially
   longer literals, and in streaming mode, prefer signatures that 'require'
   literals earlier in the pattern.

Patterns which must match on a literal will run faster than patterns that do
not. For example:

- :regexp:`/\\wab\\d*\\w\\w\\w/` will run faster than
- :regexp:`/\\w\\w\\d*\\w\\w/`, or, for that matter
- :regexp:`/\\w(abc)?\\d*\\w\\w\\w/` (this contains a literal but it need
  not appear in the input).

Even implicit literals are better than none: :regexp:`/[0-2][3-5].*\\w\\w/`
still effectively contains 9 2-character literals. No hand-optimization of this
case is required; this pattern will not run faster if rewritten as:
:regexp:`/(03|04|05|13|14|15|23|24|25).*\\w\\w/`.

Under all circumstances it is better to use longer literals than shorter ones.
A database consisting of 100 14-character literals will scan considerably
faster than one consisting of 100 4-character literals and return fewer
positives.

Additionally, in streaming mode, a signature that contains a longer literal
early in the pattern is preferred to one that does not.

For example: :regexp:`/b\\w*foobar/` is not as good a pattern as
:regexp:`/blah\\w*foobar/`.

The disparity between these patterns is much smaller in block mode.

Longer literals anywhere in the pattern are still preferred in streaming mode.
For example, both of the above patterns are stronger and will scan faster than
:regexp:`/b\\w*fo/` even in streaming mode.

**************
"Dot all" mode
**************

.. tip:: Use "dot all" mode where possible.

Not using the :c:member:`HS_FLAG_DOTALL` pattern flag can be expensive, as
implicitly, it means that patterns of the form :regexp:`/A.*B/` become
:regexp:`/A[^\\n]*B/`.

It is likely that scanning tasks without the DOTALL flag are better done 'line
at a time', with the newline sequences marking the beginning and end of each
block.

This will be true in most use-cases (an exception being where the DOTALL flag
is off but the pattern contains either explicit newlines or constructs such as
:regexp:`\\s` that implicitly match a newline character).

*****************
Single-match flag
*****************

.. tip:: Consider using the single-match flag to limit matches to one match per
   pattern only if possible.

If only one match per pattern is required, use the flag provided to indicate
this (:c:member:`HS_FLAG_SINGLEMATCH`). This flag can allow a number of
optimizations to be applied, allowing both performance improvements and state
space reductions when streaming.

However, there is some overhead associated with tracking whether each pattern in
the pattern set has matched, and some applications with infrequent matches may
see reduced performance when the single-match flag is used.

********************
Start of Match flag
********************

.. tip:: Do not request Start of Match information if it is not not needed.

Start of Match (SOM) information can be expensive to gather and can require
large amounts of stream state to store in streaming mode. As such, SOM
information should only be requested with the :c:member:`HS_FLAG_SOM_LEFTMOST`
flag for patterns that require it.

SOM information is not generally expected to be cheaper (in either performance
terms or in stream state overhead) than the use of bounded repeats.
Consequently, :regexp:`/foo.*bar/L` with a check on start of match values after
the callback is considerably more expensive and general than
:regexp:`/foo.{300}bar/`.

Similarly, the :cpp:member:`hs_expr_ext::min_length` extended parameter can be
used to specify a lower bound on the length of the matches for a pattern. Using
this facility may be more lightweight in some circumstances than using the SOM
flag and post-confirming match length in the calling application.

********************
Approximate matching
********************

.. tip:: Approximate matching is an experimental feature.

There is generally a performance impact associated with approximate matching due
to the reduced specificity of the matches. This impact may vary significantly
depending on the pattern and edit distance.