mirror of
https://github.com/VectorCamp/vectorscan.git
synced 2025-06-28 16:41:01 +03:00
336 lines
14 KiB
ReStructuredText
336 lines
14 KiB
ReStructuredText
.. _perf:
|
|
|
|
##########################
|
|
Performance Considerations
|
|
##########################
|
|
|
|
Hyperscan supports a wide range of patterns in all three scanning modes. It is
|
|
capable of extremely high levels of performance, but certain patterns can
|
|
reduce performance markedly.
|
|
|
|
The following guidelines will help construct patterns and pattern sets that
|
|
will perform better:
|
|
|
|
*****************************
|
|
Regular expression constructs
|
|
*****************************
|
|
|
|
.. tip:: Do not hand-optimize regular expression constructs.
|
|
|
|
Quite a large number of regular expressions can be written in multiple ways.
|
|
For example, caseless matching of :regexp:`/abc/` can be written as:
|
|
|
|
* :regexp:`/[Aa][Bb][Cc]/`
|
|
* :regexp:`/(A|a)(B|b)(C|c)/`
|
|
* :regexp:`/(?i)abc(?-i)/`
|
|
* :regexp:`/abc/i`
|
|
|
|
Hyperscan is capable of handling all these constructs. Unless there is a
|
|
specific reason otherwise, do not rewrite patterns from one form to another.
|
|
|
|
As another example, matching of :regexp:`/foo(bar|baz)(frotz)?/` can be
|
|
equivalently written as:
|
|
|
|
* :regexp:`/foobarfrotz|foobazfrotz|foobar|foobaz/`
|
|
|
|
This change will not improve performance or reduce overheads.
|
|
|
|
*************
|
|
Library usage
|
|
*************
|
|
|
|
.. tip:: Do not hand-optimize library usage.
|
|
|
|
The Hyperscan library is capable of dealing with small writes, unusually large
|
|
and small pattern sets, etc. Unless there is a specific performance problem
|
|
with some usage of the library, it is best to use Hyperscan in a simple and
|
|
direct fashion. For example, it is unlikely for there to be much benefit in
|
|
buffering input to the library into larger blocks unless streaming writes are
|
|
tiny (say, 1-2 bytes at a time).
|
|
|
|
Unlike many other pattern matching products, Hyperscan will run faster with
|
|
small numbers of patterns and slower with large numbers of patterns in a smooth
|
|
fashion (as opposed to, typically, running at a moderate speed up to some fixed
|
|
limit then either breaking or running half as fast).
|
|
|
|
Hyperscan also provides high-throughput matching with a single thread of
|
|
control per core; if a database runs at 3.0 Gbps in Hyperscan it means that a
|
|
3000-bit block of data will be scanned in 1 microsecond in a single thread of
|
|
control, not that it is required to scan 22 3000-bit blocks of data in 22
|
|
microseconds. Thus, it is not usually necessary to buffer data to supply
|
|
Hyperscan with available parallelism.
|
|
|
|
********************
|
|
Block-based matching
|
|
********************
|
|
|
|
.. tip:: Prefer block-based matching to streaming matching where possible.
|
|
|
|
Whenever input data appears in discrete records, or already requires some sort
|
|
of transformation (e.g. URI normalization) that requires all the data to be
|
|
accumulated before processing, it should be scanned in block rather than in
|
|
streaming mode.
|
|
|
|
Unnecessary use of streaming mode reduces the number of optimizations that can
|
|
be applied in Hyperscan and may make some patterns run slower.
|
|
|
|
If there is a mixture of 'block' and 'streaming' mode patterns, these should be
|
|
scanned in separate databases except in the case that the streaming patterns
|
|
vastly outnumber the block mode patterns.
|
|
|
|
*********************
|
|
Unnecessary databases
|
|
*********************
|
|
|
|
.. tip:: Avoid unnecessary 'union' databases.
|
|
|
|
If there are 5 different types of network traffic T1 through T5 that must
|
|
be scanned against 5 different signature sets, it will be far more efficient to
|
|
construct 5 separate databases and scan traffic against the appropriate one
|
|
than it will be to merge all 5 signature sets and remove inappropriate matches
|
|
after the fact.
|
|
|
|
This will be true even in the case where there is substantial overlap among the
|
|
signatures. Only if the common subset of the signatures is overwhelmingly large
|
|
(say, 90% of the signatures appear in all 5 traffic types) should a database
|
|
that merges all 5 signature sets be considered, and only then if there are no
|
|
performance issues with specific patterns that appear outside the common
|
|
subset.
|
|
|
|
******************************
|
|
Allocate scratch ahead of time
|
|
******************************
|
|
|
|
.. tip:: Do not allocate scratch space for your pattern database just before
|
|
calling a scan function. Instead, do it just after the pattern database is
|
|
compiled or deserialized.
|
|
|
|
Scratch allocation is not necessarily a cheap operation. Since it is the first
|
|
time (after compilation or deserialization) that a pattern database is used,
|
|
Hyperscan performs some validation checks inside :c:func:`hs_alloc_scratch` and
|
|
must also allocate memory.
|
|
|
|
Therefore, it is important to ensure that :c:func:`hs_alloc_scratch` is not
|
|
called in the application's scanning path just before :c:func:`hs_scan` (for
|
|
example).
|
|
|
|
Instead, scratch should be allocated immediately after a pattern database is
|
|
compiled or deserialized, then retained for later scanning operations.
|
|
|
|
***********************************************
|
|
Allocate one scratch space per scanning context
|
|
***********************************************
|
|
|
|
.. tip:: A scratch space can be allocated so that it can be used with any one of
|
|
a number of databases. Each concurrent scan operation (such as a thread)
|
|
needs its own scratch space.
|
|
|
|
The :c:func:`hs_alloc_scratch` function can accept an existing scratch space and
|
|
"grow" it to support scanning with another pattern database. This means that
|
|
instead of allocating one scratch space for every database used by an
|
|
application, one can call :c:func:`hs_alloc_scratch` with a pointer to the same
|
|
:c:type:`hs_scratch_t` and it will be sized appropriately for use with any of
|
|
the given databases. For example:
|
|
|
|
.. code-block:: c
|
|
|
|
hs_database_t *db1 = buildDatabaseOne();
|
|
hs_database_t *db2 = buildDatabaseTwo();
|
|
hs_database_t *db3 = buildDatabaseThree();
|
|
|
|
hs_error_t err;
|
|
hs_scratch_t *scratch = NULL;
|
|
err = hs_alloc_scratch(db1, &scratch);
|
|
if (err != HS_SUCCESS) {
|
|
printf("hs_alloc_scratch failed!");
|
|
exit(1);
|
|
}
|
|
err = hs_alloc_scratch(db2, &scratch);
|
|
if (err != HS_SUCCESS) {
|
|
printf("hs_alloc_scratch failed!");
|
|
exit(1);
|
|
}
|
|
err = hs_alloc_scratch(db3, &scratch);
|
|
if (err != HS_SUCCESS) {
|
|
printf("hs_alloc_scratch failed!");
|
|
exit(1);
|
|
}
|
|
|
|
/* scratch may now be used to scan against any of
|
|
the databases db1, db2, db3. */
|
|
|
|
*****************
|
|
Anchored patterns
|
|
*****************
|
|
|
|
.. tip:: If a pattern is meant to appear at the start of data, be sure to
|
|
anchor it.
|
|
|
|
Anchored patterns (:regexp:`/^.../`) are far simpler to match than other
|
|
patterns, especially patterns anchored to the start of the buffer (or stream, in
|
|
streaming mode). Anchoring patterns to the end of the buffer results in less of
|
|
a performance gain, especially in streaming mode.
|
|
|
|
There are a variety of ways to anchor a pattern to a particular offset:
|
|
|
|
- The :regexp:`^` and :regexp:`\\A` constructs anchor the pattern to the start
|
|
of the buffer. For example, :regexp:`/^foo/` can *only* match at offset 3.
|
|
|
|
- The :regexp:`$`, :regexp:`\\z` and :regexp:`\\Z` constructs anchor the pattern
|
|
to the end of the buffer. For example, :regexp:`/foo\\z/` can only match when
|
|
the data buffer being scanned ends in ``foo``. (It should be noted that
|
|
:regexp:`$` and :regexp:`\\Z` will also match before a newline at the end of
|
|
the buffer, so :regexp:`/foo\\z/` would match against either ``abc foo`` or
|
|
``abc foo\n``.)
|
|
|
|
- The ``min_offset`` and ``max_offset`` extended parameters may also be used to
|
|
constrain where a pattern could match. For example, the pattern
|
|
:regexp:`/foo/` with a ``max_offset`` of 10 will only match at offsets less
|
|
than or equal to 10 in the buffer. (This pattern could also be written as
|
|
:regexp:`/^.{0,7}foo/`, compiled with the :c:member:`HS_FLAG_DOTALL` flag).
|
|
|
|
|
|
*******************
|
|
Matching everywhere
|
|
*******************
|
|
|
|
.. tip:: Avoid patterns that match everywhere, and remember that our semantics
|
|
are 'match everywhere, end of match only'.
|
|
|
|
Pattern that match everywhere will run slowly due to the sheer number of
|
|
matches that they return.
|
|
|
|
Patterns like :regexp:`/.*/` in an automata-based matcher will match before and
|
|
after every single character position, so a buffer with 100 characters will
|
|
return 101 matches. Greedy pattern matchers such as libpcre will return a
|
|
single match in this case, but our semantics is to return all matches. This is
|
|
likely to be very expensive for our code and for the client code of the
|
|
library.
|
|
|
|
Another result of our semantics ("match everywhere") is that patterns that have
|
|
optional start or ending sections -- for example :regexp:`/x?abcd*/` -- may not
|
|
perform as expected.
|
|
|
|
Firstly, the :regexp:`x?` portion of the pattern is unnecessary, as it will not
|
|
affect the match results.
|
|
|
|
Secondly, the above pattern will match 'more' than :regexp:`/abc/` but
|
|
:regexp:`/abc/` will always detect any input data that will be matched by
|
|
:regexp:`/x?abcd*/` -- it will just produce fewer matches.
|
|
|
|
For example, input data ``0123abcdddd`` will match :regexp:`/abc/` once but
|
|
:regexp:`/abcd*/` five times (at ``abc``, ``abcd``, ``abcdd``, ``abcddd``, and
|
|
``abcdddd``).
|
|
|
|
*********************************
|
|
Bounded repeats in streaming mode
|
|
*********************************
|
|
|
|
.. tip:: Bounded repeats are expensive in streaming mode.
|
|
|
|
A bounded repeat construction such as :regexp:`/X.{1000,1001}abcd/` is extremely
|
|
expensive in streaming mode, of necessity. It requires us to take action on
|
|
each ``X`` character (itself expensive, relative to searching for longer strings)
|
|
and potentially record a history of hundreds of offsets where ``X`` occurred in
|
|
case the ``X`` and ``abcd`` characters are separated by a stream boundary.
|
|
|
|
Heavy and unnecessary use of bounded repeats should be avoided, especially
|
|
where other parts of a signature are quite specific. For example, a virus
|
|
signature that matches a virus payload may be sufficient without including a
|
|
prefix that includes, for example, a 2-character Windows executable prefix and
|
|
a bounded repeat beforehand.
|
|
|
|
***************
|
|
Prefer literals
|
|
***************
|
|
|
|
.. tip:: Where possible, prefer patterns which 'require' literals, especially
|
|
longer literals, and in streaming mode, prefer signatures that 'require'
|
|
literals earlier in the pattern.
|
|
|
|
Patterns which must match on a literal will run faster than patterns that do
|
|
not. For example:
|
|
|
|
- :regexp:`/\\wab\\d*\\w\\w\\w/` will run faster than
|
|
- :regexp:`/\\w\\w\\d*\\w\\w/`, or, for that matter
|
|
- :regexp:`/\\w(abc)?\\d*\\w\\w\\w/` (this contains a literal but it need
|
|
not appear in the input).
|
|
|
|
Even implicit literals are better than none: :regexp:`/[0-2][3-5].*\\w\\w/`
|
|
still effectively contains 9 2-character literals. No hand-optimization of this
|
|
case is required; this pattern will not run faster if rewritten as:
|
|
:regexp:`/(03|04|05|13|14|15|23|24|25).*\\w\\w/`.
|
|
|
|
Under all circumstances it is better to use longer literals than shorter ones.
|
|
A database consisting of 100 14-character literals will scan considerably
|
|
faster than one consisting of 100 4-character literals and return fewer
|
|
positives.
|
|
|
|
Additionally, in streaming mode, a signature that contains a longer literal
|
|
early in the pattern is preferred to one that does not.
|
|
|
|
For example: :regexp:`/b\\w*foobar/` is not as good a pattern as
|
|
:regexp:`/blah\\w*foobar/`.
|
|
|
|
The disparity between these patterns is much smaller in block mode.
|
|
|
|
Longer literals anywhere in the pattern are still preferred in streaming mode.
|
|
For example, both of the above patterns are stronger and will scan faster than
|
|
:regexp:`/b\\w*fo/` even in streaming mode.
|
|
|
|
**************
|
|
"Dot all" mode
|
|
**************
|
|
|
|
.. tip:: Use "dot all" mode where possible.
|
|
|
|
Not using the :c:member:`HS_FLAG_DOTALL` pattern flag can be expensive, as
|
|
implicitly, it means that patterns of the form :regexp:`/A.*B/` become
|
|
:regexp:`/A[^\\n]*B/`.
|
|
|
|
It is likely that scanning tasks without the DOTALL flag are better done 'line
|
|
at a time', with the newline sequences marking the beginning and end of each
|
|
block.
|
|
|
|
This will be true in most use-cases (an exception being where the DOTALL flag
|
|
is off but the pattern contains either explicit newlines or constructs such as
|
|
:regexp:`\\s` that implicitly match a newline character).
|
|
|
|
*****************
|
|
Single-match flag
|
|
*****************
|
|
|
|
.. tip:: Consider using the single-match flag to limit matches to one match per
|
|
pattern only if possible.
|
|
|
|
If only one match per pattern is required, use the flag provided to indicate
|
|
this (:c:member:`HS_FLAG_SINGLEMATCH`). This flag can allow a number of
|
|
optimizations to be applied, allowing both performance improvements and state
|
|
space reductions when streaming.
|
|
|
|
However, there is some overhead associated with tracking whether each pattern in
|
|
the pattern set has matched, and some applications with infrequent matches may
|
|
see reduced performance when the single-match flag is used.
|
|
|
|
********************
|
|
Start of Match flag
|
|
********************
|
|
|
|
.. tip:: Do not request Start of Match information if it is not not needed.
|
|
|
|
Start of Match (SOM) information can be expensive to gather and can require
|
|
large amounts of stream state to store in streaming mode. As such, SOM
|
|
information should only be requested with the :c:member:`HS_FLAG_SOM_LEFTMOST`
|
|
flag for patterns that require it.
|
|
|
|
SOM information is not generally expected to be cheaper (in either performance
|
|
terms or in stream state overhead) than the use of bounded repeats.
|
|
Consequently, :regexp:`/foo.*bar/L` with a check on start of match values after
|
|
the callback is considerably more expensive and general than
|
|
:regexp:`/foo.{300}bar/`.
|
|
|
|
Similarly, the :c:member:`hs_expr_ext::min_length` extended parameter can be
|
|
used to specify a lower bound on the length of the matches for a pattern. Using
|
|
this facility may be more lightweight in some circumstances than using the SOM
|
|
flag and post-confirming match length in the calling application.
|