mirror of
https://github.com/VectorCamp/vectorscan.git
synced 2025-06-28 16:41:01 +03:00
The generated documentation continues to refer to Hyperscan despite the project now being VectorScan. Lets replace many of the Hyperscan references with Vectorscan. At the same time, lets resync the documentation here with the vectorscan readme. This updates the supported platforms/compilers and build options. Signed-off-by: Jeremy Linton <jeremy.linton@arm.com>
346 lines
14 KiB
ReStructuredText
346 lines
14 KiB
ReStructuredText
.. _perf:
|
|
|
|
##########################
|
|
Performance Considerations
|
|
##########################
|
|
|
|
Vectorscan supports a wide range of patterns in all three scanning modes. It is
|
|
capable of extremely high levels of performance, but certain patterns can
|
|
reduce performance markedly.
|
|
|
|
The following guidelines will help construct patterns and pattern sets that
|
|
will perform better:
|
|
|
|
*****************************
|
|
Regular expression constructs
|
|
*****************************
|
|
|
|
.. tip:: Do not hand-optimize regular expression constructs.
|
|
|
|
Quite a large number of regular expressions can be written in multiple ways.
|
|
For example, caseless matching of :regexp:`/abc/` can be written as:
|
|
|
|
* :regexp:`/[Aa][Bb][Cc]/`
|
|
* :regexp:`/(A|a)(B|b)(C|c)/`
|
|
* :regexp:`/(?i)abc(?-i)/`
|
|
* :regexp:`/abc/i`
|
|
|
|
Vectorscan is capable of handling all these constructs. Unless there is a
|
|
specific reason otherwise, do not rewrite patterns from one form to another.
|
|
|
|
As another example, matching of :regexp:`/foo(bar|baz)(frotz)?/` can be
|
|
equivalently written as:
|
|
|
|
* :regexp:`/foobarfrotz|foobazfrotz|foobar|foobaz/`
|
|
|
|
This change will not improve performance or reduce overheads.
|
|
|
|
*************
|
|
Library usage
|
|
*************
|
|
|
|
.. tip:: Do not hand-optimize library usage.
|
|
|
|
The Vectorscan library is capable of dealing with small writes, unusually large
|
|
and small pattern sets, etc. Unless there is a specific performance problem
|
|
with some usage of the library, it is best to use Vectorscan in a simple and
|
|
direct fashion. For example, it is unlikely for there to be much benefit in
|
|
buffering input to the library into larger blocks unless streaming writes are
|
|
tiny (say, 1-2 bytes at a time).
|
|
|
|
Unlike many other pattern matching products, Vectorscan will run faster with
|
|
small numbers of patterns and slower with large numbers of patterns in a smooth
|
|
fashion (as opposed to, typically, running at a moderate speed up to some fixed
|
|
limit then either breaking or running half as fast).
|
|
|
|
Vectorscan also provides high-throughput matching with a single thread of
|
|
control per core; if a database runs at 3.0 Gbps in Vectorscan it means that a
|
|
3000-bit block of data will be scanned in 1 microsecond in a single thread of
|
|
control, not that it is required to scan 22 3000-bit blocks of data in 22
|
|
microseconds. Thus, it is not usually necessary to buffer data to supply
|
|
Vectorscan with available parallelism.
|
|
|
|
********************
|
|
Block-based matching
|
|
********************
|
|
|
|
.. tip:: Prefer block-based matching to streaming matching where possible.
|
|
|
|
Whenever input data appears in discrete records, or already requires some sort
|
|
of transformation (e.g. URI normalization) that requires all the data to be
|
|
accumulated before processing, it should be scanned in block rather than in
|
|
streaming mode.
|
|
|
|
Unnecessary use of streaming mode reduces the number of optimizations that can
|
|
be applied in Vectorscan and may make some patterns run slower.
|
|
|
|
If there is a mixture of 'block' and 'streaming' mode patterns, these should be
|
|
scanned in separate databases except in the case that the streaming patterns
|
|
vastly outnumber the block mode patterns.
|
|
|
|
*********************
|
|
Unnecessary databases
|
|
*********************
|
|
|
|
.. tip:: Avoid unnecessary 'union' databases.
|
|
|
|
If there are 5 different types of network traffic T1 through T5 that must
|
|
be scanned against 5 different signature sets, it will be far more efficient to
|
|
construct 5 separate databases and scan traffic against the appropriate one
|
|
than it will be to merge all 5 signature sets and remove inappropriate matches
|
|
after the fact.
|
|
|
|
This will be true even in the case where there is substantial overlap among the
|
|
signatures. Only if the common subset of the signatures is overwhelmingly large
|
|
(say, 90% of the signatures appear in all 5 traffic types) should a database
|
|
that merges all 5 signature sets be considered, and only then if there are no
|
|
performance issues with specific patterns that appear outside the common
|
|
subset.
|
|
|
|
******************************
|
|
Allocate scratch ahead of time
|
|
******************************
|
|
|
|
.. tip:: Do not allocate scratch space for your pattern database just before
|
|
calling a scan function. Instead, do it just after the pattern database is
|
|
compiled or deserialized.
|
|
|
|
Scratch allocation is not necessarily a cheap operation. Since it is the first
|
|
time (after compilation or deserialization) that a pattern database is used,
|
|
Vectorscan performs some validation checks inside :c:func:`hs_alloc_scratch` and
|
|
must also allocate memory.
|
|
|
|
Therefore, it is important to ensure that :c:func:`hs_alloc_scratch` is not
|
|
called in the application's scanning path just before :c:func:`hs_scan` (for
|
|
example).
|
|
|
|
Instead, scratch should be allocated immediately after a pattern database is
|
|
compiled or deserialized, then retained for later scanning operations.
|
|
|
|
***********************************************
|
|
Allocate one scratch space per scanning context
|
|
***********************************************
|
|
|
|
.. tip:: A scratch space can be allocated so that it can be used with any one of
|
|
a number of databases. Each concurrent scan operation (such as a thread)
|
|
needs its own scratch space.
|
|
|
|
The :c:func:`hs_alloc_scratch` function can accept an existing scratch space and
|
|
"grow" it to support scanning with another pattern database. This means that
|
|
instead of allocating one scratch space for every database used by an
|
|
application, one can call :c:func:`hs_alloc_scratch` with a pointer to the same
|
|
:c:type:`hs_scratch_t` and it will be sized appropriately for use with any of
|
|
the given databases. For example:
|
|
|
|
.. code-block:: c
|
|
|
|
hs_database_t *db1 = buildDatabaseOne();
|
|
hs_database_t *db2 = buildDatabaseTwo();
|
|
hs_database_t *db3 = buildDatabaseThree();
|
|
|
|
hs_error_t err;
|
|
hs_scratch_t *scratch = NULL;
|
|
err = hs_alloc_scratch(db1, &scratch);
|
|
if (err != HS_SUCCESS) {
|
|
printf("hs_alloc_scratch failed!");
|
|
exit(1);
|
|
}
|
|
err = hs_alloc_scratch(db2, &scratch);
|
|
if (err != HS_SUCCESS) {
|
|
printf("hs_alloc_scratch failed!");
|
|
exit(1);
|
|
}
|
|
err = hs_alloc_scratch(db3, &scratch);
|
|
if (err != HS_SUCCESS) {
|
|
printf("hs_alloc_scratch failed!");
|
|
exit(1);
|
|
}
|
|
|
|
/* scratch may now be used to scan against any of
|
|
the databases db1, db2, db3. */
|
|
|
|
*****************
|
|
Anchored patterns
|
|
*****************
|
|
|
|
.. tip:: If a pattern is meant to appear at the start of data, be sure to
|
|
anchor it.
|
|
|
|
Anchored patterns (:regexp:`/^.../`) are far simpler to match than other
|
|
patterns, especially patterns anchored to the start of the buffer (or stream, in
|
|
streaming mode). Anchoring patterns to the end of the buffer results in less of
|
|
a performance gain, especially in streaming mode.
|
|
|
|
There are a variety of ways to anchor a pattern to a particular offset:
|
|
|
|
- The :regexp:`^` and :regexp:`\\A` constructs anchor the pattern to the start
|
|
of the buffer. For example, :regexp:`/^foo/` can *only* match at offset 3.
|
|
|
|
- The :regexp:`$`, :regexp:`\\z` and :regexp:`\\Z` constructs anchor the pattern
|
|
to the end of the buffer. For example, :regexp:`/foo\\z/` can only match when
|
|
the data buffer being scanned ends in ``foo``. (It should be noted that
|
|
:regexp:`$` and :regexp:`\\Z` will also match before a newline at the end of
|
|
the buffer, so :regexp:`/foo\\z/` would match against either ``abc foo`` or
|
|
``abc foo\n``.)
|
|
|
|
- The ``min_offset`` and ``max_offset`` extended parameters may also be used to
|
|
constrain where a pattern could match. For example, the pattern
|
|
:regexp:`/foo/` with a ``max_offset`` of 10 will only match at offsets less
|
|
than or equal to 10 in the buffer. (This pattern could also be written as
|
|
:regexp:`/^.{0,7}foo/`, compiled with the :c:member:`HS_FLAG_DOTALL` flag).
|
|
|
|
|
|
*******************
|
|
Matching everywhere
|
|
*******************
|
|
|
|
.. tip:: Avoid patterns that match everywhere, and remember that our semantics
|
|
are 'match everywhere, end of match only'.
|
|
|
|
Pattern that match everywhere will run slowly due to the sheer number of
|
|
matches that they return.
|
|
|
|
Patterns like :regexp:`/.*/` in an automata-based matcher will match before and
|
|
after every single character position, so a buffer with 100 characters will
|
|
return 101 matches. Greedy pattern matchers such as libpcre will return a
|
|
single match in this case, but our semantics is to return all matches. This is
|
|
likely to be very expensive for our code and for the client code of the
|
|
library.
|
|
|
|
Another result of our semantics ("match everywhere") is that patterns that have
|
|
optional start or ending sections -- for example :regexp:`/x?abcd*/` -- may not
|
|
perform as expected.
|
|
|
|
Firstly, the :regexp:`x?` portion of the pattern is unnecessary, as it will not
|
|
affect the match results.
|
|
|
|
Secondly, the above pattern will match 'more' than :regexp:`/abc/` but
|
|
:regexp:`/abc/` will always detect any input data that will be matched by
|
|
:regexp:`/x?abcd*/` -- it will just produce fewer matches.
|
|
|
|
For example, input data ``0123abcdddd`` will match :regexp:`/abc/` once but
|
|
:regexp:`/abcd*/` five times (at ``abc``, ``abcd``, ``abcdd``, ``abcddd``, and
|
|
``abcdddd``).
|
|
|
|
*********************************
|
|
Bounded repeats in streaming mode
|
|
*********************************
|
|
|
|
.. tip:: Bounded repeats are expensive in streaming mode.
|
|
|
|
A bounded repeat construction such as :regexp:`/X.{1000,1001}abcd/` is extremely
|
|
expensive in streaming mode, of necessity. It requires us to take action on
|
|
each ``X`` character (itself expensive, relative to searching for longer strings)
|
|
and potentially record a history of hundreds of offsets where ``X`` occurred in
|
|
case the ``X`` and ``abcd`` characters are separated by a stream boundary.
|
|
|
|
Heavy and unnecessary use of bounded repeats should be avoided, especially
|
|
where other parts of a signature are quite specific. For example, a virus
|
|
signature that matches a virus payload may be sufficient without including a
|
|
prefix that includes, for example, a 2-character Windows executable prefix and
|
|
a bounded repeat beforehand.
|
|
|
|
***************
|
|
Prefer literals
|
|
***************
|
|
|
|
.. tip:: Where possible, prefer patterns which 'require' literals, especially
|
|
longer literals, and in streaming mode, prefer signatures that 'require'
|
|
literals earlier in the pattern.
|
|
|
|
Patterns which must match on a literal will run faster than patterns that do
|
|
not. For example:
|
|
|
|
- :regexp:`/\\wab\\d*\\w\\w\\w/` will run faster than
|
|
- :regexp:`/\\w\\w\\d*\\w\\w/`, or, for that matter
|
|
- :regexp:`/\\w(abc)?\\d*\\w\\w\\w/` (this contains a literal but it need
|
|
not appear in the input).
|
|
|
|
Even implicit literals are better than none: :regexp:`/[0-2][3-5].*\\w\\w/`
|
|
still effectively contains 9 2-character literals. No hand-optimization of this
|
|
case is required; this pattern will not run faster if rewritten as:
|
|
:regexp:`/(03|04|05|13|14|15|23|24|25).*\\w\\w/`.
|
|
|
|
Under all circumstances it is better to use longer literals than shorter ones.
|
|
A database consisting of 100 14-character literals will scan considerably
|
|
faster than one consisting of 100 4-character literals and return fewer
|
|
positives.
|
|
|
|
Additionally, in streaming mode, a signature that contains a longer literal
|
|
early in the pattern is preferred to one that does not.
|
|
|
|
For example: :regexp:`/b\\w*foobar/` is not as good a pattern as
|
|
:regexp:`/blah\\w*foobar/`.
|
|
|
|
The disparity between these patterns is much smaller in block mode.
|
|
|
|
Longer literals anywhere in the pattern are still preferred in streaming mode.
|
|
For example, both of the above patterns are stronger and will scan faster than
|
|
:regexp:`/b\\w*fo/` even in streaming mode.
|
|
|
|
**************
|
|
"Dot all" mode
|
|
**************
|
|
|
|
.. tip:: Use "dot all" mode where possible.
|
|
|
|
Not using the :c:member:`HS_FLAG_DOTALL` pattern flag can be expensive, as
|
|
implicitly, it means that patterns of the form :regexp:`/A.*B/` become
|
|
:regexp:`/A[^\\n]*B/`.
|
|
|
|
It is likely that scanning tasks without the DOTALL flag are better done 'line
|
|
at a time', with the newline sequences marking the beginning and end of each
|
|
block.
|
|
|
|
This will be true in most use-cases (an exception being where the DOTALL flag
|
|
is off but the pattern contains either explicit newlines or constructs such as
|
|
:regexp:`\\s` that implicitly match a newline character).
|
|
|
|
*****************
|
|
Single-match flag
|
|
*****************
|
|
|
|
.. tip:: Consider using the single-match flag to limit matches to one match per
|
|
pattern only if possible.
|
|
|
|
If only one match per pattern is required, use the flag provided to indicate
|
|
this (:c:member:`HS_FLAG_SINGLEMATCH`). This flag can allow a number of
|
|
optimizations to be applied, allowing both performance improvements and state
|
|
space reductions when streaming.
|
|
|
|
However, there is some overhead associated with tracking whether each pattern in
|
|
the pattern set has matched, and some applications with infrequent matches may
|
|
see reduced performance when the single-match flag is used.
|
|
|
|
********************
|
|
Start of Match flag
|
|
********************
|
|
|
|
.. tip:: Do not request Start of Match information if it is not not needed.
|
|
|
|
Start of Match (SOM) information can be expensive to gather and can require
|
|
large amounts of stream state to store in streaming mode. As such, SOM
|
|
information should only be requested with the :c:member:`HS_FLAG_SOM_LEFTMOST`
|
|
flag for patterns that require it.
|
|
|
|
SOM information is not generally expected to be cheaper (in either performance
|
|
terms or in stream state overhead) than the use of bounded repeats.
|
|
Consequently, :regexp:`/foo.*bar/L` with a check on start of match values after
|
|
the callback is considerably more expensive and general than
|
|
:regexp:`/foo.{300}bar/`.
|
|
|
|
Similarly, the :cpp:member:`hs_expr_ext::min_length` extended parameter can be
|
|
used to specify a lower bound on the length of the matches for a pattern. Using
|
|
this facility may be more lightweight in some circumstances than using the SOM
|
|
flag and post-confirming match length in the calling application.
|
|
|
|
********************
|
|
Approximate matching
|
|
********************
|
|
|
|
.. tip:: Approximate matching is an experimental feature.
|
|
|
|
There is generally a performance impact associated with approximate matching due
|
|
to the reduced specificity of the matches. This impact may vary significantly
|
|
depending on the pattern and edit distance.
|