mirror of
https://github.com/VectorCamp/vectorscan.git
synced 2025-06-28 16:41:01 +03:00
Add approximate matching documentation
This commit is contained in:
parent
ebe849603b
commit
eed2743d04
@ -171,6 +171,8 @@ The following regex constructs are not supported by Hyperscan:
|
|||||||
* Callouts and embedded code.
|
* Callouts and embedded code.
|
||||||
* Atomic grouping and possessive quantifiers.
|
* Atomic grouping and possessive quantifiers.
|
||||||
|
|
||||||
|
.. _semantics:
|
||||||
|
|
||||||
*********
|
*********
|
||||||
Semantics
|
Semantics
|
||||||
*********
|
*********
|
||||||
@ -284,16 +286,24 @@ which provides the following fields:
|
|||||||
expression should match successfully.
|
expression should match successfully.
|
||||||
* ``min_length``: The minimum match length (from start to end) required to
|
* ``min_length``: The minimum match length (from start to end) required to
|
||||||
successfully match this expression.
|
successfully match this expression.
|
||||||
|
* ``edit_distance``: Match this expression within a given Levenshtein distance.
|
||||||
|
|
||||||
These parameters allow the set of matches produced by a pattern to be
|
These parameters either allow the set of matches produced by a pattern to be
|
||||||
constrained at compile time, rather than relying on the application to process
|
constrained at compile time (rather than relying on the application to process
|
||||||
unwanted matches at runtime.
|
unwanted matches at runtime), or allow matching a pattern approximately (within
|
||||||
|
a given edit distance) to produce more matches.
|
||||||
|
|
||||||
For example, the pattern :regexp:`/foo.*bar/` when given a ``min_offset`` of 10
|
For example, the pattern :regexp:`/foo.*bar/` when given a ``min_offset`` of 10
|
||||||
and a ``max_offset`` of 15 will not produce matches when scanned against
|
and a ``max_offset`` of 15 will not produce matches when scanned against
|
||||||
``foobar`` or ``foo0123456789bar`` but will produce a match against the data
|
``foobar`` or ``foo0123456789bar`` but will produce a match against the data
|
||||||
streams ``foo0123bar`` or ``foo0123456bar``.
|
streams ``foo0123bar`` or ``foo0123456bar``.
|
||||||
|
|
||||||
|
Similarly, the pattern :regexp:`/foobar/` when given an ``edit_distance`` of 2
|
||||||
|
will produce matches when scanned against ``foobar``, ``fooba``, ``fobr``,
|
||||||
|
``fo_baz``, ``foooobar``, and anything else that lies within edit distance of 2
|
||||||
|
(as defined by Levenshtein distance). For more details, see the
|
||||||
|
:ref:`approximate_matching` section.
|
||||||
|
|
||||||
=================
|
=================
|
||||||
Prefiltering Mode
|
Prefiltering Mode
|
||||||
=================
|
=================
|
||||||
@ -375,3 +385,74 @@ An :c:type:`hs_platform_info_t` structure targeted at the current host can be
|
|||||||
built with the :c:func:`hs_populate_platform` function.
|
built with the :c:func:`hs_populate_platform` function.
|
||||||
|
|
||||||
See :ref:`api_constants` for the full list of CPU tuning and feature flags.
|
See :ref:`api_constants` for the full list of CPU tuning and feature flags.
|
||||||
|
|
||||||
|
.. _approximate_matching:
|
||||||
|
|
||||||
|
********************
|
||||||
|
Approximate matching
|
||||||
|
********************
|
||||||
|
|
||||||
|
Hyperscan provides an experimental approximate matching mode, which will match
|
||||||
|
patterns within a given edit distance. The exact matching behavior is defined as
|
||||||
|
follows:
|
||||||
|
|
||||||
|
#. **Edit distance** is defined as Levenshtein distance. That is, there are
|
||||||
|
three possible edit types considered: insertion, removal and substitution.
|
||||||
|
More formal description can be found on
|
||||||
|
`Wikipedia <https://en.wikipedia.org/wiki/Levenshtein_distance>`_.
|
||||||
|
|
||||||
|
#. **Approximate matching** will match all *corpora* within a given edit
|
||||||
|
distance. That is, given a pattern, approximate matching will match anything
|
||||||
|
that can be edited to arrive at a corpus that exactly matches the original
|
||||||
|
pattern.
|
||||||
|
|
||||||
|
#. **Matching semantics** are exactly the same as described in :ref:`semantics`.
|
||||||
|
|
||||||
|
Here are a few examples of approximate matching:
|
||||||
|
|
||||||
|
* Pattern :regexp:`/foo/` can match ``foo`` when using regular Hyperscan
|
||||||
|
matching behavior. With approximate matching within edit distance 2, the
|
||||||
|
pattern will produce matches when scanned against ``foo``, ``foooo``, ``f00``,
|
||||||
|
``f``, and anything else that lies within edit distance 2 of matching corpora
|
||||||
|
for the original pattern (``foo`` in this case).
|
||||||
|
|
||||||
|
* Pattern :regexp:`/foo(bar)+/` with edit distance 1 will match ``foobarbar``,
|
||||||
|
``foobarb0r``, ``fooarbar``, ``foobarba``, ``f0obarbar``, ``fobarbar`` and
|
||||||
|
anything else that lies within edit distance 1 of matching corpora for the
|
||||||
|
original pattern (``foobarbar`` in this case).
|
||||||
|
|
||||||
|
* Pattern :regexp:`/foob?ar/` with edit distance 2 will match ``fooar``,
|
||||||
|
``foo``, ``fabar``, ``oar`` and anything else that lies within edit distance 2
|
||||||
|
of matching corpora for the original pattern (``fooar`` in this case).
|
||||||
|
|
||||||
|
Currently, there are trade-offs and limitations that come with approximate
|
||||||
|
matching support. Here they are, in a nutshell:
|
||||||
|
|
||||||
|
* Reduced pattern support:
|
||||||
|
|
||||||
|
* For many patterns, approximate matching is complex and can result in
|
||||||
|
Hyperscan failing to compile a pattern with a "Pattern too large" error,
|
||||||
|
even if the pattern is supported in normal operation.
|
||||||
|
* Additionally, some patterns cannot be approximately matched because they
|
||||||
|
reduce to so-called "vacuous" patterns (patterns that match everything). For
|
||||||
|
example, pattern :regexp:`/foo/` with edit distance 3, if implemented,
|
||||||
|
would reduce to matching zero-length buffers. Such patterns will result in a
|
||||||
|
"Pattern cannot be approximately matched" compile error.
|
||||||
|
* Finally, due to the inherent complexities of defining matching behavior,
|
||||||
|
approximate matching implements a reduced subset of regular expression
|
||||||
|
syntax. Approximate matching does not support UTF-8 (and other
|
||||||
|
multibyte character encodings), and word boundaries (that is, ``\b``, ``\B``
|
||||||
|
and other equivalent constructs). Patterns containing unsupported constructs
|
||||||
|
will result in "Pattern cannot be approximately matched" compile error.
|
||||||
|
* When using approximate matching in conjunction with SOM, all of the
|
||||||
|
restrictions of SOM also apply. See :ref:`som` for more
|
||||||
|
details.
|
||||||
|
* Increased stream state/byte code size requirements: due to approximate
|
||||||
|
matching byte code being inherently larger and more complex than exact
|
||||||
|
matching, the corresponding requirements also increase.
|
||||||
|
* Performance overhead: similarly, there is generally a performance cost
|
||||||
|
associated with approximate matching, both due to increased matching
|
||||||
|
complexity, and due to the fact that it will produce more matches.
|
||||||
|
|
||||||
|
Approximate matching is always disabled by default, and can be enabled on a
|
||||||
|
per-pattern basis by using an extended parameter described in :ref:`extparam`.
|
||||||
|
@ -333,3 +333,13 @@ Similarly, the :c:member:`hs_expr_ext::min_length` extended parameter can be
|
|||||||
used to specify a lower bound on the length of the matches for a pattern. Using
|
used to specify a lower bound on the length of the matches for a pattern. Using
|
||||||
this facility may be more lightweight in some circumstances than using the SOM
|
this facility may be more lightweight in some circumstances than using the SOM
|
||||||
flag and post-confirming match length in the calling application.
|
flag and post-confirming match length in the calling application.
|
||||||
|
|
||||||
|
********************
|
||||||
|
Approximate matching
|
||||||
|
********************
|
||||||
|
|
||||||
|
.. tip:: Approximate matching is an experimental feature.
|
||||||
|
|
||||||
|
There is generally a performance impact associated with approximate matching due
|
||||||
|
to the reduced specificity of the matches. This impact may vary significantly
|
||||||
|
depending on the pattern and edit distance.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user