mirror of
https://github.com/VectorCamp/vectorscan.git
synced 2025-06-28 16:41:01 +03:00
243 lines
9.9 KiB
ReStructuredText
243 lines
9.9 KiB
ReStructuredText
.. _runtime:
|
|
|
|
#####################
|
|
Scanning for Patterns
|
|
#####################
|
|
|
|
Hyperscan provides three different scanning modes, each with its own scan
|
|
function beginning with ``hs_scan``. In addition, streaming mode has a number
|
|
of other API functions for managing stream state.
|
|
|
|
****************
|
|
Handling Matches
|
|
****************
|
|
|
|
All of these functions will call a user-supplied callback function when a match
|
|
is found. This function has the following signature:
|
|
|
|
.. doxygentypedef:: match_event_handler
|
|
:outline:
|
|
:no-link:
|
|
|
|
The *id* argument will be set to the identifier for the matching expression
|
|
provided at compile time, and the *to* argument will be set to the end-offset
|
|
of the match. If SOM was requested for the pattern (see :ref:`som`), the
|
|
*from* argument will be set to the leftmost possible start-offset for the match.
|
|
|
|
The match callback function has the capability to halt scanning
|
|
by returning a non-zero value.
|
|
|
|
See :c:type:`match_event_handler` for more information.
|
|
|
|
**************
|
|
Streaming Mode
|
|
**************
|
|
|
|
The core of the Hyperscan streaming runtime API consists of functions to open,
|
|
scan, and close Hyperscan data streams:
|
|
|
|
* :c:func:`hs_open_stream`: allocates and initializes a new stream for scanning.
|
|
|
|
* :c:func:`hs_scan_stream`: scans a block of data in a given stream, raising
|
|
matches as they are detected.
|
|
|
|
* :c:func:`hs_close_stream`: completes scanning of a given stream (raising any
|
|
matches that occur at the end of the stream) and frees the stream state. After
|
|
a call to :c:func:`hs_close_stream`, the stream handle is invalid and should
|
|
not be used again for any purpose.
|
|
|
|
Any matches detected in the data as it is scanned are returned to the calling
|
|
application via a function pointer callback.
|
|
|
|
The match callback function has the capability to halt scanning of the current
|
|
data stream by returning a non-zero value. In streaming mode, the result of
|
|
this is that the stream is then left in a state where no more data can be
|
|
scanned, and any subsequent calls to :c:func:`hs_scan_stream` for that stream
|
|
will return immediately with :c:member:`HS_SCAN_TERMINATED`. The caller must
|
|
still call :c:func:`hs_close_stream` to complete the clean-up process for that
|
|
stream.
|
|
|
|
Streams exist in the Hyperscan library so that pattern matching state can be
|
|
maintained across multiple blocks of target data -- without maintaining this
|
|
state, it would not be possible to detect patterns that span these blocks of
|
|
data. This, however, does come at the cost of requiring an amount of storage
|
|
per-stream (the size of this storage is fixed at compile time), and a slight
|
|
performance penalty in some cases to manage the state.
|
|
|
|
While Hyperscan does always support a strict ordering of multiple matches,
|
|
streaming matches will not be delivered at offsets before the current stream
|
|
write, with the exception of zero-width asserts, where constructs such as
|
|
:regexp:`\\b` and :regexp:`$` can cause a match on the final character of a
|
|
stream write to be delayed until the next stream write or stream close
|
|
operation.
|
|
|
|
=================
|
|
Stream Management
|
|
=================
|
|
|
|
In addition to :c:func:`hs_open_stream`, :c:func:`hs_scan_stream`, and
|
|
:c:func:`hs_close_stream`, the Hyperscan API provides a number of other
|
|
functions for the management of streams:
|
|
|
|
* :c:func:`hs_reset_stream`: resets a stream to its initial state; this is
|
|
equivalent to calling :c:func:`hs_close_stream` but will not free the memory
|
|
used for stream state.
|
|
|
|
* :c:func:`hs_copy_stream`: constructs a (newly allocated) duplicate of a
|
|
stream.
|
|
|
|
* :c:func:`hs_reset_and_copy_stream`: constructs a duplicate of a stream into
|
|
another, resetting the destination stream first. This call avoids the
|
|
allocation done by :c:func:`hs_copy_stream`.
|
|
|
|
==================
|
|
Stream Compression
|
|
==================
|
|
|
|
A stream object is allocated as a fixed size region of memory which has been
|
|
sized to ensure that no memory allocations are required during scan
|
|
operations. When the system is under memory pressure, it may be useful to reduce
|
|
the memory consumed by streams that are not expected to be used soon. The
|
|
Hyperscan API provides calls for translating a stream to and from a compressed
|
|
representation for this purpose. The compressed representation differs from the
|
|
full stream object as it does not reserve space for components which are not
|
|
required given the current stream state. The Hyperscan API functions for this
|
|
functionality are:
|
|
|
|
* :c:func:`hs_compress_stream`: fills the provided buffer with a compressed
|
|
representation of the stream and returns the number of bytes consumed by the
|
|
compressed representation. If the buffer is not large enough to hold the
|
|
compressed representation, :c:member:`HS_INSUFFICIENT_SPACE` is returned along
|
|
with the required size. This call does not modify the original stream in any
|
|
way: it may still be written to with :c:func:`hs_scan_stream`, used as part of
|
|
the various reset calls to reinitialise its state, or
|
|
:c:func:`hs_close_stream` may be called to free its resources.
|
|
|
|
* :c:func:`hs_expand_stream`: creates a new stream based on a buffer containing
|
|
a compressed representation.
|
|
|
|
* :c:func:`hs_reset_and_expand_stream`: constructs a stream based on a buffer
|
|
containing a compressed representation on top of an existing stream, resetting
|
|
the existing stream first. This call avoids the allocation done by
|
|
:c:func:`hs_expand_stream`.
|
|
|
|
Note: it is not recommended to use stream compression between every call to scan
|
|
for performance reasons as it takes time to convert between the compressed
|
|
representation and a standard stream.
|
|
|
|
|
|
**********
|
|
Block Mode
|
|
**********
|
|
|
|
The block mode runtime API consists of a single function: :c:func:`hs_scan`. Using
|
|
the compiled patterns this function identifies matches in the target data,
|
|
using a function pointer callback to communicate with the application.
|
|
|
|
This single :c:func:`hs_scan` function is essentially equivalent to calling
|
|
:c:func:`hs_open_stream`, making a single call to :c:func:`hs_scan_stream`, and
|
|
then :c:func:`hs_close_stream`, except that block mode operation does not
|
|
incur all the stream related overhead.
|
|
|
|
*************
|
|
Vectored Mode
|
|
*************
|
|
|
|
The vectored mode runtime API, like the block mode API, consists of a single
|
|
function: :c:func:`hs_scan_vector`. This function accepts an array of data
|
|
pointers and lengths, facilitating the scanning in sequence of a set of data
|
|
blocks that are not contiguous in memory.
|
|
|
|
From the caller's perspective, this mode will produce the same matches as if
|
|
the set of data blocks were (a) scanned in sequence with a series of streaming
|
|
mode scans, or (b) copied in sequence into a single block of memory and then
|
|
scanned in block mode.
|
|
|
|
*************
|
|
Scratch Space
|
|
*************
|
|
|
|
While scanning data, Hyperscan needs a small amount of temporary memory to store
|
|
on-the-fly internal data. This amount is unfortunately too large to fit on the
|
|
stack, particularly for embedded applications, and allocating memory dynamically
|
|
is too expensive, so a pre-allocated "scratch" space must be provided to the
|
|
scanning functions.
|
|
|
|
The function :c:func:`hs_alloc_scratch` allocates a large enough region of
|
|
scratch space to support a given database. If the application uses multiple
|
|
databases, only a single scratch region is necessary: in this case, calling
|
|
:c:func:`hs_alloc_scratch` on each database (with the same ``scratch`` pointer)
|
|
will ensure that the scratch space is large enough to support scanning against
|
|
any of the given databases.
|
|
|
|
While the Hyperscan library is re-entrant, the use of scratch spaces is not.
|
|
For example, if by design it is deemed necessary to run recursive or nested
|
|
scanning (say, from the match callback function), then an additional scratch
|
|
space is required for that context.
|
|
|
|
In the absence of recursive scanning, only one such space is required per thread
|
|
and can (and indeed should) be allocated before data scanning is to commence.
|
|
|
|
In a scenario where a set of expressions are compiled by a single "main"
|
|
thread and data will be scanned by multiple "worker" threads, the convenience
|
|
function :c:func:`hs_clone_scratch` allows multiple copies of an existing
|
|
scratch space to be made for each thread (rather than forcing the caller to pass
|
|
all the compiled databases through :c:func:`hs_alloc_scratch` multiple times).
|
|
|
|
For example:
|
|
|
|
.. code-block:: c
|
|
|
|
hs_error_t err;
|
|
hs_scratch_t *scratch_prototype = NULL;
|
|
err = hs_alloc_scratch(db, &scratch_prototype);
|
|
if (err != HS_SUCCESS) {
|
|
printf("hs_alloc_scratch failed!");
|
|
exit(1);
|
|
}
|
|
|
|
hs_scratch_t *scratch_thread1 = NULL;
|
|
hs_scratch_t *scratch_thread2 = NULL;
|
|
|
|
err = hs_clone_scratch(scratch_prototype, &scratch_thread1);
|
|
if (err != HS_SUCCESS) {
|
|
printf("hs_clone_scratch failed!");
|
|
exit(1);
|
|
}
|
|
err = hs_clone_scratch(scratch_prototype, &scratch_thread2);
|
|
if (err != HS_SUCCESS) {
|
|
printf("hs_clone_scratch failed!");
|
|
exit(1);
|
|
}
|
|
|
|
hs_free_scratch(scratch_prototype);
|
|
|
|
/* Now two threads can both scan against database db,
|
|
each with its own scratch space. */
|
|
|
|
*****************
|
|
Custom Allocators
|
|
*****************
|
|
|
|
By default, structures used by Hyperscan at runtime (scratch space, stream
|
|
state, etc) are allocated with the default system allocators, usually
|
|
``malloc()`` and ``free()``.
|
|
|
|
The Hyperscan API provides a facility for changing this behaviour to support
|
|
applications that use custom memory allocators.
|
|
|
|
These functions are:
|
|
|
|
- :c:func:`hs_set_database_allocator`, which sets the allocate and free functions
|
|
used for compiled pattern databases.
|
|
- :c:func:`hs_set_scratch_allocator`, which sets the allocate and free
|
|
functions used for scratch space.
|
|
- :c:func:`hs_set_stream_allocator`, which sets the allocate and free functions
|
|
used for stream state in streaming mode.
|
|
- :c:func:`hs_set_misc_allocator`, which sets the allocate and free functions
|
|
used for miscellaneous data, such as compile error structures and
|
|
informational strings.
|
|
|
|
The :c:func:`hs_set_allocator` function can be used to set all of the custom
|
|
allocators to the same allocate/free pair.
|