2015-10-20 09:13:35 +11:00

199 lines
7.8 KiB
ReStructuredText

.. _runtime:
#####################
Scanning for Patterns
#####################
Hyperscan provides three different scanning modes, each with its own scan
function beginning with ``hs_scan``. In addition, streaming mode has a number
of other API functions for managing stream state.
****************
Handling Matches
****************
All of these functions will call a user-supplied callback function when a match
is found. This function has the following signature:
.. doxygentypedef:: match_event_handler
:outline:
:no-link:
The *id* argument will be set to the identifier for the matching expression
provided at compile time, and the *to* argument will be set to the end-offset
of the match. If SOM was requested for the pattern (see :ref:`som`), the
*from* argument will be set to the leftmost possible start-offset for the match.
The match callback function has the capability to halt scanning
by returning a non-zero value.
See :c:type:`match_event_handler` for more information.
**************
Streaming Mode
**************
The streaming runtime API consists of functions to open, scan, and close
Hyperscan data streams -- these functions being :c:func:`hs_open_stream`,
:c:func:`hs_scan_stream`, and :c:func:`hs_close_stream`. Any matches detected
in the written data are returned to the calling application via a function
pointer callback.
The match callback function has the capability to halt scanning of the current
data stream by returning a non-zero value. In streaming mode, the result of
this is that the stream is then left in a state where no more data can be
scanned, and any subsequent calls to :c:func:`hs_scan_stream` for that stream
will return immediately with :c:member:`HS_SCAN_TERMINATED`. The caller must
still call :c:func:`hs_close_stream` to complete the clean-up process for that
stream.
Streams exist in the Hyperscan library so that pattern matching state can be
maintained across multiple blocks of target data -- without maintaining this
state, it would not be possible to detect patterns that span these blocks of
data. This, however, does come at the cost of requiring an amount of storage
per-stream (the size of this storage is fixed at compile time), and a slight
performance penalty in some cases to manage the state.
While Hyperscan does always support a strict ordering of multiple matches,
streaming matches will not be delivered at offsets before the current stream
write, with the exception of zero-width asserts, where constructs such as
:regexp:`\\b` and :regexp:`$` can cause a match on the final character of a
stream write to be delayed until the next stream write or stream close
operation.
=================
Stream Management
=================
In addition to :c:func:`hs_open_stream`, :c:func:`hs_scan_stream`, and
:c:func:`hs_close_stream`, the Hyperscan API provides a number of other
functions for the management of streams:
* :c:func:`hs_reset_stream`: resets a stream to its initial state; this is
equivalent to calling :c:func:`hs_close_stream` but will not free the memory
used for stream state.
* :c:func:`hs_copy_stream`: constructs a (newly allocated) duplicate of a
stream.
* :c:func:`hs_reset_and_copy_stream`: constructs a duplicate of a stream into
another, resetting the destination stream first. This call avoids the
allocation done by :c:func:`hs_copy_stream`.
**********
Block Mode
**********
The block mode runtime API consists of a single function: :c:func:`hs_scan`. Using
the compiled patterns this function identifies matches in the target data,
using a function pointer callback to communicate with the application.
This single :c:func:`hs_scan` function is essentially equivalent to calling
:c:func:`hs_open_stream`, making a single call to :c:func:`hs_scan_stream`, and
then :c:func:`hs_close_stream`, except that block mode operation does not
incur all the stream related overhead.
*************
Vectored Mode
*************
The vectored mode runtime API, like the block mode API, consists of a single
function: :c:func:`hs_scan_vector`. This function accepts an array of data
pointers and lengths, facilitating the scanning in sequence of a set of data
blocks that are not contiguous in memory.
From the caller's perspective, this mode will produce the same matches as if
the set of data blocks were (a) scanned in sequence with a series of streaming
mode scans, or (b) copied in sequence into a single block of memory and then
scanned in block mode.
*************
Scratch Space
*************
While scanning data, Hyperscan needs a small amount of temporary memory to store
on-the-fly internal data. This amount is unfortunately too large to fit on the
stack, particularly for embedded applications, and allocating memory dynamically
is too expensive, so a pre-allocated "scratch" space must be provided to the
scanning functions.
The function :c:func:`hs_alloc_scratch` allocates a large enough region of
scratch space to support a given database. If the application uses multiple
databases, only a single scratch region is necessary: in this case, calling
:c:func:`hs_alloc_scratch` on each database (with the same ``scratch`` pointer)
will ensure that the scratch space is large enough to support scanning against
any of the given databases.
Importantly, only one such space is required per thread and can (and indeed
should) be allocated before data scanning is to commence. In a scenario where a
set of expressions are compiled by a single "master" thread and data will be
scanned by multiple "worker" threads, the convenience function
:c:func:`hs_clone_scratch` allows multiple copies of an existing scratch space
to be made for each thread (rather than forcing the caller to pass all the
compiled databases through :c:func:`hs_alloc_scratch` multiple times).
For example:
.. code-block:: c
hs_error_t err;
hs_scratch_t *scratch_prototype = NULL;
err = hs_alloc_scratch(db, &scratch_prototype);
if (err != HS_SUCCESS) {
printf("hs_alloc_scratch failed!");
exit(1);
}
hs_scratch_t *scratch_thread1 = NULL;
hs_scratch_t *scratch_thread2 = NULL;
err = hs_clone_scratch(scratch_prototype, &scratch_thread1);
if (err != HS_SUCCESS) {
printf("hs_clone_scratch failed!");
exit(1);
}
err = hs_clone_scratch(scratch_prototype, &scratch_thread2);
if (err != HS_SUCCESS) {
printf("hs_clone_scratch failed!");
exit(1);
}
hs_free_scratch(scratch_prototype);
/* Now two threads can both scan against database db,
each with its own scratch space. */
While the Hyperscan library is re-entrant, the use of scratch spaces is not.
For example, if by design it is deemed necessary to run recursive or nested
scanning (say, from the match callback function), then an additional scratch
space is required for that context.
The easiest way to achieve this is to build up a single scratch space as a
prototype, then clone it for each context:
*****************
Custom Allocators
*****************
By default, structures used by Hyperscan at runtime (scratch space, stream
state, etc) are allocated with the default system allocators, usually
``malloc()`` and ``free()``.
The Hyperscan API provides a facility for changing this behaviour to support
applications that use custom memory allocators.
These functions are:
- :c:func:`hs_set_database_allocator`, which sets the allocate and free functions
used for compiled pattern databases.
- :c:func:`hs_set_scratch_allocator`, which sets the allocate and free
functions used for scratch space.
- :c:func:`hs_set_stream_allocator`, which sets the allocate and free functions
used for stream state in streaming mode.
- :c:func:`hs_set_misc_allocator`, which sets the allocate and free functions
used for miscellaneous data, such as compile error structures and
informational strings.
The :c:func:`hs_set_allocator` function can be used to set all of the custom
allocators to the same allocate/free pair.