mirror of
https://github.com/VectorCamp/vectorscan.git
synced 2025-06-28 16:41:01 +03:00
chimera: update dev-reference
This commit is contained in:
parent
c8ec0d0ec2
commit
746d1eafe5
333
doc/dev-reference/chimera.rst
Normal file
333
doc/dev-reference/chimera.rst
Normal file
@ -0,0 +1,333 @@
|
||||
.. _chimera:
|
||||
|
||||
#######
|
||||
Chimera
|
||||
#######
|
||||
|
||||
This section describes Chimera library.
|
||||
|
||||
************
|
||||
Introduction
|
||||
************
|
||||
|
||||
Chimera is a software regular expression matching engine that is a hybrid of
|
||||
Hyperscan and PCRE. The design goals of Chimera are to fully support PCRE
|
||||
syntax as well as to take advantage of the high performance nature of Hyperscan.
|
||||
|
||||
Chimera inherits the design guideline of Hyperscan with C APIs for compilation
|
||||
and scanning.
|
||||
|
||||
The Chimera API itself is composed of two major components:
|
||||
|
||||
===========
|
||||
Compilation
|
||||
===========
|
||||
|
||||
These functions take a group of regular expressions, along with identifiers and
|
||||
option flags, and compile them into an immutable database that can be used by
|
||||
the Chimera scanning API. This compilation process performs considerable
|
||||
analysis and optimization work in order to build a database that will match
|
||||
the given expressions efficiently.
|
||||
|
||||
See :ref:`chcompile` for more details
|
||||
|
||||
========
|
||||
Scanning
|
||||
========
|
||||
|
||||
Once a Chimera database has been created, it can be used to scan data in memory.
|
||||
Chimera only supports block mode in which we scan a single contiguous block in
|
||||
memory.
|
||||
|
||||
Matches are delivered to the application via a user-supplied callback function
|
||||
that is called synchronously for each match.
|
||||
|
||||
For a given database, Chimera provides several guarantees:
|
||||
|
||||
* No memory allocations occur at runtime with the exception of scratch space
|
||||
allocation, it should be done ahead of time for performance-critical
|
||||
applications:
|
||||
|
||||
- **Scratch space**: temporary memory used for internal data at scan time.
|
||||
Structures in scratch space do not persist beyond the end of a single scan
|
||||
call.
|
||||
|
||||
* The size of the scratch space required for a given database is fixed and
|
||||
determined at database compile time. This means that the memory requirement
|
||||
of the application are known ahead of time, and the scratch space can be
|
||||
pre-allocated if required for performance reasons.
|
||||
|
||||
* Any pattern that has successfully been compiled by the Chimera compiler can
|
||||
be scanned against any input. There could be internal resource limits or
|
||||
other limitations caused by PCRE at runtime that could cause a scan call to
|
||||
return an error.
|
||||
|
||||
.. note:: Chimera is designed to have the same matching behavior as PCRE,
|
||||
including greedy/ungreedy, capturing, etc. Chimera reports both
|
||||
**start offset** and **end offset** for each match like PCRE. Different
|
||||
from the fashion of reporting all matches in Hyperscan, Chimera only reports
|
||||
non-overlapping matches. For example, the pattern :regexp:`/foofoo/` will
|
||||
match ``foofoofoofoo`` at offsets (0, 6) and (6, 12).
|
||||
|
||||
.. note:: Since Chimera is a hybrid of Hyperscan and PCRE in order to support
|
||||
full PCRE syntax, there will be extra performance overhead compared to
|
||||
Hyperscan-only solution. Please always use Hyperscan for better performance
|
||||
unless you must need full PCRE syntax support.
|
||||
|
||||
See :ref:`chruntime` for more details
|
||||
|
||||
************
|
||||
Requirements
|
||||
************
|
||||
|
||||
The PCRE library (http://pcre.org/) version 8.41 is required for Chimera.
|
||||
|
||||
.. note:: Since Chimera needs to reference PCRE internal function, please place PCRE source
|
||||
directory under Hyperscan root directory in order to build Chimera.
|
||||
|
||||
Beside this, both hardware and software requirements of Chimera are the same to Hyperscan.
|
||||
See :ref:`hardware` and :ref:`software` for more details.
|
||||
|
||||
.. note:: Building Hyperscan will automatically generate Chimera library.
|
||||
Currently only static library is supported for Chimera, so please
|
||||
use static build type when configure CMake build options.
|
||||
|
||||
.. _chcompile:
|
||||
|
||||
******************
|
||||
Compiling Patterns
|
||||
******************
|
||||
|
||||
===================
|
||||
Building a Database
|
||||
===================
|
||||
|
||||
The Chimera compiler API accepts regular expressions and converts them into a
|
||||
compiled pattern database that can then be used to scan data.
|
||||
|
||||
The API provides two functions that compile regular expressions into
|
||||
databases:
|
||||
|
||||
#. :c:func:`ch_compile`: compiles a single expression into a pattern database.
|
||||
|
||||
#. :c:func:`ch_compile_multi`: compiles an array of expressions into a pattern
|
||||
database. All of the supplied patterns will be scanned for concurrently at
|
||||
scan time, with user-supplied identifiers returned when they match.
|
||||
|
||||
#. :c:func:`ch_compile_ext_multi`: compiles an array of expressions as above,
|
||||
but allows PCRE match limits to be specified for each expression.
|
||||
|
||||
Compilation allows the Chimera library to analyze the given pattern(s) and
|
||||
pre-determine how to scan for these patterns in an optimized fashion using
|
||||
Hyperscan and PCRE.
|
||||
|
||||
===============
|
||||
Pattern Support
|
||||
===============
|
||||
|
||||
Chimera fully supports the pattern syntax used by the PCRE library ("libpcre"),
|
||||
described at <http://www.pcre.org/>.The version of PCRE used to validate
|
||||
Chimera's interpretation of this syntax is 8.41.
|
||||
|
||||
=========
|
||||
Semantics
|
||||
=========
|
||||
|
||||
Chimera supports the exact same semantics of PCRE library. Moreover, it supports
|
||||
multiple simultaneous pattern matching like Hyperscan and the multiple matches
|
||||
will be reported in order by end offset.
|
||||
|
||||
.. _chruntime:
|
||||
|
||||
*********************
|
||||
Scanning for Patterns
|
||||
*********************
|
||||
|
||||
Chimera provides scan function with ``ch_scan``.
|
||||
|
||||
================
|
||||
Handling Matches
|
||||
================
|
||||
|
||||
``ch_scan`` will call a user-supplied callback function when a match
|
||||
is found. This function has the following signature:
|
||||
|
||||
.. doxygentypedef:: ch_match_event_handler
|
||||
:outline:
|
||||
:no-link:
|
||||
|
||||
The *id* argument will be set to the identifier for the matching expression
|
||||
provided at compile time, and the *from* argument will be set to the
|
||||
start-offset of the match the *to* argument will be set to the end-offset
|
||||
of the match. The *captured* stores offsets of entire pattern match as well as
|
||||
captured subexpressions. The *size* will be set to the number of valid entries in
|
||||
the *captured*.
|
||||
|
||||
The match callback function has the capability to continue or halt scanning
|
||||
by returning different values.
|
||||
|
||||
See :c:type:`ch_match_event_handler` for more information.
|
||||
|
||||
=======================
|
||||
Handling Runtime Errors
|
||||
=======================
|
||||
|
||||
``ch_scan`` will call a user-supplied callback function when a runtime error
|
||||
occurs in libpcre. This function has the following signature:
|
||||
|
||||
.. doxygentypedef:: ch_error_event_handler
|
||||
:outline:
|
||||
:no-link:
|
||||
|
||||
The *id* argument will be set to the identifier for the matching expression
|
||||
provided at compile time.
|
||||
|
||||
The match callback function has the capability to either halt scanning or
|
||||
continue scanning for the next pattern.
|
||||
|
||||
See :c:type:`ch_error_event_handler` for more information.
|
||||
|
||||
=============
|
||||
Scratch Space
|
||||
=============
|
||||
|
||||
While scanning data, Chimera needs a small amount of temporary memory to store
|
||||
on-the-fly internal data. This amount is unfortunately too large to fit on the
|
||||
stack, particularly for embedded applications, and allocating memory dynamically
|
||||
is too expensive, so a pre-allocated "scratch" space must be provided to the
|
||||
scanning functions.
|
||||
|
||||
The function :c:func:`ch_alloc_scratch` allocates a large enough region of
|
||||
scratch space to support a given database. If the application uses multiple
|
||||
databases, only a single scratch region is necessary: in this case, calling
|
||||
:c:func:`ch_alloc_scratch` on each database (with the same ``scratch`` pointer)
|
||||
will ensure that the scratch space is large enough to support scanning against
|
||||
any of the given databases.
|
||||
|
||||
While the Chimera library is re-entrant, the use of scratch spaces is not.
|
||||
For example, if by design it is deemed necessary to run recursive or nested
|
||||
scanning (say, from the match callback function), then an additional scratch
|
||||
space is required for that context.
|
||||
|
||||
In the absence of recursive scanning, only one such space is required per thread
|
||||
and can (and indeed should) be allocated before data scanning is to commence.
|
||||
|
||||
In a scenario where a set of expressions are compiled by a single "master"
|
||||
thread and data will be scanned by multiple "worker" threads, the convenience
|
||||
function :c:func:`ch_clone_scratch` allows multiple copies of an existing
|
||||
scratch space to be made for each thread (rather than forcing the caller to pass
|
||||
all the compiled databases through :c:func:`ch_alloc_scratch` multiple times).
|
||||
|
||||
For example:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
ch_error_t err;
|
||||
ch_scratch_t *scratch_prototype = NULL;
|
||||
err = ch_alloc_scratch(db, &scratch_prototype);
|
||||
if (err != CH_SUCCESS) {
|
||||
printf("ch_alloc_scratch failed!");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
ch_scratch_t *scratch_thread1 = NULL;
|
||||
ch_scratch_t *scratch_thread2 = NULL;
|
||||
|
||||
err = ch_clone_scratch(scratch_prototype, &scratch_thread1);
|
||||
if (err != CH_SUCCESS) {
|
||||
printf("ch_clone_scratch failed!");
|
||||
exit(1);
|
||||
}
|
||||
err = ch_clone_scratch(scratch_prototype, &scratch_thread2);
|
||||
if (err != CH_SUCCESS) {
|
||||
printf("ch_clone_scratch failed!");
|
||||
exit(1);
|
||||
}
|
||||
|
||||
ch_free_scratch(scratch_prototype);
|
||||
|
||||
/* Now two threads can both scan against database db,
|
||||
each with its own scratch space. */
|
||||
|
||||
|
||||
=================
|
||||
Custom Allocators
|
||||
=================
|
||||
|
||||
By default, structures used by Chimera at runtime (scratch space, etc) are
|
||||
allocated with the default system allocators, usually
|
||||
``malloc()`` and ``free()``.
|
||||
|
||||
The Chimera API provides a facility for changing this behaviour to support
|
||||
applications that use custom memory allocators.
|
||||
|
||||
These functions are:
|
||||
|
||||
- :c:func:`ch_set_database_allocator`, which sets the allocate and free functions
|
||||
used for compiled pattern databases.
|
||||
- :c:func:`ch_set_scratch_allocator`, which sets the allocate and free
|
||||
functions used for scratch space.
|
||||
- :c:func:`ch_set_misc_allocator`, which sets the allocate and free functions
|
||||
used for miscellaneous data, such as compile error structures and
|
||||
informational strings.
|
||||
|
||||
The :c:func:`ch_set_allocator` function can be used to set all of the custom
|
||||
allocators to the same allocate/free pair.
|
||||
|
||||
|
||||
************************
|
||||
API Reference: Constants
|
||||
************************
|
||||
|
||||
===========
|
||||
Error Codes
|
||||
===========
|
||||
|
||||
.. doxygengroup:: CH_ERROR
|
||||
:content-only:
|
||||
:no-link:
|
||||
|
||||
=============
|
||||
Pattern flags
|
||||
=============
|
||||
|
||||
.. doxygengroup:: CH_PATTERN_FLAG
|
||||
:content-only:
|
||||
:no-link:
|
||||
|
||||
==================
|
||||
Compile mode flags
|
||||
==================
|
||||
|
||||
.. doxygengroup:: CH_MODE_FLAG
|
||||
:content-only:
|
||||
:no-link:
|
||||
|
||||
|
||||
********************
|
||||
API Reference: Files
|
||||
********************
|
||||
|
||||
==========
|
||||
File: ch.h
|
||||
==========
|
||||
|
||||
.. doxygenfile:: ch.h
|
||||
|
||||
=================
|
||||
File: ch_common.h
|
||||
=================
|
||||
|
||||
.. doxygenfile:: ch_common.h
|
||||
|
||||
==================
|
||||
File: ch_compile.h
|
||||
==================
|
||||
|
||||
.. doxygenfile:: ch_compile.h
|
||||
|
||||
==================
|
||||
File: ch_runtime.h
|
||||
==================
|
||||
|
||||
.. doxygenfile:: ch_runtime.h
|
@ -50,6 +50,8 @@ Very Quick Start
|
||||
Requirements
|
||||
************
|
||||
|
||||
.. _hardware:
|
||||
|
||||
Hardware
|
||||
========
|
||||
|
||||
|
@ -758,7 +758,7 @@ WARN_LOGFILE =
|
||||
# spaces.
|
||||
# Note: If this tag is empty the current directory is searched.
|
||||
|
||||
INPUT = @CMAKE_SOURCE_DIR@/src/hs.h @CMAKE_SOURCE_DIR@/src/hs_common.h @CMAKE_SOURCE_DIR@/src/hs_compile.h @CMAKE_SOURCE_DIR@/src/hs_runtime.h
|
||||
INPUT = @CMAKE_SOURCE_DIR@/src/hs.h @CMAKE_SOURCE_DIR@/src/hs_common.h @CMAKE_SOURCE_DIR@/src/hs_compile.h @CMAKE_SOURCE_DIR@/src/hs_runtime.h @CMAKE_SOURCE_DIR@/chimera/ch.h @CMAKE_SOURCE_DIR@/chimera/ch_common.h @CMAKE_SOURCE_DIR@/chimera/ch_compile.h @CMAKE_SOURCE_DIR@/chimera/ch_runtime.h
|
||||
|
||||
# This tag can be used to specify the character encoding of the source files
|
||||
# that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
|
||||
|
@ -20,3 +20,4 @@ Hyperscan |version| Developer's Reference Guide
|
||||
tools
|
||||
api_constants
|
||||
api_files
|
||||
chimera
|
||||
|
Loading…
x
Reference in New Issue
Block a user