From 746d1eafe5cb26df6c920c6ad09a35d9173c8c99 Mon Sep 17 00:00:00 2001 From: "Wang, Xiang W" Date: Wed, 27 Jun 2018 10:21:50 -0400 Subject: [PATCH] chimera: update dev-reference --- doc/dev-reference/chimera.rst | 333 ++++++++++++++++++++++++ doc/dev-reference/getting_started.rst | 2 + doc/dev-reference/hyperscan.doxyfile.in | 2 +- doc/dev-reference/index.rst | 1 + 4 files changed, 337 insertions(+), 1 deletion(-) create mode 100644 doc/dev-reference/chimera.rst diff --git a/doc/dev-reference/chimera.rst b/doc/dev-reference/chimera.rst new file mode 100644 index 00000000..883cb5a0 --- /dev/null +++ b/doc/dev-reference/chimera.rst @@ -0,0 +1,333 @@ +.. _chimera: + +####### +Chimera +####### + +This section describes Chimera library. + +************ +Introduction +************ + +Chimera is a software regular expression matching engine that is a hybrid of +Hyperscan and PCRE. The design goals of Chimera are to fully support PCRE +syntax as well as to take advantage of the high performance nature of Hyperscan. + +Chimera inherits the design guideline of Hyperscan with C APIs for compilation +and scanning. + +The Chimera API itself is composed of two major components: + +=========== +Compilation +=========== + +These functions take a group of regular expressions, along with identifiers and +option flags, and compile them into an immutable database that can be used by +the Chimera scanning API. This compilation process performs considerable +analysis and optimization work in order to build a database that will match +the given expressions efficiently. + +See :ref:`chcompile` for more details + +======== +Scanning +======== + +Once a Chimera database has been created, it can be used to scan data in memory. +Chimera only supports block mode in which we scan a single contiguous block in +memory. + +Matches are delivered to the application via a user-supplied callback function +that is called synchronously for each match. + +For a given database, Chimera provides several guarantees: + +* No memory allocations occur at runtime with the exception of scratch space + allocation, it should be done ahead of time for performance-critical + applications: + + - **Scratch space**: temporary memory used for internal data at scan time. + Structures in scratch space do not persist beyond the end of a single scan + call. + +* The size of the scratch space required for a given database is fixed and + determined at database compile time. This means that the memory requirement + of the application are known ahead of time, and the scratch space can be + pre-allocated if required for performance reasons. + +* Any pattern that has successfully been compiled by the Chimera compiler can + be scanned against any input. There could be internal resource limits or + other limitations caused by PCRE at runtime that could cause a scan call to + return an error. + +.. note:: Chimera is designed to have the same matching behavior as PCRE, + including greedy/ungreedy, capturing, etc. Chimera reports both + **start offset** and **end offset** for each match like PCRE. Different + from the fashion of reporting all matches in Hyperscan, Chimera only reports + non-overlapping matches. For example, the pattern :regexp:`/foofoo/` will + match ``foofoofoofoo`` at offsets (0, 6) and (6, 12). + +.. note:: Since Chimera is a hybrid of Hyperscan and PCRE in order to support + full PCRE syntax, there will be extra performance overhead compared to + Hyperscan-only solution. Please always use Hyperscan for better performance + unless you must need full PCRE syntax support. + +See :ref:`chruntime` for more details + +************ +Requirements +************ + +The PCRE library (http://pcre.org/) version 8.41 is required for Chimera. + +.. note:: Since Chimera needs to reference PCRE internal function, please place PCRE source + directory under Hyperscan root directory in order to build Chimera. + +Beside this, both hardware and software requirements of Chimera are the same to Hyperscan. +See :ref:`hardware` and :ref:`software` for more details. + +.. note:: Building Hyperscan will automatically generate Chimera library. + Currently only static library is supported for Chimera, so please + use static build type when configure CMake build options. + +.. _chcompile: + +****************** +Compiling Patterns +****************** + +=================== +Building a Database +=================== + +The Chimera compiler API accepts regular expressions and converts them into a +compiled pattern database that can then be used to scan data. + +The API provides two functions that compile regular expressions into +databases: + +#. :c:func:`ch_compile`: compiles a single expression into a pattern database. + +#. :c:func:`ch_compile_multi`: compiles an array of expressions into a pattern + database. All of the supplied patterns will be scanned for concurrently at + scan time, with user-supplied identifiers returned when they match. + +#. :c:func:`ch_compile_ext_multi`: compiles an array of expressions as above, + but allows PCRE match limits to be specified for each expression. + +Compilation allows the Chimera library to analyze the given pattern(s) and +pre-determine how to scan for these patterns in an optimized fashion using +Hyperscan and PCRE. + +=============== +Pattern Support +=============== + +Chimera fully supports the pattern syntax used by the PCRE library ("libpcre"), +described at .The version of PCRE used to validate +Chimera's interpretation of this syntax is 8.41. + +========= +Semantics +========= + +Chimera supports the exact same semantics of PCRE library. Moreover, it supports +multiple simultaneous pattern matching like Hyperscan and the multiple matches +will be reported in order by end offset. + +.. _chruntime: + +********************* +Scanning for Patterns +********************* + +Chimera provides scan function with ``ch_scan``. + +================ +Handling Matches +================ + +``ch_scan`` will call a user-supplied callback function when a match +is found. This function has the following signature: + + .. doxygentypedef:: ch_match_event_handler + :outline: + :no-link: + +The *id* argument will be set to the identifier for the matching expression +provided at compile time, and the *from* argument will be set to the +start-offset of the match the *to* argument will be set to the end-offset +of the match. The *captured* stores offsets of entire pattern match as well as +captured subexpressions. The *size* will be set to the number of valid entries in +the *captured*. + +The match callback function has the capability to continue or halt scanning +by returning different values. + +See :c:type:`ch_match_event_handler` for more information. + +======================= +Handling Runtime Errors +======================= + +``ch_scan`` will call a user-supplied callback function when a runtime error +occurs in libpcre. This function has the following signature: + + .. doxygentypedef:: ch_error_event_handler + :outline: + :no-link: + +The *id* argument will be set to the identifier for the matching expression +provided at compile time. + +The match callback function has the capability to either halt scanning or +continue scanning for the next pattern. + +See :c:type:`ch_error_event_handler` for more information. + +============= +Scratch Space +============= + +While scanning data, Chimera needs a small amount of temporary memory to store +on-the-fly internal data. This amount is unfortunately too large to fit on the +stack, particularly for embedded applications, and allocating memory dynamically +is too expensive, so a pre-allocated "scratch" space must be provided to the +scanning functions. + +The function :c:func:`ch_alloc_scratch` allocates a large enough region of +scratch space to support a given database. If the application uses multiple +databases, only a single scratch region is necessary: in this case, calling +:c:func:`ch_alloc_scratch` on each database (with the same ``scratch`` pointer) +will ensure that the scratch space is large enough to support scanning against +any of the given databases. + +While the Chimera library is re-entrant, the use of scratch spaces is not. +For example, if by design it is deemed necessary to run recursive or nested +scanning (say, from the match callback function), then an additional scratch +space is required for that context. + +In the absence of recursive scanning, only one such space is required per thread +and can (and indeed should) be allocated before data scanning is to commence. + +In a scenario where a set of expressions are compiled by a single "master" +thread and data will be scanned by multiple "worker" threads, the convenience +function :c:func:`ch_clone_scratch` allows multiple copies of an existing +scratch space to be made for each thread (rather than forcing the caller to pass +all the compiled databases through :c:func:`ch_alloc_scratch` multiple times). + +For example: + +.. code-block:: c + + ch_error_t err; + ch_scratch_t *scratch_prototype = NULL; + err = ch_alloc_scratch(db, &scratch_prototype); + if (err != CH_SUCCESS) { + printf("ch_alloc_scratch failed!"); + exit(1); + } + + ch_scratch_t *scratch_thread1 = NULL; + ch_scratch_t *scratch_thread2 = NULL; + + err = ch_clone_scratch(scratch_prototype, &scratch_thread1); + if (err != CH_SUCCESS) { + printf("ch_clone_scratch failed!"); + exit(1); + } + err = ch_clone_scratch(scratch_prototype, &scratch_thread2); + if (err != CH_SUCCESS) { + printf("ch_clone_scratch failed!"); + exit(1); + } + + ch_free_scratch(scratch_prototype); + + /* Now two threads can both scan against database db, + each with its own scratch space. */ + + +================= +Custom Allocators +================= + +By default, structures used by Chimera at runtime (scratch space, etc) are +allocated with the default system allocators, usually +``malloc()`` and ``free()``. + +The Chimera API provides a facility for changing this behaviour to support +applications that use custom memory allocators. + +These functions are: + +- :c:func:`ch_set_database_allocator`, which sets the allocate and free functions + used for compiled pattern databases. +- :c:func:`ch_set_scratch_allocator`, which sets the allocate and free + functions used for scratch space. +- :c:func:`ch_set_misc_allocator`, which sets the allocate and free functions + used for miscellaneous data, such as compile error structures and + informational strings. + +The :c:func:`ch_set_allocator` function can be used to set all of the custom +allocators to the same allocate/free pair. + + +************************ +API Reference: Constants +************************ + +=========== +Error Codes +=========== + +.. doxygengroup:: CH_ERROR + :content-only: + :no-link: + +============= +Pattern flags +============= + +.. doxygengroup:: CH_PATTERN_FLAG + :content-only: + :no-link: + +================== +Compile mode flags +================== + +.. doxygengroup:: CH_MODE_FLAG + :content-only: + :no-link: + + +******************** +API Reference: Files +******************** + +========== +File: ch.h +========== + +.. doxygenfile:: ch.h + +================= +File: ch_common.h +================= + +.. doxygenfile:: ch_common.h + +================== +File: ch_compile.h +================== + +.. doxygenfile:: ch_compile.h + +================== +File: ch_runtime.h +================== + +.. doxygenfile:: ch_runtime.h diff --git a/doc/dev-reference/getting_started.rst b/doc/dev-reference/getting_started.rst index 942d6e21..4e4d36f3 100644 --- a/doc/dev-reference/getting_started.rst +++ b/doc/dev-reference/getting_started.rst @@ -50,6 +50,8 @@ Very Quick Start Requirements ************ +.. _hardware: + Hardware ======== diff --git a/doc/dev-reference/hyperscan.doxyfile.in b/doc/dev-reference/hyperscan.doxyfile.in index a0173958..b9eaf078 100644 --- a/doc/dev-reference/hyperscan.doxyfile.in +++ b/doc/dev-reference/hyperscan.doxyfile.in @@ -758,7 +758,7 @@ WARN_LOGFILE = # spaces. # Note: If this tag is empty the current directory is searched. -INPUT = @CMAKE_SOURCE_DIR@/src/hs.h @CMAKE_SOURCE_DIR@/src/hs_common.h @CMAKE_SOURCE_DIR@/src/hs_compile.h @CMAKE_SOURCE_DIR@/src/hs_runtime.h +INPUT = @CMAKE_SOURCE_DIR@/src/hs.h @CMAKE_SOURCE_DIR@/src/hs_common.h @CMAKE_SOURCE_DIR@/src/hs_compile.h @CMAKE_SOURCE_DIR@/src/hs_runtime.h @CMAKE_SOURCE_DIR@/chimera/ch.h @CMAKE_SOURCE_DIR@/chimera/ch_common.h @CMAKE_SOURCE_DIR@/chimera/ch_compile.h @CMAKE_SOURCE_DIR@/chimera/ch_runtime.h # This tag can be used to specify the character encoding of the source files # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses diff --git a/doc/dev-reference/index.rst b/doc/dev-reference/index.rst index 32f188dd..b5d6a54b 100644 --- a/doc/dev-reference/index.rst +++ b/doc/dev-reference/index.rst @@ -20,3 +20,4 @@ Hyperscan |version| Developer's Reference Guide tools api_constants api_files + chimera