Multiple AVX512VBMI-related fixes:
src/nfa/mcsheng_compile.cpp: No need for an assert here, impl_id can be set to 0
src/nfa/nfa_api_queue.h: Make sure this compiles on both C++ and C
src/nfagraph/ng_fuzzy.cpp: Fix compilation error when DEBUG_OUTPUT=on
src/runtime.c: Fix crash when data == NULL
unit/internal/sheng.cpp: Unit test has to enable AVX512VBMI manually as autodetection does not get trigger, this causes test to fail
src/fdr/teddy_fat.cpp: AVX512 loads need to be 64-bit aligned, caused a crash on clang-18
256b wide SVE vectors allow some simplification of truffle.
Up to 40% speedup on graviton3. Going from 12500 MB/s to 17000 MB/s
onhe microbenchmark.
SVE2 also offer this capability for 128b vector with a speedup around
25% compared to normal SVE
Add unit tests and benchmark for this wide variant
Signed-off-by: Yoan Picchi <yoan.picchi@arm.com>