Fast multipattern regular expression searching for digital forensics
Fast multipattern regular expression searching for digital forensics
Lightgrep is a regular expression search engine, designed specifically for digital forensics. Why another regexp engine?
Lightgrep:
Lightgrep is still pretty new and doesn't have all the regexp features you might be used to. But it has enough features to be more than a toy, and what is supported is well-tested.
liblightgrep is copyright (c) 2010-2015, Stroz Friedberg, LLC. liblightgrep is available under version 3 of the GNU Public License. See COPYING for details.
Lightgrep is implemented in portable C++11 but exposes a concise C API. The core of the API is defined in include/lightgrep/api.h. You can see a small example program at c_example/main.c.
Lightgrep depends on a number of Boost libraries and also on ICU. Currently you will need gcc 4.6+ or clang 3.1 to compile the libraries.
See the BUILD.md file for installation instructions.
Lightgrep is implemented in C++ but exposes a small C API which is considered the public interface.
In general, one must first create a few opaque objects: LG_HFSM, LG_HPATTERNMAP, LG_HPATTERN.
Given a UTF-8 string of the pattern, lg_parse_pattern()
validates the syntax of the pattern and populates the
LG_HPATTERN
with the abstract syntax tree representation of the pattern. This is then added to the finite state
machine and the pattern map with lg_add_pattern()
. lg_add_pattern()
takes a single encoding chain
string which is the encoding of the pattern that one wants to search for. The same pattern can be added multiple times with
different encoding chains, ensuring that lightgrep matches on the pattern in differently encoded raw data.
Once all of the patterns have been added to the finite state machine and the pattern map, then the finite state machine must be
transformed into a form that's amenable for searching. This is done with lg_create_program()
, which takes the
LG_HFSM
and returns a new LG_HPROGRAM
. The program is an opaque representation of the logic necessary
to match against all of the patterns in all of their respective encodings. lg_create_program()
is similar in
concept to the traditional REGCOMP function and it should generally only be called once and then reused with different search
streams; if not, you'll find that lightgrep is glacially slow.
After the program has been created, one can free memory by calling lg_destroy_fsm()
. In contrast, the pattern map
should be retained. Next, create a search context object with lg_create_context()
. Creating a context is not as
heavyweight as lg_create_program()
but one should generally reuse a context rather than creating a new one. With a
context, begin searching data with lg_search()
, which takes a buffer of data, a logical offset to associate with
the beginning of the buffer, a callback function for receiving hits, and a void*
value of one's choice which will
be passed through the to the callback when hits are discovered. lg_search()
can be invoked repeatedly on separate
buffers and it will treat them as a stream, automatically handling hits that cross boundaries. Lightgrep does not retain the
buffers and searching will not, with average pattern sets, allocate much memory. However, problematic patterns can cause great
increases in memory usage; more on this later.
Once a stream has been searched, it is essential to call lg_closeout_search()
. This tells Lightgrep that
no more data will be received on the stream and if there are any buffered hits, they will be flushed to the callback.
lg_reset_context()
should then be called to begin searching new streams. Once searching completes, call the
destroy_
functions on the pattern map, the program, and the contexts.
The pattern map is nothing more than a lookup table that can be indexed into by SearchHit.KeywordIndex
and will
return a pointer to an LH_PatternInfo
struct. This struct contains the pattern as its original UTF-8 string and a
string representing the encoding chain of the hit. LG_PatternInfo
also has a void*
pointer
which one can set to associate custom data. This can be used to funnel hits on different patterns to different handlers.
Lightgrep is thread agnostic. It performs no locking itself and in general is not thread safe. However, all API calls are
re-entrant. Additionally, once all patterns have been added, the program and pattern map are immutable. They can therefore be
shared safely between different threads. A typical threading scenario would be to have a pool of threads to handle searching
different streams, with each thread having its own LG_HCONTEXT
and sharing the pattern map and program. It is not
really possible to use multithreading to parallelize searching of a single stream.
Lightgrep can be used from within Python scripts by importing the pylightgrep
module located in the main lightgrep
source tree. The Lightgrep Python bindings are designed to work in Python 2.7+ and 3.4+
in both 32-bit and 64-bit versions. The liblightgrep.dll and support DLL architectures must, however,
match the Python architecture otherwise an error will be generated. The Python bindings use the ctypes module to
import liblightgrep and call into it via the C API.
The Python API is documented in more depth here.
The list of supported encodings can be found at the ICU Project website.
Expression | Description |
---|---|
\xhh | A single Unicode code point |
\zhh | A single hex byte, regardless of encoding |
\w | Match a "word" character [0-9A-Za-z_] |
\W | Match a non-"word" character [^0-9A-Za-z_] |
\s | Match a whitespace character [\t\n\f\r ] |
\S | Match a non-whitespace character [^\t\n\f\r ] |
\d | Match a decimal digit [0-9] |
\D | Match a non-digit character [^0-9] |
\. | Literal character (.) |
. | Any character |
* | Match 0 or more times |
+ | Match 1 or more times |
? | Match 1 or 0 times |
{n} | Match exactly n times |
{n,} | Match at least n times |
{n,m} | Match at least n but not more than m times |
[a-z] | a through z |
[xyz] | Either x, y, or z |
[^xyz] | Not x, y, or z |
(ab) | Group ab together for ?, +, *, | |
a|b | Either a or b |
------ Unicode Operations ------ | |
\x{hhhhhh} | A single Unicode code point |
\N{U+hhhhhh} | A single Unicode code point |
\N{name} | A single Unicode code point with the given name |
\p{property} | \p{Script=Cyrillic} matches any Cyrillic code point |
\P{property} | Equivalent to [^\p{property}] |
We've published a couple of papers about Lightgrep's implementation. They'll be here shortly
Lightgrep is not the fastest single-pattern engine. However, its performance degrades gracefully as patterns are added and it can support very large numbers of patterns. It is generally faster to search a stream of data once with lightgrep than to search it once for each pattern with a single-pattern engine.