liblightgrep

Fast multipattern regular expression searching for digital forensics

About

Lightgrep is a regular expression search engine, designed specifically for digital forensics. Why another regexp engine?

Lightgrep:

searches for many patterns simultaneously
searches binary data as a stream, not as discrete lines of text
searches for patterns in many different encodings; give it dirty data, lightgrep don't care
never, ever, ever, never, never looks at a byte twice or backs up in your input
implements a large number of standard regexp features
follows the same hit-reporting rules as PCRE
degrades gracefully in performance as patterns increase in number and complexity

Lightgrep is still pretty new and doesn't have all the regexp features you might be used to. But it has enough features to be more than a toy, and what is supported is well-tested.

Technical Info

Lightgrep is implemented in portable C++11 but exposes a concise C API. The core of the API is defined in include/lightgrep/api.h. You can see a small example program at c_example/main.c.

Lightgrep depends on a number of Boost libraries and also on ICU. Currently you will need gcc 4.6+ or clang 3.1 to compile the libraries.

Install

See the BUILD.md file for installation instructions.

Usage

C API

Lightgrep is implemented in C++ but exposes a small C API which is considered the public interface.

aph.h—the primary search API
search_hit.h—defines the SearchHit struct and the search hit function callback signature
util.h—functions for reading context around hits and safely transcoding to UTF-8
encodings.h—names, aliases, and integer codes for all encodings lightgrep supports
transforms.h—character and byte transformations that lightgrep supports

In general, one must first create a few opaque objects: LG_HFSM, LG_HPATTERNMAP, LG_HPATTERN. Given a UTF-8 string of the pattern, lg_parse_pattern() validates the syntax of the pattern and populates the LG_HPATTERN with the abstract syntax tree representation of the pattern. This is then added to the finite state machine and the pattern map with lg_add_pattern(). lg_add_pattern() takes a single encoding chain string which is the encoding of the pattern that one wants to search for. The same pattern can be added multiple times with different encoding chains, ensuring that lightgrep matches on the pattern in differently encoded raw data.

Once all of the patterns have been added to the finite state machine and the pattern map, then the finite state machine must be transformed into a form that's amenable for searching. This is done with lg_create_program(), which takes the LG_HFSM and returns a new LG_HPROGRAM. The program is an opaque representation of the logic necessary to match against all of the patterns in all of their respective encodings. lg_create_program() is similar in concept to the traditional REGCOMP function and it should generally only be called once and then reused with different search streams; if not, you'll find that lightgrep is glacially slow.

After the program has been created, one can free memory by calling lg_destroy_fsm(). In contrast, the pattern map should be retained. Next, create a search context object with lg_create_context(). Creating a context is not as heavyweight as lg_create_program() but one should generally reuse a context rather than creating a new one. With a context, begin searching data with lg_search(), which takes a buffer of data, a logical offset to associate with the beginning of the buffer, a callback function for receiving hits, and a void* value of one's choice which will be passed through the to the callback when hits are discovered. lg_search() can be invoked repeatedly on separate buffers and it will treat them as a stream, automatically handling hits that cross boundaries. Lightgrep does not retain the buffers and searching will not, with average pattern sets, allocate much memory. However, problematic patterns can cause great increases in memory usage; more on this later.

Once a stream has been searched, it is essential to call lg_closeout_search(). This tells Lightgrep that no more data will be received on the stream and if there are any buffered hits, they will be flushed to the callback. lg_reset_context() should then be called to begin searching new streams. Once searching completes, call the destroy_ functions on the pattern map, the program, and the contexts.

The pattern map is nothing more than a lookup table that can be indexed into by SearchHit.KeywordIndex and will return a pointer to an LH_PatternInfo struct. This struct contains the pattern as its original UTF-8 string and a string representing the encoding chain of the hit. LG_PatternInfo also has a void* pointer which one can set to associate custom data. This can be used to funnel hits on different patterns to different handlers.

Lightgrep is thread agnostic. It performs no locking itself and in general is not thread safe. However, all API calls are re-entrant. Additionally, once all patterns have been added, the program and pattern map are immutable. They can therefore be shared safely between different threads. A typical threading scenario would be to have a pool of threads to handle searching different streams, with each thread having its own LG_HCONTEXT and sharing the pattern map and program. It is not really possible to use multithreading to parallelize searching of a single stream.

Python API

Lightgrep can be used from within Python scripts by importing the pylightgrep module located in the main lightgrep source tree. The Lightgrep Python bindings are designed to work in Python 2.7+ and 3.4+ in both 32-bit and 64-bit versions. The liblightgrep.dll and support DLL architectures must, however, match the Python architecture otherwise an error will be generated. The Python bindings use the ctypes module to import liblightgrep and call into it via the C API.

The Python API is documented in more depth here.

The list of supported encodings can be found at the ICU Project website.

Syntax

The table below contains an overview of simple syntax supported by Lightgrep. For a more detailed look at the full syntax, download the Lightgrep Syntax Cheat Sheet.

Expression	Description
\xhh	A single Unicode code point
\zhh	A single hex byte, regardless of encoding
\w	Match a "word" character [0-9A-Za-z_]
\W	Match a non-"word" character [^0-9A-Za-z_]
\s	Match a whitespace character [\t\n\f\r ]
\S	Match a non-whitespace character [^\t\n\f\r ]
\d	Match a decimal digit [0-9]
\D	Match a non-digit character [^0-9]
\.	Literal character (.)
.	Any character
*	Match 0 or more times
+	Match 1 or more times
?	Match 1 or 0 times
{n}	Match exactly n times
{n,}	Match at least n times
{n,m}	Match at least n but not more than m times
[a-z]	a through z
[xyz]	Either x, y, or z
[^xyz]	Not x, y, or z
(ab)	Group ab together for ?, +, *, \|
a\|b	Either a or b
------ Unicode Operations ------
\x{hhhhhh}	A single Unicode code point
\N{U+hhhhhh}	A single Unicode code point
\N{name}	A single Unicode code point with the given name
\p{property}	\p{Script=Cyrillic} matches any Cyrillic code point
\P{property}	Equivalent to [^\p{property}]

Implementation

We've published a couple of papers about Lightgrep's implementation. They'll be here shortly

Performance

Lightgrep is not the fastest single-pattern engine. However, its performance degrades gracefully as patterns are added and it can support very large numbers of patterns. It is generally faster to search a stream of data once with lightgrep than to search it once for each pattern with a single-pattern engine.