Liblightgrep

Fast multipattern regular expression searching for digital forensics

View the Project on GitHub strozfriedberg/liblightgrep

Lightgrep Python API

Overview

The Lightgrep Python bindings are designed to work in Python 2.7+ and 3.4+ in both 32-bit and 64-bit versions. The liblightgrep.dll and support DLL architectures must, however, match the Python architecture otherwise an error will be generated.

The list of supported encodings can be found at the ICU Project website.

API Concepts

PatternA regular expression to be matched.
Pattern MapStores user-defined data for each pattern.
FSMFinite State Machine. An intemediate representation of the patterns.
ProgramA compiled version of the FSM.
Search ContextA handle for conducting a search using a program.

Simple Example in Python

First, let's create some sample text to search (this isn't part of Lightgrep):

text = "These aren't the droids you're looking for."
data = text.encode('utf-8')

Now define the patterns, or keywords, to search for. A keyword contains a regular expression, list of encodings, and keyword options. Available keyword options are "fixedString" and "caseInsensitive". The default value for both options is False, meaning if you do not specify them, the search for that keyword will be GREP and case sensitive.

pats = (
    ('droids', ['UTF-8', 'UTF-16LE'], KeyOpts(caseInsensitive = True)),
    ('a.*?t', ['ASCII'], KeyOpts())
)

Create an accumulator that will store all hits as they are generated:

accum = HitAccumulator()

Search the data with the previously defined patterns:

with Lightgrep(pats, accum.lgCallback) as lg:
    hitcount = lg.searchBuffer(data, accum)

Iterate the list of hits:

for h in accum.Hits:
    print('hit on {} in {} at [{} ,{}): "{}"'.format(
        h['pattern'], h['encChain'], h['start'], h['end'],
        data [h['start']:h['end']].decode('utf-8')
    ))

Run the script to see the output:

$ ./ex.py
hit on a.*?t in ASCII at [6 ,12): "aren't"
hit on droids in UTF-8 at [17 ,23): "droids"

Where are the program, pattern map, FSM, and search context?

Not in the simple example.

Advanced example:

text = "These aren't the droids you're looking for."
data = text.encode('utf-8')

pats = (
    ('droids', ['UTF-8', 'UTF-16LE'], KeyOpts(caseInsensitive = True)),
    ('a.*?t', ['ASCII'], KeyOpts())
)
accum = HitAccumulator()

Everything is the same up to this point. Next, though, we can manually create the program, pattern map, and search context:

prog, pmap = Lightgrep.createProgram(pats)
lg = Lightgrep()
accum = HitAccumulator()
lg.createContext(prog, pmap, accum.lgCallback)

We still call searchBuffer() and iterate the hits in the accumulator:

hitcount = lg.searchBuffer(data, accum)
for h in accum.Hits:
    # do stuff with hits

After we're done, we can reset the accumulator and Lightgrep:

accum.reset()
lg.reset()

Now we can search another data stream without having to reinitialize everything:

hitcount = lg.searchBuffer(otherdata, accum)
for h in accum.Hits :
    # do stuff with other hits

Finally, close the Lightgrep object:

lg.close()

Use Lightgrep in the language you want!*

*As long as the language you want is C, C++, or Python (or you can write your own bindings).