Fast multipattern regular expression searching for digital forensics
The Lightgrep Python bindings are designed to work in Python 2.7+ and 3.4+ in both 32-bit and 64-bit versions. The liblightgrep.dll and support DLL architectures must, however, match the Python architecture otherwise an error will be generated.
The list of supported encodings can be found at the ICU Project website.
Pattern | A regular expression to be matched. |
Pattern Map | Stores user-defined data for each pattern. |
FSM | Finite State Machine. An intemediate representation of the patterns. |
Program | A compiled version of the FSM. |
Search Context | A handle for conducting a search using a program. |
First, let's create some sample text to search (this isn't part of Lightgrep):
text = "These aren't the droids you're looking for." data = text.encode('utf-8')
Now define the patterns, or keywords, to search for. A keyword contains a regular expression, list of encodings, and keyword options. Available keyword options are "fixedString" and "caseInsensitive". The default value for both options is False, meaning if you do not specify them, the search for that keyword will be GREP and case sensitive.
pats = ( ('droids', ['UTF-8', 'UTF-16LE'], KeyOpts(caseInsensitive = True)), ('a.*?t', ['ASCII'], KeyOpts()) )
Create an accumulator that will store all hits as they are generated:
accum = HitAccumulator()
Search the data with the previously defined patterns:
with Lightgrep(pats, accum.lgCallback) as lg: hitcount = lg.searchBuffer(data, accum)
Iterate the list of hits:
for h in accum.Hits: print('hit on {} in {} at [{} ,{}): "{}"'.format( h['pattern'], h['encChain'], h['start'], h['end'], data [h['start']:h['end']].decode('utf-8') ))
Run the script to see the output:
$ ./ex.py hit on a.*?t in ASCII at [6 ,12): "aren't" hit on droids in UTF-8 at [17 ,23): "droids"
Where are the program, pattern map, FSM, and search context?
Not in the simple example.
text = "These aren't the droids you're looking for." data = text.encode('utf-8') pats = ( ('droids', ['UTF-8', 'UTF-16LE'], KeyOpts(caseInsensitive = True)), ('a.*?t', ['ASCII'], KeyOpts()) ) accum = HitAccumulator()
Everything is the same up to this point. Next, though, we can manually create the program, pattern map, and search context:
prog, pmap = Lightgrep.createProgram(pats) lg = Lightgrep() accum = HitAccumulator() lg.createContext(prog, pmap, accum.lgCallback)
We still call searchBuffer() and iterate the hits in the accumulator:
hitcount = lg.searchBuffer(data, accum) for h in accum.Hits: # do stuff with hits
After we're done, we can reset the accumulator and Lightgrep:
accum.reset() lg.reset()
Now we can search another data stream without having to reinitialize everything:
hitcount = lg.searchBuffer(otherdata, accum) for h in accum.Hits : # do stuff with other hits
Finally, close the Lightgrep object:
lg.close()
Use Lightgrep in the language you want!*
*As long as the language you want is C, C++, or Python (or you can write your own bindings).