Extract strings from binary in python

I have a project where I am provided with a file, and I need to extract the lines from the file. Basically think of the "string" command in linux, but I do it in python. The next condition is that the file is provided to me as a stream (for example, a string), so the obvious answer to using one of the subprocess functions to run strings is also not an option.

I wrote this code:

def isStringChar(ch):
    if ord(ch) >= ord('a') and ord(ch) <= ord('z'): return True
    if ord(ch) >= ord('A') and ord(ch) <= ord('Z'): return True
    if ord(ch) >= ord('0') and ord(ch) <= ord('9'): return True

    if ch in ['/', '-', ':', '.', ',', '_', '$', '%', '\'', '(', ')', '[', ']', '<', '>', ' ']: return True

# default out
return False

def process(stream):
dwStreamLen = len(stream)
if dwStreamLen < 4: return None

dwIndex = 0;
strString = ''
for ch in stream:
    if isStringChar(ch) == False:
        if len(strString) > 4:
            #print strString
            strString = ''
    else:
        strString += ch

This technically works, but the WAY is slow. For example, I was able to use the strings command in the 500Meg executable, and it produced 300,000 lines of lines in less than 1 second. I ran the same file through the specified code, and it took 16 minutes.

Is there a library out there that will allow me to do this without the burden of delaying python?

Thank!

+5
2

, re, Python. , , . , , C , . , char in set() , . Python C.

import sys
import re

chars = r"A-Za-z0-9/\-:.,_$%'()[\]<> "
shortest_run = 4

regexp = '[%s]{%d,}' % (chars, shortest_run)
pattern = re.compile(regexp)

def process(stream):
    data = stream.read()
    return pattern.findall(data)

if __name__ == "__main__":
    for found_str in process(sys.stdin):
        print found_str

4k , re. ( 4k, 2 )

+7

, (… = len(stream)), - isStringChar ( , ).

- :

import sys
import string

printable = set(string.printable)

def process(stream):
    found_str = ""
    while True:
        data = stream.read(1024*4)
        if not data:
            break
        for char in data:
            if char in printable:
                found_str += char
            elif len(found_str) >= 4:
                yield found_str
                found_str = ""
            else:
                found_str = ""

 if __name__ == "__main__":
     for found_str in process(sys.stdin):
        print found_str

, :

  • " " ( O (1)), ( ) C ( ).
  • 4 . , , .
+5

All Articles