Python: matching regex between file boundaries

Huge file with text data

I read a huge file in chunks using python. Then I apply the regex on this snippet. Based on the identifier tag, I want to extract the corresponding value. Due to the size of the block, data is not available at the boundaries of the blocks.

Requirements:

  • The file should be read in chunks.
  • Block sizes must be less than or equal to 1 GiB.


Python sample code

identifier_pattern = re.compile(r'Identifier: (.*?)\n')
with open('huge_file', 'r') as f:
    data_chunk = f.read(1024*1024*1024)
    m = re.findall(identifier_pattern, data_chunk)


Examples of these pieces

Good: Number of tags equivalent to number of values

Identifier: value
Identifier: value
Identifier: value
Identifier: value


- , . "v" "value". "alue". .

:

:
:
: v


?

+6
4

, , , , ( ):

import re
matches = []
identifier_pattern = re.compile(r'Identifier: (.*?)$')
with open('huge_file') as f:
    for line in f:
        matches += re.findall(identifier_pattern, line)

print("matches", matches)
+3

1024 * 1024 * 1024, :

import re


identifier_pattern = re.compile(r'Identifier: (.*?)\n')
counter = 1024 * 1024 * 1024
data_chunk = ''
with open('huge_file', 'r') as f:
    for line in f:
        data_chunk = '{}{}'.format(data_chunk, line)
        if len(data_chunk) > counter:
            m = re.findall(identifier_pattern, data_chunk)
            print m.group()
            data_chunk = ''
    # Analyse last chunk of data
    m = re.findall(identifier_pattern, data_chunk)
    print m.group()

, read ( : 0, , ), , key=[start position of matched string in file], , , , .

!

+2

, file , ( ), :

import re
matches = []
for line in open('huge_file'):
    matches += re.findall("Identifier:\s(.*?)$", line)
+1

I have a solution very similar to Jack's answer:

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

m = []
with open('huge_file', 'r') as f:
    for line in f:
        m.extend(identifier_pattern.findall(line))

You can use another part of the regexp API to get the same result:

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

m = []
with open('huge_file', 'r') as f:
    for line in f:
        pattern_found = identifier_pattern.search(line)
        if pattern_found:
            value_found = pattern_found.group(0)
            m.append(value_found)

What could we simplify with expression and list comprehension

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

with open('huge_file', 'r') as f:
    patterns_found = (identifier.search(line) for line in f)
    m = [pattern_found.group(0) 
         for pattern_found in patterns_found if pattern_found]
0
source

All Articles