Python: matching regex between file boundaries

Question

Python: matching regex between file boundaries

Huge file with text data

I read a huge file in chunks using python. Then I apply the regex on this snippet. Based on the identifier tag, I want to extract the corresponding value. Due to the size of the block, data is not available at the boundaries of the blocks.

Requirements:

The file should be read in chunks.
Block sizes must be less than or equal to 1 GiB.

Python sample code

identifier_pattern = re.compile(r'Identifier: (.*?)\n')
with open('huge_file', 'r') as f:
    data_chunk = f.read(1024*1024*1024)
    m = re.findall(identifier_pattern, data_chunk)

Examples of these pieces

Good: Number of tags equivalent to number of values

Identifier: value
Identifier: value
Identifier: value
Identifier: value

- , . "v" "value". "alue". .

:

:
:
: v

?

+6

python regex

JodyK 27 '17 1:53

4

Jack · Answer 1 · 2017-05-27T02:11:09+0000

, , , , ( ):

import re
matches = []
identifier_pattern = re.compile(r'Identifier: (.*?)$')
with open('huge_file') as f:
    for line in f:
        matches += re.findall(identifier_pattern, line)

print("matches", matches)

Andriy Ivaneyko · Answer 2 · 2017-05-27T10:12:54+0000

1024 * 1024 * 1024, :

import re


identifier_pattern = re.compile(r'Identifier: (.*?)\n')
counter = 1024 * 1024 * 1024
data_chunk = ''
with open('huge_file', 'r') as f:
    for line in f:
        data_chunk = '{}{}'.format(data_chunk, line)
        if len(data_chunk) > counter:
            m = re.findall(identifier_pattern, data_chunk)
            print m.group()
            data_chunk = ''
    # Analyse last chunk of data
    m = re.findall(identifier_pattern, data_chunk)
    print m.group()

, read ( : 0, , ), , key=[start position of matched string in file], , , , .

!

Pedro Lobito · Answer 3 · 2017-05-27T03:03:23+0000

, file , ( ), :

import re
matches = []
for line in open('huge_file'):
    matches += re.findall("Identifier:\s(.*?)$", line)

Evensf · Answer 4 · 2017-05-27T02:57:56+0000

I have a solution very similar to Jack's answer:

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

m = []
with open('huge_file', 'r') as f:
    for line in f:
        m.extend(identifier_pattern.findall(line))

You can use another part of the regexp API to get the same result:

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

m = []
with open('huge_file', 'r') as f:
    for line in f:
        pattern_found = identifier_pattern.search(line)
        if pattern_found:
            value_found = pattern_found.group(0)
            m.append(value_found)

What could we simplify with expression and list comprehension

#!/usr/bin/env python3

import re

identifier_pattern = re.compile(r'Identifier: (.*)$')

with open('huge_file', 'r') as f:
    patterns_found = (identifier.search(line) for line in f)
    m = [pattern_found.group(0) 
         for pattern_found in patterns_found if pattern_found]

Python: matching regex between file boundaries

Huge file with text data

More articles: