Extract multiple line data between two characters - Regex and Python3

I have a huge file from which I need data for certain records. File structure:

>Entry1.1 #size=1688 704 1 1 1 4 979 2 2 2 0 1220 1 1 1 4 1309 1 1 1 4 1316 1 1 1 4 1372 1 1 1 4 1374 1 1 1 4 1576 1 1 1 4 >Entry2.1 #size=6251 6110 3 1.5 0 2 6129 2 2 2 2 6136 1 1 1 4 6142 3 3 3 2 6143 4 4 4 1 6150 1 1 1 4 6152 1 1 1 4 >Entry3.2 #size=1777 AND SO ON----------- 

I have to ensure that I need to extract all the rows (full record) for certain records. For ex I need an entry for Entry1.1, than I can use the entry name '> Entry1.1' to next '>' as markers in REGEX to extract the lines between them. But I do not know how to create such complex REGEX expressions. As soon as I have such an expression, I will put in a FOR loop:

 For entry in entrylist: GET record from big_file DO some processing WRITE in result file 

What can be REGEX to do this record retrieval for certain records? Is there any other pythonic way to achieve this? I would appreciate your help in this.

AK

+4
source share
3 answers

With regex

 import re ss = ''' >Entry1.1 #size=1688 704 1 1 1 4 979 2 2 2 0 1220 1 1 1 4 1309 1 1 1 4 1316 1 1 1 4 1372 1 1 1 4 1374 1 1 1 4 1576 1 1 1 4 >Entry2.1 #size=6251 6110 3 1.5 0 2 6129 2 2 2 2 6136 1 1 1 4 6142 3 3 3 2 6143 4 4 4 1 6150 1 1 1 4 6152 1 1 1 4 >Entry3.2 #size=1777 AND SO ON----------- ''' patbase = '(>Entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\Z))' while True: x = raw_input('What entry do you want ? : ') found = re.findall(patbase % x, ss, re.DOTALL) if found: print 'found ==',found for each_entry in found: print '\n%s\n' % each_entry else: print '\n ** There is no such an entry **\n' 

Explanation '(>Entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\Z))' :

1)

%s gets a link to the entry: 1.1, 2, 2.1, etc.

2)

The (?![^\n]+?\d) should complete the check.

(?![^\n]+?\d) is a negative statement saying that after %s there should not be [^\n]+?\d , that is, any characters [^\n]+? before the digit \d

I write [^\n] to mean "any character except the new line \n ".
I have to write it instead of just .+? , because I put the re.DOTALL flag, and part of the template .+? will be valid until the end of the recording.
However, I only want to check that after the entered link (represented by% s in the template), additional digits entered with an error will not be added to the end of the line

All that is connected with the fact that if there is Entry2.1, but not Entry2, and the user enters only 2 because he wants Entry2 and none other, the regular expression will detect the presence of Entry2.1 and give it, although the user will really like Entry2 .

3)

At the end '(>Entry *%s(?![^\n]+?\d).+?) Part .+? will catch the full Entry block, because the dot represents any character consisting of a newline \n
For this purpose, I put the re.DOTALL flag to make the next part of the template .+? capable of transmitting newline characters to the end of a record.

4)

I want the match to stop at the end of the required record, not inside the next one, so the group defined in brackets in (>Entry *%s(?![^\n]+?\d).+?) Will catch exactly what we want | Therefore, I put an end to the positive statement (?=>|(?:\s*\Z)) , which says that the character before which is running ungreedy .+? must stop to match either > (start of next entry) or end of line \Z
Since it is possible that the end of the last record will not be exactly the end of the whole line, I put \s* , which means "possible spaces to the very end."
Thus, \s*\Z means that there may be spaces before striking the end of the line "Spaces are blank , \f , \n , \r , \t , \v

+4
source

I am not very good at regular expressions, so I try to look for solutions without regular expressions whenever I can. In Python, the natural place to store the iteration logic is in the generator, and so I would use something like this (non-itertools-required version):

 def group_by_marker(seq, marker): group = [] # advance past negatives at start for line in seq: if marker(line): group = [line] break for line in seq: # found a new group start; yield what we've got # and start over if marker(line) and group: yield group group = [] group.append(line) # might have extra bits left.. if group: yield group 

In the example of your example, we get:

 >>> with open("entry0.dat") as fp: ... marker = lambda line: line.startswith(">Entry") ... for group in group_by_marker(fp, marker): ... print(repr(group[0]), len(group)) ... '>Entry1.1\n' 10 '>Entry2.1\n' 9 '>Entry3.2\n' 4 

One of the advantages of this approach is that we never need to store more than one group in memory, which is why it is convenient for really large files. This is not as fast as a regular expression, although if the file is 1 GB, you are probably related to I / O.

+1
source

Not quite sure what you are asking. Does it bring you closer? He will put all your entries in the dictionary keys and a list of all his entries. Assuming it is formatted as I think. Does it have duplicate entries? Here is what I have:

 entries = {} key = '' for entry in open('entries.txt'): if entry.startswith('>Entry'): key = entry[1:].strip() # removes > and newline entries[key] = [] else: entries[key].append(entry) 
0
source

All Articles