Bash or Python to go back?

Question

Bash or Python to go back?

I have a text file in which there are many random occurrences of the @STRING_A line, and I would be interested to write a short script that only deletes some of them. Especially the one that scans the file and as soon as it finds a line starting with this line, for example

@STRING_A

then checks if there are 3 lines back, there is another occurrence of a line starting with the same line, for example

 @STRING_A @STRING_A

and if that happens, delete the entry 3 lines back. I was thinking about bash, but I don't know how to "go back" with it. Therefore, I am sure that this is not possible with bash. I also thought about python, but then I have to store all the information in memory in order to go back, and then, for long files, this would not be feasible.

What do you think? Is it possible to do this in bash or python?

thanks

+6

python bash

Werner Jun 18 '09 at 12:56

source share

11 answers

Alex martelli · Answer 1 · 2009-06-19T01:10:30+0000

It's funny that after all these hours no one has yet given a solution to the problem as actually formulated (as @John Machin points out in the comment) - delete only the leading marker (if it is followed by another such marker 3 lines down), not the entire line containing it. It's not complicated, of course, here is a tiny mod as needed for @truppo a fun solution, for example:

 from itertools import izip, chain f = "foo.txt" for third, line in izip(chain(" ", open(f)), open(f)): if third.startswith("@STRING_A") and line.startswith("@STRING_A"): line = line[len("@STRING_A"):] print line,

Of course, in real life it would be possible to use iterator.tee instead of reading the file twice, having this code in a function, rather than constantly repeating the token constant; & c; -).

AlbertoPL · Answer 2 · 2009-06-18T13:07:59+0000

Of course, Python will work too. Just save the last three lines in the array and check if the first element in the array matches the value you are currently reading. Then delete the value and print the current array. Then you move around your elements to make room for the new value and repeat. Of course, when the array is full, you will need to continue to move the values from the array and enter new read values, stopping each time to check if the first value in the array matches your value currently being read.

truppo · Answer 3 · 2009-06-18T22:53:20+0000

Here is a more interesting solution using two iterators with three element offsets :)

 from itertools import izip, chain, tee f1, f2 = tee(open("foo.txt")) for third, line in izip(chain(" ", f1), f2): if not (third.startswith("@STRING_A") and line.startswith("@STRING_A")): print line,

Devsolar · Answer 4 · 2009-06-18T13:03:05+0000

Why is this not possible in bash? You do not need to store the entire file in memory, only the last three lines (if I understood correctly) and write what is suitable for standardization. Transfer this to a temporary file, make sure everything works as expected, and overwrite the original file with a temporary one.

The same goes for Python.

I would provide a script of my own, but this will not be verified .; -)

Martin geisler · Answer 5 · 2009-06-18T13:33:09+0000

This code scans the file and removes lines starting with a marker. By default, it only stores 3 lines in memory:

 from collections import deque def delete(fp, marker, gap=3): """Delete lines from *fp* if they with *marker* and are followed by another line starting with *marker* *gap* lines after. """ buf = deque() for line in fp: if len(buf) < gap: buf.append(line) else: old = buf.popleft() if not (line.startswith(marker) and old.startswith(marker)): yield old buf.append(line) for line in buf: yield line

I tested it with

 >>> from StringIO import StringIO >>> fp = StringIO('''a ... b ... xxx 1 ... c ... xxx 2 ... d ... e ... xxx 3 ... f ... g ... h ... xxx 4 ... i''') >>> print ''.join(delete(fp, 'xxx')) a b xxx 1 c d e xxx 3 f g h xxx 4 i

goger · Answer 6 · 2009-06-18T14:57:11+0000

As AlbertoPL said, save the lines in fifo for later use - don't “go back”. For this, I would definitely use python over bash + sed / awk / whatever.

I took a few moments to encode this fragment:

 from collections import deque line_fifo = deque() for line in open("test"): line_fifo.append(line) if len(line_fifo) == 4: # "look 3 lines backward" if line_fifo[0] == line_fifo[-1] == "@STRING_A\n": # get rid of that match line_fifo.popleft() else: # print out the top of the fifo print line_fifo.popleft(), # don't forget to print out the fifo when the file ends for line in line_fifo: print line,

jkerian · Answer 7 · 2009-06-18T22:23:33+0000

My awk-fu has never been so good ... but the following may provide you with what you are looking for in the form of bash -shell / shell-utility:

 sed `awk 'BEGIN{ORS=";"} /@STRING_A/ { if(LAST!="" && LAST+3 >= NR) print LAST "d" LAST = NR }' test_file` test_file

Basically ... awk produces a command for sed to cut certain lines. I'm sure there is a relatively easy way to do awk all processing, but it works.

The bad part? It reads test_file twice.

The good part? This is an implementation of bash / shell-utility.

Edit: Alex Martelli points out that the sample file above may have confused me. (my above code removes the entire line, not just the @STRING_A flag)

This can be easily fixed by editing the sed command:

 sed `awk 'BEGIN{ORS=";"} /@STRING_A/ { if(LAST!="" && LAST+3 >= NR) print LAST "s/@STRING_A//" LAST = NR }' test_file` test_file

John machin · Answer 8 · 2009-06-19T10:23:05+0000

This "answer" is for the lyre ... I will correct my previous comment: if the needle is in the first three lines of the file, your script will either raise an IndexError or gain access to the line. Access to them, sometimes with interesting side effects.

An example of your script calling IndexError:

 >>> lines = "@string line 0\nblah blah\n".splitlines(True) >>> needle = "@string " >>> for i,line in enumerate(lines): ... if line.startswith(needle) and lines[i-3].startswith(needle): ... lines[i-3] = lines[i-3].replace(needle, "") ... Traceback (most recent call last): File "<stdin>", line 2, in <module> IndexError: list index out of range

and this example shows not only that the Earth is round, but also why your "fix" of the problem "not to delete the entire line" should use .replace(needle, "", 1) or [len(needle):] instead of .replace(needle, "")

 >>> lines = "NEEDLE x NEEDLE y\nnoddle\nnuddle\n".splitlines(True) >>> needle = "NEEDLE" >>> # Expected result: no change to the file ... for i,line in enumerate(lines): ... if line.startswith(needle) and lines[i-3].startswith(needle): ... lines[i-3] = lines[i-3].replace(needle, "") ... >>> print ''.join(lines) xy <<<=== whoops! noddle nuddle <<<=== still got unwanted newline in here >>>

Andrew Austin · Answer 9 · 2009-06-18T13:01:17+0000

In bash, you can use sort -r filename and tail -n filename to read the file backwards.

 $LINES=`tail -n filename | sort -r` # now iterate through the lines and do your checking

sqram · Answer 10 · 2009-06-19T02:37:20+0000

Could this be what you are looking for?

 lines = open('sample.txt').readlines() needle = "@string " for i,line in enumerate(lines): if line.startswith(needle) and lines[i-3].startswith(needle): lines[i-3] = lines[i-3].replace(needle, "") print ''.join(lines)

these outputs:

 string 0 extra text string 1 extra text string 2 extra text string 3 extra text --replaced -- 4 extra text string 5 extra text string 6 extra text @string 7 extra text string 8 extra text string 9 extra text string 10 extra text

Sashan · Answer 11 · 2009-06-18T14:27:17+0000

I would consider using sed. gnu sed supports line range detection. if sed fails, then there is another beast - awk, and I'm sure you can do it with awk.

OK. I feel like I have to put my awk POC. I could not figure out how to use sed addresses. I have not tried the combination of awk + sed, but it seems to me that this is overkill.

my awk script works as follows:

It reads the lines and saves them in a 3-line buffer
after the desired template is found (/ ^ data. * / in my case), a 3-line buffer checks to see if the desired template is checked three lines back
if the pattern was noticed, then 3 lines are marked

frankly, I would probably go with python too, given that awk is really inconvenient. AWK code follows:

  function max (a, b)
 {
     if (a> b)
         return a;
     else
         return b;
 }

 BEGIN {
     w is 0;  #write index
     r is 0;  #read index
     buf [0, 1, 2];  #buffer

 }

 END {
     # flush buffer
     # start at read index and print out up to w index
     for (k = r% 3; kr - max (r - 3,0); k--) {
         #search in 3 line history buf
         if (match (buf [k% 3], /^data.*/)! = 0) {
             # found -> remove lines from history
             # by rewriting them -> adjust write index
             w - = max (r, 3);
         }
     }
     buf [w% 3] = $ 0;
     w ++;
 }

 /^.*/ {
     # store line into buffer, if the history
     # is full, print out the oldest one.
     if (w> 2) {
         print buf [r% 3];
         r ++;
         buf [w% 3] = $ 0;
     }
     else {
         buf [w] = $ 0;
     }
     w ++;
 }

Bash or Python to go back?

More articles: