Sed: hold and change string

I'm not sure if I can do this with sed:

I'm trying to rearrange lines like this

GF:001,GF:00012,GF:01223<TAB>XXR GF:001,GF:00012,GF:01223,GF:0666<TAB>XXXR3 

to

 GF:001<TAB>XXR GF:00012<TAB>XXR GF:01223<TAB>XXR GF:001<TAB>XXXR3 GF:00012<TAB>XXXR3 GF:01223<TAB>XXXR3 GF:0666<TAB>XXXR3 

Are there any hints? Power GF: XXXX alternates with the length of GF: XXXX there.

I am stuck with sed -n ' '/\(XX.*\)$/' { s/,/\t\1\n/ }' input , but I cannot reference the originally matched pattern in the first place. any ideas? Hooray!

Update: I think this cannot be done simply using sed. So I used perl for this:

 perl -e 'open(IN, "< file"); while (<IN>) { @a = split(/\t/); @gos = split(/,/, $a[0]); foreach (@gos) { print $_."\t".$a[1]; } close( IN );' > output 

But if anyone knows a way to solve this only with sed , write here ...

+4
source share
5 answers

This can be done in sed , although I will probably use Perl (or Awk or Python) for this.

I do not pretend to be elegant for this decision, but brute force and ignorance sometimes pay off. I created a file called uniriginally, sed.script containing:

 /\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/{ :redo s/\(GF:[0-9]*\),\(.*\)<TAB>\(.*\)/\1<TAB>\ 3@ @@@@\2<TAB>\3/ h s/@@@@@.*// p x s/.*@@@@@// t redo d } 

I ran it like:

 sed -f sed.script input 

where input contains two lines specified in the question. He made an exit:

 GF:001<TAB>XXR GF:00012<TAB>XXR GF:01223<TAB>XXR GF:001<TAB>XXXR3 GF:00012<TAB>XXXR3 GF:01223<TAB>XXXR3 GF:0666<TAB>XXXR3 

(I took the liberty of deliberately misinterpreting <TAB> as a 5-character sequence instead of a single tab character, you can easily correct the answer to handle the actual tab character.)

sed script explanation:

  • Find lines with more than one occurrence of GF:nnn , separated by commas (we do not need to process lines containing one such occurrence). The rest of the script is used only on such lines. Everything else is transmitted (printed) unchanged.
  • Create a shortcut so that we can return to it.
  • Divide the line into 3 memorized parts. The first part is the initial GF information; the second part is any other GF information; the third part is the field after <TAB> . Replace this with the first field, <TAB> , the third field, the implausible marker template ( @@@@@ ), the second field, <TAB> , the third field.
  • Copy the modified row to hold.
  • Remove the marker template to the end.
  • Print.
  • Change the holding space to the drawing space.
  • Delete everything before and including the marker template.
  • If we have done any work, return to the redo label.
  • Delete what remains (it has already been printed).
  • End of script block.

This is a simple loop that reduces the number of patterns, one at a time in each iteration.

+7
source

You can do this with awk:

 $ awk '{gsub(/,/, "\t" $NF "\n");print}' input 

In this case, we simply replace the comma with the tab combined with the last field ( NF stores the number of record fields; $NF gets the NFth field), combined with a new line. Then print the result.

It can be solved with sed too, similarly, but IMHO is a little better than Jonathan's solution (which is rather complicated, I should notice).

 sed -n ' :BEGIN h s/,.*<TAB>/<TAB>/ p x s/^[^,]*,// t BEGIN' input 

Here we define the label at the beginning of the script:

 :BEGIN 

Then we copy the contents of the template space to the hold space:

 h 

Now we replace everything from the first comma to the tab with only the tab:

  s/,.*<TAB>/<TAB>/ 

Print the result ...

 p 

... and get the contents of the hold space:

 x 

Since we printed the first line β€” which contains the first GF:XXX pattern, followed by the last XXR pattern β€” we will remove the first GF:XXX pattern from the line:

  s/^[^,]*,// 

If the replacement is done, we insert it at the beginning of the script:

 t BEGIN 

And everything again applies to the same line, except that now this line no longer has the first GF:XXX pattern. OTOH, if the replacement is not performed, the processing of the current line is performed, and we no longer go to the beginning.

+3
source

Unless you need a strict sed, awk does it well:

 awk -F'\t|,' '{ i=1; do { printf("%s\t%s\n",$i,$NF); i++;} while ( i<NF ); }' inputfile 
+2
source

Well, it took me 3 hours to do this

sed -re ':a; s/(GF:[0-9]*[^,]*),([^<]*)(<TAB>[AZ]*)/\1\3\n\2\3/g;ta; ' file.txt

+2
source
 awk -F'[,\t]' '{for (i=1;i<NF;i++) print $i"\t"$NF}' file 

Awk reads one line at a time (default) and breaks the line into fields. I use -F to tell awk to separate the line from the fields in each comma or tab. NF - the number of fields in the line, $ i - the contents of the number of field i.

+1
source

All Articles