What is the fastest way to automatically create a string of gigabytes separated by keywords using a bash shell?

For example, given the line a11b12c22d322 e... the break fields are numbers or spaces, we want to convert them to

 a b c d e ... 

sed needs to read the entire line in memory, for gigabytes of the line it will not be effective, and work cannot be done if we do not have enough memory.

EDIT:

Can someone explain how grep, tr, Awk, perl and python manipulate memory while reading a large file? How much and how much content do they read in memory once?

+4
source share
4 answers

If you use gawk (which, in my opinion, is the standard awk on Linux), you can use the RS parameter to indicate that multi-digit numbers or spaces are recognized as line delimiters instead of the new -line.

 awk '{print}' RS="[[:digit:]]+| +" file.txt 

As for your second question, all these programs will have to read a fixed number of bytes and look for its idea of ​​a line separator in the internal buffer to mimic the look of reading one line at a time. To prevent too much data from being read while searching for the end of a line, you need to change the way programs see what ends the line.

Most languages ​​allow this, but only one character can be specified. gawk makes it easy to define a regular expression to recognize the end of line character. This saves you from having to search by a fixed size and at the end of a line.

+6
source

The fastest ... You can do this with gcc, here is a version that reads data from a given file name, if specified, otherwise from stdin. If it's still too slow, you can see if you can do it faster by replacing getchar() and putchar() (which can be macros and should be very optimized) with your own buffering code. If we want to get ridiculous, even faster, you should have three threads, so the kernel can copy the next block of data with one core, while the other core is processing, and the third main copy is processing the output back to the kernel.

 #!/bin/bash set -e BINNAME=$(mktemp) gcc -xc -O3 -o $BINNAME - <<"EOF" #include <stdio.h> #include <stdlib.h> int main(void) { int sep = 0; /* speed is a requirement, so let reduce io overhead */ const int bufsize = 1024*1024; setvbuf(stdin, malloc(bufsize), _IOFBF, bufsize); setvbuf(stdout, malloc(bufsize), _IOFBF, bufsize); /* above buffers intentionally not freed, it doesn't really matter here */ int ch; while((ch = getc(stdin)) >= 0) { if (isdigit(ch) || isspace(ch)) { if (!sep) { if (putc('\n', stdout) == EOF) break; sep = 1; } } else { sep = 0; if (putc(ch, stdout) == EOF) break; } } /* flush should happen by on-exit handler, as buffer is not freed, but this will detect write errors, for program exit code */ fflush(stdout); return ferror(stdin) || ferror(stdout); } EOF if [ -z "$1" ] ; then $BINNAME <&0 else $BINNAME <"$1" fi 

Edit: I also looked too much at GNU / Linux stdio.h, some notes: putchar / getchar are not macros, but putc / getc , so using them instead can be a little optimization, probably avoiding a single function call, changed the code to reflect that. Also added verification of the putc return putc , and on it.

+4
source

With grep :

 $ grep -o '[^0-9 ]' <<< "a11b12c22d322 e" a b c d e 

With sed :

 $ sed 's/[0-9 ]\+/\n/g' <<< "a11b12c22d322 e" a b c d e 

With awk :

 $ awk 'gsub(/[0-9 ]+/,"\n")' <<< "a11b12c22d322 e" a b c d e 

I will give you a rating.

+3
source

Try with tr :

 tr -s '[:digit:][:space:]' '\n' <<< "a11b12c22d322e" 

This gives:

 abcde 
+2
source

All Articles