Finding and replacing a string in a very large file

Question

Finding and replacing a string in a very large file

I have a preference for shell commands so that everything is done. I have a very, very large file - about 2.8 GB, and the content is in JSON. Everything is on the same line, and I was told that there are at least 1.5 million records.

I have to prepare the file for consumption. Each entry should be on a separate line. Example:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}

Or use the following ...

 {"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":" acne.pimple@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":" swati.singh@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":" christian.bale@hollywood.com ","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":" acne.pimple@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":" swati.singh@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":" acne.pimple@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":" swati.singh@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}

The end result should be:

 {"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]}, {"RecordId":"2",...}, {"RecordId":"3",...}, {"RecordId":"4",...}, {"RecordId":"5",...} }}

Command Attempt:

sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat

Attempting commands works fine for small files. But this does not work for a 2.8 GB file that I have to manipulate. Sed leaves halfway through 10 minutes for no reason, and nothing has been done. Awk made a mistake with the cause of segmentation failure (main dump) after many hours. I tried searching and replacing perl and got the error message "Out of memory".

Any help / ideas would be great!

Additional information about my car:

More than 105 GB of disk space is available.
8 GB of memory
4 CPU cores
Running Ubuntu 14.04

+6

json awk perl large-files data-manipulation

dat789 Jan 25 '16 at 12:03

source share

5 answers

Since you tagged your question with sed, awk AND perl, I understand that you really need a recommendation for the tool. Although I'm not on the topic, I believe jq is something you could use for this. It will be better than sed or awk because it really understands JSON. Everything shown here with jq can also be done in perl with a little programming.

Assuming the content would look like this (based on your sample):

 {"RomanCharacters":{"Alphabet": [ {"RecordId":"1","data":"data"},{"RecordId":"2","data":"data"},{"RecordId":"3","data":"data"},{"RecordId":"4","data":"data"},{"RecordId":"5","data":"data"} ] }}

You can easily reformat this to cover it:

 $ jq '.' < data.json { "RomanCharacters": { "Alphabet": [ { "RecordId": "1", "data": "data" }, { "RecordId": "2", "data": "data" }, { "RecordId": "3", "data": "data" }, { "RecordId": "4", "data": "data" }, { "RecordId": "5", "data": "data" } ] } }

And we can delve into the data to get only those records that are of interest to you (regardless of what they wrapped):

 $ jq '.[][][]' < data.json { "RecordId": "1", "data": "data" } { "RecordId": "2", "data": "data" } { "RecordId": "3", "data": "data" } { "RecordId": "4", "data": "data" } { "RecordId": "5", "data": "data" }

This is more readable by both people and tools like awk, which process content in turn. If you want to join your processing lines for your question, awk is much easier:

 $ jq '.[][][]' < data.json | awk '{printf("%s ",$0)} /}/{printf("\n")}' { "RecordId": "1", "data": "data" } { "RecordId": "2", "data": "data" } { "RecordId": "3", "data": "data" } { "RecordId": "4", "data": "data" } { "RecordId": "5", "data": "data" }

Or, as suggested in @peak's comments, completely eliminate the awk part using the jq -c option (compact output):

 $ jq -c '.[][][]' < data.json {"RecordId":"1","data":"data"} {"RecordId":"2","data":"data"} {"RecordId":"3","data":"data"} {"RecordId":"4","data":"data"} {"RecordId":"5","data":"data"}

+4

ghoti Jan 25 '16 at 12:55

source share

Try using } as a record separator, for example. in Perl:

 perl -l -0175 -ne 'print $_, $/' < input

You may need to glue strings containing only } .

+2

choroba Jan 25 '16 at 12:22

source share

This avoids the memory problem by not treating the data as a single record, but can go too far in relation to performance (processing one character at a time). Also note that for the RT built-in variable (the value of the current record separator) gawk is required:

 $ cat j.awk BEGIN { RS="[[:print:]]" } RT == "{" { bal++} RT == "}" { bal-- } { printf "%s", RT } RT == "," && bal == 2 { print "" } END { print "" } $ gawk -f j.awk j.txt {"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]}, {"RecordId":"2",...}, {"RecordId":"3",...}, {"RecordId":"4",...}, {"RecordId":"5",...} }}

+2

jas Jan 25 '16 at 12:45

source share

Using the data examples given here (the one that starts with {Accounts: {Customer ...), the solution to this problem is the one that reads in the file and, as it reads, counts the number of specific delimiters in $ /. For each count of 10,000 separators, it will be written to a new file. And for each separator found, it gives a new line. This is what the script looks like:

 #!/usr/bin/perl $base="/home/dat789/incoming"; #$_="sample.dat"; $/= "}]},"; # delimiter to find and insert new line after $n = 0; $match=""; $filecount=0; $recsPerFile=10000; # set number of records in a file print "Processing " . $_ ."\n"; while (<>){ if ($n < $recsPerFile) { $match=$match.$_."\n"; $n++; print "."; #This is so that we'd know it has done something } else { my $newfile="partfile".$recsPerFile."-".$filecount . ".dat"; open ( OUTPUT,'>', $newfile ); print OUTPUT $match; $match=""; $filecount++; $n=0; print "Wrote file " . $newfile . "\n"; } } print "Finished\n\n";

I used this script for a large 2.8 GB file, where it is an unformatted single-line JSON. As a result, the output files will not have the correct JSON headers and footers, but this can be easily fixed.

Thanks a lot of guys for the contribution!

0

dat789 Jan 27 '16 at 10:06

source share

neuhaus · Accepted Answer · 2016-01-25T13:04:19+0000

Regarding perl: try setting the input line separator $/ to }, as follows:

 #!/usr/bin/perl $/= "},"; while (<>){ print "$_\n"; }'

or, as a single line:

 $ perl -e '$/="},";while(<>){print "$_\n"}' sample.dat

Finding and replacing a string in a very large file

More articles: