Finding and replacing a string in a very large file

I have a preference for shell commands so that everything is done. I have a very, very large file - about 2.8 GB, and the content is in JSON. Everything is on the same line, and I was told that there are at least 1.5 million records.

I have to prepare the file for consumption. Each entry should be on a separate line. Example:

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }} 

Or use the following ...

 {"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":" acne.pimple@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":" swati.singh@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":" christian.bale@hollywood.com ","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":" acne.pimple@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":" swati.singh@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":" acne.pimple@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":" swati.singh@microchimerism.com ","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}} 

The end result should be:

 {"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]}, {"RecordId":"2",...}, {"RecordId":"3",...}, {"RecordId":"4",...}, {"RecordId":"5",...} }} 

Command Attempt:

  • sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
  • awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat

Attempting commands works fine for small files. But this does not work for a 2.8 GB file that I have to manipulate. Sed leaves halfway through 10 minutes for no reason, and nothing has been done. Awk made a mistake with the cause of segmentation failure (main dump) after many hours. I tried searching and replacing perl and got the error message "Out of memory".

Any help / ideas would be great!

Additional information about my car:

  • More than 105 GB of disk space is available.
  • 8 GB of memory
  • 4 CPU cores
  • Running Ubuntu 14.04
+6
source share
5 answers

Regarding perl: try setting the input line separator $/ to }, as follows:

 #!/usr/bin/perl $/= "},"; while (<>){ print "$_\n"; }' 

or, as a single line:

 $ perl -e '$/="},";while(<>){print "$_\n"}' sample.dat 
+3
source

Since you tagged your question with sed, awk AND perl, I understand that you really need a recommendation for the tool. Although I'm not on the topic, I believe jq is something you could use for this. It will be better than sed or awk because it really understands JSON. Everything shown here with jq can also be done in perl with a little programming.

Assuming the content would look like this (based on your sample):

 {"RomanCharacters":{"Alphabet": [ {"RecordId":"1","data":"data"},{"RecordId":"2","data":"data"},{"RecordId":"3","data":"data"},{"RecordId":"4","data":"data"},{"RecordId":"5","data":"data"} ] }} 

You can easily reformat this to cover it:

 $ jq '.' < data.json { "RomanCharacters": { "Alphabet": [ { "RecordId": "1", "data": "data" }, { "RecordId": "2", "data": "data" }, { "RecordId": "3", "data": "data" }, { "RecordId": "4", "data": "data" }, { "RecordId": "5", "data": "data" } ] } } 

And we can delve into the data to get only those records that are of interest to you (regardless of what they wrapped):

 $ jq '.[][][]' < data.json { "RecordId": "1", "data": "data" } { "RecordId": "2", "data": "data" } { "RecordId": "3", "data": "data" } { "RecordId": "4", "data": "data" } { "RecordId": "5", "data": "data" } 

This is more readable by both people and tools like awk, which process content in turn. If you want to join your processing lines for your question, awk is much easier:

 $ jq '.[][][]' < data.json | awk '{printf("%s ",$0)} /}/{printf("\n")}' { "RecordId": "1", "data": "data" } { "RecordId": "2", "data": "data" } { "RecordId": "3", "data": "data" } { "RecordId": "4", "data": "data" } { "RecordId": "5", "data": "data" } 

Or, as suggested in @peak's comments, completely eliminate the awk part using the jq -c option (compact output):

 $ jq -c '.[][][]' < data.json {"RecordId":"1","data":"data"} {"RecordId":"2","data":"data"} {"RecordId":"3","data":"data"} {"RecordId":"4","data":"data"} {"RecordId":"5","data":"data"} 
+4
source

Try using } as a record separator, for example. in Perl:

 perl -l -0175 -ne 'print $_, $/' < input 

You may need to glue strings containing only } .

+2
source

This avoids the memory problem by not treating the data as a single record, but can go too far in relation to performance (processing one character at a time). Also note that for the RT built-in variable (the value of the current record separator) gawk is required:

 $ cat j.awk BEGIN { RS="[[:print:]]" } RT == "{" { bal++} RT == "}" { bal-- } { printf "%s", RT } RT == "," && bal == 2 { print "" } END { print "" } $ gawk -f j.awk j.txt {"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]}, {"RecordId":"2",...}, {"RecordId":"3",...}, {"RecordId":"4",...}, {"RecordId":"5",...} }} 
+2
source

Using the data examples given here (the one that starts with {Accounts: {Customer ...), the solution to this problem is the one that reads in the file and, as it reads, counts the number of specific delimiters in $ /. For each count of 10,000 separators, it will be written to a new file. And for each separator found, it gives a new line. This is what the script looks like:

 #!/usr/bin/perl $base="/home/dat789/incoming"; #$_="sample.dat"; $/= "}]},"; # delimiter to find and insert new line after $n = 0; $match=""; $filecount=0; $recsPerFile=10000; # set number of records in a file print "Processing " . $_ ."\n"; while (<>){ if ($n < $recsPerFile) { $match=$match.$_."\n"; $n++; print "."; #This is so that we'd know it has done something } else { my $newfile="partfile".$recsPerFile."-".$filecount . ".dat"; open ( OUTPUT,'>', $newfile ); print OUTPUT $match; $match=""; $filecount++; $n=0; print "Wrote file " . $newfile . "\n"; } } print "Finished\n\n"; 

I used this script for a large 2.8 GB file, where it is an unformatted single-line JSON. As a result, the output files will not have the correct JSON headers and footers, but this can be easily fixed.

Thanks a lot of guys for the contribution!

0
source

All Articles