Shell script to rename a file with a string from an internal file

I searched for this in forums and in stackoverflow; he should be here somewhere, but I could not find him.
I am on a Mac using a terminal to run a shell script to rename some pdf files based on the contents of the file.

I have a directory full of pdf files that I export to text files using an open source pdf file. The resulting files have the same name as the pdf file, but end in .txt . I created text files to find a line inside a file with the format Page xx Question xx ; for example Page 43 Question 2 . In this example, I would like to rename the pdf file as pg43_q2.pdf

I think the regex that I want is the following: /Page\s+(\d+)Question\s+(\d+) but I'm not sure how to read the two captured numbers and save them to a string that I can use as file name.

script I still have:

 #!/bin/sh PDF_FILE_PATH=$1 echo "Converting pdfs at $PDF_FILE_PATH" find "$PDF_FILE_PATH" -name '*.pdf' -print0 | while IFS= read -r -d '' filename; do echo $filename java -jar pdfbox-app-1.6.0.jar ExtractText "$filename" "$filename.txt" NEWNAME=$(sed -n -e '/Page/s/Page\s+\(\d+\)\s+Question\s+\(\d+\).*$/pg\1_q\2/p' "$filename.txt") echo "Renaming pdf $filename to $NEWNAME" # I would do this next but the $NEWNAME is empty # mv "filename" "PDF_FILE_PATH$NEWNAME" done 

... but the sed command does not put anything in the NEWNAME variable.

I'm not particularly attached to sed, any suggestions would be appreciated

Last editing on the script uses the following sed command:

 newname=$(sed -nE -e '/Page/s/^.*Page[[:blank:]]+([0-9]+)[[:blank:]]+Question[[:blank:]]+([0-9]+).*$/pg\1_q\2.pdf/p' "$filename.txt") 

This works in about 50% of cases, but the rest of the time the newname variable is empty when I move on to renaming the file.

The third line of the converted file that works:

 Unit 2 Review Page 257 Question 9 a) 12 (2)(2)(3) 

The third line of the converted file that does not work:

 Unit 2 Review Page 258 Question 16 a) (a – 4)(a + 7) = a(a + 7) – 4(a + 7) = a2 + 7a – 4a – 28 = a2 + 3a – 28 b) (2x + 3)(5x + 2) = 2x(5x + 2) + 3(5x + 2) = 10x2 + 4x + 15x + 6 = 10x2 + 19x + 6 c) (–x + 5)(x + 5) = –x(x + 5) + 5(x + 5) = –x2 – 5x + 5x + 25 = –x2 + 25 d) (3y + 4)2 = (3y + 4)(3y + 4) = 3y(3y + 4) + 4(3y + 4) = 9y2 + 12y + 12y + 16 = 9y2 + 24y + 16 e) (a – 3b)(4a – b) = a(4a – b) – 3b(4a – b) = 4a2 – ab – 12ab + 3b2 = 4a2 – 13ab + 3b2 f) (v – 1)(2v2 – 4v – 9) = v(2v2 – 4v – 9) – 1(2v2 – 4v – 9) = 2v3 – 4v2 – 9v – 2v2 + 4v + 9 = 2v3 – 6v2 – 5v + 9 
+4
source share
1 answer

Inappropriate original response removed

 echo 'Unit 2 Review Page 257 Question 9 a) 12 (2)(2)(3)'\ | sed -n '/Page/{s/.*Page[ ][ ]*\([0-9][0-9]*\)[ ][ ]*Question[ ][ ]*\([0-9][0-9]*\).*$/pg\1_q\2/;p;q;}' 

Exit

 pg257_q9 echo 'Unit 2 Review Page 258 Question 16 a) (a 4)(a + 7) = a(a + 7) 4(a + 7)'\ | sed -n '/Page/{s/.*Page[ ][ ]*\([0-9][0-9]*\)[ ][ ]*Question[ ][ ]*\([0-9][0-9]*\).*$/pg\1_q\2/;p;q;}' 

Exit

 pg258_q16 

Otherwise, you got it right!

(Note that sed handling is the same for both cases).

I included trailing ;p;q} and the initial { , so the sed script will just process the line using "Page" and then exit.

I have expanded the posix char classes to basic terms, i.e. [[:digit:]] = [0-9] , and replaced + with a repetition of the intitial char class, followed by an 'or more' char '*', making [0-9][0-9]* . My personal experience, having learned sed on Sun 3 from the 2nd edition of OReilly Sed and Awk (with comb-binding!), Is that all posix material is a distraction and an additional source of errors. I am clearly in the minority about this here on SO ;-), but I agree to admit that the new seeds have great opportunities anyway .....

Hope this helps.

+2
source

Source: https://habr.com/ru/post/1413574/


All Articles