How can I check a large number of files using search and replace?

Question

How can I check a large number of files using search and replace?

I am currently validating a client-side HTML source, and I am getting many validation errors for images and input files that Omittag does not have. I would do it manually, but this client literally has thousands of files, with many cases where this is not the case.

This client has validated some img tags (for some reason).

Just wondering if there is a unix command that I could run to check if Omittag has it added.

I performed a simple search and replaced the following command:

find . \! -path '*.svn*' -type f -exec sed -i -n '1h;1!H;${;g;s/<b>/<strong>/g;p}' {} \;

But never anything so big. Any help would be appreciated.

+4

html unix perl omittag

Patrick hankinson Oct 28 '08 at 2:35

source share

2 answers

Try it. It will go through your files, make a backup copy of the .orig each file (perl -i ) and replace the <img> and <input> with <img /> and <input > .

 find . \! -path '*.svn*' -type f -exec perl -pi.orig -e 's{ ( <(?:img|input)\b ([^>]*?) ) \ ?/?> }{$1\ />}sgxi' {} \;

This entry:

 <img> <img/> <img src=".."> <img src="" > <input> <input/> <input id=".."> <input id="" >

It changes the file to:

 <img /> <img /> <img src=".." /> <img src="" /> <input /> <input /> <input id=".." /> <input id="" />

Here is what the regex does:

 s{(<(?:img|input)\b ([^>]*?)) # capture "<img" or "<input" followed by non-">" chars \ ?/?>} # optional space, optional slash, followed by ">" {$1\ />}sgxi # replace with: captured text, plus " />"

+4

Anirvan Oct 28 '08 at 6:15

source share

joelhardi · Accepted Answer · 2008-10-28T06:16:18+0000

See the questions I asked in the comment above.

Assuming you are using GNU sed, and that you are trying to add a final / to your tags to make the XML <img /> and <input /> compatible, then replace sed in your command with this, and it should do the trick: '1h;1!H;${;g;s/$img\|input$$ [^>]*[^/]$>/\1\2\/>/g;p;}'

Here it is in the simplest test file (SO colorizer does stupid things):

 $ cat test.html This is an <img tag> without closing slash. Here is an <img tag /> with closing slash. This is an <input tag > without closing slash. And here one <input attrib="1" > that spans multiple lines. Finally one <input attrib="1" /> with closing slash. $ sed -n '1h;1!H;${;g;s/\(img\|input\)\( [^>]*[^/]\)>/\1\2\/>/g;p;}' test.html This is an <img tag/> without closing slash. Here is an <img tag /> with closing slash. This is an <input tag /> without closing slash. And here one <input attrib="1" /> that spans multiple lines. Finally one <input attrib="1" /> with closing slash.

Here is the syntax of the GNU sed syntax and how buffering works to perform multiline search / replace .

Alternatively, you can use something like Tidy , which is designed to disinfect bad HTML - what would I do if I did something more complex than a few simple searches / replacements. The ordered parameters quickly get complicated, so it is usually better to write a script in your chosen scripting language (Python, Perl), which calls libtidy and sets whatever parameters you need.

How can I check a large number of files using search and replace?

More articles: