How to remove invalid characters from xml file using sed or Perl

I want to get rid of all invalid characters; example hexadecimal value 0x1Afrom an XML file using sed.
What is regex and command line?
EDIT
A Perl tag has been added hoping to get more answers. I prefer a one line solution.
EDIT
These are valid XML characters

x9 | xA | xD | [x20-xD7FF] | [xE000-xFFFD] | [x10000-x10FFFF]
+5
source share
3 answers

Assuming UTF-8 XML documents:

perl -CSDA -pe'
   s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > file_fixed.xml

If you want to encode erroneous bytes instead,

perl -CSDA -pe'
   s/([^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}])/
      "&#".ord($1).";"
   /xeg;
' file.xml > file_fixed.xml

You can call it in several ways:

perl -CSDA     -pe'...' file.xml > file_fixed.xml
perl -CSDA -i~ -pe'...' file.xml     # Inplace with backup
perl -CSDA -i  -pe'...' file.xml     # Inplace without backup
+6
source

The team trwill be easier. So try something like:

cat <filename> | tr -d '\032' > <newfilename>

, ascii '0x1a' '032', tr. , tr hex.

+2

Try:

perl -pi -e 's/[^\x9\xA\xD\x20-\x{d7ff}\x{e000}-\x{fffd}\x{10000}-\x{10ffff}]//g' file.xml
0
source

All Articles