Attempting to detect illegal XML characters in a PL / SQL procedure

Here is the puzzle. I want to write a procedure that checks tables for any characters that violate the XML code. This can be found in the W3C Recommendation, but it doesn’t matter right now. The important thing is that:

1) The character 'Γ§' has ASCII code 135. This is a fact. However, when I run

  begin
   ascii ('Γ§');
 end;

I get 50087 .

2) When I started

  begin
   dbms_output.put_line (chr (135));
 end;

I get pure nothing .

Well, apparently, ascii () and chr () only process values ​​at 0..127. So my question is how to find unicode equivalents or write custom extensions that work with values ​​like "Γ§" and "135".

Help would be greatly appreciated.

PS I am using Oracle SQL Developer.

+6
xml plsql unicode ascii
source share
3 answers

plsql functions for processing arbitrary character sets (well, as far as rdbms is known about them) are located in utl_i18n and utl_raw packages. for your specific problem, I would suggest a test like the following:

  select <pk_column_of_table_to_check> , instr ( utl_i18n.string_to_raw ( <column_to_test> , 'UTF8' ) , hextoraw ( <hex_rep_in_utf8> ) ) from <table_to_check> ; 

if you want to check for unicode characters whose utf8 representation is not available to you, use the term

  utl_raw.convert ( hextoraw ( <hex_rep_in_utf16>, 'UTF16', 'UTF8' ) ) 

as a second argument to instr. do not rely on the absolute positions returned by instr, but only on the dichotomy 0 / non-0, since you are not comparing in character, but at the byte level.

utf8 and utf16 - 2 different byte level encodings for Unicode character sets in the sense of named character objects; details can be found on wikipedia and unicode.org

note that the utf8 view allows you to run byte tests at the byte level by design. also note that utf16 encoding should be easily accessible, as this is a familiar representation of U + <4 hex digit> for Unicode characters.

the byte level representation of the incriminated characters should be accessible from the standard (xml). otherwise, you should have an idea of ​​what char is called and look at the code point database at unicodde.org or aomeweher else. there are also online conversion tools, if you know only the encoding name, but have a sample text in a file in my system, I can find uris if you need to.

Hope this helps.

ps: after a clearer reading of your first comment, I think that you may find yourself in a mission impossible: correctly interpret sequences of bytes from single-byte encodings of the encoding that are necessary to store information about the encoding used. Will this information be lost when the user copies the text from the word processor (set to a specific encoding [encoding]) to the database (where it will be stored in the database character set) as soon as the sequence of bytes is copied? you may be lucky when both ends are set to Unicode, and the db encoding is utf8 (so some character copying will fail), but once the data is in the database, it will be difficult for you to restore the original (possibly with dictionary support)

+1
source share

It is not clear what problem you are trying to solve. Convert user input to correct encoding or confirm that you have valid XML? If the latter, I think the conversion to inline XMLType will check the input as syntactically valid. You can even confirm that it follows a given XML schema.

+1
source share

If you just want to get rid of weird characters, you can use the REGEXP_REPLACE function. For example,

 REGEXP_REPLACE(your_value, '[:cntrl:]', '') 

will delete all control (unprintable) characters.

REGEXP_REPLACE is available from Oracle 10g Rel. 2 onwards. The documentation is at http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions130.htm

+1
source share

All Articles