Replace full width punctuation characters with equivalent width equivalents

file1 contains some (this full width), which I would like to turn into a regular one : (our regular colon). How to do this in bash? Maybe a python script?

+4
source share
5 answers

With all due respect, python is not the right tool for this job; perl:

 perl -CSAD -i.orig -pe 'tr[][:]' file1 

or

 perl -CSAD -i.orig -pe 'tr[\x{FF1A}][:]' file1 

or

 perl -CSAD -i.orig -Mcharnames=:full -pe 'tr[\N{FULLWIDTH COLON}][:]' file1 

or

 perl -CSAD -i.orig -Mcharnames=:full -pe 'tr[\N{FULLWIDTH EXCLAMATION MARK}\N{FULLWIDTH QUOTATION MARK}\{FULLWIDTH NUMBER SIGN}\N{FULLWIDTH DOLLAR SIGN}\N{FULLWIDTH PERCENT SIGN}\N{FULLWIDTH AMPERSAND}\{FULLWIDTH APOSTROPHE}\N{FULLWIDTH LEFT PARENTHESIS}\N{FULLWIDTH RIGHT PARENTHESIS}\N{FULLWIDTH ASTERISK}\N{FULLWIDTH PLUS SIGN}\N{FULLWIDTH COMMA}\N{FULLWIDTH HYPHEN-MINUS}\N{FULLWIDTH FULL STOP}\N{FULLWIDTH SOLIDUS}][\N{EXCLAMATION MARK}\N{QUOTATION MARK}\N{NUMBER SIGN}\N{DOLLAR SIGN}\N{PERCENT SIGN}\{AMPERSAND}\N{APOSTROPHE}\N{LEFT PARENTHESIS}\N{RIGHT PARENTHESIS}\N{ASTERISK}\N{PLUS SIGN}\N{COMMA}\{HYPHEN-MINUS}\N{FULL STOP}\N{SOLIDUS}]' file1 
+4
source

Perhaps you should take a look at Python unicodedata.normalize() .

This allows you to take a Unicode string and normalize it in a specific form, for example:

unicodedata.normalize('NFKC', thestring)

Here is a table of the various normalization forms from Unicode Standard Application No. 15 :

enter image description here


If you want to replace only certain characters, you can use unicode.translate() .

  >>> orig = u '\ uFF1A:'
 >>> table = {0xFF1A: u ':'}
 >>> print repr (orig)
 >>> print repr (orig.translate (table))
 u '\ uFF1A:'
 u '::'
+2
source

I would agree that Python is not the most efficient tool for this purpose. While the options presented so far are good, sed is another good tool that might be around:

 sed -i 's/\xEF\xBC\x9A/:/g' file.txt 

The -i option causes sed to edit the file in place, as in the tchrist perl example. Note that \xEF\xBC\x9A is the UTF-8 equivalent of the UTF-16 \xFF1A . This page is a useful link if you need to deal with different encodings of the same Unicode value.

+2
source

You can try tr :

 cat file.ext | tr ":" ":" > file_new.ext 
0
source

In Python 2.x, you can use the unicode.translate method to translate one Unicode code to 0, 1 or more code points using

 replacement_string = original_string.translate(table) 

The following code sets up a translation table that will display the full width equivalents of all ASCII graphic characters for their ASCII equivalents:

 # ! is 0x21 (ASCII) 0xFF01 (full); ~ is 0x7E (ASCII) 0xFF5E (full) table = dict((x + 0xFF00 - 0x20, unichr(x)) for x in xrange(0x21, 0x7F)) 

(link: see Wikipedia )

If you want to handle the spaces the same way do table[0x3000] = u' '

0
source

All Articles