In the process of converting files generated by our university’s ancient library program based on our university’s DOS into the Chinese language study department, I became something more useful and accessible.
Among the problems I'm dealing with is that the exported text files (about 80 MB in size) are in mixed encoding. I am on Windows.
German umlauts and other characters with higher ASCII are encoded in cp1252, I think, and CJK characters in GB18030. Due to the “overlapping” encodings, I can't just drag the whole file into Word or something else, and let it do the conversion, because I will get something like this:
orig:
+Autor: -Yan, Lianke / ÑÖÁ¬¿Æ
result:
+Autor: -Yan, Lianke / 阎连科
So, I wrote a script with several routines that convert non-ASCII characters in stages. It performs the following actions (among other things):
replace some higher-order ASCII characters (š, á, etc.) with alphanumeric codes (they are unlikely to naturally appear anywhere in the file). Example: -Min, Jie / (šbers.) → -Min, Jie / (uumlautgrossbers.)
Note. I did the "conversion table" manually, so I took into account only the special characters displayed in my document. Thus, the conversion is not fully completed, but in my case gives adequate results, since our books are mainly in German, English and Chinese, and only very few in languages such as Italian, Spanish, French, etc., And almost no in Czech.
replace á, £, ¢, ¡, í alphanumeric codes only if they are not preceded or followed by another character in the upper ASCII range \x80-\xFF . (These are the cp1252 encoded versions of ß, ú, ó, í and " small nordic o with cross-stroke " and are displayed on both cp1252- and GB18030 lines.)
read the entire file and convert it from GB18030 to UTF8, thus converting encoded Chinese characters to real Chinese characters.
Convert the alphanumeric codes back to their Unicode equivalents.
The script basically works, the following problem occurs:
- After converting the original 80 megabyte file, Notepad ++ still thinks it is an ANSI file and displays it as such. I need to click "Encoding-> Encode in UTF-8" to display it correctly.
What I would like to know:
Generally, is there a better way to convert a mixed encoding file to UTF-8?
If not, should I use utf8 so that I can directly enter characters instead of their hexadecimal representation in the codes2char routine?
Will the specification at the beginning of the file solve the NP ++ problem, displaying it initially as an ANSI file? If so, how do I modify my script so that the output file has a specification?
After the conversion, I can call a few more routines (for example, to convert the entire file to CSV or ODS). Should I continue to use the opening statement from the codes2char routine?
The code consists of several routines that are called at the end:
!perl -w use strict; use warnings; use Encode qw(decode encode); use Encode::HanExtra; our $input = "export.txt"; our $output = "export2.txt"; sub switch_var {
wow, that was a long time. I hope he's not too confused
EDIT
This is the hexdump line of the example above:
01A36596 2B 41 +A 01A365A9 75 74 6F 72 3A 0D 0A 2D 59 61 6E 2C 20 4C 69 61 6E 6B 65 utor: -Yan, Lianke 01A365BC 20 2F 20 D1 D6 C1 AC BF C6 0D 0A 2B 43 6F 2D 41 75 74 6F / ÑÖÁ¬¿Æ +Co-Auto 01A365CF 72 3A 0D 0A 2D 4D 69 6E 2C 20 4A 69 65 20 2F 20 28 9A 62 r: -Min, Jie / (šb 01A365E2 65 72 73 2E 29 0D 0A ers.)
and two more to illustrate:
one.
000036B3 2D 52 75 -Ru 000036C6 E1 6C 61 6E 64 0D 0A áland
2.
015FE030 2B 54 69 74 65 6C 3A 0D 0A 2D 57 65 6E 72 6F 75 +Titel: -Wenrou 015FE043 64 75 6E 68 6F 75 20 20 CE C2 C8 E1 B6 D8 BA F1 20 28 47 dunhou ÎÂÈá¶Øºñ (G 015FE056 65 6E 74 6C 65 6E 65 73 73 20 61 6E 64 20 4B 69 6E 64 6E entleness and Kindn 015FE069 65 73 73 29 2E 0D 0A ess).
In both cases, there is a Hex value of E1. In the first case, it stands still for the sharp German (ß, "Russland" = "Russia"), and in the second case, it is part of the multibyte character CJK (reading: "rou").
In the library program, Chinese characters are entered and displayed with an additional program, which should be loaded first, and, as far as I can tell, it connects to the graphics driver on low-level ones that capture encoded Chinese characters and display them as characters, leaving everything else alone. German umlauts, etc. Processed by the library program itself.
I don’t quite understand how this works, that is, how programs know if HexE1 should be treated as a single á character and thus be converted according to codepage X and when it is part of a multibyte character and thus converted according to codepage Y
The closest approximation I have found is that special characters are likely to be part of the Chinese string if there are other special characters before or behind it. (e.g. ÎÂÈá¶Øºñ )