Reading a text file in D

Is there any one-size-fits-all (more or less) way to read a text file in D?

The requirement is that the function automatically determine the encoding and give me all the file data in a consistent format, for example string or dstring . It should automatically identify specifications and interpret them as necessary.

I tried std.file.readText() , but it does not handle different encodings.

(Of course, this will have a non-zero failure rate and this is acceptable for my application.)

+8
d d2 phobos
source share
2 answers

I believe that the only real options for file I / O in Phobos at this point (in addition to calling C functions) are std.file.readText and std.stdio.File . readText will read in the file as an array of characters, wchars or dchars (by default - immutable (char) [] - for example, a string). I believe that the encoding should be UTF-8, UTF-16 and UTF-32 for characters, wchars and dchars respectively, although I will need to dig into the source code to be sure. Any encodings that are compatible with these encodings (for example, ASCII are compatible with UTF-8) should work fine.

If you use File , you have several options for reading file functions - including readln and rawRead . However, you essentially read the file using UTF-8, UTF-16, or UTF-32 encoding, just like with readText , or you read it as binary data and manage it yourself.

Since the character types in D are char, wchar, and dchar, which are the code units of UTF-8, UTF-16, and UTF-32, respectively, if you do not want to read data in binary format, the file must be encoded in an encoding compatible with one of these three types of unicode. Given a string in a specific encoding, you can convert it to another encoding using functions in std.utf . However, I do not know how to request a file for its encoding type, other than using readText , to try to read the file in that encoding and see if it succeeds.

So, if you do not want to process the file yourself and determine on the fly what encodes it, it is best to use readText with each consecutive string type, using the first, which will be successful. However, since text files are usually encoded in UTF-8 or UTF-8, I would expect that readText used with a regular line will almost always work fine.

+8
source share

Regarding specification verification:

 char[] ConvertViaBOM(ubyte[] data) { char[] UTF8() { /*...*/ } char[] UTF16LE(){ /*...*/ } char[] UTF16BE(){ /*...*/ } char[] UTF32LE(){ /*...*/ } char[] UTF32BE(){ /*...*/ } switch (data.length) { default: case 4: if (data[0..4] == [cast(ubyte)0x00, 0x00, 0xFE, 0xFF]) return UTF32BE(); if (data[0..4] == [cast(ubyte)0xFF, 0xFE, 0x00, 0x00]) return UTF32LE(); goto case 3; case 3: if (data[0..3] == [cast(ubyte)0xEF, 0xBB, 0xBF]) return UTF8(); goto case 2; case 2: if (data[0..2] == [cast(ubyte)0xFE, 0xFF]) return UTF16BE(); if (data[0..2] == [cast(ubyte)0xFF, 0xFE]) return UTF16LE(); goto case 1; case 1: return UTF8(); } } 

Adding a more obscure specification remains as an exercise for the reader.

+4
source share

Source: https://habr.com/ru/post/651182/


All Articles